* [PATCH 0/3] x86: Update memcpy/memset inline strategies @ 2021-03-22 13:16 H.J. Lu 2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw) To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang Simply memcpy and memset inline strategies to avoid branches: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. There are no significant performance impacts on SPEC CPU 2017. There are visible performance improvements on eembc benchmarks with one regression. H.J. Lu (3): x86: Update memcpy/memset inline strategies for Ice Lake x86: Update memcpy/memset inline strategies for Skylake family CPUs x86: Update memcpy/memset inline strategies for -mtune=generic gcc/config/i386/i386-expand.c | 11 +- gcc/config/i386/i386-options.c | 12 +- gcc/config/i386/i386.h | 2 + gcc/config/i386/x86-tune-costs.h | 185 ++++++++++++++++-- gcc/config/i386/x86-tune.def | 6 + .../gcc.target/i386/memcpy-strategy-10.c | 11 ++ .../gcc.target/i386/memcpy-strategy-11.c | 18 ++ .../gcc.target/i386/memcpy-strategy-12.c | 9 + .../gcc.target/i386/memcpy-strategy-13.c | 11 ++ .../gcc.target/i386/memcpy-strategy-5.c | 11 ++ .../gcc.target/i386/memcpy-strategy-6.c | 18 ++ .../gcc.target/i386/memcpy-strategy-7.c | 9 + .../gcc.target/i386/memcpy-strategy-8.c | 18 ++ .../gcc.target/i386/memcpy-strategy-9.c | 9 + .../gcc.target/i386/memset-strategy-10.c | 11 ++ .../gcc.target/i386/memset-strategy-11.c | 9 + .../gcc.target/i386/memset-strategy-3.c | 17 ++ .../gcc.target/i386/memset-strategy-4.c | 17 ++ .../gcc.target/i386/memset-strategy-5.c | 11 ++ .../gcc.target/i386/memset-strategy-6.c | 9 + .../gcc.target/i386/memset-strategy-7.c | 11 ++ .../gcc.target/i386/memset-strategy-8.c | 9 + .../gcc.target/i386/memset-strategy-9.c | 17 ++ gcc/testsuite/gcc.target/i386/shrink_wrap_1.c | 2 +- gcc/testsuite/gcc.target/i386/sw-1.c | 2 +- 25 files changed, 413 insertions(+), 32 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-3.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-4.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-5.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-6.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c -- 2.30.2 ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu @ 2021-03-22 13:16 ` H.J. Lu 2021-03-22 14:10 ` Jan Hubicka 2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu 2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu 2 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw) To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang Simply memcpy and memset inline strategies to avoid branches for -mtune=icelake: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. On Ice Lake processor with -march=native -Ofast -flto, 1. Performance impacts of SPEC CPU 2017 rate are: 500.perlbench_r -0.93% 502.gcc_r 0.36% 505.mcf_r 0.31% 520.omnetpp_r -0.07% 523.xalancbmk_r -0.53% 525.x264_r -0.09% 531.deepsjeng_r -0.19% 541.leela_r 0.16% 548.exchange2_r 0.22% 557.xz_r -1.64% Geomean -0.24% 503.bwaves_r -0.01% 507.cactuBSSN_r 0.00% 508.namd_r 0.12% 510.parest_r 0.07% 511.povray_r 0.29% 519.lbm_r 0.00% 521.wrf_r -0.38% 526.blender_r 0.16% 527.cam4_r 0.18% 538.imagick_r 0.76% 544.nab_r -0.84% 549.fotonik3d_r -0.07% 554.roms_r -0.01% Geomean 0.02% 2. Significant impacts on eembc benchmarks are: eembc/nnet_test 9.90% eembc/mp2decoddata2 16.42% eembc/textv2data3 -4.86% eembc/qos 12.90% gcc/ * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode to SImode. (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use "rep movsb/stosb" only for known sizes. * config/i386/i386-options.c (processor_cost_table): Use Ice Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire Rapids and Alder Lake. * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New. * config/i386/x86-tune-costs.h (icelake_memcpy): New. (icelake_memset): Likewise. (icelake_cost): Likewise. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): New. gcc/testsuite/ * gcc.target/i386/memcpy-strategy-5.c: New test. * gcc.target/i386/memcpy-strategy-6.c: Likewise. * gcc.target/i386/memcpy-strategy-7.c: Likewise. * gcc.target/i386/memcpy-strategy-8.c: Likewise. * gcc.target/i386/memset-strategy-3.c: Likewise. * gcc.target/i386/memset-strategy-4.c: Likewise. * gcc.target/i386/memset-strategy-5.c: Likewise. * gcc.target/i386/memset-strategy-6.c: Likewise. --- gcc/config/i386/i386-expand.c | 11 +- gcc/config/i386/i386-options.c | 12 +- gcc/config/i386/i386.h | 2 + gcc/config/i386/x86-tune-costs.h | 127 ++++++++++++++++++ gcc/config/i386/x86-tune.def | 7 + .../gcc.target/i386/memcpy-strategy-5.c | 11 ++ .../gcc.target/i386/memcpy-strategy-6.c | 18 +++ .../gcc.target/i386/memcpy-strategy-7.c | 9 ++ .../gcc.target/i386/memcpy-strategy-8.c | 18 +++ .../gcc.target/i386/memset-strategy-3.c | 17 +++ .../gcc.target/i386/memset-strategy-4.c | 17 +++ .../gcc.target/i386/memset-strategy-5.c | 11 ++ .../gcc.target/i386/memset-strategy-6.c | 9 ++ 13 files changed, 260 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-3.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-4.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-5.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-6.c diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index ac69eed4d32..00efe090d97 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -5976,6 +5976,7 @@ expand_set_or_cpymem_via_rep (rtx destmem, rtx srcmem, /* If possible, it is shorter to use rep movs. TODO: Maybe it is better to move this logic to decide_alg. */ if (mode == QImode && CONST_INT_P (count) && !(INTVAL (count) & 3) + && !TARGET_PREFER_KNOWN_REP_MOVSB_STOSB && (!issetmem || orig_value == const0_rtx)) mode = SImode; @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, const struct processor_costs *cost; int i; bool any_alg_usable_p = false; + bool known_size_p = expected_size != -1; *noalign = false; *dynamic_check = -1; @@ -6899,7 +6901,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, if (optimize_function_for_size_p (cfun) || (optimize_insn_for_size_p () && (max_size < 256 - || (expected_size != -1 && expected_size < 256)))) + || (known_size_p && expected_size < 256)))) optimize_for_speed = false; else optimize_for_speed = true; @@ -6925,7 +6927,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, so inline version is a win, set expected size into the range. */ if (((max > 1 && (unsigned HOST_WIDE_INT) max >= max_size) || max == -1) - && expected_size == -1) + && !known_size_p) expected_size = min_size / 2 + max_size / 2; /* If user specified the algorithm, honor it if possible. */ @@ -6984,7 +6986,10 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, else if (!any_alg_usable_p) break; } - else if (alg_usable_p (candidate, memset, have_as)) + else if (alg_usable_p (candidate, memset, have_as) + && !(TARGET_PREFER_KNOWN_REP_MOVSB_STOSB + && candidate == rep_prefix_1_byte + && !known_size_p)) { *noalign = algs->size[i].noalign; return candidate; diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c index b653527d266..bd52ce6ffec 100644 --- a/gcc/config/i386/i386-options.c +++ b/gcc/config/i386/i386-options.c @@ -721,14 +721,14 @@ static const struct processor_costs *processor_cost_table[] = &slm_cost, &skylake_cost, &skylake_cost, + &icelake_cost, + &icelake_cost, + &icelake_cost, &skylake_cost, + &icelake_cost, &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, + &icelake_cost, + &icelake_cost, &intel_cost, &geode_cost, &k6_cost, diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 058c1cc25b2..b4001d21b70 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -523,6 +523,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; #define TARGET_PROMOTE_QImode ix86_tune_features[X86_TUNE_PROMOTE_QIMODE] #define TARGET_FAST_PREFIX ix86_tune_features[X86_TUNE_FAST_PREFIX] #define TARGET_SINGLE_STRINGOP ix86_tune_features[X86_TUNE_SINGLE_STRINGOP] +#define TARGET_PREFER_KNOWN_REP_MOVSB_STOSB \ + ix86_tune_features[X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB] #define TARGET_MISALIGNED_MOVE_STRING_PRO_EPILOGUES \ ix86_tune_features[X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES] #define TARGET_QIMODE_MATH ix86_tune_features[X86_TUNE_QIMODE_MATH] diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 58b3b81985b..0e00ff99df3 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1936,6 +1936,133 @@ struct processor_costs skylake_cost = { "0:0:8", /* Label alignment. */ "16", /* Func alignment. */ }; + +/* icelake_cost should produce code tuned for Icelake family of CPUs. + NB: rep_prefix_1_byte is used only for known size. */ + +static stringop_algs icelake_memcpy[2] = { + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; + +static stringop_algs icelake_memset[2] = { + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; + +static const +struct processor_costs icelake_cost = { + { + /* Start of register allocator costs. integer->integer move cost is 2. */ + 6, /* cost for loading QImode using movzbl */ + {4, 4, 4}, /* cost of loading integer registers + in QImode, HImode and SImode. + Relative to reg-reg move (2). */ + {6, 6, 6}, /* cost of storing integer registers */ + 2, /* cost of reg,reg fld/fst */ + {6, 6, 8}, /* cost of loading fp registers + in SFmode, DFmode and XFmode */ + {6, 6, 10}, /* cost of storing fp registers + in SFmode, DFmode and XFmode */ + 2, /* cost of moving MMX register */ + {6, 6}, /* cost of loading MMX registers + in SImode and DImode */ + {6, 6}, /* cost of storing MMX registers + in SImode and DImode */ + 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ + {6, 6, 6, 10, 20}, /* cost of loading SSE registers + in 32,64,128,256 and 512-bit */ + {8, 8, 8, 12, 24}, /* cost of storing SSE registers + in 32,64,128,256 and 512-bit */ + 6, 6, /* SSE->integer and integer->SSE moves */ + 5, 5, /* mask->integer and integer->mask moves */ + {8, 8, 8}, /* cost of loading mask register + in QImode, HImode, SImode. */ + {6, 6, 6}, /* cost if storing mask register + in QImode, HImode, SImode. */ + 3, /* cost of moving mask register. */ + /* End of register allocator costs. */ + }, + + COSTS_N_INSNS (1), /* cost of an add instruction */ + COSTS_N_INSNS (1)+1, /* cost of a lea instruction */ + COSTS_N_INSNS (1), /* variable shift costs */ + COSTS_N_INSNS (1), /* constant shift costs */ + {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ + COSTS_N_INSNS (4), /* HI */ + COSTS_N_INSNS (3), /* SI */ + COSTS_N_INSNS (3), /* DI */ + COSTS_N_INSNS (3)}, /* other */ + 0, /* cost of multiply per each bit set */ + /* Expanding div/mod currently doesn't consider parallelism. So the cost + model is not realistic. We compensate by increasing the latencies a bit. */ + {COSTS_N_INSNS (11), /* cost of a divide/mod for QI */ + COSTS_N_INSNS (11), /* HI */ + COSTS_N_INSNS (14), /* SI */ + COSTS_N_INSNS (76), /* DI */ + COSTS_N_INSNS (76)}, /* other */ + COSTS_N_INSNS (1), /* cost of movsx */ + COSTS_N_INSNS (0), /* cost of movzx */ + 8, /* "large" insn */ + 17, /* MOVE_RATIO */ + 17, /* CLEAR_RATIO */ + {4, 4, 4}, /* cost of loading integer registers + in QImode, HImode and SImode. + Relative to reg-reg move (2). */ + {6, 6, 6}, /* cost of storing integer registers */ + {6, 6, 6, 10, 20}, /* cost of loading SSE register + in 32bit, 64bit, 128bit, 256bit and 512bit */ + {8, 8, 8, 12, 24}, /* cost of storing SSE register + in 32bit, 64bit, 128bit, 256bit and 512bit */ + {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ + {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ + 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ + 6, /* cost of moving SSE register to integer. */ + 20, 8, /* Gather load static, per_elt. */ + 22, 10, /* Gather store static, per_elt. */ + 64, /* size of l1 cache. */ + 512, /* size of l2 cache. */ + 64, /* size of prefetch block */ + 6, /* number of parallel prefetches */ + 3, /* Branch cost */ + COSTS_N_INSNS (3), /* cost of FADD and FSUB insns. */ + COSTS_N_INSNS (4), /* cost of FMUL instruction. */ + COSTS_N_INSNS (20), /* cost of FDIV instruction. */ + COSTS_N_INSNS (1), /* cost of FABS instruction. */ + COSTS_N_INSNS (1), /* cost of FCHS instruction. */ + COSTS_N_INSNS (20), /* cost of FSQRT instruction. */ + + COSTS_N_INSNS (1), /* cost of cheap SSE instruction. */ + COSTS_N_INSNS (4), /* cost of ADDSS/SD SUBSS/SD insns. */ + COSTS_N_INSNS (4), /* cost of MULSS instruction. */ + COSTS_N_INSNS (4), /* cost of MULSD instruction. */ + COSTS_N_INSNS (4), /* cost of FMA SS instruction. */ + COSTS_N_INSNS (4), /* cost of FMA SD instruction. */ + COSTS_N_INSNS (11), /* cost of DIVSS instruction. */ + COSTS_N_INSNS (14), /* cost of DIVSD instruction. */ + COSTS_N_INSNS (12), /* cost of SQRTSS instruction. */ + COSTS_N_INSNS (18), /* cost of SQRTSD instruction. */ + 1, 4, 2, 2, /* reassoc int, fp, vec_int, vec_fp. */ + icelake_memcpy, + icelake_memset, + COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ + COSTS_N_INSNS (1), /* cond_not_taken_branch_cost. */ + "16:11:8", /* Loop alignment. */ + "16:11:8", /* Jump alignment. */ + "0:0:8", /* Label alignment. */ + "16", /* Func alignment. */ +}; + /* BTVER1 has optimized REP instruction for medium sized blocks, but for very small blocks it is better to use loop. For large blocks, libcall can do nontemporary accesses and beat inline considerably. */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index caebf76736e..134916cc972 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -269,6 +269,13 @@ DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove", as MOVS and STOS (without a REP prefix) to move/set sequences of bytes. */ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) +/* X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB: Enable use of REP MOVSB/STOSB to + move/set sequences of bytes with known size. */ +DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, + "prefer_known_rep_movsb_stosb", + m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE + | m_ALDERLAKE | m_SAPPHIRERAPIDS) + /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This requires target to handle misaligned moves and partial memory stalls diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c new file mode 100644 index 00000000000..83c333b551d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c new file mode 100644 index 00000000000..ed963dec853 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + e_u8 b[4][MAXBC]; + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = b[i][j]; +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c new file mode 100644 index 00000000000..be66d6b8426 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 256); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c new file mode 100644 index 00000000000..e8fe0a66c98 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + e_u8 b[4][MAXBC]; + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = b[i][j]; +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-3.c b/gcc/testsuite/gcc.target/i386/memset-strategy-3.c new file mode 100644 index 00000000000..9ea1e1ae7c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-3.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = 1; +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-4.c b/gcc/testsuite/gcc.target/i386/memset-strategy-4.c new file mode 100644 index 00000000000..00d82f13ff8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-4.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = 1; +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-5.c b/gcc/testsuite/gcc.target/i386/memset-strategy-5.c new file mode 100644 index 00000000000..dc1de8e79c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-5.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-6.c b/gcc/testsuite/gcc.target/i386/memset-strategy-6.c new file mode 100644 index 00000000000..e51af3b730f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-6.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 256); +} -- 2.30.2 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu @ 2021-03-22 14:10 ` Jan Hubicka 2021-03-22 23:57 ` [PATCH v2 " H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: Jan Hubicka @ 2021-03-22 14:10 UTC (permalink / raw) To: H.J. Lu; +Cc: gcc-patches, Hongtao Liu, Hongyu Wang > > gcc/ > > * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep): > For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode > to SImode. > (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use > "rep movsb/stosb" only for known sizes. > * config/i386/i386-options.c (processor_cost_table): Use Ice > Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire > Rapids and Alder Lake. > * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New. > * config/i386/x86-tune-costs.h (icelake_memcpy): New. > (icelake_memset): Likewise. > (icelake_cost): Likewise. > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > New. It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously benefical and independent of the rest of changes. I think we will need to discuss bit more the move ratio and the code size/uop cache polution issues - one option would be to use increased limits for -O3 only. Can you break this out to independent patch? I also wonder if it owuld not be more readable to special case this just on the beggining of decide_alg. > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, > const struct processor_costs *cost; > int i; > bool any_alg_usable_p = false; > + bool known_size_p = expected_size != -1; expected_size is not -1 if we have profile feedback and we detected from histogram average size of a block. It seems to me that from description that you want the const to be actual compile time constant that would be min_size == max_size I guess. Honza ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-22 14:10 ` Jan Hubicka @ 2021-03-22 23:57 ` H.J. Lu 2021-03-29 13:43 ` H.J. Lu ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: H.J. Lu @ 2021-03-22 23:57 UTC (permalink / raw) To: Jan Hubicka; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang [-- Attachment #1: Type: text/plain, Size: 2292 bytes --] On Mon, Mar 22, 2021 at 7:10 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > > gcc/ > > > > * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep): > > For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode > > to SImode. > > (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use > > "rep movsb/stosb" only for known sizes. > > * config/i386/i386-options.c (processor_cost_table): Use Ice > > Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire > > Rapids and Alder Lake. > > * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New. > > * config/i386/x86-tune-costs.h (icelake_memcpy): New. > > (icelake_memset): Likewise. > > (icelake_cost): Likewise. > > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > > New. > > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously > benefical and independent of the rest of changes. I think we will need > to discuss bit more the move ratio and the code size/uop cache polution > issues - one option would be to use increased limits for -O3 only. My change only increases CLEAR_RATIO, not MOVE_RATIO. We are checking code size impacts on SPEC CPU 2017 and eembc. > Can you break this out to independent patch? I also wonder if it owuld X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance only when memcpy/memset costs and MOVE_RATIO are updated the same time, like: https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html Make it a standalone means moving from Ice Lake patch to Skylake patch. > not be more readable to special case this just on the beggining of > decide_alg. > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, > > const struct processor_costs *cost; > > int i; > > bool any_alg_usable_p = false; > > + bool known_size_p = expected_size != -1; > > expected_size is not -1 if we have profile feedback and we detected from > histogram average size of a block. It seems to me that from description > that you want the const to be actual compile time constant that would be > min_size == max_size I guess. > You are right. Here is the v2 patch with min_size != max_size check for unknown size. Thanks. -- H.J. [-- Attachment #2: v2-0001-x86-Update-memcpy-memset-inline-strategies-for-Ic.patch --] [-- Type: application/x-patch, Size: 18453 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-22 23:57 ` [PATCH v2 " H.J. Lu @ 2021-03-29 13:43 ` H.J. Lu 2021-03-31 6:59 ` Richard Biener 2021-03-31 8:05 ` Jan Hubicka 2 siblings, 0 replies; 31+ messages in thread From: H.J. Lu @ 2021-03-29 13:43 UTC (permalink / raw) To: Jan Hubicka; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang On Mon, Mar 22, 2021 at 4:57 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Mon, Mar 22, 2021 at 7:10 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > > > > > gcc/ > > > > > > * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep): > > > For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode > > > to SImode. > > > (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use > > > "rep movsb/stosb" only for known sizes. > > > * config/i386/i386-options.c (processor_cost_table): Use Ice > > > Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire > > > Rapids and Alder Lake. > > > * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New. > > > * config/i386/x86-tune-costs.h (icelake_memcpy): New. > > > (icelake_memset): Likewise. > > > (icelake_cost): Likewise. > > > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > > > New. > > > > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously > > benefical and independent of the rest of changes. I think we will need > > to discuss bit more the move ratio and the code size/uop cache polution > > issues - one option would be to use increased limits for -O3 only. > > My change only increases CLEAR_RATIO, not MOVE_RATIO. We are > checking code size impacts on SPEC CPU 2017 and eembc. > > > Can you break this out to independent patch? I also wonder if it owuld > > X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance > only when memcpy/memset costs and MOVE_RATIO are updated the same time, > like: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html > > Make it a standalone means moving from Ice Lake patch to Skylake patch. > > > not be more readable to special case this just on the beggining of > > decide_alg. > > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, > > > const struct processor_costs *cost; > > > int i; > > > bool any_alg_usable_p = false; > > > + bool known_size_p = expected_size != -1; > > > > expected_size is not -1 if we have profile feedback and we detected from > > histogram average size of a block. It seems to me that from description > > that you want the const to be actual compile time constant that would be > > min_size == max_size I guess. > > > > You are right. Here is the v2 patch with min_size != max_size check for > unknown size. > Hi Honza, This patch only impacts Ice Lake. Do you have any comments for the v2 patch? Thanks. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-22 23:57 ` [PATCH v2 " H.J. Lu 2021-03-29 13:43 ` H.J. Lu @ 2021-03-31 6:59 ` Richard Biener 2021-03-31 8:05 ` Jan Hubicka 2 siblings, 0 replies; 31+ messages in thread From: Richard Biener @ 2021-03-31 6:59 UTC (permalink / raw) To: H.J. Lu; +Cc: Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang On Tue, Mar 23, 2021 at 12:59 AM H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > On Mon, Mar 22, 2021 at 7:10 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > > > > > gcc/ > > > > > > * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep): > > > For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode > > > to SImode. > > > (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use > > > "rep movsb/stosb" only for known sizes. > > > * config/i386/i386-options.c (processor_cost_table): Use Ice > > > Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire > > > Rapids and Alder Lake. > > > * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New. > > > * config/i386/x86-tune-costs.h (icelake_memcpy): New. > > > (icelake_memset): Likewise. > > > (icelake_cost): Likewise. > > > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > > > New. > > > > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously > > benefical and independent of the rest of changes. I think we will need > > to discuss bit more the move ratio and the code size/uop cache polution > > issues - one option would be to use increased limits for -O3 only. > > My change only increases CLEAR_RATIO, not MOVE_RATIO. We are > checking code size impacts on SPEC CPU 2017 and eembc. > > > Can you break this out to independent patch? I also wonder if it owuld > > X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance > only when memcpy/memset costs and MOVE_RATIO are updated the same time, > like: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html > > Make it a standalone means moving from Ice Lake patch to Skylake patch. > > > not be more readable to special case this just on the beggining of > > decide_alg. > > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, > > > const struct processor_costs *cost; > > > int i; > > > bool any_alg_usable_p = false; > > > + bool known_size_p = expected_size != -1; > > > > expected_size is not -1 if we have profile feedback and we detected from > > histogram average size of a block. It seems to me that from description > > that you want the const to be actual compile time constant that would be > > min_size == max_size I guess. > > > > You are right. Here is the v2 patch with min_size != max_size check for > unknown size. OK. Thanks, Richard. > Thanks. > > -- > H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-22 23:57 ` [PATCH v2 " H.J. Lu 2021-03-29 13:43 ` H.J. Lu 2021-03-31 6:59 ` Richard Biener @ 2021-03-31 8:05 ` Jan Hubicka 2021-03-31 13:09 ` H.J. Lu 2 siblings, 1 reply; 31+ messages in thread From: Jan Hubicka @ 2021-03-31 8:05 UTC (permalink / raw) To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang > > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously > > benefical and independent of the rest of changes. I think we will need > > to discuss bit more the move ratio and the code size/uop cache polution > > issues - one option would be to use increased limits for -O3 only. > > My change only increases CLEAR_RATIO, not MOVE_RATIO. We are > checking code size impacts on SPEC CPU 2017 and eembc. > > > Can you break this out to independent patch? I also wonder if it owuld > > X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance > only when memcpy/memset costs and MOVE_RATIO are updated the same time, > like: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html > > Make it a standalone means moving from Ice Lake patch to Skylake patch. > > > not be more readable to special case this just on the beggining of > > decide_alg. > > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, > > > const struct processor_costs *cost; > > > int i; > > > bool any_alg_usable_p = false; > > > + bool known_size_p = expected_size != -1; > > > > expected_size is not -1 if we have profile feedback and we detected from > > histogram average size of a block. It seems to me that from description > > that you want the const to be actual compile time constant that would be > > min_size == max_size I guess. > > > > You are right. Here is the v2 patch with min_size != max_size check for > unknown size. Patch is OK now. I was wondering about using avx256 for moves of known size (per comment on MOVE_MAX_PIECES there is issue with MAX_FIXED_MODE_SIZE, but that seems not hard to fix). Did you look into it? Honza > > Thanks. > > -- > H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-31 8:05 ` Jan Hubicka @ 2021-03-31 13:09 ` H.J. Lu 2021-03-31 13:40 ` Jan Hubicka 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-03-31 13:09 UTC (permalink / raw) To: Jan Hubicka; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang On Wed, Mar 31, 2021 at 1:05 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously > > > benefical and independent of the rest of changes. I think we will need > > > to discuss bit more the move ratio and the code size/uop cache polution > > > issues - one option would be to use increased limits for -O3 only. > > > > My change only increases CLEAR_RATIO, not MOVE_RATIO. We are > > checking code size impacts on SPEC CPU 2017 and eembc. > > > > > Can you break this out to independent patch? I also wonder if it owuld > > > > X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance > > only when memcpy/memset costs and MOVE_RATIO are updated the same time, > > like: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html > > > > Make it a standalone means moving from Ice Lake patch to Skylake patch. > > > > > not be more readable to special case this just on the beggining of > > > decide_alg. > > > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, > > > > const struct processor_costs *cost; > > > > int i; > > > > bool any_alg_usable_p = false; > > > > + bool known_size_p = expected_size != -1; > > > > > > expected_size is not -1 if we have profile feedback and we detected from > > > histogram average size of a block. It seems to me that from description > > > that you want the const to be actual compile time constant that would be > > > min_size == max_size I guess. > > > > > > > You are right. Here is the v2 patch with min_size != max_size check for > > unknown size. > > Patch is OK now. I was wondering about using avx256 for moves of known Done. X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now. Can you take a look at the patch for Skylake: https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html > size (per comment on MOVE_MAX_PIECES there is issue with > MAX_FIXED_MODE_SIZE, but that seems not hard to fix). Did you look into > it? It requires some changes in the middle-end. See users/hjl/pieces/master branch: https://gitlab.com/x86-gcc/gcc/-/tree/users/hjl/pieces/master I am rebasing it. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-31 13:09 ` H.J. Lu @ 2021-03-31 13:40 ` Jan Hubicka 2021-03-31 13:47 ` Jan Hubicka 0 siblings, 1 reply; 31+ messages in thread From: Jan Hubicka @ 2021-03-31 13:40 UTC (permalink / raw) To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang > > > > Patch is OK now. I was wondering about using avx256 for moves of known > > Done. X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now. Can > you take a look at the patch for Skylake: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html I was wondering, if CPU preffers rep movsb when rcx is a compile time constant, it probably does some logic at the decode time (i.e. expands it into some sequence) and if so, then it may require the code setting the register to be near rep (via fusing or simlar mechanism) Perhaps we want to have fusing pattern for this, so we do not move them far apart? > > > size (per comment on MOVE_MAX_PIECES there is issue with > > MAX_FIXED_MODE_SIZE, but that seems not hard to fix). Did you look into > > it? > > It requires some changes in the middle-end. See yep, I know - tried that too for zen3 tuning :) > users/hjl/pieces/master branch: > > https://gitlab.com/x86-gcc/gcc/-/tree/users/hjl/pieces/master > > I am rebasing it. Thanks, it would also help to reduce the code size bloat by bumping up the move by pieces. Clang is using those. Honza > > -- > H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-31 13:40 ` Jan Hubicka @ 2021-03-31 13:47 ` Jan Hubicka 2021-03-31 15:41 ` H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: Jan Hubicka @ 2021-03-31 13:47 UTC (permalink / raw) To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang > > > > > > Patch is OK now. I was wondering about using avx256 for moves of known > > > > Done. X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now. Can > > you take a look at the patch for Skylake: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html > > I was wondering, if CPU preffers rep movsb when rcx is a compile time > constant, it probably does some logic at the decode time (i.e. expands > it into some sequence) and if so, then it may require the code setting > the register to be near rep (via fusing or simlar mechanism) > > Perhaps we want to have fusing pattern for this, so we do not move them > far apart? Reading through the optimization manual it seems that mosvb is fast for small block no matter if the size is hard wired. In that case you probably want to check whetehr max_size or expected_size is known to be small rather than max_size == min_size and both being small. But it depends on what CPU really does. Honza ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-31 13:47 ` Jan Hubicka @ 2021-03-31 15:41 ` H.J. Lu 2021-03-31 17:43 ` Jan Hubicka 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-03-31 15:41 UTC (permalink / raw) To: Jan Hubicka; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang On Wed, Mar 31, 2021 at 6:47 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > > > > > > Patch is OK now. I was wondering about using avx256 for moves of known > > > > > > Done. X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now. Can > > > you take a look at the patch for Skylake: > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html > > > > I was wondering, if CPU preffers rep movsb when rcx is a compile time > > constant, it probably does some logic at the decode time (i.e. expands > > it into some sequence) and if so, then it may require the code setting > > the register to be near rep (via fusing or simlar mechanism) > > > > Perhaps we want to have fusing pattern for this, so we do not move them > > far apart? > > Reading through the optimization manual it seems that mosvb is fast for > small block no matter if the size is hard wired. In that case you > probably want to check whetehr max_size or expected_size is known to be > small rather than max_size == min_size and both being small. > > But it depends on what CPU really does. > Honza For small data size, rep movsb is faster only under certain conditions. We can continue fine tuning rep movsb. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-31 15:41 ` H.J. Lu @ 2021-03-31 17:43 ` Jan Hubicka 2021-03-31 17:54 ` H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: Jan Hubicka @ 2021-03-31 17:43 UTC (permalink / raw) To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang > > Reading through the optimization manual it seems that mosvb is fast for > > small block no matter if the size is hard wired. In that case you > > probably want to check whetehr max_size or expected_size is known to be > > small rather than max_size == min_size and both being small. > > > > But it depends on what CPU really does. > > Honza > > For small data size, rep movsb is faster only under certain conditions. We > can continue fine tuning rep movsb. OK, I however wonder why you need condtion maxsize=minsize. - If CPU is looking for movl $cst, %rcx than we probably want to be sure that it is not moved away fro rep ;movsb by adding fused pattern - If rep movsb is slower than loop for very small blocks then you want to set lower bound on minsize & expected size, but you do not need to require maxsize=minsize - If rep movsb is slower than sequence of moves for small blocks then one needs to tweak move by pieces - If rep movsb is slower for larger blocks than you want to test maxsize and expected size So in neither of those scenarios testing maxsize=minsize alone makes too much sense to me... What was the original motivation for differentiating between precisely known size? I am mostly curious because it is not that uncomon to have small maxsize because we are able to track the object size and using short sequence for those would be nice. Having minsize non-trivial may not be that uncommon these days either given that we track value ranges (and under assumption that memcpy/memset expanders was updated to take these into account). Honza > > -- > H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-31 17:43 ` Jan Hubicka @ 2021-03-31 17:54 ` H.J. Lu 2021-04-01 5:57 ` Hongyu Wang 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-03-31 17:54 UTC (permalink / raw) To: Jan Hubicka; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang On Wed, Mar 31, 2021 at 10:43 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > Reading through the optimization manual it seems that mosvb is fast for > > > small block no matter if the size is hard wired. In that case you > > > probably want to check whetehr max_size or expected_size is known to be > > > small rather than max_size == min_size and both being small. > > > > > > But it depends on what CPU really does. > > > Honza > > > > For small data size, rep movsb is faster only under certain conditions. We > > can continue fine tuning rep movsb. > > OK, I however wonder why you need condtion maxsize=minsize. > - If CPU is looking for movl $cst, %rcx than we probably want to be > sure that it is not moved away fro rep ;movsb by adding fused pattern > - If rep movsb is slower than loop for very small blocks then you want > to set lower bound on minsize & expected size, but you do not need > to require maxsize=minsize > - If rep movsb is slower than sequence of moves for small blocks then > one needs to tweak move by pieces > - If rep movsb is slower for larger blocks than you want to test > maxsize and expected size > So in neither of those scenarios testing maxsize=minsize alone makes too > much sense to me... What was the original motivation for differentiating > between precisely known size? > > I am mostly curious because it is not that uncomon to have small maxsize > because we are able to track the object size and using short sequence > for those would be nice. > > Having minsize non-trivial may not be that uncommon these days either > given that we track value ranges (and under assumption that > memcpy/memset expanders was updated to take these into account). > Hongyu has done some analysis on this. Hongyu, can you share what you got? Thanks. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake 2021-03-31 17:54 ` H.J. Lu @ 2021-04-01 5:57 ` Hongyu Wang 0 siblings, 0 replies; 31+ messages in thread From: Hongyu Wang @ 2021-04-01 5:57 UTC (permalink / raw) To: H.J. Lu; +Cc: Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang > > So in neither of those scenarios testing maxsize=minsize alone makes too > > much sense to me... What was the original motivation for differentiating > > between precisely known size? There is a case that could meet small maxsize. https://godbolt.org/z/489Tf7ssj typedef unsigned char e_u8; #define MAXBC 8 void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) { e_u8 b[4][MAXBC]; int i, j; for(i = 0; i < 4; i++) for(j = 0; j < BC; j++) a[i][j] = b[i][j]; } Where BC is unsigned char so maxsize will be 256. If we set stringop_alg to rep_1_byte the code could be like movzbl %sil, %r8d movq %rdi, %rdx leaq -40(%rsp), %rax movq %r8, %r9 leaq -8(%rsp), %r10 testb %r9b, %r9b je .L5 movq %rdx, %rdi movq %rax, %rsi movq %r8, %rcx rep movsb addq $8, %rax addq $8, %rdx cmpq %r10, %rax jne .L2 ret In our test we found this is much slower than current trunk because rep movsb triggers machine clear events, while in the current trunk such small size is handled in the loop mov epilogue and rep movsq is never executed. So here we disabled inline for unknown size to avoid potential issues like this. H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年4月1日周四 上午1:55写道: > > On Wed, Mar 31, 2021 at 10:43 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > > > Reading through the optimization manual it seems that mosvb is fast for > > > > small block no matter if the size is hard wired. In that case you > > > > probably want to check whetehr max_size or expected_size is known to be > > > > small rather than max_size == min_size and both being small. > > > > > > > > But it depends on what CPU really does. > > > > Honza > > > > > > For small data size, rep movsb is faster only under certain conditions. We > > > can continue fine tuning rep movsb. > > > > OK, I however wonder why you need condtion maxsize=minsize. > > - If CPU is looking for movl $cst, %rcx than we probably want to be > > sure that it is not moved away fro rep ;movsb by adding fused pattern > > - If rep movsb is slower than loop for very small blocks then you want > > to set lower bound on minsize & expected size, but you do not need > > to require maxsize=minsize > > - If rep movsb is slower than sequence of moves for small blocks then > > one needs to tweak move by pieces > > - If rep movsb is slower for larger blocks than you want to test > > maxsize and expected size > > So in neither of those scenarios testing maxsize=minsize alone makes too > > much sense to me... What was the original motivation for differentiating > > between precisely known size? > > > > I am mostly curious because it is not that uncomon to have small maxsize > > because we are able to track the object size and using short sequence > > for those would be nice. > > > > Having minsize non-trivial may not be that uncommon these days either > > given that we track value ranges (and under assumption that > > memcpy/memset expanders was updated to take these into account). > > > > Hongyu has done some analysis on this. Hongyu, can you share what > you got? > > Thanks. > > -- > H.J. -- Regards, Hongyu, Wang ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs 2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu 2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu @ 2021-03-22 13:16 ` H.J. Lu 2021-04-05 13:45 ` H.J. Lu 2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu 2 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw) To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang Simply memcpy and memset inline strategies to avoid branches for Skylake family CPUs: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. On Cascadelake processor with -march=native -Ofast -flto, 1. Performance impacts of SPEC CPU 2017 rate are: 500.perlbench_r 0.17% 502.gcc_r -0.36% 505.mcf_r 0.00% 520.omnetpp_r 0.08% 523.xalancbmk_r -0.62% 525.x264_r 1.04% 531.deepsjeng_r 0.11% 541.leela_r -1.09% 548.exchange2_r -0.25% 557.xz_r 0.17% Geomean -0.08% 503.bwaves_r 0.00% 507.cactuBSSN_r 0.69% 508.namd_r -0.07% 510.parest_r 1.12% 511.povray_r 1.82% 519.lbm_r 0.00% 521.wrf_r -1.32% 526.blender_r -0.47% 527.cam4_r 0.23% 538.imagick_r -1.72% 544.nab_r -0.56% 549.fotonik3d_r 0.12% 554.roms_r 0.43% Geomean 0.02% 2. Significant impacts on eembc benchmarks are: eembc/idctrn01 9.23% eembc/nnet_test 29.26% gcc/ * config/i386/x86-tune-costs.h (skylake_memcpy): Updated. (skylake_memset): Likewise. (skylake_cost): Change CLEAR_RATIO to 17. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Replace m_CANNONLAKE, m_ICELAKE_CLIENT, m_ICELAKE_SERVER, m_TIGERLAKE and m_SAPPHIRERAPIDS with m_SKYLAKE and m_CORE_AVX512. gcc/testsuite/ * gcc.target/i386/memcpy-strategy-9.c: New test. * gcc.target/i386/memcpy-strategy-10.c: Likewise. * gcc.target/i386/memcpy-strategy-11.c: Likewise. * gcc.target/i386/memset-strategy-7.c: Likewise. * gcc.target/i386/memset-strategy-8.c: Likewise. * gcc.target/i386/memset-strategy-9.c: Likewise. --- gcc/config/i386/x86-tune-costs.h | 27 ++++++++++++------- gcc/config/i386/x86-tune.def | 3 +-- .../gcc.target/i386/memcpy-strategy-10.c | 11 ++++++++ .../gcc.target/i386/memcpy-strategy-11.c | 18 +++++++++++++ .../gcc.target/i386/memcpy-strategy-9.c | 9 +++++++ .../gcc.target/i386/memset-strategy-7.c | 11 ++++++++ .../gcc.target/i386/memset-strategy-8.c | 9 +++++++ .../gcc.target/i386/memset-strategy-9.c | 17 ++++++++++++ 8 files changed, 93 insertions(+), 12 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 0e00ff99df3..ffe810f2bcb 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1822,17 +1822,24 @@ struct processor_costs znver3_cost = { /* skylake_cost should produce code tuned for Skylake familly of CPUs. */ static stringop_algs skylake_memcpy[2] = { - {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{16, loop, false}, {512, unrolled_loop, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static stringop_algs skylake_memset[2] = { - {libcall, {{6, loop_1_byte, true}, - {24, loop, true}, - {8192, rep_prefix_4_byte, true}, - {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, unrolled_loop, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static const struct processor_costs skylake_cost = { @@ -1889,7 +1896,7 @@ struct processor_costs skylake_cost = { COSTS_N_INSNS (0), /* cost of movzx */ 8, /* "large" insn */ 17, /* MOVE_RATIO */ - 6, /* CLEAR_RATIO */ + 17, /* CLEAR_RATIO */ {4, 4, 4}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 134916cc972..eb057a67750 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -273,8 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) move/set sequences of bytes with known size. */ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, "prefer_known_rep_movsb_stosb", - m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE - | m_ALDERLAKE | m_SAPPHIRERAPIDS) + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c new file mode 100644 index 00000000000..970aa741971 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c new file mode 100644 index 00000000000..b6041944630 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + e_u8 b[4][MAXBC]; + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = b[i][j]; +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c new file mode 100644 index 00000000000..b0dc7484d09 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 256); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-7.c b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c new file mode 100644 index 00000000000..07c2816910c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-8.c b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c new file mode 100644 index 00000000000..52ea882c814 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 256); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-9.c b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c new file mode 100644 index 00000000000..d4db031958f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = 1; +} -- 2.30.2 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs 2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu @ 2021-04-05 13:45 ` H.J. Lu 2021-04-05 21:14 ` Jan Hubicka 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-04-05 13:45 UTC (permalink / raw) To: GCC Patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang On Mon, Mar 22, 2021 at 6:16 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > Simply memcpy and memset inline strategies to avoid branches for > Skylake family CPUs: > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > load and store for up to 16 * 16 (256) bytes when the data size is > fixed and known. > 2. Inline only if data size is known to be <= 256. > a. Use "rep movsb/stosb" with simple code sequence if the data size > is a constant. > b. Use loop if data size is not a constant. > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > On Cascadelake processor with -march=native -Ofast -flto, > > 1. Performance impacts of SPEC CPU 2017 rate are: > > 500.perlbench_r 0.17% > 502.gcc_r -0.36% > 505.mcf_r 0.00% > 520.omnetpp_r 0.08% > 523.xalancbmk_r -0.62% > 525.x264_r 1.04% > 531.deepsjeng_r 0.11% > 541.leela_r -1.09% > 548.exchange2_r -0.25% > 557.xz_r 0.17% > Geomean -0.08% > > 503.bwaves_r 0.00% > 507.cactuBSSN_r 0.69% > 508.namd_r -0.07% > 510.parest_r 1.12% > 511.povray_r 1.82% > 519.lbm_r 0.00% > 521.wrf_r -1.32% > 526.blender_r -0.47% > 527.cam4_r 0.23% > 538.imagick_r -1.72% > 544.nab_r -0.56% > 549.fotonik3d_r 0.12% > 554.roms_r 0.43% > Geomean 0.02% > > 2. Significant impacts on eembc benchmarks are: > > eembc/idctrn01 9.23% > eembc/nnet_test 29.26% > > gcc/ > > * config/i386/x86-tune-costs.h (skylake_memcpy): Updated. > (skylake_memset): Likewise. > (skylake_cost): Change CLEAR_RATIO to 17. > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > Replace m_CANNONLAKE, m_ICELAKE_CLIENT, m_ICELAKE_SERVER, > m_TIGERLAKE and m_SAPPHIRERAPIDS with m_SKYLAKE and m_CORE_AVX512. > > gcc/testsuite/ > > * gcc.target/i386/memcpy-strategy-9.c: New test. > * gcc.target/i386/memcpy-strategy-10.c: Likewise. > * gcc.target/i386/memcpy-strategy-11.c: Likewise. > * gcc.target/i386/memset-strategy-7.c: Likewise. > * gcc.target/i386/memset-strategy-8.c: Likewise. > * gcc.target/i386/memset-strategy-9.c: Likewise. > --- > gcc/config/i386/x86-tune-costs.h | 27 ++++++++++++------- > gcc/config/i386/x86-tune.def | 3 +-- > .../gcc.target/i386/memcpy-strategy-10.c | 11 ++++++++ > .../gcc.target/i386/memcpy-strategy-11.c | 18 +++++++++++++ > .../gcc.target/i386/memcpy-strategy-9.c | 9 +++++++ > .../gcc.target/i386/memset-strategy-7.c | 11 ++++++++ > .../gcc.target/i386/memset-strategy-8.c | 9 +++++++ > .../gcc.target/i386/memset-strategy-9.c | 17 ++++++++++++ > 8 files changed, 93 insertions(+), 12 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c > create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c > create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h > index 0e00ff99df3..ffe810f2bcb 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -1822,17 +1822,24 @@ struct processor_costs znver3_cost = { > > /* skylake_cost should produce code tuned for Skylake familly of CPUs. */ > static stringop_algs skylake_memcpy[2] = { > - {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, > - {libcall, {{16, loop, false}, {512, unrolled_loop, false}, > - {-1, libcall, false}}}}; > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}, > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}}; > > static stringop_algs skylake_memset[2] = { > - {libcall, {{6, loop_1_byte, true}, > - {24, loop, true}, > - {8192, rep_prefix_4_byte, true}, > - {-1, libcall, false}}}, > - {libcall, {{24, loop, true}, {512, unrolled_loop, false}, > - {-1, libcall, false}}}}; > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}, > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}}; > > static const > struct processor_costs skylake_cost = { > @@ -1889,7 +1896,7 @@ struct processor_costs skylake_cost = { > COSTS_N_INSNS (0), /* cost of movzx */ > 8, /* "large" insn */ > 17, /* MOVE_RATIO */ > - 6, /* CLEAR_RATIO */ > + 17, /* CLEAR_RATIO */ > {4, 4, 4}, /* cost of loading integer registers > in QImode, HImode and SImode. > Relative to reg-reg move (2). */ > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index 134916cc972..eb057a67750 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -273,8 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) > move/set sequences of bytes with known size. */ > DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, > "prefer_known_rep_movsb_stosb", > - m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE > - | m_ALDERLAKE | m_SAPPHIRERAPIDS) > + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) > > /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of > compact prologues and epilogues by issuing a misaligned moves. This > diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c > new file mode 100644 > index 00000000000..970aa741971 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=skylake -mno-sse" } */ > +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ > +/* { dg-final { scan-assembler-not "rep movsb" } } */ > + > +void > +foo (char *dest, char *src) > +{ > + __builtin_memcpy (dest, src, 257); > +} > diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c > new file mode 100644 > index 00000000000..b6041944630 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=skylake" } */ > +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */ > +/* { dg-final { scan-assembler-not "rep movsb" } } */ > + > +typedef unsigned char e_u8; > + > +#define MAXBC 8 > + > +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) > +{ > + e_u8 b[4][MAXBC]; > + int i, j; > + > + for(i = 0; i < 4; i++) > + for(j = 0; j < BC; j++) a[i][j] = b[i][j]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c > new file mode 100644 > index 00000000000..b0dc7484d09 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=skylake -mno-sse" } */ > +/* { dg-final { scan-assembler "rep movsb" } } */ > + > +void > +foo (char *dest, char *src) > +{ > + __builtin_memcpy (dest, src, 256); > +} > diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-7.c b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c > new file mode 100644 > index 00000000000..07c2816910c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=skylake -mno-sse" } */ > +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ > +/* { dg-final { scan-assembler-not "rep stosb" } } */ > + > +void > +foo (char *dest) > +{ > + __builtin_memset (dest, 0, 257); > +} > diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-8.c b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c > new file mode 100644 > index 00000000000..52ea882c814 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=skylake -mno-sse" } */ > +/* { dg-final { scan-assembler "rep stosb" } } */ > + > +void > +foo (char *dest) > +{ > + __builtin_memset (dest, 0, 256); > +} > diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-9.c b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c > new file mode 100644 > index 00000000000..d4db031958f > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c > @@ -0,0 +1,17 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=skylake" } */ > +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */ > +/* { dg-final { scan-assembler-not "rep stosb" } } */ > + > +typedef unsigned char e_u8; > + > +#define MAXBC 8 > + > +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) > +{ > + int i, j; > + > + for(i = 0; i < 4; i++) > + for(j = 0; j < BC; j++) a[i][j] = 1; > +} > -- > 2.30.2 > If there are no objections, I will check it in on Wednesday. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs 2021-04-05 13:45 ` H.J. Lu @ 2021-04-05 21:14 ` Jan Hubicka 2021-04-05 21:53 ` H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: Jan Hubicka @ 2021-04-05 21:14 UTC (permalink / raw) To: H.J. Lu; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang > > /* skylake_cost should produce code tuned for Skylake familly of CPUs. */ > > static stringop_algs skylake_memcpy[2] = { > > - {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, > > - {libcall, {{16, loop, false}, {512, unrolled_loop, false}, > > - {-1, libcall, false}}}}; > > + {libcall, > > + {{256, rep_prefix_1_byte, true}, > > + {256, loop, false}, > > + {-1, libcall, false}}}, > > + {libcall, > > + {{256, rep_prefix_1_byte, true}, > > + {256, loop, false}, > > + {-1, libcall, false}}}}; > > > > static stringop_algs skylake_memset[2] = { > > - {libcall, {{6, loop_1_byte, true}, > > - {24, loop, true}, > > - {8192, rep_prefix_4_byte, true}, > > - {-1, libcall, false}}}, > > - {libcall, {{24, loop, true}, {512, unrolled_loop, false}, > > - {-1, libcall, false}}}}; > > + {libcall, > > + {{256, rep_prefix_1_byte, true}, > > + {256, loop, false}, > > + {-1, libcall, false}}}, > > + {libcall, > > + {{256, rep_prefix_1_byte, true}, > > + {256, loop, false}, > > + {-1, libcall, false}}}}; > > > > If there are no objections, I will check it in on Wednesday. On my skylake notebook if I run the benchmarking script I get: jan@skylake:~/trunk/contrib> ./bench-stringop 64 640000000 gcc -march=native memcpy block size libcall rep1 noalg rep4 noalg rep8 noalg loop noalg unrl noalg sse noalg byte PGO dynamic BEST 8192000 0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18 0:00.19 sse 819200 0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09 0:00.09 libcall 81920 0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06 0:00.06 libcall 20480 0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09 0:00.05 rep1noalign 8192 0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05 0:00.04 rep1noalign 4096 0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07 0:00.05 libcall 2048 0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07 0:00.04 libcall 1024 0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06 0:00.06 libcall 512 0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08 0:00.06 libcall 256 0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12 0:00.10 libcall 128 0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17 0:00.15 libcall 64 0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28 0:00.25 loop 48 0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31 0:00.32 unrl 32 0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40 0:00.40 unrl 24 0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50 0:00.50 unrlnoalign 16 0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.91 0:00.77 unrlnoalign 14 0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.99 0:00.94 unrl 12 0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.10 0:01.02 unrl 10 0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.38 0:01.23 unrlnoalign 8 0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.55 0:01.38 unrl So indeed rep byte seems consistently outperforming rep4/rep8 however urolled variant seems to be better than rep byte for small block sizes. Do you have some data for blocks in size 8...256 to be faster with rep1 compared to unrolled loop for perhaps more real world benchmarks? The difference seems to get quite big for small locks in range 8...16 bytes. I noticed that before and sort of conlcuded that it is probably the branch prediction playing relatively well for those small block sizes. On the other hand winding up the relatively long unrolled loop is not very cool just to catch this case. Do you know what of the three changes (preferring reps/stosb, CLEAR_RATIO and algorithm choice changes) cause the two speedups on eebmc? Honza > > -- > H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs 2021-04-05 21:14 ` Jan Hubicka @ 2021-04-05 21:53 ` H.J. Lu 2021-04-06 9:09 ` Hongyu Wang 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-04-05 21:53 UTC (permalink / raw) To: Jan Hubicka; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang On Mon, Apr 5, 2021 at 2:14 PM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > /* skylake_cost should produce code tuned for Skylake familly of CPUs. */ > > > static stringop_algs skylake_memcpy[2] = { > > > - {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, > > > - {libcall, {{16, loop, false}, {512, unrolled_loop, false}, > > > - {-1, libcall, false}}}}; > > > + {libcall, > > > + {{256, rep_prefix_1_byte, true}, > > > + {256, loop, false}, > > > + {-1, libcall, false}}}, > > > + {libcall, > > > + {{256, rep_prefix_1_byte, true}, > > > + {256, loop, false}, > > > + {-1, libcall, false}}}}; > > > > > > static stringop_algs skylake_memset[2] = { > > > - {libcall, {{6, loop_1_byte, true}, > > > - {24, loop, true}, > > > - {8192, rep_prefix_4_byte, true}, > > > - {-1, libcall, false}}}, > > > - {libcall, {{24, loop, true}, {512, unrolled_loop, false}, > > > - {-1, libcall, false}}}}; > > > + {libcall, > > > + {{256, rep_prefix_1_byte, true}, > > > + {256, loop, false}, > > > + {-1, libcall, false}}}, > > > + {libcall, > > > + {{256, rep_prefix_1_byte, true}, > > > + {256, loop, false}, > > > + {-1, libcall, false}}}}; > > > > > > > If there are no objections, I will check it in on Wednesday. > > On my skylake notebook if I run the benchmarking script I get: > > jan@skylake:~/trunk/contrib> ./bench-stringop 64 640000000 gcc -march=native > memcpy > block size libcall rep1 noalg rep4 noalg rep8 noalg loop noalg unrl noalg sse noalg byte PGO dynamic BEST > 8192000 0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18 0:00.19 sse > 819200 0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09 0:00.09 libcall > 81920 0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06 0:00.06 libcall > 20480 0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09 0:00.05 rep1noalign > 8192 0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05 0:00.04 rep1noalign > 4096 0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07 0:00.05 libcall > 2048 0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07 0:00.04 libcall > 1024 0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06 0:00.06 libcall > 512 0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08 0:00.06 libcall > 256 0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12 0:00.10 libcall > 128 0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17 0:00.15 libcall > 64 0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28 0:00.25 loop > 48 0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31 0:00.32 unrl > 32 0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40 0:00.40 unrl > 24 0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50 0:00.50 unrlnoalign > 16 0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.91 0:00.77 unrlnoalign > 14 0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.99 0:00.94 unrl > 12 0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.10 0:01.02 unrl > 10 0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.38 0:01.23 unrlnoalign > 8 0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.55 0:01.38 unrl > So indeed rep byte seems consistently outperforming rep4/rep8 however > urolled variant seems to be better than rep byte for small block sizes. My patch generates "rep movsb" only in a very limited cases: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with a simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. As a result, "rep stosb" is generated only when 128 < data size < 256 with -mno-sse. > Do you have some data for blocks in size 8...256 to be faster with rep1 > compared to unrolled loop for perhaps more real world benchmarks? "rep movsb" isn't generated with my patch in this case since MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with XMM registers. > The difference seems to get quite big for small locks in range 8...16 > bytes. I noticed that before and sort of conlcuded that it is probably > the branch prediction playing relatively well for those small block > sizes. On the other hand winding up the relatively long unrolled loop is > not very cool just to catch this case. > > Do you know what of the three changes (preferring reps/stosb, > CLEAR_RATIO and algorithm choice changes) cause the two speedups > on eebmc? Hongyu, can you find out where the speedup came from? Thanks. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs 2021-04-05 21:53 ` H.J. Lu @ 2021-04-06 9:09 ` Hongyu Wang 2021-04-06 9:51 ` Jan Hubicka 0 siblings, 1 reply; 31+ messages in thread From: Hongyu Wang @ 2021-04-06 9:09 UTC (permalink / raw) To: H.J. Lu; +Cc: Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang > Do you know what of the three changes (preferring reps/stosb, > CLEAR_RATIO and algorithm choice changes) cause the two speedups > on eebmc? A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP This loop is transformed to builtin_memcpy and builtin_memset with size 280. Current strategy for skylake is {512, unrolled_loop, false} for such size, so it will generate unrolled loops with mov, while the patch generates memcpy/memset libcall and uses vector move. For idctrn01 it is memset with size 512. So the speedups come from algorithm change. H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年4月6日周二 上午5:55写道: > > On Mon, Apr 5, 2021 at 2:14 PM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > > > /* skylake_cost should produce code tuned for Skylake familly of CPUs. */ > > > > static stringop_algs skylake_memcpy[2] = { > > > > - {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, > > > > - {libcall, {{16, loop, false}, {512, unrolled_loop, false}, > > > > - {-1, libcall, false}}}}; > > > > + {libcall, > > > > + {{256, rep_prefix_1_byte, true}, > > > > + {256, loop, false}, > > > > + {-1, libcall, false}}}, > > > > + {libcall, > > > > + {{256, rep_prefix_1_byte, true}, > > > > + {256, loop, false}, > > > > + {-1, libcall, false}}}}; > > > > > > > > static stringop_algs skylake_memset[2] = { > > > > - {libcall, {{6, loop_1_byte, true}, > > > > - {24, loop, true}, > > > > - {8192, rep_prefix_4_byte, true}, > > > > - {-1, libcall, false}}}, > > > > - {libcall, {{24, loop, true}, {512, unrolled_loop, false}, > > > > - {-1, libcall, false}}}}; > > > > + {libcall, > > > > + {{256, rep_prefix_1_byte, true}, > > > > + {256, loop, false}, > > > > + {-1, libcall, false}}}, > > > > + {libcall, > > > > + {{256, rep_prefix_1_byte, true}, > > > > + {256, loop, false}, > > > > + {-1, libcall, false}}}}; > > > > > > > > > > If there are no objections, I will check it in on Wednesday. > > > > On my skylake notebook if I run the benchmarking script I get: > > > > jan@skylake:~/trunk/contrib> ./bench-stringop 64 640000000 gcc -march=native > > memcpy > > block size libcall rep1 noalg rep4 noalg rep8 noalg loop noalg unrl noalg sse noalg byte PGO dynamic BEST > > 8192000 0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18 0:00.19 sse > > 819200 0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09 0:00.09 libcall > > 81920 0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06 0:00.06 libcall > > 20480 0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09 0:00.05 rep1noalign > > 8192 0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05 0:00.04 rep1noalign > > 4096 0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07 0:00.05 libcall > > 2048 0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07 0:00.04 libcall > > 1024 0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06 0:00.06 libcall > > 512 0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08 0:00.06 libcall > > 256 0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12 0:00.10 libcall > > 128 0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17 0:00.15 libcall > > 64 0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28 0:00.25 loop > > 48 0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31 0:00.32 unrl > > 32 0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40 0:00.40 unrl > > 24 0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50 0:00.50 unrlnoalign > > 16 0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.91 0:00.77 unrlnoalign > > 14 0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.99 0:00.94 unrl > > 12 0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.10 0:01.02 unrl > > 10 0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.38 0:01.23 unrlnoalign > > 8 0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.55 0:01.38 unrl > > So indeed rep byte seems consistently outperforming rep4/rep8 however > > urolled variant seems to be better than rep byte for small block sizes. > > My patch generates "rep movsb" only in a very limited cases: > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > load and store for up to 16 * 16 (256) bytes when the data size is > fixed and known. > 2. Inline only if data size is known to be <= 256. > a. Use "rep movsb/stosb" with a simple code sequence if the data size > is a constant. > b. Use loop if data size is not a constant. > > As a result, "rep stosb" is generated only when 128 < data size < 256 > with -mno-sse. > > > Do you have some data for blocks in size 8...256 to be faster with rep1 > > compared to unrolled loop for perhaps more real world benchmarks? > > "rep movsb" isn't generated with my patch in this case since > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with > XMM registers. > > > The difference seems to get quite big for small locks in range 8...16 > > bytes. I noticed that before and sort of conlcuded that it is probably > > the branch prediction playing relatively well for those small block > > sizes. On the other hand winding up the relatively long unrolled loop is > > not very cool just to catch this case. > > > > Do you know what of the three changes (preferring reps/stosb, > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > on eebmc? > > Hongyu, can you find out where the speedup came from? > > Thanks. > > -- > H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs 2021-04-06 9:09 ` Hongyu Wang @ 2021-04-06 9:51 ` Jan Hubicka 2021-04-06 12:34 ` H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: Jan Hubicka @ 2021-04-06 9:51 UTC (permalink / raw) To: Hongyu Wang; +Cc: H.J. Lu, Hongtao Liu, GCC Patches, Hongyu Wang > > Do you know what of the three changes (preferring reps/stosb, > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > on eebmc? > > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP > > This loop is transformed to builtin_memcpy and builtin_memset with size 280. > > Current strategy for skylake is {512, unrolled_loop, false} for such > size, so it will generate unrolled loops with mov, while the patch > generates memcpy/memset libcall and uses vector move. This is good - I originally set the table based on this micro-benchmarking script and apparently glibc used at that time had more expensive memcpy for small blocks. One thing to consider is, however, that calling external memcpy has also additional cost of clobbering all caller saved registers. Especially for code that uses SSE this is painful since all needs to go to stack in that case. So I am not completely sure how representative the micro-benchmark is to this respect since it does not use any SSE and register pressure is generally small. So with current glibc it seems libcall is win for blocks of size greater than 64 or 128 at least if the register pressure is not big. With this respect your change looks good. > > > > My patch generates "rep movsb" only in a very limited cases: > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > load and store for up to 16 * 16 (256) bytes when the data size is > > fixed and known. > > 2. Inline only if data size is known to be <= 256. > > a. Use "rep movsb/stosb" with a simple code sequence if the data size > > is a constant. > > b. Use loop if data size is not a constant. Aha, this is very hard to read from the algorithm descriptor. So we still have the check that maxsize==minsize and use rep mosb only for constant sized blocks when the corresponding TARGET macro is defined. I think it would be more readable if we introduced rep_1_byte_constant. The descriptor is supposed to read as a sequence of rules where fist applies. It is not obvious that we have another TARGET_* macro that makes rep_1_byte to be ignored in some cases. (TARGET macro will also interfere with the microbenchmarking script). Still I do not understand why compile time constant makes rep mosb/stosb better than loop. Is it CPU special casing it at decoder time and requiring explicit mov instruction? Or is it only becuase rep mosb is not good for blocks smaller than 128bit? > > > > As a result, "rep stosb" is generated only when 128 < data size < 256 > > with -mno-sse. > > > > > Do you have some data for blocks in size 8...256 to be faster with rep1 > > > compared to unrolled loop for perhaps more real world benchmarks? > > > > "rep movsb" isn't generated with my patch in this case since > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with > > XMM registers. OK, so I guess: {libcall, {{256, rep_1_byte, true}, {256, unrolled_loop, false}, {-1, libcall, false}}}, {libcall, {{256, rep_1_loop, true}, {256, unrolled_loop, false}, {-1, libcall, false}}}}; may still perform better but the differnece between loop and unrolled loop is within 10% margin.. So i guess patch is OK and we should look into cleaning up the descriptors. I can make patch for that once I understand the logic above. Honza > > > > > The difference seems to get quite big for small locks in range 8...16 > > > bytes. I noticed that before and sort of conlcuded that it is probably > > > the branch prediction playing relatively well for those small block > > > sizes. On the other hand winding up the relatively long unrolled loop is > > > not very cool just to catch this case. > > > > > > Do you know what of the three changes (preferring reps/stosb, > > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > > on eebmc? > > > > Hongyu, can you find out where the speedup came from? > > > > Thanks. > > > > -- > > H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs 2021-04-06 9:51 ` Jan Hubicka @ 2021-04-06 12:34 ` H.J. Lu 0 siblings, 0 replies; 31+ messages in thread From: H.J. Lu @ 2021-04-06 12:34 UTC (permalink / raw) To: Jan Hubicka; +Cc: Hongyu Wang, Hongtao Liu, GCC Patches, Hongyu Wang On Tue, Apr 6, 2021 at 2:51 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > > Do you know what of the three changes (preferring reps/stosb, > > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > > on eebmc? > > > > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP > > > > This loop is transformed to builtin_memcpy and builtin_memset with size 280. > > > > Current strategy for skylake is {512, unrolled_loop, false} for such > > size, so it will generate unrolled loops with mov, while the patch > > generates memcpy/memset libcall and uses vector move. > > This is good - I originally set the table based on this > micro-benchmarking script and apparently glibc used at that time had > more expensive memcpy for small blocks. > > One thing to consider is, however, that calling external memcpy has also > additional cost of clobbering all caller saved registers. Especially > for code that uses SSE this is painful since all needs to go to stack in > that case. So I am not completely sure how representative the > micro-benchmark is to this respect since it does not use any SSE and > register pressure is generally small. > > So with current glibc it seems libcall is win for blocks of size greater > than 64 or 128 at least if the register pressure is not big. > With this respect your change looks good. > > > > > > My patch generates "rep movsb" only in a very limited cases: > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > fixed and known. > > > 2. Inline only if data size is known to be <= 256. > > > a. Use "rep movsb/stosb" with a simple code sequence if the data size > > > is a constant. > > > b. Use loop if data size is not a constant. > > Aha, this is very hard to read from the algorithm descriptor. So we > still have the check that maxsize==minsize and use rep mosb only for > constant sized blocks when the corresponding TARGET macro is defined. > > I think it would be more readable if we introduced rep_1_byte_constant. > The descriptor is supposed to read as a sequence of rules where fist > applies. It is not obvious that we have another TARGET_* macro that > makes rep_1_byte to be ignored in some cases. > (TARGET macro will also interfere with the microbenchmarking script). > > Still I do not understand why compile time constant makes rep mosb/stosb > better than loop. Is it CPU special casing it at decoder time and > requiring explicit mov instruction? Or is it only becuase rep mosb is > not good for blocks smaller than 128bit? Non constant "rep movsb" triggers more machine clear events: https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/mo-machine-clear-overhead.html in hot loops of some workloads. > > > > > > As a result, "rep stosb" is generated only when 128 < data size < 256 > > > with -mno-sse. > > > > > > > Do you have some data for blocks in size 8...256 to be faster with rep1 > > > > compared to unrolled loop for perhaps more real world benchmarks? > > > > > > "rep movsb" isn't generated with my patch in this case since > > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with > > > XMM registers. > > OK, so I guess: > {libcall, > {{256, rep_1_byte, true}, > {256, unrolled_loop, false}, > {-1, libcall, false}}}, > {libcall, > {{256, rep_1_loop, true}, > {256, unrolled_loop, false}, > {-1, libcall, false}}}}; > > may still perform better but the differnece between loop and unrolled > loop is within 10% margin.. > > So i guess patch is OK and we should look into cleaning up the > descriptors. I can make patch for that once I understand the logic above. I am checking in my patch. We improve it for GCC 12. We will also revisit: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90773 for GCC 12. Thanks. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu 2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu 2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu @ 2021-03-22 13:16 ` H.J. Lu 2021-03-22 13:29 ` Richard Biener 2 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw) To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang Simply memcpy and memset inline strategies to avoid branches for -mtune=generic: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. With -mtune=generic -O2, 1. On Ice Lake processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r 0.51% 502.gcc_r 0.55% 505.mcf_r 0.38% 520.omnetpp_r -0.74% 523.xalancbmk_r -0.35% 525.x264_r 2.99% 531.deepsjeng_r -0.17% 541.leela_r -0.98% 548.exchange2_r 0.89% 557.xz_r 0.70% Geomean 0.37% 503.bwaves_r 0.04% 507.cactuBSSN_r -0.01% 508.namd_r -0.45% 510.parest_r -0.09% 511.povray_r -1.37% 519.lbm_r 0.00% 521.wrf_r -2.56% 526.blender_r -0.01% 527.cam4_r -0.05% 538.imagick_r 0.36% 544.nab_r 0.08% 549.fotonik3d_r -0.06% 554.roms_r 0.05% Geomean -0.34% Significant impacts on eembc benchmarks: eembc/nnet_test 14.85% eembc/mp2decoddata2 13.57% 2. On Cascadelake processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r -0.02% 502.gcc_r 0.10% 505.mcf_r -1.14% 520.omnetpp_r -0.22% 523.xalancbmk_r 0.21% 525.x264_r 0.94% 531.deepsjeng_r -0.37% 541.leela_r -0.46% 548.exchange2_r -0.40% 557.xz_r 0.60% Geomean -0.08% 503.bwaves_r -0.50% 507.cactuBSSN_r 0.05% 508.namd_r -0.02% 510.parest_r 0.09% 511.povray_r -1.35% 519.lbm_r 0.00% 521.wrf_r -0.03% 526.blender_r -0.83% 527.cam4_r 1.23% 538.imagick_r 0.97% 544.nab_r -0.02% 549.fotonik3d_r -0.12% 554.roms_r 0.55% Geomean 0.00% Significant impacts on eembc benchmarks: eembc/nnet_test 9.90% eembc/mp2decoddata2 16.42% eembc/textv2data3 -4.86% eembc/qos 12.90% 3. On Znver3 processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r -0.96% 502.gcc_r -1.06% 505.mcf_r -0.01% 520.omnetpp_r -1.45% 523.xalancbmk_r 2.89% 525.x264_r 4.98% 531.deepsjeng_r 0.18% 541.leela_r -1.54% 548.exchange2_r -1.25% 557.xz_r -0.01% Geomean 0.16% 503.bwaves_r 0.04% 507.cactuBSSN_r 0.85% 508.namd_r -0.13% 510.parest_r 0.39% 511.povray_r 0.00% 519.lbm_r 0.00% 521.wrf_r 0.28% 526.blender_r -0.10% 527.cam4_r -0.58% 538.imagick_r 0.69% 544.nab_r -0.04% 549.fotonik3d_r -0.04% 554.roms_r 0.40% Geomean 0.15% Significant impacts on eembc benchmarks: eembc/aifftr01 13.95% eembc/idctrn01 8.41% eembc/nnet_test 30.25% eembc/mp2decoddata2 5.05% eembc/textv2data3 6.43% eembc/qos -5.79% gcc/ * config/i386/x86-tune-costs.h (generic_memcpy): Updated. (generic_memset): Likewise. (generic_cost): Change CLEAR_RATIO to 17. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Add m_GENERIC. gcc/testsuite/ * gcc.target/i386/memcpy-strategy-12.c: New test. * gcc.target/i386/memcpy-strategy-13.c: Likewise. * gcc.target/i386/memset-strategy-10.c: Likewise. * gcc.target/i386/memset-strategy-11.c: Likewise. * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset-strategy=rep_8byte:-1:align. * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. --- gcc/config/i386/x86-tune-costs.h | 31 ++++++++++++------- gcc/config/i386/x86-tune.def | 2 +- .../gcc.target/i386/memcpy-strategy-12.c | 9 ++++++ .../gcc.target/i386/memcpy-strategy-13.c | 11 +++++++ .../gcc.target/i386/memset-strategy-10.c | 11 +++++++ .../gcc.target/i386/memset-strategy-11.c | 9 ++++++ gcc/testsuite/gcc.target/i386/shrink_wrap_1.c | 2 +- gcc/testsuite/gcc.target/i386/sw-1.c | 2 +- 8 files changed, 63 insertions(+), 14 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index ffe810f2bcb..30e7c3e4261 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = { "16", /* Func alignment. */ }; -/* Generic should produce code tuned for Core-i7 (and newer chips) - and btver1 (and newer chips). */ +/* Generic should produce code tuned for Haswell (and newer chips) + and znver1 (and newer chips). NB: rep_prefix_1_byte is used only + for known size. */ static stringop_algs generic_memcpy[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static stringop_algs generic_memset[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static const struct processor_costs generic_cost = { { @@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = { COSTS_N_INSNS (1), /* cost of movzx */ 8, /* "large" insn */ 17, /* MOVE_RATIO */ - 6, /* CLEAR_RATIO */ + 17, /* CLEAR_RATIO */ {6, 6, 6}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index eb057a67750..fd9c011a3f5 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) move/set sequences of bytes with known size. */ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, "prefer_known_rep_movsb_stosb", - m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC) /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c new file mode 100644 index 00000000000..87f03352736 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 249); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c new file mode 100644 index 00000000000..cfc3cfba623 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c new file mode 100644 index 00000000000..ade5e8da42c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c new file mode 100644 index 00000000000..d1b86152474 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 253); +} diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c index 94dadd6cdbd..44fe7d2836e 100644 --- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c @@ -1,5 +1,5 @@ /* { dg-do compile { target { ! ia32 } } } */ -/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */ enum machine_mode { diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c index aec095eda62..f61621e42bf 100644 --- a/gcc/testsuite/gcc.target/i386/sw-1.c +++ b/gcc/testsuite/gcc.target/i386/sw-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */ #include <string.h> -- 2.30.2 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu @ 2021-03-22 13:29 ` Richard Biener 2021-03-22 13:38 ` H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: Richard Biener @ 2021-03-22 13:29 UTC (permalink / raw) To: H.J. Lu; +Cc: GCC Patches, Jan Hubicka, Hongtao Liu, Hongyu Wang On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Simply memcpy and memset inline strategies to avoid branches for > -mtune=generic: > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > load and store for up to 16 * 16 (256) bytes when the data size is > fixed and known. > 2. Inline only if data size is known to be <= 256. > a. Use "rep movsb/stosb" with simple code sequence if the data size > is a constant. > b. Use loop if data size is not a constant. > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > With -mtune=generic -O2, Is there any visible code-size effect of increasing CLEAR_RATIO on SPEC/eembc? Did you play with other values of MOVE/CLEAR_RATIO? 17 memory-to-memory/memory-clear insns looks quite a lot. > 1. On Ice Lake processor, > > Performance impacts on SPEC CPU 2017: > > 500.perlbench_r 0.51% > 502.gcc_r 0.55% > 505.mcf_r 0.38% > 520.omnetpp_r -0.74% > 523.xalancbmk_r -0.35% > 525.x264_r 2.99% > 531.deepsjeng_r -0.17% > 541.leela_r -0.98% > 548.exchange2_r 0.89% > 557.xz_r 0.70% > Geomean 0.37% > > 503.bwaves_r 0.04% > 507.cactuBSSN_r -0.01% > 508.namd_r -0.45% > 510.parest_r -0.09% > 511.povray_r -1.37% > 519.lbm_r 0.00% > 521.wrf_r -2.56% > 526.blender_r -0.01% > 527.cam4_r -0.05% > 538.imagick_r 0.36% > 544.nab_r 0.08% > 549.fotonik3d_r -0.06% > 554.roms_r 0.05% > Geomean -0.34% > > Significant impacts on eembc benchmarks: > > eembc/nnet_test 14.85% > eembc/mp2decoddata2 13.57% > > 2. On Cascadelake processor, > > Performance impacts on SPEC CPU 2017: > > 500.perlbench_r -0.02% > 502.gcc_r 0.10% > 505.mcf_r -1.14% > 520.omnetpp_r -0.22% > 523.xalancbmk_r 0.21% > 525.x264_r 0.94% > 531.deepsjeng_r -0.37% > 541.leela_r -0.46% > 548.exchange2_r -0.40% > 557.xz_r 0.60% > Geomean -0.08% > > 503.bwaves_r -0.50% > 507.cactuBSSN_r 0.05% > 508.namd_r -0.02% > 510.parest_r 0.09% > 511.povray_r -1.35% > 519.lbm_r 0.00% > 521.wrf_r -0.03% > 526.blender_r -0.83% > 527.cam4_r 1.23% > 538.imagick_r 0.97% > 544.nab_r -0.02% > 549.fotonik3d_r -0.12% > 554.roms_r 0.55% > Geomean 0.00% > > Significant impacts on eembc benchmarks: > > eembc/nnet_test 9.90% > eembc/mp2decoddata2 16.42% > eembc/textv2data3 -4.86% > eembc/qos 12.90% > > 3. On Znver3 processor, > > Performance impacts on SPEC CPU 2017: > > 500.perlbench_r -0.96% > 502.gcc_r -1.06% > 505.mcf_r -0.01% > 520.omnetpp_r -1.45% > 523.xalancbmk_r 2.89% > 525.x264_r 4.98% > 531.deepsjeng_r 0.18% > 541.leela_r -1.54% > 548.exchange2_r -1.25% > 557.xz_r -0.01% > Geomean 0.16% > > 503.bwaves_r 0.04% > 507.cactuBSSN_r 0.85% > 508.namd_r -0.13% > 510.parest_r 0.39% > 511.povray_r 0.00% > 519.lbm_r 0.00% > 521.wrf_r 0.28% > 526.blender_r -0.10% > 527.cam4_r -0.58% > 538.imagick_r 0.69% > 544.nab_r -0.04% > 549.fotonik3d_r -0.04% > 554.roms_r 0.40% > Geomean 0.15% > > Significant impacts on eembc benchmarks: > > eembc/aifftr01 13.95% > eembc/idctrn01 8.41% > eembc/nnet_test 30.25% > eembc/mp2decoddata2 5.05% > eembc/textv2data3 6.43% > eembc/qos -5.79% > > gcc/ > > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > (generic_memset): Likewise. > (generic_cost): Change CLEAR_RATIO to 17. > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > Add m_GENERIC. > > gcc/testsuite/ > > * gcc.target/i386/memcpy-strategy-12.c: New test. > * gcc.target/i386/memcpy-strategy-13.c: Likewise. > * gcc.target/i386/memset-strategy-10.c: Likewise. > * gcc.target/i386/memset-strategy-11.c: Likewise. > * gcc.target/i386/shrink_wrap_1.c: Also pass > -mmemset-strategy=rep_8byte:-1:align. > * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. > --- > gcc/config/i386/x86-tune-costs.h | 31 ++++++++++++------- > gcc/config/i386/x86-tune.def | 2 +- > .../gcc.target/i386/memcpy-strategy-12.c | 9 ++++++ > .../gcc.target/i386/memcpy-strategy-13.c | 11 +++++++ > .../gcc.target/i386/memset-strategy-10.c | 11 +++++++ > .../gcc.target/i386/memset-strategy-11.c | 9 ++++++ > gcc/testsuite/gcc.target/i386/shrink_wrap_1.c | 2 +- > gcc/testsuite/gcc.target/i386/sw-1.c | 2 +- > 8 files changed, 63 insertions(+), 14 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c > create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c > create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h > index ffe810f2bcb..30e7c3e4261 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = { > "16", /* Func alignment. */ > }; > > -/* Generic should produce code tuned for Core-i7 (and newer chips) > - and btver1 (and newer chips). */ > +/* Generic should produce code tuned for Haswell (and newer chips) > + and znver1 (and newer chips). NB: rep_prefix_1_byte is used only > + for known size. */ > > static stringop_algs generic_memcpy[2] = { > - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, > - {-1, libcall, false}}}, > - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, > - {-1, libcall, false}}}}; > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}, > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}}; > static stringop_algs generic_memset[2] = { > - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, > - {-1, libcall, false}}}, > - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, > - {-1, libcall, false}}}}; > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}, > + {libcall, > + {{256, rep_prefix_1_byte, true}, > + {256, loop, false}, > + {-1, libcall, false}}}}; > static const > struct processor_costs generic_cost = { > { > @@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = { > COSTS_N_INSNS (1), /* cost of movzx */ > 8, /* "large" insn */ > 17, /* MOVE_RATIO */ > - 6, /* CLEAR_RATIO */ > + 17, /* CLEAR_RATIO */ > {6, 6, 6}, /* cost of loading integer registers > in QImode, HImode and SImode. > Relative to reg-reg move (2). */ > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index eb057a67750..fd9c011a3f5 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) > move/set sequences of bytes with known size. */ > DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, > "prefer_known_rep_movsb_stosb", > - m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) > + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC) > > /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of > compact prologues and epilogues by issuing a misaligned moves. This > diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c > new file mode 100644 > index 00000000000..87f03352736 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic" } */ > +/* { dg-final { scan-assembler "rep movsb" } } */ > + > +void > +foo (char *dest, char *src) > +{ > + __builtin_memcpy (dest, src, 249); > +} > diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c > new file mode 100644 > index 00000000000..cfc3cfba623 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic" } */ > +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ > +/* { dg-final { scan-assembler-not "rep movsb" } } */ > + > +void > +foo (char *dest, char *src) > +{ > + __builtin_memcpy (dest, src, 257); > +} > diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c > new file mode 100644 > index 00000000000..ade5e8da42c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic" } */ > +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ > +/* { dg-final { scan-assembler-not "rep stosb" } } */ > + > +void > +foo (char *dest) > +{ > + __builtin_memset (dest, 0, 257); > +} > diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c > new file mode 100644 > index 00000000000..d1b86152474 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic" } */ > +/* { dg-final { scan-assembler "rep stosb" } } */ > + > +void > +foo (char *dest) > +{ > + __builtin_memset (dest, 0, 253); > +} > diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c > index 94dadd6cdbd..44fe7d2836e 100644 > --- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c > +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile { target { ! ia32 } } } */ > -/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ > +/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */ > > enum machine_mode > { > diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c > index aec095eda62..f61621e42bf 100644 > --- a/gcc/testsuite/gcc.target/i386/sw-1.c > +++ b/gcc/testsuite/gcc.target/i386/sw-1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ > +/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ > /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */ > > #include <string.h> > -- > 2.30.2 > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-03-22 13:29 ` Richard Biener @ 2021-03-22 13:38 ` H.J. Lu 2021-03-23 2:41 ` Hongyu Wang 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-03-22 13:38 UTC (permalink / raw) To: Richard Biener; +Cc: GCC Patches, Jan Hubicka, Hongtao Liu, Hongyu Wang On Mon, Mar 22, 2021 at 6:29 AM Richard Biener <richard.guenther@gmail.com> wrote: > > On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > > > Simply memcpy and memset inline strategies to avoid branches for > > -mtune=generic: > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > load and store for up to 16 * 16 (256) bytes when the data size is > > fixed and known. > > 2. Inline only if data size is known to be <= 256. > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > is a constant. > > b. Use loop if data size is not a constant. > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > With -mtune=generic -O2, > > Is there any visible code-size effect of increasing CLEAR_RATIO on Hongyue, please collect code size differences on SPEC CPU 2017 and eembc. > SPEC/eembc? Did you play with other values of MOVE/CLEAR_RATIO? > 17 memory-to-memory/memory-clear insns looks quite a lot. > Yes, we did. 256 bytes is the threshold above which memcpy/memset in libc win. Below 256 bytes, 16 by_pieces move/store is faster. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-03-22 13:38 ` H.J. Lu @ 2021-03-23 2:41 ` Hongyu Wang 2021-03-23 8:19 ` Richard Biener 0 siblings, 1 reply; 31+ messages in thread From: Hongyu Wang @ 2021-03-23 2:41 UTC (permalink / raw) To: H.J. Lu Cc: Richard Biener, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang > Hongyue, please collect code size differences on SPEC CPU 2017 and > eembc. Here is code size difference for this patch SPEC CPU 2017 difference w patch w/o patch 500.perlbench_r 0.051% 1622637 1621805 502.gcc_r 0.039% 6930877 6928141 505.mcf_r 0.098% 16413 16397 520.omnetpp_r 0.083% 1327757 1326653 523.xalancbmk_r 0.001% 3575709 3575677 525.x264_r -0.067% 769095 769607 531.deepsjeng_r 0.071% 67629 67581 541.leela_r -3.062% 127629 131661 548.exchange2_r -0.338% 66141 66365 557.xz_r 0.946% 128061 126861 503.bwaves_r 0.534% 33117 32941 507.cactuBSSN_r 0.004% 2993645 2993517 508.namd_r 0.006% 851677 851629 510.parest_r 0.488% 6741277 6708557 511.povray_r -0.021% 849290 849466 521.wrf_r 0.022% 29682154 29675530 526.blender_r 0.054% 7544057 7540009 527.cam4_r 0.043% 6102234 6099594 538.imagick_r -0.015% 1625770 1626010 544.nab_r 0.155% 155453 155213 549.fotonik3d_r 0.000% 351757 351757 554.roms_r 0.041% 735837 735533 eembc difference w patch w/o patch aifftr01 0.762% 14813 14701 aiifft01 0.556% 14477 14397 idctrn01 0.101% 15853 15837 cjpeg-rose7-preset 0.114% 56125 56061 nnet_test -0.848% 35549 35853 aes 0.125% 38493 38445 cjpegv2data 0.108% 59213 59149 djpegv2data 0.025% 63821 63805 huffde -0.104% 30621 30653 mp2decoddata -0.047% 68285 68317 mp2enf32data1 0.018% 86925 86909 mp2enf32data2 0.018% 89357 89341 mp2enf32data3 0.018% 88253 88237 mp3playerfixeddata 0.103% 46877 46829 ip_pktcheckb1m 0.191% 25213 25165 nat 0.527% 45757 45517 ospfv2 0.196% 24573 24525 routelookup 0.189% 25389 25341 tcpbulk 0.155% 30925 30877 textv2data 0.055% 29101 29085 H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年3月22日周一 下午9:39写道: > > On Mon, Mar 22, 2021 at 6:29 AM Richard Biener > <richard.guenther@gmail.com> wrote: > > > > On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > Simply memcpy and memset inline strategies to avoid branches for > > > -mtune=generic: > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > fixed and known. > > > 2. Inline only if data size is known to be <= 256. > > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > > is a constant. > > > b. Use loop if data size is not a constant. > > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > > > With -mtune=generic -O2, > > > > Is there any visible code-size effect of increasing CLEAR_RATIO on > > Hongyue, please collect code size differences on SPEC CPU 2017 and > eembc. > > > SPEC/eembc? Did you play with other values of MOVE/CLEAR_RATIO? > > 17 memory-to-memory/memory-clear insns looks quite a lot. > > > > Yes, we did. 256 bytes is the threshold above which memcpy/memset in libc > win. Below 256 bytes, 16 by_pieces move/store is faster. > > -- > H.J. -- Regards, Hongyu, Wang ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-03-23 2:41 ` Hongyu Wang @ 2021-03-23 8:19 ` Richard Biener 2021-08-22 15:28 ` PING [PATCH] " H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: Richard Biener @ 2021-03-23 8:19 UTC (permalink / raw) To: Hongyu Wang; +Cc: H.J. Lu, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > eembc. > > Here is code size difference for this patch Thanks, nothing too bad although slightly larger impacts than envisioned. > SPEC CPU 2017 > difference w patch w/o patch > 500.perlbench_r 0.051% 1622637 1621805 > 502.gcc_r 0.039% 6930877 6928141 > 505.mcf_r 0.098% 16413 16397 > 520.omnetpp_r 0.083% 1327757 1326653 > 523.xalancbmk_r 0.001% 3575709 3575677 > 525.x264_r -0.067% 769095 769607 > 531.deepsjeng_r 0.071% 67629 67581 > 541.leela_r -3.062% 127629 131661 > 548.exchange2_r -0.338% 66141 66365 > 557.xz_r 0.946% 128061 126861 > > 503.bwaves_r 0.534% 33117 32941 > 507.cactuBSSN_r 0.004% 2993645 2993517 > 508.namd_r 0.006% 851677 851629 > 510.parest_r 0.488% 6741277 6708557 > 511.povray_r -0.021% 849290 849466 > 521.wrf_r 0.022% 29682154 29675530 > 526.blender_r 0.054% 7544057 7540009 > 527.cam4_r 0.043% 6102234 6099594 > 538.imagick_r -0.015% 1625770 1626010 > 544.nab_r 0.155% 155453 155213 > 549.fotonik3d_r 0.000% 351757 351757 > 554.roms_r 0.041% 735837 735533 > > eembc > difference w patch w/o patch > aifftr01 0.762% 14813 14701 > aiifft01 0.556% 14477 14397 > idctrn01 0.101% 15853 15837 > cjpeg-rose7-preset 0.114% 56125 56061 > nnet_test -0.848% 35549 35853 > aes 0.125% 38493 38445 > cjpegv2data 0.108% 59213 59149 > djpegv2data 0.025% 63821 63805 > huffde -0.104% 30621 30653 > mp2decoddata -0.047% 68285 68317 > mp2enf32data1 0.018% 86925 86909 > mp2enf32data2 0.018% 89357 89341 > mp2enf32data3 0.018% 88253 88237 > mp3playerfixeddata 0.103% 46877 46829 > ip_pktcheckb1m 0.191% 25213 25165 > nat 0.527% 45757 45517 > ospfv2 0.196% 24573 24525 > routelookup 0.189% 25389 25341 > tcpbulk 0.155% 30925 30877 > textv2data 0.055% 29101 29085 > > H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年3月22日周一 下午9:39写道: > > > > On Mon, Mar 22, 2021 at 6:29 AM Richard Biener > > <richard.guenther@gmail.com> wrote: > > > > > > On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > Simply memcpy and memset inline strategies to avoid branches for > > > > -mtune=generic: > > > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > > fixed and known. > > > > 2. Inline only if data size is known to be <= 256. > > > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > > > is a constant. > > > > b. Use loop if data size is not a constant. > > > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > > > > > With -mtune=generic -O2, > > > > > > Is there any visible code-size effect of increasing CLEAR_RATIO on > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > eembc. > > > > > SPEC/eembc? Did you play with other values of MOVE/CLEAR_RATIO? > > > 17 memory-to-memory/memory-clear insns looks quite a lot. > > > > > > > Yes, we did. 256 bytes is the threshold above which memcpy/memset in libc > > win. Below 256 bytes, 16 by_pieces move/store is faster. > > > > -- > > H.J. > > -- > Regards, > > Hongyu, Wang ^ permalink raw reply [flat|nested] 31+ messages in thread
* PING [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-03-23 8:19 ` Richard Biener @ 2021-08-22 15:28 ` H.J. Lu 2021-09-08 3:01 ` PING^2 " H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-08-22 15:28 UTC (permalink / raw) To: Richard Biener Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > eembc. > > > > Here is code size difference for this patch > > Thanks, nothing too bad although slightly larger impacts than envisioned. > PING. OK for master branch? Thanks. H.J. --- Simplify memcpy and memset inline strategies to avoid branches for -mtune=generic: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. With -mtune=generic -O2, 1. On Ice Lake processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r 0.51% 502.gcc_r 0.55% 505.mcf_r 0.38% 520.omnetpp_r -0.74% 523.xalancbmk_r -0.35% 525.x264_r 2.99% 531.deepsjeng_r -0.17% 541.leela_r -0.98% 548.exchange2_r 0.89% 557.xz_r 0.70% Geomean 0.37% 503.bwaves_r 0.04% 507.cactuBSSN_r -0.01% 508.namd_r -0.45% 510.parest_r -0.09% 511.povray_r -1.37% 519.lbm_r 0.00% 521.wrf_r -2.56% 526.blender_r -0.01% 527.cam4_r -0.05% 538.imagick_r 0.36% 544.nab_r 0.08% 549.fotonik3d_r -0.06% 554.roms_r 0.05% Geomean -0.34% Significant impacts on eembc benchmarks: eembc/nnet_test 14.85% eembc/mp2decoddata2 13.57% 2. On Cascadelake processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r -0.02% 502.gcc_r 0.10% 505.mcf_r -1.14% 520.omnetpp_r -0.22% 523.xalancbmk_r 0.21% 525.x264_r 0.94% 531.deepsjeng_r -0.37% 541.leela_r -0.46% 548.exchange2_r -0.40% 557.xz_r 0.60% Geomean -0.08% 503.bwaves_r -0.50% 507.cactuBSSN_r 0.05% 508.namd_r -0.02% 510.parest_r 0.09% 511.povray_r -1.35% 519.lbm_r 0.00% 521.wrf_r -0.03% 526.blender_r -0.83% 527.cam4_r 1.23% 538.imagick_r 0.97% 544.nab_r -0.02% 549.fotonik3d_r -0.12% 554.roms_r 0.55% Geomean 0.00% Significant impacts on eembc benchmarks: eembc/nnet_test 9.90% eembc/mp2decoddata2 16.42% eembc/textv2data3 -4.86% eembc/qos 12.90% 3. On Znver3 processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r -0.96% 502.gcc_r -1.06% 505.mcf_r -0.01% 520.omnetpp_r -1.45% 523.xalancbmk_r 2.89% 525.x264_r 4.98% 531.deepsjeng_r 0.18% 541.leela_r -1.54% 548.exchange2_r -1.25% 557.xz_r -0.01% Geomean 0.16% 503.bwaves_r 0.04% 507.cactuBSSN_r 0.85% 508.namd_r -0.13% 510.parest_r 0.39% 511.povray_r 0.00% 519.lbm_r 0.00% 521.wrf_r 0.28% 526.blender_r -0.10% 527.cam4_r -0.58% 538.imagick_r 0.69% 544.nab_r -0.04% 549.fotonik3d_r -0.04% 554.roms_r 0.40% Geomean 0.15% Significant impacts on eembc benchmarks: eembc/aifftr01 13.95% eembc/idctrn01 8.41% eembc/nnet_test 30.25% eembc/mp2decoddata2 5.05% eembc/textv2data3 6.43% eembc/qos -5.79% Code size differences are: SPEC CPU 2017 difference w patch w/o patch 500.perlbench_r 0.051% 1622637 1621805 502.gcc_r 0.039% 6930877 6928141 505.mcf_r 0.098% 16413 16397 520.omnetpp_r 0.083% 1327757 1326653 523.xalancbmk_r 0.001% 3575709 3575677 525.x264_r -0.067% 769095 769607 531.deepsjeng_r 0.071% 67629 67581 541.leela_r -3.062% 127629 131661 548.exchange2_r -0.338% 66141 66365 557.xz_r 0.946% 128061 126861 503.bwaves_r 0.534% 33117 32941 507.cactuBSSN_r 0.004% 2993645 2993517 508.namd_r 0.006% 851677 851629 510.parest_r 0.488% 6741277 6708557 511.povray_r -0.021% 849290 849466 521.wrf_r 0.022% 29682154 29675530 526.blender_r 0.054% 7544057 7540009 527.cam4_r 0.043% 6102234 6099594 538.imagick_r -0.015% 1625770 1626010 544.nab_r 0.155% 155453 155213 549.fotonik3d_r 0.000% 351757 351757 554.roms_r 0.041% 735837 735533 eembc aifftr01 0.762% 14813 14701 aiifft01 0.556% 14477 14397 idctrn01 0.101% 15853 15837 cjpeg-rose7-preset 0.114% 56125 56061 nnet_test -0.848% 35549 35853 aes 0.125% 38493 38445 cjpegv2data 0.108% 59213 59149 djpegv2data 0.025% 63821 63805 huffde -0.104% 30621 30653 mp2decoddata -0.047% 68285 68317 mp2enf32data1 0.018% 86925 86909 mp2enf32data2 0.018% 89357 89341 mp2enf32data3 0.018% 88253 88237 mp3playerfixeddata 0.103% 46877 46829 ip_pktcheckb1m 0.191% 25213 25165 nat 0.527% 45757 45517 ospfv2 0.196% 24573 24525 routelookup 0.189% 25389 25341 tcpbulk 0.155% 30925 30877 textv2data 0.055% 29101 29085 gcc/ * config/i386/x86-tune-costs.h (generic_memcpy): Updated. (generic_memset): Likewise. (generic_cost): Change CLEAR_RATIO to 17. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Add m_GENERIC. gcc/testsuite/ * gcc.target/i386/memcpy-strategy-12.c: New test. * gcc.target/i386/memcpy-strategy-13.c: Likewise. * gcc.target/i386/memset-strategy-10.c: Likewise. * gcc.target/i386/memset-strategy-11.c: Likewise. * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset-strategy=rep_8byte:-1:align. * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. --- gcc/config/i386/x86-tune-costs.h | 31 ++++++++++++------- gcc/config/i386/x86-tune.def | 2 +- .../gcc.target/i386/memcpy-strategy-12.c | 9 ++++++ .../gcc.target/i386/memcpy-strategy-13.c | 11 +++++++ .../gcc.target/i386/memset-strategy-10.c | 11 +++++++ .../gcc.target/i386/memset-strategy-11.c | 9 ++++++ gcc/testsuite/gcc.target/i386/shrink_wrap_1.c | 2 +- gcc/testsuite/gcc.target/i386/sw-1.c | 2 +- 8 files changed, 63 insertions(+), 14 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index ffe810f2bcb..30e7c3e4261 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = { "16", /* Func alignment. */ }; -/* Generic should produce code tuned for Core-i7 (and newer chips) - and btver1 (and newer chips). */ +/* Generic should produce code tuned for Haswell (and newer chips) + and znver1 (and newer chips). NB: rep_prefix_1_byte is used only + for known size. */ static stringop_algs generic_memcpy[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static stringop_algs generic_memset[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static const struct processor_costs generic_cost = { { @@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = { COSTS_N_INSNS (1), /* cost of movzx */ 8, /* "large" insn */ 17, /* MOVE_RATIO */ - 6, /* CLEAR_RATIO */ + 17, /* CLEAR_RATIO */ {6, 6, 6}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 8f55da89c92..a9a023f33f5 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) move/set sequences of bytes with known size. */ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, "prefer_known_rep_movsb_stosb", - m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC) /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c new file mode 100644 index 00000000000..e9998b70ab2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 249); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c new file mode 100644 index 00000000000..109bd675a51 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-avx" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c new file mode 100644 index 00000000000..685d6e5a5c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-avx" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c new file mode 100644 index 00000000000..61ee463a8cf --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 253); +} diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c index 94dadd6cdbd..44fe7d2836e 100644 --- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c @@ -1,5 +1,5 @@ /* { dg-do compile { target { ! ia32 } } } */ -/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */ enum machine_mode { diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c index a9c89fca4ec..234db0e67c2 100644 --- a/gcc/testsuite/gcc.target/i386/sw-1.c +++ b/gcc/testsuite/gcc.target/i386/sw-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ /* { dg-additional-options "-mno-avx" { target ia32 } } */ /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */ -- 2.31.1 ^ permalink raw reply [flat|nested] 31+ messages in thread
* PING^2 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-08-22 15:28 ` PING [PATCH] " H.J. Lu @ 2021-09-08 3:01 ` H.J. Lu 2021-09-13 13:38 ` H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-09-08 3:01 UTC (permalink / raw) To: Richard Biener, Lili Cui Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > eembc. > > > > > > Here is code size difference for this patch > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > PING. > > OK for master branch? > > Thanks. > > H.J. > --- > Simplify memcpy and memset inline strategies to avoid branches for > -mtune=generic: > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > load and store for up to 16 * 16 (256) bytes when the data size is > fixed and known. > 2. Inline only if data size is known to be <= 256. > a. Use "rep movsb/stosb" with simple code sequence if the data size > is a constant. > b. Use loop if data size is not a constant. > 3. Use memcpy/memset libray function if data size is unknown or > 256. > PING: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PING^2 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-09-08 3:01 ` PING^2 " H.J. Lu @ 2021-09-13 13:38 ` H.J. Lu 2021-09-20 17:06 ` PING^3 " H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-09-13 13:38 UTC (permalink / raw) To: Richard Biener, Lili Cui Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > > eembc. > > > > > > > > Here is code size difference for this patch > > > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > > > > PING. > > > > OK for master branch? > > > > Thanks. > > > > H.J. > > --- > > Simplify memcpy and memset inline strategies to avoid branches for > > -mtune=generic: > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > load and store for up to 16 * 16 (256) bytes when the data size is > > fixed and known. > > 2. Inline only if data size is known to be <= 256. > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > is a constant. > > b. Use loop if data size is not a constant. > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > PING: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html > PING. This should fix: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* PING^3 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-09-13 13:38 ` H.J. Lu @ 2021-09-20 17:06 ` H.J. Lu 2021-10-01 15:24 ` PING^4 " H.J. Lu 0 siblings, 1 reply; 31+ messages in thread From: H.J. Lu @ 2021-09-20 17:06 UTC (permalink / raw) To: Richard Biener, Lili Cui Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang On Mon, Sep 13, 2021 at 6:38 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > > > eembc. > > > > > > > > > > Here is code size difference for this patch > > > > > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > > > > > > > PING. > > > > > > OK for master branch? > > > > > > Thanks. > > > > > > H.J. > > > --- > > > Simplify memcpy and memset inline strategies to avoid branches for > > > -mtune=generic: > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > fixed and known. > > > 2. Inline only if data size is known to be <= 256. > > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > > is a constant. > > > b. Use loop if data size is not a constant. > > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > > > > PING: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html > > > > PING. This should fix: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 > PING. -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
* PING^4 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic 2021-09-20 17:06 ` PING^3 " H.J. Lu @ 2021-10-01 15:24 ` H.J. Lu 0 siblings, 0 replies; 31+ messages in thread From: H.J. Lu @ 2021-10-01 15:24 UTC (permalink / raw) To: Richard Biener, Lili Cui, Jeff Law, Jakub Jelinek Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang On Mon, Sep 20, 2021 at 10:06 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Mon, Sep 13, 2021 at 6:38 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > > > > eembc. > > > > > > > > > > > > Here is code size difference for this patch > > > > > > > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > > > > > > > > > > PING. > > > > > > > > OK for master branch? > > > > > > > > Thanks. > > > > > > > > H.J. > > > > --- > > > > Simplify memcpy and memset inline strategies to avoid branches for > > > > -mtune=generic: > > > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > > fixed and known. > > > > 2. Inline only if data size is known to be <= 256. > > > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > > > is a constant. > > > > b. Use loop if data size is not a constant. > > > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > > > > > > > PING: > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html > > > > > > > PING. This should fix: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 > > > > PING. > Any comments or objections to this patch? -- H.J. ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2021-10-01 15:25 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu 2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu 2021-03-22 14:10 ` Jan Hubicka 2021-03-22 23:57 ` [PATCH v2 " H.J. Lu 2021-03-29 13:43 ` H.J. Lu 2021-03-31 6:59 ` Richard Biener 2021-03-31 8:05 ` Jan Hubicka 2021-03-31 13:09 ` H.J. Lu 2021-03-31 13:40 ` Jan Hubicka 2021-03-31 13:47 ` Jan Hubicka 2021-03-31 15:41 ` H.J. Lu 2021-03-31 17:43 ` Jan Hubicka 2021-03-31 17:54 ` H.J. Lu 2021-04-01 5:57 ` Hongyu Wang 2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu 2021-04-05 13:45 ` H.J. Lu 2021-04-05 21:14 ` Jan Hubicka 2021-04-05 21:53 ` H.J. Lu 2021-04-06 9:09 ` Hongyu Wang 2021-04-06 9:51 ` Jan Hubicka 2021-04-06 12:34 ` H.J. Lu 2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu 2021-03-22 13:29 ` Richard Biener 2021-03-22 13:38 ` H.J. Lu 2021-03-23 2:41 ` Hongyu Wang 2021-03-23 8:19 ` Richard Biener 2021-08-22 15:28 ` PING [PATCH] " H.J. Lu 2021-09-08 3:01 ` PING^2 " H.J. Lu 2021-09-13 13:38 ` H.J. Lu 2021-09-20 17:06 ` PING^3 " H.J. Lu 2021-10-01 15:24 ` PING^4 " H.J. Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).