* [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool. @ 2021-07-14 2:34 liuhongt 2021-07-14 4:40 ` Hongtao Liu 0 siblings, 1 reply; 6+ messages in thread From: liuhongt @ 2021-07-14 2:34 UTC (permalink / raw) To: gcc-patches By optimizing vector movement to broadcast in ix86_expand_vector_move during pass_expand, pass_reload/LRA can automatically generate an avx512 embedded broadcast, pass_cpb is not needed. Considering that in the absence of avx512f, broadcast from memory is still slightly faster than loading the entire memory, so always enable broadcast. benchmark: https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vaddps/broadcast The performance diff strategy : cycles memory : 1046611188 memory : 1255420817 memory : 1044720793 memory : 1253414145 average : 1097868397 broadcast : 1044430688 broadcast : 1044477630 broadcast : 1253554603 broadcast : 1044561934 average : 1096756213 But however broadcast has larger size. the size diff size broadcast.o text data bss dec hex filename 137 0 0 137 89 broadcast.o size memory.o text data bss dec hex filename 115 0 0 115 73 memory.o Bootstrapped and regtested on x86_64-linux-gnu{-m32,} gcc/ChangeLog: * config/i386/i386-expand.c (ix86_broadcast_from_integer_constant): Rename to .. (ix86_broadcast_from_constant): .. this, and extend it to handle float mode. (ix86_expand_vector_move): Extend to float mode. * config/i386/i386-features.c (replace_constant_pool_with_broadcast): Remove. (remove_partial_avx_dependency_gate): Ditto. (constant_pool_broadcast): Ditto. (class pass_constant_pool_broadcast): Ditto. (make_pass_constant_pool_broadcast): Ditto. (remove_partial_avx_dependency): Adjust gate. * config/i386/i386-passes.def: Remove pass_constant_pool_broadcast. * config/i386/i386-protos.h (make_pass_constant_pool_broadcast): Remove. gcc/testsuite/ChangeLog: * gcc.target/i386/fuse-caller-save-xmm.c: Adjust testcase. --- gcc/config/i386/i386-expand.c | 29 +++- gcc/config/i386/i386-features.c | 157 +----------------- gcc/config/i386/i386-passes.def | 1 - gcc/config/i386/i386-protos.h | 1 - .../gcc.target/i386/fuse-caller-save-xmm.c | 2 +- 5 files changed, 26 insertions(+), 164 deletions(-) diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index 69ea79e6123..ba870145acd 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -453,8 +453,10 @@ ix86_expand_move (machine_mode mode, rtx operands[]) emit_insn (gen_rtx_SET (op0, op1)); } +/* OP is a memref of CONST_VECTOR, return scalar constant mem + if CONST_VECTOR is a vec_duplicate, else return NULL. */ static rtx -ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) +ix86_broadcast_from_constant (machine_mode mode, rtx op) { int nunits = GET_MODE_NUNITS (mode); if (nunits < 2) @@ -462,7 +464,8 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) /* Don't use integer vector broadcast if we can't move from GPR to SSE register directly. */ - if (!TARGET_INTER_UNIT_MOVES_TO_VEC) + if (!TARGET_INTER_UNIT_MOVES_TO_VEC + && INTEGRAL_MODE_P (mode)) return nullptr; /* Convert CONST_VECTOR to a non-standard SSE constant integer @@ -470,12 +473,17 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) if (!(TARGET_AVX2 || (TARGET_AVX && (GET_MODE_INNER (mode) == SImode - || GET_MODE_INNER (mode) == DImode))) + || GET_MODE_INNER (mode) == DImode)) + || FLOAT_MODE_P (mode)) || standard_sse_constant_p (op, mode)) return nullptr; - /* Don't broadcast from a 64-bit integer constant in 32-bit mode. */ - if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT) + /* Don't broadcast from a 64-bit integer constant in 32-bit mode. + We can still put 64-bit integer constant in memory when + avx512 embed broadcast is available. */ + if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT + && (!TARGET_AVX512F + || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL))) return nullptr; if (GET_MODE_INNER (mode) == TImode) @@ -561,17 +569,20 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[]) if (can_create_pseudo_p () && GET_MODE_SIZE (mode) >= 16 - && GET_MODE_CLASS (mode) == MODE_VECTOR_INT + && VECTOR_MODE_P (mode) && (MEM_P (op1) && SYMBOL_REF_P (XEXP (op1, 0)) && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))) { - rtx first = ix86_broadcast_from_integer_constant (mode, op1); + rtx first = ix86_broadcast_from_constant (mode, op1); if (first != nullptr) { /* Broadcast to XMM/YMM/ZMM register from an integer - constant. */ - op1 = ix86_gen_scratch_sse_rtx (mode); + constant or scalar mem. */ + op1 = gen_reg_rtx (mode); + if (FLOAT_MODE_P (mode) + || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)) + first = force_const_mem (GET_MODE_INNER (mode), first); bool ok = ix86_expand_vector_init_duplicate (false, mode, op1, first); gcc_assert (ok); diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index cbd430a2ecf..d9c6652d1c9 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -2136,81 +2136,6 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt) return new pass_insert_endbr_and_patchable_area (ctxt); } -/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x - with embedded broadcast. i.e.transform - - vpaddq .LC0(%rip), %zmm0, %zmm0 - ret - .LC0: - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - - to - - vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0 - ret - .LC0: - .quad 3 */ -static void -replace_constant_pool_with_broadcast (rtx_insn *insn) -{ - subrtx_ptr_iterator::array_type array; - FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL) - { - rtx *loc = *iter; - rtx x = *loc; - rtx broadcast_mem, vec_dup, constant, first; - machine_mode mode; - - /* Constant pool. */ - if (!MEM_P (x) - || !SYMBOL_REF_P (XEXP (x, 0)) - || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))) - continue; - - /* Const vector. */ - mode = GET_MODE (x); - if (!VECTOR_MODE_P (mode)) - return; - constant = get_pool_constant (XEXP (x, 0)); - if (GET_CODE (constant) != CONST_VECTOR) - return; - - /* There could be some rtx like - (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1"))) - but with "*.LC1" refer to V2DI constant vector. */ - if (GET_MODE (constant) != mode) - { - constant = simplify_subreg (mode, constant, GET_MODE (constant), 0); - if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR) - return; - } - first = XVECEXP (constant, 0, 0); - - for (int i = 1; i < GET_MODE_NUNITS (mode); ++i) - { - rtx tmp = XVECEXP (constant, 0, i); - /* Vector duplicate value. */ - if (!rtx_equal_p (tmp, first)) - return; - } - - /* Replace with embedded broadcast. */ - broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first); - vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem); - validate_change (insn, loc, vec_dup, 0); - - /* At most 1 memory_operand in an insn. */ - return; - } -} - /* At entry of the nearest common dominator for basic blocks with conversions, generate a single vxorps %xmmN, %xmmN, %xmmN @@ -2249,10 +2174,6 @@ remove_partial_avx_dependency (void) if (!NONDEBUG_INSN_P (insn)) continue; - /* Handle AVX512 embedded broadcast here to save compile time. */ - if (TARGET_AVX512F) - replace_constant_pool_with_broadcast (insn); - set = single_set (insn); if (!set) continue; @@ -2384,16 +2305,6 @@ remove_partial_avx_dependency (void) return 0; } -static bool -remove_partial_avx_dependency_gate () -{ - return (TARGET_AVX - && TARGET_SSE_PARTIAL_REG_DEPENDENCY - && TARGET_SSE_MATH - && optimize - && optimize_function_for_speed_p (cfun)); -} - namespace { const pass_data pass_data_remove_partial_avx_dependency = @@ -2419,7 +2330,11 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { - return remove_partial_avx_dependency_gate (); + return (TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY + && TARGET_SSE_MATH + && optimize + && optimize_function_for_speed_p (cfun)); } virtual unsigned int execute (function *) @@ -2436,68 +2351,6 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt) return new pass_remove_partial_avx_dependency (ctxt); } -/* For const vector having one duplicated value, there's no need to put - whole vector in the constant pool when target supports embedded broadcast. */ -static unsigned int -constant_pool_broadcast (void) -{ - timevar_push (TV_MACH_DEP); - rtx_insn *insn; - - for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) - { - if (INSN_P (insn)) - replace_constant_pool_with_broadcast (insn); - } - timevar_pop (TV_MACH_DEP); - return 0; -} - -namespace { - -const pass_data pass_data_constant_pool_broadcast = -{ - RTL_PASS, /* type */ - "cpb", /* name */ - OPTGROUP_NONE, /* optinfo_flags */ - TV_MACH_DEP, /* tv_id */ - 0, /* properties_required */ - 0, /* properties_provided */ - 0, /* properties_destroyed */ - 0, /* todo_flags_start */ - TODO_df_finish, /* todo_flags_finish */ -}; - -class pass_constant_pool_broadcast : public rtl_opt_pass -{ -public: - pass_constant_pool_broadcast (gcc::context *ctxt) - : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt) - {} - - /* opt_pass methods: */ - virtual bool gate (function *) - { - /* Return false if rpad pass gate is true. - replace_constant_pool_with_broadcast is called - from both this pass and rpad pass. */ - return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ()); - } - - virtual unsigned int execute (function *) - { - return constant_pool_broadcast (); - } -}; // class pass_cpb - -} // anon namespace - -rtl_opt_pass * -make_pass_constant_pool_broadcast (gcc::context *ctxt) -{ - return new pass_constant_pool_broadcast (ctxt); -} - /* This compares the priority of target features in function DECL1 and DECL2. It returns positive value if DECL1 is higher priority, negative value if DECL2 is higher priority and 0 if they are the diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def index 44df00e94ac..29baf8acd0b 100644 --- a/gcc/config/i386/i386-passes.def +++ b/gcc/config/i386/i386-passes.def @@ -33,4 +33,3 @@ along with GCC; see the file COPYING3. If not see INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area); INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); - INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast); diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 51376fcc454..07ac02aff69 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -395,4 +395,3 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area (gcc::context *); extern rtl_opt_pass *make_pass_remove_partial_avx_dependency (gcc::context *); -extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *); diff --git a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c index 4deff93c1e8..b0d3dc38a0c 100644 --- a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c +++ b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c @@ -6,7 +6,7 @@ typedef double v2df __attribute__((vector_size (16))); static v2df __attribute__((noinline)) bar (v2df a) { - return a + (v2df){ 3.0, 3.0 }; + return a + (v2df){ 3.0, 4.0 }; } v2df __attribute__((noinline)) -- 2.18.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool. 2021-07-14 2:34 [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool liuhongt @ 2021-07-14 4:40 ` Hongtao Liu 2021-07-14 12:37 ` H.J. Lu 0 siblings, 1 reply; 6+ messages in thread From: Hongtao Liu @ 2021-07-14 4:40 UTC (permalink / raw) To: H. J. Lu, Uros Bizjak; +Cc: GCC Patches, liuhongt On Wed, Jul 14, 2021 at 10:34 AM liuhongt <hongtao.liu@intel.com> wrote: > > By optimizing vector movement to broadcast in ix86_expand_vector_move > during pass_expand, pass_reload/LRA can automatically generate an avx512 > embedded broadcast, pass_cpb is not needed. > > Considering that in the absence of avx512f, broadcast from memory is > still slightly faster than loading the entire memory, so always enable > broadcast. > > benchmark: > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vaddps/broadcast > > The performance diff > > strategy : cycles > memory : 1046611188 > memory : 1255420817 > memory : 1044720793 > memory : 1253414145 > average : 1097868397 > > broadcast : 1044430688 > broadcast : 1044477630 > broadcast : 1253554603 > broadcast : 1044561934 > average : 1096756213 > > But however broadcast has larger size. > > the size diff > > size broadcast.o > text data bss dec hex filename > 137 0 0 137 89 broadcast.o > > size memory.o > text data bss dec hex filename > 115 0 0 115 73 memory.o > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,} > > gcc/ChangeLog: > > * config/i386/i386-expand.c > (ix86_broadcast_from_integer_constant): Rename to .. > (ix86_broadcast_from_constant): .. this, and extend it to > handle float mode. > (ix86_expand_vector_move): Extend to float mode. > * config/i386/i386-features.c > (replace_constant_pool_with_broadcast): Remove. > (remove_partial_avx_dependency_gate): Ditto. > (constant_pool_broadcast): Ditto. > (class pass_constant_pool_broadcast): Ditto. > (make_pass_constant_pool_broadcast): Ditto. > (remove_partial_avx_dependency): Adjust gate. > * config/i386/i386-passes.def: Remove pass_constant_pool_broadcast. > * config/i386/i386-protos.h > (make_pass_constant_pool_broadcast): Remove. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/fuse-caller-save-xmm.c: Adjust testcase. > --- > gcc/config/i386/i386-expand.c | 29 +++- > gcc/config/i386/i386-features.c | 157 +----------------- > gcc/config/i386/i386-passes.def | 1 - > gcc/config/i386/i386-protos.h | 1 - > .../gcc.target/i386/fuse-caller-save-xmm.c | 2 +- > 5 files changed, 26 insertions(+), 164 deletions(-) > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > index 69ea79e6123..ba870145acd 100644 > --- a/gcc/config/i386/i386-expand.c > +++ b/gcc/config/i386/i386-expand.c > @@ -453,8 +453,10 @@ ix86_expand_move (machine_mode mode, rtx operands[]) > emit_insn (gen_rtx_SET (op0, op1)); > } > > +/* OP is a memref of CONST_VECTOR, return scalar constant mem > + if CONST_VECTOR is a vec_duplicate, else return NULL. */ > static rtx > -ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > +ix86_broadcast_from_constant (machine_mode mode, rtx op) > { > int nunits = GET_MODE_NUNITS (mode); > if (nunits < 2) > @@ -462,7 +464,8 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > /* Don't use integer vector broadcast if we can't move from GPR to SSE > register directly. */ > - if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > + if (!TARGET_INTER_UNIT_MOVES_TO_VEC > + && INTEGRAL_MODE_P (mode)) > return nullptr; > > /* Convert CONST_VECTOR to a non-standard SSE constant integer > @@ -470,12 +473,17 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > if (!(TARGET_AVX2 > || (TARGET_AVX > && (GET_MODE_INNER (mode) == SImode > - || GET_MODE_INNER (mode) == DImode))) > + || GET_MODE_INNER (mode) == DImode)) > + || FLOAT_MODE_P (mode)) > || standard_sse_constant_p (op, mode)) > return nullptr; > > - /* Don't broadcast from a 64-bit integer constant in 32-bit mode. */ > - if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT) > + /* Don't broadcast from a 64-bit integer constant in 32-bit mode. > + We can still put 64-bit integer constant in memory when > + avx512 embed broadcast is available. */ > + if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT > + && (!TARGET_AVX512F > + || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL))) > return nullptr; > > if (GET_MODE_INNER (mode) == TImode) > @@ -561,17 +569,20 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[]) > > if (can_create_pseudo_p () > && GET_MODE_SIZE (mode) >= 16 > - && GET_MODE_CLASS (mode) == MODE_VECTOR_INT > + && VECTOR_MODE_P (mode) > && (MEM_P (op1) > && SYMBOL_REF_P (XEXP (op1, 0)) > && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))) > { > - rtx first = ix86_broadcast_from_integer_constant (mode, op1); > + rtx first = ix86_broadcast_from_constant (mode, op1); > if (first != nullptr) > { > /* Broadcast to XMM/YMM/ZMM register from an integer > - constant. */ > - op1 = ix86_gen_scratch_sse_rtx (mode); > + constant or scalar mem. */ > + op1 = gen_reg_rtx (mode); I've changed ix86_gen_scratch_sse_rtx to gen_reg_rtx to let LRA/reload make the final decision for register usage, would that make sense? w/o that, hard register used here will prevent LRA/reload combine (set op1 (vec_duplicate: mem)) and (add (op1, op2)) to (add (op1, vec_duplicate: mem)) > + if (FLOAT_MODE_P (mode) > + || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)) > + first = force_const_mem (GET_MODE_INNER (mode), first); > bool ok = ix86_expand_vector_init_duplicate (false, mode, > op1, first); > gcc_assert (ok); > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c > index cbd430a2ecf..d9c6652d1c9 100644 > --- a/gcc/config/i386/i386-features.c > +++ b/gcc/config/i386/i386-features.c > @@ -2136,81 +2136,6 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt) > return new pass_insert_endbr_and_patchable_area (ctxt); > } > > -/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x > - with embedded broadcast. i.e.transform > - > - vpaddq .LC0(%rip), %zmm0, %zmm0 > - ret > - .LC0: > - .quad 3 > - .quad 3 > - .quad 3 > - .quad 3 > - .quad 3 > - .quad 3 > - .quad 3 > - .quad 3 > - > - to > - > - vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0 > - ret > - .LC0: > - .quad 3 */ > -static void > -replace_constant_pool_with_broadcast (rtx_insn *insn) > -{ > - subrtx_ptr_iterator::array_type array; > - FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL) > - { > - rtx *loc = *iter; > - rtx x = *loc; > - rtx broadcast_mem, vec_dup, constant, first; > - machine_mode mode; > - > - /* Constant pool. */ > - if (!MEM_P (x) > - || !SYMBOL_REF_P (XEXP (x, 0)) > - || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))) > - continue; > - > - /* Const vector. */ > - mode = GET_MODE (x); > - if (!VECTOR_MODE_P (mode)) > - return; > - constant = get_pool_constant (XEXP (x, 0)); > - if (GET_CODE (constant) != CONST_VECTOR) > - return; > - > - /* There could be some rtx like > - (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1"))) > - but with "*.LC1" refer to V2DI constant vector. */ > - if (GET_MODE (constant) != mode) > - { > - constant = simplify_subreg (mode, constant, GET_MODE (constant), 0); > - if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR) > - return; > - } > - first = XVECEXP (constant, 0, 0); > - > - for (int i = 1; i < GET_MODE_NUNITS (mode); ++i) > - { > - rtx tmp = XVECEXP (constant, 0, i); > - /* Vector duplicate value. */ > - if (!rtx_equal_p (tmp, first)) > - return; > - } > - > - /* Replace with embedded broadcast. */ > - broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first); > - vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem); > - validate_change (insn, loc, vec_dup, 0); > - > - /* At most 1 memory_operand in an insn. */ > - return; > - } > -} > - > /* At entry of the nearest common dominator for basic blocks with > conversions, generate a single > vxorps %xmmN, %xmmN, %xmmN > @@ -2249,10 +2174,6 @@ remove_partial_avx_dependency (void) > if (!NONDEBUG_INSN_P (insn)) > continue; > > - /* Handle AVX512 embedded broadcast here to save compile time. */ > - if (TARGET_AVX512F) > - replace_constant_pool_with_broadcast (insn); > - > set = single_set (insn); > if (!set) > continue; > @@ -2384,16 +2305,6 @@ remove_partial_avx_dependency (void) > return 0; > } > > -static bool > -remove_partial_avx_dependency_gate () > -{ > - return (TARGET_AVX > - && TARGET_SSE_PARTIAL_REG_DEPENDENCY > - && TARGET_SSE_MATH > - && optimize > - && optimize_function_for_speed_p (cfun)); > -} > - > namespace { > > const pass_data pass_data_remove_partial_avx_dependency = > @@ -2419,7 +2330,11 @@ public: > /* opt_pass methods: */ > virtual bool gate (function *) > { > - return remove_partial_avx_dependency_gate (); > + return (TARGET_AVX > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY > + && TARGET_SSE_MATH > + && optimize > + && optimize_function_for_speed_p (cfun)); > } > > virtual unsigned int execute (function *) > @@ -2436,68 +2351,6 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt) > return new pass_remove_partial_avx_dependency (ctxt); > } > > -/* For const vector having one duplicated value, there's no need to put > - whole vector in the constant pool when target supports embedded broadcast. */ > -static unsigned int > -constant_pool_broadcast (void) > -{ > - timevar_push (TV_MACH_DEP); > - rtx_insn *insn; > - > - for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) > - { > - if (INSN_P (insn)) > - replace_constant_pool_with_broadcast (insn); > - } > - timevar_pop (TV_MACH_DEP); > - return 0; > -} > - > -namespace { > - > -const pass_data pass_data_constant_pool_broadcast = > -{ > - RTL_PASS, /* type */ > - "cpb", /* name */ > - OPTGROUP_NONE, /* optinfo_flags */ > - TV_MACH_DEP, /* tv_id */ > - 0, /* properties_required */ > - 0, /* properties_provided */ > - 0, /* properties_destroyed */ > - 0, /* todo_flags_start */ > - TODO_df_finish, /* todo_flags_finish */ > -}; > - > -class pass_constant_pool_broadcast : public rtl_opt_pass > -{ > -public: > - pass_constant_pool_broadcast (gcc::context *ctxt) > - : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt) > - {} > - > - /* opt_pass methods: */ > - virtual bool gate (function *) > - { > - /* Return false if rpad pass gate is true. > - replace_constant_pool_with_broadcast is called > - from both this pass and rpad pass. */ > - return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ()); > - } > - > - virtual unsigned int execute (function *) > - { > - return constant_pool_broadcast (); > - } > -}; // class pass_cpb > - > -} // anon namespace > - > -rtl_opt_pass * > -make_pass_constant_pool_broadcast (gcc::context *ctxt) > -{ > - return new pass_constant_pool_broadcast (ctxt); > -} > - > /* This compares the priority of target features in function DECL1 > and DECL2. It returns positive value if DECL1 is higher priority, > negative value if DECL2 is higher priority and 0 if they are the > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def > index 44df00e94ac..29baf8acd0b 100644 > --- a/gcc/config/i386/i386-passes.def > +++ b/gcc/config/i386/i386-passes.def > @@ -33,4 +33,3 @@ along with GCC; see the file COPYING3. If not see > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area); > > INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); > - INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast); > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > index 51376fcc454..07ac02aff69 100644 > --- a/gcc/config/i386/i386-protos.h > +++ b/gcc/config/i386/i386-protos.h > @@ -395,4 +395,3 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area > (gcc::context *); > extern rtl_opt_pass *make_pass_remove_partial_avx_dependency > (gcc::context *); > -extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *); > diff --git a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > index 4deff93c1e8..b0d3dc38a0c 100644 > --- a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > +++ b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > @@ -6,7 +6,7 @@ typedef double v2df __attribute__((vector_size (16))); > static v2df __attribute__((noinline)) > bar (v2df a) > { > - return a + (v2df){ 3.0, 3.0 }; > + return a + (v2df){ 3.0, 4.0 }; > } > > v2df __attribute__((noinline)) > -- > 2.18.1 > -- BR, Hongtao ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool. 2021-07-14 4:40 ` Hongtao Liu @ 2021-07-14 12:37 ` H.J. Lu 2021-07-22 5:12 ` Hongtao Liu 2021-08-06 6:26 ` Hongtao Liu 0 siblings, 2 replies; 6+ messages in thread From: H.J. Lu @ 2021-07-14 12:37 UTC (permalink / raw) To: Hongtao Liu; +Cc: Uros Bizjak, GCC Patches, liuhongt On Tue, Jul 13, 2021 at 9:35 PM Hongtao Liu <crazylht@gmail.com> wrote: > > On Wed, Jul 14, 2021 at 10:34 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > By optimizing vector movement to broadcast in ix86_expand_vector_move > > during pass_expand, pass_reload/LRA can automatically generate an avx512 > > embedded broadcast, pass_cpb is not needed. > > > > Considering that in the absence of avx512f, broadcast from memory is > > still slightly faster than loading the entire memory, so always enable > > broadcast. > > > > benchmark: > > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vaddps/broadcast > > > > The performance diff > > > > strategy : cycles > > memory : 1046611188 > > memory : 1255420817 > > memory : 1044720793 > > memory : 1253414145 > > average : 1097868397 > > > > broadcast : 1044430688 > > broadcast : 1044477630 > > broadcast : 1253554603 > > broadcast : 1044561934 > > average : 1096756213 > > > > But however broadcast has larger size. > > > > the size diff > > > > size broadcast.o > > text data bss dec hex filename > > 137 0 0 137 89 broadcast.o > > > > size memory.o > > text data bss dec hex filename > > 115 0 0 115 73 memory.o > > > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,} > > > > gcc/ChangeLog: > > > > * config/i386/i386-expand.c > > (ix86_broadcast_from_integer_constant): Rename to .. > > (ix86_broadcast_from_constant): .. this, and extend it to > > handle float mode. > > (ix86_expand_vector_move): Extend to float mode. > > * config/i386/i386-features.c > > (replace_constant_pool_with_broadcast): Remove. > > (remove_partial_avx_dependency_gate): Ditto. > > (constant_pool_broadcast): Ditto. > > (class pass_constant_pool_broadcast): Ditto. > > (make_pass_constant_pool_broadcast): Ditto. > > (remove_partial_avx_dependency): Adjust gate. > > * config/i386/i386-passes.def: Remove pass_constant_pool_broadcast. > > * config/i386/i386-protos.h > > (make_pass_constant_pool_broadcast): Remove. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/fuse-caller-save-xmm.c: Adjust testcase. > > --- > > gcc/config/i386/i386-expand.c | 29 +++- > > gcc/config/i386/i386-features.c | 157 +----------------- > > gcc/config/i386/i386-passes.def | 1 - > > gcc/config/i386/i386-protos.h | 1 - > > .../gcc.target/i386/fuse-caller-save-xmm.c | 2 +- > > 5 files changed, 26 insertions(+), 164 deletions(-) > > > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > > index 69ea79e6123..ba870145acd 100644 > > --- a/gcc/config/i386/i386-expand.c > > +++ b/gcc/config/i386/i386-expand.c > > @@ -453,8 +453,10 @@ ix86_expand_move (machine_mode mode, rtx operands[]) > > emit_insn (gen_rtx_SET (op0, op1)); > > } > > > > +/* OP is a memref of CONST_VECTOR, return scalar constant mem > > + if CONST_VECTOR is a vec_duplicate, else return NULL. */ > > static rtx > > -ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > +ix86_broadcast_from_constant (machine_mode mode, rtx op) > > { > > int nunits = GET_MODE_NUNITS (mode); > > if (nunits < 2) > > @@ -462,7 +464,8 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > > /* Don't use integer vector broadcast if we can't move from GPR to SSE > > register directly. */ > > - if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > > + if (!TARGET_INTER_UNIT_MOVES_TO_VEC > > + && INTEGRAL_MODE_P (mode)) > > return nullptr; > > > > /* Convert CONST_VECTOR to a non-standard SSE constant integer > > @@ -470,12 +473,17 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > if (!(TARGET_AVX2 > > || (TARGET_AVX > > && (GET_MODE_INNER (mode) == SImode > > - || GET_MODE_INNER (mode) == DImode))) > > + || GET_MODE_INNER (mode) == DImode)) > > + || FLOAT_MODE_P (mode)) > > || standard_sse_constant_p (op, mode)) > > return nullptr; > > > > - /* Don't broadcast from a 64-bit integer constant in 32-bit mode. */ > > - if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT) > > + /* Don't broadcast from a 64-bit integer constant in 32-bit mode. > > + We can still put 64-bit integer constant in memory when > > + avx512 embed broadcast is available. */ > > + if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT > > + && (!TARGET_AVX512F > > + || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL))) > > return nullptr; > > > > if (GET_MODE_INNER (mode) == TImode) > > @@ -561,17 +569,20 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[]) > > > > if (can_create_pseudo_p () > > && GET_MODE_SIZE (mode) >= 16 > > - && GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > + && VECTOR_MODE_P (mode) > > && (MEM_P (op1) > > && SYMBOL_REF_P (XEXP (op1, 0)) > > && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))) > > { > > - rtx first = ix86_broadcast_from_integer_constant (mode, op1); > > + rtx first = ix86_broadcast_from_constant (mode, op1); > > if (first != nullptr) > > { > > /* Broadcast to XMM/YMM/ZMM register from an integer > > - constant. */ > > - op1 = ix86_gen_scratch_sse_rtx (mode); > > + constant or scalar mem. */ > > + op1 = gen_reg_rtx (mode); > I've changed ix86_gen_scratch_sse_rtx to gen_reg_rtx to let LRA/reload > make the final decision for register usage, would that make sense? Hard registers are used for 2 purposes: 1. Prevent stack realignment when the original code doesn't use vector registers, which is the same for memcpy and memset. 2. Prevent combine to convert constant broadcast to load from constant pool. Please add an argument to ix86_gen_scratch_sse_rtx to use hard registers for the above 2 cases. You can get pseudo registers for your use cases. > w/o that, hard register used here will prevent LRA/reload combine (set > op1 (vec_duplicate: mem)) and (add (op1, op2)) to (add (op1, > vec_duplicate: mem)) > > > + if (FLOAT_MODE_P (mode) > > + || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)) > > + first = force_const_mem (GET_MODE_INNER (mode), first); > > bool ok = ix86_expand_vector_init_duplicate (false, mode, > > op1, first); > > gcc_assert (ok); > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c > > index cbd430a2ecf..d9c6652d1c9 100644 > > --- a/gcc/config/i386/i386-features.c > > +++ b/gcc/config/i386/i386-features.c > > @@ -2136,81 +2136,6 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt) > > return new pass_insert_endbr_and_patchable_area (ctxt); > > } > > > > -/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x > > - with embedded broadcast. i.e.transform > > - > > - vpaddq .LC0(%rip), %zmm0, %zmm0 > > - ret > > - .LC0: > > - .quad 3 > > - .quad 3 > > - .quad 3 > > - .quad 3 > > - .quad 3 > > - .quad 3 > > - .quad 3 > > - .quad 3 > > - > > - to > > - > > - vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0 > > - ret > > - .LC0: > > - .quad 3 */ > > -static void > > -replace_constant_pool_with_broadcast (rtx_insn *insn) > > -{ > > - subrtx_ptr_iterator::array_type array; > > - FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL) > > - { > > - rtx *loc = *iter; > > - rtx x = *loc; > > - rtx broadcast_mem, vec_dup, constant, first; > > - machine_mode mode; > > - > > - /* Constant pool. */ > > - if (!MEM_P (x) > > - || !SYMBOL_REF_P (XEXP (x, 0)) > > - || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))) > > - continue; > > - > > - /* Const vector. */ > > - mode = GET_MODE (x); > > - if (!VECTOR_MODE_P (mode)) > > - return; > > - constant = get_pool_constant (XEXP (x, 0)); > > - if (GET_CODE (constant) != CONST_VECTOR) > > - return; > > - > > - /* There could be some rtx like > > - (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1"))) > > - but with "*.LC1" refer to V2DI constant vector. */ > > - if (GET_MODE (constant) != mode) > > - { > > - constant = simplify_subreg (mode, constant, GET_MODE (constant), 0); > > - if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR) > > - return; > > - } > > - first = XVECEXP (constant, 0, 0); > > - > > - for (int i = 1; i < GET_MODE_NUNITS (mode); ++i) > > - { > > - rtx tmp = XVECEXP (constant, 0, i); > > - /* Vector duplicate value. */ > > - if (!rtx_equal_p (tmp, first)) > > - return; > > - } > > - > > - /* Replace with embedded broadcast. */ > > - broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first); > > - vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem); > > - validate_change (insn, loc, vec_dup, 0); > > - > > - /* At most 1 memory_operand in an insn. */ > > - return; > > - } > > -} > > - > > /* At entry of the nearest common dominator for basic blocks with > > conversions, generate a single > > vxorps %xmmN, %xmmN, %xmmN > > @@ -2249,10 +2174,6 @@ remove_partial_avx_dependency (void) > > if (!NONDEBUG_INSN_P (insn)) > > continue; > > > > - /* Handle AVX512 embedded broadcast here to save compile time. */ > > - if (TARGET_AVX512F) > > - replace_constant_pool_with_broadcast (insn); > > - > > set = single_set (insn); > > if (!set) > > continue; > > @@ -2384,16 +2305,6 @@ remove_partial_avx_dependency (void) > > return 0; > > } > > > > -static bool > > -remove_partial_avx_dependency_gate () > > -{ > > - return (TARGET_AVX > > - && TARGET_SSE_PARTIAL_REG_DEPENDENCY > > - && TARGET_SSE_MATH > > - && optimize > > - && optimize_function_for_speed_p (cfun)); > > -} > > - > > namespace { > > > > const pass_data pass_data_remove_partial_avx_dependency = > > @@ -2419,7 +2330,11 @@ public: > > /* opt_pass methods: */ > > virtual bool gate (function *) > > { > > - return remove_partial_avx_dependency_gate (); > > + return (TARGET_AVX > > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY > > + && TARGET_SSE_MATH > > + && optimize > > + && optimize_function_for_speed_p (cfun)); > > } > > > > virtual unsigned int execute (function *) > > @@ -2436,68 +2351,6 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt) > > return new pass_remove_partial_avx_dependency (ctxt); > > } > > > > -/* For const vector having one duplicated value, there's no need to put > > - whole vector in the constant pool when target supports embedded broadcast. */ > > -static unsigned int > > -constant_pool_broadcast (void) > > -{ > > - timevar_push (TV_MACH_DEP); > > - rtx_insn *insn; > > - > > - for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) > > - { > > - if (INSN_P (insn)) > > - replace_constant_pool_with_broadcast (insn); > > - } > > - timevar_pop (TV_MACH_DEP); > > - return 0; > > -} > > - > > -namespace { > > - > > -const pass_data pass_data_constant_pool_broadcast = > > -{ > > - RTL_PASS, /* type */ > > - "cpb", /* name */ > > - OPTGROUP_NONE, /* optinfo_flags */ > > - TV_MACH_DEP, /* tv_id */ > > - 0, /* properties_required */ > > - 0, /* properties_provided */ > > - 0, /* properties_destroyed */ > > - 0, /* todo_flags_start */ > > - TODO_df_finish, /* todo_flags_finish */ > > -}; > > - > > -class pass_constant_pool_broadcast : public rtl_opt_pass > > -{ > > -public: > > - pass_constant_pool_broadcast (gcc::context *ctxt) > > - : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt) > > - {} > > - > > - /* opt_pass methods: */ > > - virtual bool gate (function *) > > - { > > - /* Return false if rpad pass gate is true. > > - replace_constant_pool_with_broadcast is called > > - from both this pass and rpad pass. */ > > - return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ()); > > - } > > - > > - virtual unsigned int execute (function *) > > - { > > - return constant_pool_broadcast (); > > - } > > -}; // class pass_cpb > > - > > -} // anon namespace > > - > > -rtl_opt_pass * > > -make_pass_constant_pool_broadcast (gcc::context *ctxt) > > -{ > > - return new pass_constant_pool_broadcast (ctxt); > > -} > > - > > /* This compares the priority of target features in function DECL1 > > and DECL2. It returns positive value if DECL1 is higher priority, > > negative value if DECL2 is higher priority and 0 if they are the > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def > > index 44df00e94ac..29baf8acd0b 100644 > > --- a/gcc/config/i386/i386-passes.def > > +++ b/gcc/config/i386/i386-passes.def > > @@ -33,4 +33,3 @@ along with GCC; see the file COPYING3. If not see > > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area); > > > > INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); > > - INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast); > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > > index 51376fcc454..07ac02aff69 100644 > > --- a/gcc/config/i386/i386-protos.h > > +++ b/gcc/config/i386/i386-protos.h > > @@ -395,4 +395,3 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area > > (gcc::context *); > > extern rtl_opt_pass *make_pass_remove_partial_avx_dependency > > (gcc::context *); > > -extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *); > > diff --git a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > index 4deff93c1e8..b0d3dc38a0c 100644 > > --- a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > +++ b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > @@ -6,7 +6,7 @@ typedef double v2df __attribute__((vector_size (16))); > > static v2df __attribute__((noinline)) > > bar (v2df a) > > { > > - return a + (v2df){ 3.0, 3.0 }; > > + return a + (v2df){ 3.0, 4.0 }; > > } > > > > v2df __attribute__((noinline)) > > -- > > 2.18.1 > > > > > -- > BR, > Hongtao -- H.J. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool. 2021-07-14 12:37 ` H.J. Lu @ 2021-07-22 5:12 ` Hongtao Liu 2021-08-06 6:26 ` Hongtao Liu 1 sibling, 0 replies; 6+ messages in thread From: Hongtao Liu @ 2021-07-22 5:12 UTC (permalink / raw) To: H.J. Lu; +Cc: Uros Bizjak, GCC Patches, liuhongt [-- Attachment #1: Type: text/plain, Size: 15622 bytes --] On Wed, Jul 14, 2021 at 8:38 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Tue, Jul 13, 2021 at 9:35 PM Hongtao Liu <crazylht@gmail.com> wrote: > > > > On Wed, Jul 14, 2021 at 10:34 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > > > By optimizing vector movement to broadcast in ix86_expand_vector_move > > > during pass_expand, pass_reload/LRA can automatically generate an avx512 > > > embedded broadcast, pass_cpb is not needed. > > > > > > Considering that in the absence of avx512f, broadcast from memory is > > > still slightly faster than loading the entire memory, so always enable > > > broadcast. > > > > > > benchmark: > > > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vaddps/broadcast > > > > > > The performance diff > > > > > > strategy : cycles > > > memory : 1046611188 > > > memory : 1255420817 > > > memory : 1044720793 > > > memory : 1253414145 > > > average : 1097868397 > > > > > > broadcast : 1044430688 > > > broadcast : 1044477630 > > > broadcast : 1253554603 > > > broadcast : 1044561934 > > > average : 1096756213 > > > > > > But however broadcast has larger size. > > > > > > the size diff > > > > > > size broadcast.o > > > text data bss dec hex filename > > > 137 0 0 137 89 broadcast.o > > > > > > size memory.o > > > text data bss dec hex filename > > > 115 0 0 115 73 memory.o > > > > > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,} > > > > > > gcc/ChangeLog: > > > > > > * config/i386/i386-expand.c > > > (ix86_broadcast_from_integer_constant): Rename to .. > > > (ix86_broadcast_from_constant): .. this, and extend it to > > > handle float mode. > > > (ix86_expand_vector_move): Extend to float mode. > > > * config/i386/i386-features.c > > > (replace_constant_pool_with_broadcast): Remove. > > > (remove_partial_avx_dependency_gate): Ditto. > > > (constant_pool_broadcast): Ditto. > > > (class pass_constant_pool_broadcast): Ditto. > > > (make_pass_constant_pool_broadcast): Ditto. > > > (remove_partial_avx_dependency): Adjust gate. > > > * config/i386/i386-passes.def: Remove pass_constant_pool_broadcast. > > > * config/i386/i386-protos.h > > > (make_pass_constant_pool_broadcast): Remove. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/i386/fuse-caller-save-xmm.c: Adjust testcase. > > > --- > > > gcc/config/i386/i386-expand.c | 29 +++- > > > gcc/config/i386/i386-features.c | 157 +----------------- > > > gcc/config/i386/i386-passes.def | 1 - > > > gcc/config/i386/i386-protos.h | 1 - > > > .../gcc.target/i386/fuse-caller-save-xmm.c | 2 +- > > > 5 files changed, 26 insertions(+), 164 deletions(-) > > > > > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > > > index 69ea79e6123..ba870145acd 100644 > > > --- a/gcc/config/i386/i386-expand.c > > > +++ b/gcc/config/i386/i386-expand.c > > > @@ -453,8 +453,10 @@ ix86_expand_move (machine_mode mode, rtx operands[]) > > > emit_insn (gen_rtx_SET (op0, op1)); > > > } > > > > > > +/* OP is a memref of CONST_VECTOR, return scalar constant mem > > > + if CONST_VECTOR is a vec_duplicate, else return NULL. */ > > > static rtx > > > -ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > +ix86_broadcast_from_constant (machine_mode mode, rtx op) > > > { > > > int nunits = GET_MODE_NUNITS (mode); > > > if (nunits < 2) > > > @@ -462,7 +464,8 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > > > > /* Don't use integer vector broadcast if we can't move from GPR to SSE > > > register directly. */ > > > - if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > > > + if (!TARGET_INTER_UNIT_MOVES_TO_VEC > > > + && INTEGRAL_MODE_P (mode)) > > > return nullptr; > > > > > > /* Convert CONST_VECTOR to a non-standard SSE constant integer > > > @@ -470,12 +473,17 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > if (!(TARGET_AVX2 > > > || (TARGET_AVX > > > && (GET_MODE_INNER (mode) == SImode > > > - || GET_MODE_INNER (mode) == DImode))) > > > + || GET_MODE_INNER (mode) == DImode)) > > > + || FLOAT_MODE_P (mode)) > > > || standard_sse_constant_p (op, mode)) > > > return nullptr; > > > > > > - /* Don't broadcast from a 64-bit integer constant in 32-bit mode. */ > > > - if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT) > > > + /* Don't broadcast from a 64-bit integer constant in 32-bit mode. > > > + We can still put 64-bit integer constant in memory when > > > + avx512 embed broadcast is available. */ > > > + if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT > > > + && (!TARGET_AVX512F > > > + || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL))) > > > return nullptr; > > > > > > if (GET_MODE_INNER (mode) == TImode) > > > @@ -561,17 +569,20 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[]) > > > > > > if (can_create_pseudo_p () > > > && GET_MODE_SIZE (mode) >= 16 > > > - && GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > > + && VECTOR_MODE_P (mode) > > > && (MEM_P (op1) > > > && SYMBOL_REF_P (XEXP (op1, 0)) > > > && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))) > > > { > > > - rtx first = ix86_broadcast_from_integer_constant (mode, op1); > > > + rtx first = ix86_broadcast_from_constant (mode, op1); > > > if (first != nullptr) > > > { > > > /* Broadcast to XMM/YMM/ZMM register from an integer > > > - constant. */ > > > - op1 = ix86_gen_scratch_sse_rtx (mode); > > > + constant or scalar mem. */ > > > + op1 = gen_reg_rtx (mode); > > I've changed ix86_gen_scratch_sse_rtx to gen_reg_rtx to let LRA/reload > > make the final decision for register usage, would that make sense? > > Hard registers are used for 2 purposes: > > 1. Prevent stack realignment when the original code doesn't use vector > registers, which is the same for memcpy and memset. > 2. Prevent combine to convert constant broadcast to load from constant > pool. > > Please add an argument to ix86_gen_scratch_sse_rtx to use hard registers > for the above 2 cases. You can get pseudo registers for your use cases. > Ok, Here's the patch i'm going to check in. > > w/o that, hard register used here will prevent LRA/reload combine (set > > op1 (vec_duplicate: mem)) and (add (op1, op2)) to (add (op1, > > vec_duplicate: mem)) > > > > > + if (FLOAT_MODE_P (mode) > > > + || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)) > > > + first = force_const_mem (GET_MODE_INNER (mode), first); > > > bool ok = ix86_expand_vector_init_duplicate (false, mode, > > > op1, first); > > > gcc_assert (ok); > > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c > > > index cbd430a2ecf..d9c6652d1c9 100644 > > > --- a/gcc/config/i386/i386-features.c > > > +++ b/gcc/config/i386/i386-features.c > > > @@ -2136,81 +2136,6 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt) > > > return new pass_insert_endbr_and_patchable_area (ctxt); > > > } > > > > > > -/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x > > > - with embedded broadcast. i.e.transform > > > - > > > - vpaddq .LC0(%rip), %zmm0, %zmm0 > > > - ret > > > - .LC0: > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - > > > - to > > > - > > > - vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0 > > > - ret > > > - .LC0: > > > - .quad 3 */ > > > -static void > > > -replace_constant_pool_with_broadcast (rtx_insn *insn) > > > -{ > > > - subrtx_ptr_iterator::array_type array; > > > - FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL) > > > - { > > > - rtx *loc = *iter; > > > - rtx x = *loc; > > > - rtx broadcast_mem, vec_dup, constant, first; > > > - machine_mode mode; > > > - > > > - /* Constant pool. */ > > > - if (!MEM_P (x) > > > - || !SYMBOL_REF_P (XEXP (x, 0)) > > > - || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))) > > > - continue; > > > - > > > - /* Const vector. */ > > > - mode = GET_MODE (x); > > > - if (!VECTOR_MODE_P (mode)) > > > - return; > > > - constant = get_pool_constant (XEXP (x, 0)); > > > - if (GET_CODE (constant) != CONST_VECTOR) > > > - return; > > > - > > > - /* There could be some rtx like > > > - (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1"))) > > > - but with "*.LC1" refer to V2DI constant vector. */ > > > - if (GET_MODE (constant) != mode) > > > - { > > > - constant = simplify_subreg (mode, constant, GET_MODE (constant), 0); > > > - if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR) > > > - return; > > > - } > > > - first = XVECEXP (constant, 0, 0); > > > - > > > - for (int i = 1; i < GET_MODE_NUNITS (mode); ++i) > > > - { > > > - rtx tmp = XVECEXP (constant, 0, i); > > > - /* Vector duplicate value. */ > > > - if (!rtx_equal_p (tmp, first)) > > > - return; > > > - } > > > - > > > - /* Replace with embedded broadcast. */ > > > - broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first); > > > - vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem); > > > - validate_change (insn, loc, vec_dup, 0); > > > - > > > - /* At most 1 memory_operand in an insn. */ > > > - return; > > > - } > > > -} > > > - > > > /* At entry of the nearest common dominator for basic blocks with > > > conversions, generate a single > > > vxorps %xmmN, %xmmN, %xmmN > > > @@ -2249,10 +2174,6 @@ remove_partial_avx_dependency (void) > > > if (!NONDEBUG_INSN_P (insn)) > > > continue; > > > > > > - /* Handle AVX512 embedded broadcast here to save compile time. */ > > > - if (TARGET_AVX512F) > > > - replace_constant_pool_with_broadcast (insn); > > > - > > > set = single_set (insn); > > > if (!set) > > > continue; > > > @@ -2384,16 +2305,6 @@ remove_partial_avx_dependency (void) > > > return 0; > > > } > > > > > > -static bool > > > -remove_partial_avx_dependency_gate () > > > -{ > > > - return (TARGET_AVX > > > - && TARGET_SSE_PARTIAL_REG_DEPENDENCY > > > - && TARGET_SSE_MATH > > > - && optimize > > > - && optimize_function_for_speed_p (cfun)); > > > -} > > > - > > > namespace { > > > > > > const pass_data pass_data_remove_partial_avx_dependency = > > > @@ -2419,7 +2330,11 @@ public: > > > /* opt_pass methods: */ > > > virtual bool gate (function *) > > > { > > > - return remove_partial_avx_dependency_gate (); > > > + return (TARGET_AVX > > > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY > > > + && TARGET_SSE_MATH > > > + && optimize > > > + && optimize_function_for_speed_p (cfun)); > > > } > > > > > > virtual unsigned int execute (function *) > > > @@ -2436,68 +2351,6 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt) > > > return new pass_remove_partial_avx_dependency (ctxt); > > > } > > > > > > -/* For const vector having one duplicated value, there's no need to put > > > - whole vector in the constant pool when target supports embedded broadcast. */ > > > -static unsigned int > > > -constant_pool_broadcast (void) > > > -{ > > > - timevar_push (TV_MACH_DEP); > > > - rtx_insn *insn; > > > - > > > - for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) > > > - { > > > - if (INSN_P (insn)) > > > - replace_constant_pool_with_broadcast (insn); > > > - } > > > - timevar_pop (TV_MACH_DEP); > > > - return 0; > > > -} > > > - > > > -namespace { > > > - > > > -const pass_data pass_data_constant_pool_broadcast = > > > -{ > > > - RTL_PASS, /* type */ > > > - "cpb", /* name */ > > > - OPTGROUP_NONE, /* optinfo_flags */ > > > - TV_MACH_DEP, /* tv_id */ > > > - 0, /* properties_required */ > > > - 0, /* properties_provided */ > > > - 0, /* properties_destroyed */ > > > - 0, /* todo_flags_start */ > > > - TODO_df_finish, /* todo_flags_finish */ > > > -}; > > > - > > > -class pass_constant_pool_broadcast : public rtl_opt_pass > > > -{ > > > -public: > > > - pass_constant_pool_broadcast (gcc::context *ctxt) > > > - : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt) > > > - {} > > > - > > > - /* opt_pass methods: */ > > > - virtual bool gate (function *) > > > - { > > > - /* Return false if rpad pass gate is true. > > > - replace_constant_pool_with_broadcast is called > > > - from both this pass and rpad pass. */ > > > - return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ()); > > > - } > > > - > > > - virtual unsigned int execute (function *) > > > - { > > > - return constant_pool_broadcast (); > > > - } > > > -}; // class pass_cpb > > > - > > > -} // anon namespace > > > - > > > -rtl_opt_pass * > > > -make_pass_constant_pool_broadcast (gcc::context *ctxt) > > > -{ > > > - return new pass_constant_pool_broadcast (ctxt); > > > -} > > > - > > > /* This compares the priority of target features in function DECL1 > > > and DECL2. It returns positive value if DECL1 is higher priority, > > > negative value if DECL2 is higher priority and 0 if they are the > > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def > > > index 44df00e94ac..29baf8acd0b 100644 > > > --- a/gcc/config/i386/i386-passes.def > > > +++ b/gcc/config/i386/i386-passes.def > > > @@ -33,4 +33,3 @@ along with GCC; see the file COPYING3. If not see > > > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area); > > > > > > INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); > > > - INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast); > > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > > > index 51376fcc454..07ac02aff69 100644 > > > --- a/gcc/config/i386/i386-protos.h > > > +++ b/gcc/config/i386/i386-protos.h > > > @@ -395,4 +395,3 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area > > > (gcc::context *); > > > extern rtl_opt_pass *make_pass_remove_partial_avx_dependency > > > (gcc::context *); > > > -extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *); > > > diff --git a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > > index 4deff93c1e8..b0d3dc38a0c 100644 > > > --- a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > > +++ b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > > @@ -6,7 +6,7 @@ typedef double v2df __attribute__((vector_size (16))); > > > static v2df __attribute__((noinline)) > > > bar (v2df a) > > > { > > > - return a + (v2df){ 3.0, 3.0 }; > > > + return a + (v2df){ 3.0, 4.0 }; > > > } > > > > > > v2df __attribute__((noinline)) > > > -- > > > 2.18.1 > > > > > > > > > -- > > BR, > > Hongtao > > > > -- > H.J. -- BR, Hongtao [-- Attachment #2: V2-0001-Remove-pass_cpb-which-is-related-to-enable-avx512-em.patch --] [-- Type: text/x-patch, Size: 12366 bytes --] From c6cf0603b262bc41ec06f29f647214daf2a5965c Mon Sep 17 00:00:00 2001 From: liuhongt <hongtao.liu@intel.com> Date: Tue, 13 Jul 2021 18:22:03 +0800 Subject: [PATCH] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool. By optimizing vector movement to broadcast in ix86_expand_vector_move during pass_expand, pass_reload/LRA can automatically generate an avx512 embedded broadcast, pass_cpb is not needed. Considering that in the absence of avx512f, broadcast from memory is still slightly faster than loading the entire memory, so always enable broadcast. benchmark: https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vaddps/broadcast The performance diff strategy : cycles memory : 1046611188 memory : 1255420817 memory : 1044720793 memory : 1253414145 average : 1097868397 broadcast : 1044430688 broadcast : 1044477630 broadcast : 1253554603 broadcast : 1044561934 average : 1096756213 But however broadcast has larger size. the size diff size broadcast.o text data bss dec hex filename 137 0 0 137 89 broadcast.o size memory.o text data bss dec hex filename 115 0 0 115 73 memory.o gcc/ChangeLog: * config/i386/i386-expand.c (ix86_broadcast_from_integer_constant): Rename to .. (ix86_broadcast_from_constant): .. this, and extend it to handle float mode. (ix86_expand_vector_move): Extend to float mode. * config/i386/i386-features.c (replace_constant_pool_with_broadcast): Remove. (remove_partial_avx_dependency_gate): Ditto. (constant_pool_broadcast): Ditto. (class pass_constant_pool_broadcast): Ditto. (make_pass_constant_pool_broadcast): Ditto. (remove_partial_avx_dependency): Adjust gate. * config/i386/i386-passes.def: Remove pass_constant_pool_broadcast. * config/i386/i386-protos.h (make_pass_constant_pool_broadcast): Remove. gcc/testsuite/ChangeLog: * gcc.target/i386/fuse-caller-save-xmm.c: Adjust testcase. --- gcc/config/i386/i386-expand.c | 36 +++- gcc/config/i386/i386-features.c | 157 +----------------- gcc/config/i386/i386-passes.def | 1 - gcc/config/i386/i386-protos.h | 1 - .../gcc.target/i386/fuse-caller-save-xmm.c | 2 +- 5 files changed, 34 insertions(+), 163 deletions(-) diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index 69ea79e6123..896bd685b1f 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -453,8 +453,10 @@ ix86_expand_move (machine_mode mode, rtx operands[]) emit_insn (gen_rtx_SET (op0, op1)); } +/* OP is a memref of CONST_VECTOR, return scalar constant mem + if CONST_VECTOR is a vec_duplicate, else return NULL. */ static rtx -ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) +ix86_broadcast_from_constant (machine_mode mode, rtx op) { int nunits = GET_MODE_NUNITS (mode); if (nunits < 2) @@ -462,7 +464,8 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) /* Don't use integer vector broadcast if we can't move from GPR to SSE register directly. */ - if (!TARGET_INTER_UNIT_MOVES_TO_VEC) + if (!TARGET_INTER_UNIT_MOVES_TO_VEC + && INTEGRAL_MODE_P (mode)) return nullptr; /* Convert CONST_VECTOR to a non-standard SSE constant integer @@ -470,12 +473,17 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) if (!(TARGET_AVX2 || (TARGET_AVX && (GET_MODE_INNER (mode) == SImode - || GET_MODE_INNER (mode) == DImode))) + || GET_MODE_INNER (mode) == DImode)) + || FLOAT_MODE_P (mode)) || standard_sse_constant_p (op, mode)) return nullptr; - /* Don't broadcast from a 64-bit integer constant in 32-bit mode. */ - if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT) + /* Don't broadcast from a 64-bit integer constant in 32-bit mode. + We can still put 64-bit integer constant in memory when + avx512 embed broadcast is available. */ + if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT + && (!TARGET_AVX512F + || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL))) return nullptr; if (GET_MODE_INNER (mode) == TImode) @@ -561,17 +569,29 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[]) if (can_create_pseudo_p () && GET_MODE_SIZE (mode) >= 16 - && GET_MODE_CLASS (mode) == MODE_VECTOR_INT + && VECTOR_MODE_P (mode) && (MEM_P (op1) && SYMBOL_REF_P (XEXP (op1, 0)) && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))) { - rtx first = ix86_broadcast_from_integer_constant (mode, op1); + rtx first = ix86_broadcast_from_constant (mode, op1); if (first != nullptr) { /* Broadcast to XMM/YMM/ZMM register from an integer - constant. */ + constant or scalar mem. */ + /* Hard registers are used for 2 purposes: + 1. Prevent stack realignment when the original code + doesn't use vector registers, which is the same for + memcpy and memset. + 2. Prevent combine to convert constant broadcast to + load from constant pool. */ op1 = ix86_gen_scratch_sse_rtx (mode); + if (FLOAT_MODE_P (mode) + || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)) + { + first = force_const_mem (GET_MODE_INNER (mode), first); + op1 = gen_reg_rtx (mode); + } bool ok = ix86_expand_vector_init_duplicate (false, mode, op1, first); gcc_assert (ok); diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index cbd430a2ecf..d9c6652d1c9 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -2136,81 +2136,6 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt) return new pass_insert_endbr_and_patchable_area (ctxt); } -/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x - with embedded broadcast. i.e.transform - - vpaddq .LC0(%rip), %zmm0, %zmm0 - ret - .LC0: - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - .quad 3 - - to - - vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0 - ret - .LC0: - .quad 3 */ -static void -replace_constant_pool_with_broadcast (rtx_insn *insn) -{ - subrtx_ptr_iterator::array_type array; - FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL) - { - rtx *loc = *iter; - rtx x = *loc; - rtx broadcast_mem, vec_dup, constant, first; - machine_mode mode; - - /* Constant pool. */ - if (!MEM_P (x) - || !SYMBOL_REF_P (XEXP (x, 0)) - || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))) - continue; - - /* Const vector. */ - mode = GET_MODE (x); - if (!VECTOR_MODE_P (mode)) - return; - constant = get_pool_constant (XEXP (x, 0)); - if (GET_CODE (constant) != CONST_VECTOR) - return; - - /* There could be some rtx like - (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1"))) - but with "*.LC1" refer to V2DI constant vector. */ - if (GET_MODE (constant) != mode) - { - constant = simplify_subreg (mode, constant, GET_MODE (constant), 0); - if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR) - return; - } - first = XVECEXP (constant, 0, 0); - - for (int i = 1; i < GET_MODE_NUNITS (mode); ++i) - { - rtx tmp = XVECEXP (constant, 0, i); - /* Vector duplicate value. */ - if (!rtx_equal_p (tmp, first)) - return; - } - - /* Replace with embedded broadcast. */ - broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first); - vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem); - validate_change (insn, loc, vec_dup, 0); - - /* At most 1 memory_operand in an insn. */ - return; - } -} - /* At entry of the nearest common dominator for basic blocks with conversions, generate a single vxorps %xmmN, %xmmN, %xmmN @@ -2249,10 +2174,6 @@ remove_partial_avx_dependency (void) if (!NONDEBUG_INSN_P (insn)) continue; - /* Handle AVX512 embedded broadcast here to save compile time. */ - if (TARGET_AVX512F) - replace_constant_pool_with_broadcast (insn); - set = single_set (insn); if (!set) continue; @@ -2384,16 +2305,6 @@ remove_partial_avx_dependency (void) return 0; } -static bool -remove_partial_avx_dependency_gate () -{ - return (TARGET_AVX - && TARGET_SSE_PARTIAL_REG_DEPENDENCY - && TARGET_SSE_MATH - && optimize - && optimize_function_for_speed_p (cfun)); -} - namespace { const pass_data pass_data_remove_partial_avx_dependency = @@ -2419,7 +2330,11 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { - return remove_partial_avx_dependency_gate (); + return (TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY + && TARGET_SSE_MATH + && optimize + && optimize_function_for_speed_p (cfun)); } virtual unsigned int execute (function *) @@ -2436,68 +2351,6 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt) return new pass_remove_partial_avx_dependency (ctxt); } -/* For const vector having one duplicated value, there's no need to put - whole vector in the constant pool when target supports embedded broadcast. */ -static unsigned int -constant_pool_broadcast (void) -{ - timevar_push (TV_MACH_DEP); - rtx_insn *insn; - - for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) - { - if (INSN_P (insn)) - replace_constant_pool_with_broadcast (insn); - } - timevar_pop (TV_MACH_DEP); - return 0; -} - -namespace { - -const pass_data pass_data_constant_pool_broadcast = -{ - RTL_PASS, /* type */ - "cpb", /* name */ - OPTGROUP_NONE, /* optinfo_flags */ - TV_MACH_DEP, /* tv_id */ - 0, /* properties_required */ - 0, /* properties_provided */ - 0, /* properties_destroyed */ - 0, /* todo_flags_start */ - TODO_df_finish, /* todo_flags_finish */ -}; - -class pass_constant_pool_broadcast : public rtl_opt_pass -{ -public: - pass_constant_pool_broadcast (gcc::context *ctxt) - : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt) - {} - - /* opt_pass methods: */ - virtual bool gate (function *) - { - /* Return false if rpad pass gate is true. - replace_constant_pool_with_broadcast is called - from both this pass and rpad pass. */ - return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ()); - } - - virtual unsigned int execute (function *) - { - return constant_pool_broadcast (); - } -}; // class pass_cpb - -} // anon namespace - -rtl_opt_pass * -make_pass_constant_pool_broadcast (gcc::context *ctxt) -{ - return new pass_constant_pool_broadcast (ctxt); -} - /* This compares the priority of target features in function DECL1 and DECL2. It returns positive value if DECL1 is higher priority, negative value if DECL2 is higher priority and 0 if they are the diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def index 44df00e94ac..29baf8acd0b 100644 --- a/gcc/config/i386/i386-passes.def +++ b/gcc/config/i386/i386-passes.def @@ -33,4 +33,3 @@ along with GCC; see the file COPYING3. If not see INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area); INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); - INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast); diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 51376fcc454..07ac02aff69 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -395,4 +395,3 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area (gcc::context *); extern rtl_opt_pass *make_pass_remove_partial_avx_dependency (gcc::context *); -extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *); diff --git a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c index 4deff93c1e8..b0d3dc38a0c 100644 --- a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c +++ b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c @@ -6,7 +6,7 @@ typedef double v2df __attribute__((vector_size (16))); static v2df __attribute__((noinline)) bar (v2df a) { - return a + (v2df){ 3.0, 3.0 }; + return a + (v2df){ 3.0, 4.0 }; } v2df __attribute__((noinline)) -- 2.18.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool. 2021-07-14 12:37 ` H.J. Lu 2021-07-22 5:12 ` Hongtao Liu @ 2021-08-06 6:26 ` Hongtao Liu 2021-08-06 12:26 ` H.J. Lu 1 sibling, 1 reply; 6+ messages in thread From: Hongtao Liu @ 2021-08-06 6:26 UTC (permalink / raw) To: H.J. Lu; +Cc: Uros Bizjak, GCC Patches, liuhongt On Wed, Jul 14, 2021 at 8:38 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Tue, Jul 13, 2021 at 9:35 PM Hongtao Liu <crazylht@gmail.com> wrote: > > > > On Wed, Jul 14, 2021 at 10:34 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > > > By optimizing vector movement to broadcast in ix86_expand_vector_move > > > during pass_expand, pass_reload/LRA can automatically generate an avx512 > > > embedded broadcast, pass_cpb is not needed. > > > > > > Considering that in the absence of avx512f, broadcast from memory is > > > still slightly faster than loading the entire memory, so always enable > > > broadcast. > > > > > > benchmark: > > > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vaddps/broadcast > > > > > > The performance diff > > > > > > strategy : cycles > > > memory : 1046611188 > > > memory : 1255420817 > > > memory : 1044720793 > > > memory : 1253414145 > > > average : 1097868397 > > > > > > broadcast : 1044430688 > > > broadcast : 1044477630 > > > broadcast : 1253554603 > > > broadcast : 1044561934 > > > average : 1096756213 > > > > > > But however broadcast has larger size. > > > > > > the size diff > > > > > > size broadcast.o > > > text data bss dec hex filename > > > 137 0 0 137 89 broadcast.o > > > > > > size memory.o > > > text data bss dec hex filename > > > 115 0 0 115 73 memory.o > > > > > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,} > > > > > > gcc/ChangeLog: > > > > > > * config/i386/i386-expand.c > > > (ix86_broadcast_from_integer_constant): Rename to .. > > > (ix86_broadcast_from_constant): .. this, and extend it to > > > handle float mode. > > > (ix86_expand_vector_move): Extend to float mode. > > > * config/i386/i386-features.c > > > (replace_constant_pool_with_broadcast): Remove. > > > (remove_partial_avx_dependency_gate): Ditto. > > > (constant_pool_broadcast): Ditto. > > > (class pass_constant_pool_broadcast): Ditto. > > > (make_pass_constant_pool_broadcast): Ditto. > > > (remove_partial_avx_dependency): Adjust gate. > > > * config/i386/i386-passes.def: Remove pass_constant_pool_broadcast. > > > * config/i386/i386-protos.h > > > (make_pass_constant_pool_broadcast): Remove. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/i386/fuse-caller-save-xmm.c: Adjust testcase. > > > --- > > > gcc/config/i386/i386-expand.c | 29 +++- > > > gcc/config/i386/i386-features.c | 157 +----------------- > > > gcc/config/i386/i386-passes.def | 1 - > > > gcc/config/i386/i386-protos.h | 1 - > > > .../gcc.target/i386/fuse-caller-save-xmm.c | 2 +- > > > 5 files changed, 26 insertions(+), 164 deletions(-) > > > > > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > > > index 69ea79e6123..ba870145acd 100644 > > > --- a/gcc/config/i386/i386-expand.c > > > +++ b/gcc/config/i386/i386-expand.c > > > @@ -453,8 +453,10 @@ ix86_expand_move (machine_mode mode, rtx operands[]) > > > emit_insn (gen_rtx_SET (op0, op1)); > > > } > > > > > > +/* OP is a memref of CONST_VECTOR, return scalar constant mem > > > + if CONST_VECTOR is a vec_duplicate, else return NULL. */ > > > static rtx > > > -ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > +ix86_broadcast_from_constant (machine_mode mode, rtx op) > > > { > > > int nunits = GET_MODE_NUNITS (mode); > > > if (nunits < 2) > > > @@ -462,7 +464,8 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > > > > /* Don't use integer vector broadcast if we can't move from GPR to SSE > > > register directly. */ > > > - if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > > > + if (!TARGET_INTER_UNIT_MOVES_TO_VEC > > > + && INTEGRAL_MODE_P (mode)) > > > return nullptr; > > > > > > /* Convert CONST_VECTOR to a non-standard SSE constant integer > > > @@ -470,12 +473,17 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > if (!(TARGET_AVX2 > > > || (TARGET_AVX > > > && (GET_MODE_INNER (mode) == SImode > > > - || GET_MODE_INNER (mode) == DImode))) > > > + || GET_MODE_INNER (mode) == DImode)) > > > + || FLOAT_MODE_P (mode)) > > > || standard_sse_constant_p (op, mode)) > > > return nullptr; > > > > > > - /* Don't broadcast from a 64-bit integer constant in 32-bit mode. */ > > > - if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT) > > > + /* Don't broadcast from a 64-bit integer constant in 32-bit mode. > > > + We can still put 64-bit integer constant in memory when > > > + avx512 embed broadcast is available. */ > > > + if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT > > > + && (!TARGET_AVX512F > > > + || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL))) > > > return nullptr; > > > > > > if (GET_MODE_INNER (mode) == TImode) > > > @@ -561,17 +569,20 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[]) > > > > > > if (can_create_pseudo_p () > > > && GET_MODE_SIZE (mode) >= 16 > > > - && GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > > + && VECTOR_MODE_P (mode) > > > && (MEM_P (op1) > > > && SYMBOL_REF_P (XEXP (op1, 0)) > > > && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))) > > > { > > > - rtx first = ix86_broadcast_from_integer_constant (mode, op1); > > > + rtx first = ix86_broadcast_from_constant (mode, op1); > > > if (first != nullptr) > > > { > > > /* Broadcast to XMM/YMM/ZMM register from an integer > > > - constant. */ > > > - op1 = ix86_gen_scratch_sse_rtx (mode); > > > + constant or scalar mem. */ > > > + op1 = gen_reg_rtx (mode); > > I've changed ix86_gen_scratch_sse_rtx to gen_reg_rtx to let LRA/reload > > make the final decision for register usage, would that make sense? > > Hard registers are used for 2 purposes: > > 1. Prevent stack realignment when the original code doesn't use vector > registers, which is the same for memcpy and memset. > 2. Prevent combine to convert constant broadcast to load from constant > pool. > W/ gcc version 12.0.0 20210806 (experimental) (GCC) and replaced ix86_gen_scratch_sse_rtx with gen_reg_rtx i didn't observe the failure related to upper cases. only failure like gcc.target/i386/pr100865-10b.c scan-assembler-not vzeroupper gcc.target/i386/pr100865-4b.c scan-assembler-not vzeroupper gcc.target/i386/pr100865-6b.c scan-assembler-not vzeroupper I guess it's related to xmm31 usage. Have you check in all patches rely on this part? > Please add an argument to ix86_gen_scratch_sse_rtx to use hard registers > for the above 2 cases. You can get pseudo registers for your use cases. > > > w/o that, hard register used here will prevent LRA/reload combine (set > > op1 (vec_duplicate: mem)) and (add (op1, op2)) to (add (op1, > > vec_duplicate: mem)) > > > > > + if (FLOAT_MODE_P (mode) > > > + || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)) > > > + first = force_const_mem (GET_MODE_INNER (mode), first); > > > bool ok = ix86_expand_vector_init_duplicate (false, mode, > > > op1, first); > > > gcc_assert (ok); > > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c > > > index cbd430a2ecf..d9c6652d1c9 100644 > > > --- a/gcc/config/i386/i386-features.c > > > +++ b/gcc/config/i386/i386-features.c > > > @@ -2136,81 +2136,6 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt) > > > return new pass_insert_endbr_and_patchable_area (ctxt); > > > } > > > > > > -/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x > > > - with embedded broadcast. i.e.transform > > > - > > > - vpaddq .LC0(%rip), %zmm0, %zmm0 > > > - ret > > > - .LC0: > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - .quad 3 > > > - > > > - to > > > - > > > - vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0 > > > - ret > > > - .LC0: > > > - .quad 3 */ > > > -static void > > > -replace_constant_pool_with_broadcast (rtx_insn *insn) > > > -{ > > > - subrtx_ptr_iterator::array_type array; > > > - FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL) > > > - { > > > - rtx *loc = *iter; > > > - rtx x = *loc; > > > - rtx broadcast_mem, vec_dup, constant, first; > > > - machine_mode mode; > > > - > > > - /* Constant pool. */ > > > - if (!MEM_P (x) > > > - || !SYMBOL_REF_P (XEXP (x, 0)) > > > - || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))) > > > - continue; > > > - > > > - /* Const vector. */ > > > - mode = GET_MODE (x); > > > - if (!VECTOR_MODE_P (mode)) > > > - return; > > > - constant = get_pool_constant (XEXP (x, 0)); > > > - if (GET_CODE (constant) != CONST_VECTOR) > > > - return; > > > - > > > - /* There could be some rtx like > > > - (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1"))) > > > - but with "*.LC1" refer to V2DI constant vector. */ > > > - if (GET_MODE (constant) != mode) > > > - { > > > - constant = simplify_subreg (mode, constant, GET_MODE (constant), 0); > > > - if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR) > > > - return; > > > - } > > > - first = XVECEXP (constant, 0, 0); > > > - > > > - for (int i = 1; i < GET_MODE_NUNITS (mode); ++i) > > > - { > > > - rtx tmp = XVECEXP (constant, 0, i); > > > - /* Vector duplicate value. */ > > > - if (!rtx_equal_p (tmp, first)) > > > - return; > > > - } > > > - > > > - /* Replace with embedded broadcast. */ > > > - broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first); > > > - vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem); > > > - validate_change (insn, loc, vec_dup, 0); > > > - > > > - /* At most 1 memory_operand in an insn. */ > > > - return; > > > - } > > > -} > > > - > > > /* At entry of the nearest common dominator for basic blocks with > > > conversions, generate a single > > > vxorps %xmmN, %xmmN, %xmmN > > > @@ -2249,10 +2174,6 @@ remove_partial_avx_dependency (void) > > > if (!NONDEBUG_INSN_P (insn)) > > > continue; > > > > > > - /* Handle AVX512 embedded broadcast here to save compile time. */ > > > - if (TARGET_AVX512F) > > > - replace_constant_pool_with_broadcast (insn); > > > - > > > set = single_set (insn); > > > if (!set) > > > continue; > > > @@ -2384,16 +2305,6 @@ remove_partial_avx_dependency (void) > > > return 0; > > > } > > > > > > -static bool > > > -remove_partial_avx_dependency_gate () > > > -{ > > > - return (TARGET_AVX > > > - && TARGET_SSE_PARTIAL_REG_DEPENDENCY > > > - && TARGET_SSE_MATH > > > - && optimize > > > - && optimize_function_for_speed_p (cfun)); > > > -} > > > - > > > namespace { > > > > > > const pass_data pass_data_remove_partial_avx_dependency = > > > @@ -2419,7 +2330,11 @@ public: > > > /* opt_pass methods: */ > > > virtual bool gate (function *) > > > { > > > - return remove_partial_avx_dependency_gate (); > > > + return (TARGET_AVX > > > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY > > > + && TARGET_SSE_MATH > > > + && optimize > > > + && optimize_function_for_speed_p (cfun)); > > > } > > > > > > virtual unsigned int execute (function *) > > > @@ -2436,68 +2351,6 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt) > > > return new pass_remove_partial_avx_dependency (ctxt); > > > } > > > > > > -/* For const vector having one duplicated value, there's no need to put > > > - whole vector in the constant pool when target supports embedded broadcast. */ > > > -static unsigned int > > > -constant_pool_broadcast (void) > > > -{ > > > - timevar_push (TV_MACH_DEP); > > > - rtx_insn *insn; > > > - > > > - for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) > > > - { > > > - if (INSN_P (insn)) > > > - replace_constant_pool_with_broadcast (insn); > > > - } > > > - timevar_pop (TV_MACH_DEP); > > > - return 0; > > > -} > > > - > > > -namespace { > > > - > > > -const pass_data pass_data_constant_pool_broadcast = > > > -{ > > > - RTL_PASS, /* type */ > > > - "cpb", /* name */ > > > - OPTGROUP_NONE, /* optinfo_flags */ > > > - TV_MACH_DEP, /* tv_id */ > > > - 0, /* properties_required */ > > > - 0, /* properties_provided */ > > > - 0, /* properties_destroyed */ > > > - 0, /* todo_flags_start */ > > > - TODO_df_finish, /* todo_flags_finish */ > > > -}; > > > - > > > -class pass_constant_pool_broadcast : public rtl_opt_pass > > > -{ > > > -public: > > > - pass_constant_pool_broadcast (gcc::context *ctxt) > > > - : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt) > > > - {} > > > - > > > - /* opt_pass methods: */ > > > - virtual bool gate (function *) > > > - { > > > - /* Return false if rpad pass gate is true. > > > - replace_constant_pool_with_broadcast is called > > > - from both this pass and rpad pass. */ > > > - return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ()); > > > - } > > > - > > > - virtual unsigned int execute (function *) > > > - { > > > - return constant_pool_broadcast (); > > > - } > > > -}; // class pass_cpb > > > - > > > -} // anon namespace > > > - > > > -rtl_opt_pass * > > > -make_pass_constant_pool_broadcast (gcc::context *ctxt) > > > -{ > > > - return new pass_constant_pool_broadcast (ctxt); > > > -} > > > - > > > /* This compares the priority of target features in function DECL1 > > > and DECL2. It returns positive value if DECL1 is higher priority, > > > negative value if DECL2 is higher priority and 0 if they are the > > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def > > > index 44df00e94ac..29baf8acd0b 100644 > > > --- a/gcc/config/i386/i386-passes.def > > > +++ b/gcc/config/i386/i386-passes.def > > > @@ -33,4 +33,3 @@ along with GCC; see the file COPYING3. If not see > > > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area); > > > > > > INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); > > > - INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast); > > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > > > index 51376fcc454..07ac02aff69 100644 > > > --- a/gcc/config/i386/i386-protos.h > > > +++ b/gcc/config/i386/i386-protos.h > > > @@ -395,4 +395,3 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area > > > (gcc::context *); > > > extern rtl_opt_pass *make_pass_remove_partial_avx_dependency > > > (gcc::context *); > > > -extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *); > > > diff --git a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > > index 4deff93c1e8..b0d3dc38a0c 100644 > > > --- a/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > > +++ b/gcc/testsuite/gcc.target/i386/fuse-caller-save-xmm.c > > > @@ -6,7 +6,7 @@ typedef double v2df __attribute__((vector_size (16))); > > > static v2df __attribute__((noinline)) > > > bar (v2df a) > > > { > > > - return a + (v2df){ 3.0, 3.0 }; > > > + return a + (v2df){ 3.0, 4.0 }; > > > } > > > > > > v2df __attribute__((noinline)) > > > -- > > > 2.18.1 > > > > > > > > > -- > > BR, > > Hongtao > > > > -- > H.J. -- BR, Hongtao ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool. 2021-08-06 6:26 ` Hongtao Liu @ 2021-08-06 12:26 ` H.J. Lu 0 siblings, 0 replies; 6+ messages in thread From: H.J. Lu @ 2021-08-06 12:26 UTC (permalink / raw) To: Hongtao Liu; +Cc: Uros Bizjak, GCC Patches, liuhongt On Thu, Aug 5, 2021 at 11:21 PM Hongtao Liu <crazylht@gmail.com> wrote: > > On Wed, Jul 14, 2021 at 8:38 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Tue, Jul 13, 2021 at 9:35 PM Hongtao Liu <crazylht@gmail.com> wrote: > > > > > > On Wed, Jul 14, 2021 at 10:34 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > > > > > By optimizing vector movement to broadcast in ix86_expand_vector_move > > > > during pass_expand, pass_reload/LRA can automatically generate an avx512 > > > > embedded broadcast, pass_cpb is not needed. > > > > > > > > Considering that in the absence of avx512f, broadcast from memory is > > > > still slightly faster than loading the entire memory, so always enable > > > > broadcast. > > > > > > > > benchmark: > > > > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vaddps/broadcast > > > > > > > > The performance diff > > > > > > > > strategy : cycles > > > > memory : 1046611188 > > > > memory : 1255420817 > > > > memory : 1044720793 > > > > memory : 1253414145 > > > > average : 1097868397 > > > > > > > > broadcast : 1044430688 > > > > broadcast : 1044477630 > > > > broadcast : 1253554603 > > > > broadcast : 1044561934 > > > > average : 1096756213 > > > > > > > > But however broadcast has larger size. > > > > > > > > the size diff > > > > > > > > size broadcast.o > > > > text data bss dec hex filename > > > > 137 0 0 137 89 broadcast.o > > > > > > > > size memory.o > > > > text data bss dec hex filename > > > > 115 0 0 115 73 memory.o > > > > > > > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,} > > > > > > > > gcc/ChangeLog: > > > > > > > > * config/i386/i386-expand.c > > > > (ix86_broadcast_from_integer_constant): Rename to .. > > > > (ix86_broadcast_from_constant): .. this, and extend it to > > > > handle float mode. > > > > (ix86_expand_vector_move): Extend to float mode. > > > > * config/i386/i386-features.c > > > > (replace_constant_pool_with_broadcast): Remove. > > > > (remove_partial_avx_dependency_gate): Ditto. > > > > (constant_pool_broadcast): Ditto. > > > > (class pass_constant_pool_broadcast): Ditto. > > > > (make_pass_constant_pool_broadcast): Ditto. > > > > (remove_partial_avx_dependency): Adjust gate. > > > > * config/i386/i386-passes.def: Remove pass_constant_pool_broadcast. > > > > * config/i386/i386-protos.h > > > > (make_pass_constant_pool_broadcast): Remove. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.target/i386/fuse-caller-save-xmm.c: Adjust testcase. > > > > --- > > > > gcc/config/i386/i386-expand.c | 29 +++- > > > > gcc/config/i386/i386-features.c | 157 +----------------- > > > > gcc/config/i386/i386-passes.def | 1 - > > > > gcc/config/i386/i386-protos.h | 1 - > > > > .../gcc.target/i386/fuse-caller-save-xmm.c | 2 +- > > > > 5 files changed, 26 insertions(+), 164 deletions(-) > > > > > > > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > > > > index 69ea79e6123..ba870145acd 100644 > > > > --- a/gcc/config/i386/i386-expand.c > > > > +++ b/gcc/config/i386/i386-expand.c > > > > @@ -453,8 +453,10 @@ ix86_expand_move (machine_mode mode, rtx operands[]) > > > > emit_insn (gen_rtx_SET (op0, op1)); > > > > } > > > > > > > > +/* OP is a memref of CONST_VECTOR, return scalar constant mem > > > > + if CONST_VECTOR is a vec_duplicate, else return NULL. */ > > > > static rtx > > > > -ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > > +ix86_broadcast_from_constant (machine_mode mode, rtx op) > > > > { > > > > int nunits = GET_MODE_NUNITS (mode); > > > > if (nunits < 2) > > > > @@ -462,7 +464,8 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > > > > > > /* Don't use integer vector broadcast if we can't move from GPR to SSE > > > > register directly. */ > > > > - if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > > > > + if (!TARGET_INTER_UNIT_MOVES_TO_VEC > > > > + && INTEGRAL_MODE_P (mode)) > > > > return nullptr; > > > > > > > > /* Convert CONST_VECTOR to a non-standard SSE constant integer > > > > @@ -470,12 +473,17 @@ ix86_broadcast_from_integer_constant (machine_mode mode, rtx op) > > > > if (!(TARGET_AVX2 > > > > || (TARGET_AVX > > > > && (GET_MODE_INNER (mode) == SImode > > > > - || GET_MODE_INNER (mode) == DImode))) > > > > + || GET_MODE_INNER (mode) == DImode)) > > > > + || FLOAT_MODE_P (mode)) > > > > || standard_sse_constant_p (op, mode)) > > > > return nullptr; > > > > > > > > - /* Don't broadcast from a 64-bit integer constant in 32-bit mode. */ > > > > - if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT) > > > > + /* Don't broadcast from a 64-bit integer constant in 32-bit mode. > > > > + We can still put 64-bit integer constant in memory when > > > > + avx512 embed broadcast is available. */ > > > > + if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT > > > > + && (!TARGET_AVX512F > > > > + || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL))) > > > > return nullptr; > > > > > > > > if (GET_MODE_INNER (mode) == TImode) > > > > @@ -561,17 +569,20 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[]) > > > > > > > > if (can_create_pseudo_p () > > > > && GET_MODE_SIZE (mode) >= 16 > > > > - && GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > > > + && VECTOR_MODE_P (mode) > > > > && (MEM_P (op1) > > > > && SYMBOL_REF_P (XEXP (op1, 0)) > > > > && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))) > > > > { > > > > - rtx first = ix86_broadcast_from_integer_constant (mode, op1); > > > > + rtx first = ix86_broadcast_from_constant (mode, op1); > > > > if (first != nullptr) > > > > { > > > > /* Broadcast to XMM/YMM/ZMM register from an integer > > > > - constant. */ > > > > - op1 = ix86_gen_scratch_sse_rtx (mode); > > > > + constant or scalar mem. */ > > > > + op1 = gen_reg_rtx (mode); > > > I've changed ix86_gen_scratch_sse_rtx to gen_reg_rtx to let LRA/reload > > > make the final decision for register usage, would that make sense? > > > > Hard registers are used for 2 purposes: > > > > 1. Prevent stack realignment when the original code doesn't use vector > > registers, which is the same for memcpy and memset. > > 2. Prevent combine to convert constant broadcast to load from constant > > pool. > > > W/ gcc version 12.0.0 20210806 (experimental) (GCC) and replaced > ix86_gen_scratch_sse_rtx with gen_reg_rtx i didn't observe the failure > related to upper cases. Did you run full GCC testsuite with RUNTESTFLAGS="--target_board='unix{-m32,}'" My partial results show FAIL: gcc.target/i386/pieces-memset-3.c scan-assembler-not and[^\n\r]*%[re]sp FAIL: gcc.target/i386/pieces-memset-37.c scan-assembler-not and[^\n\r]*%[re]sp FAIL: gcc.target/i386/pieces-memset-37.c scan-assembler-not %[re]bp FAIL: gcc.target/i386/pieces-memset-39.c scan-assembler-not and[^\n\r]*%[re]sp FAIL: gcc.target/i386/pieces-memset-39.c scan-assembler-not %[re]bp FAIL: gcc.target/i386/pieces-memset-3.c scan-assembler-not and[^\n\r]*%[re]sp FAIL: gcc.target/i386/pieces-memset-3.c scan-assembler-not %[re]bp FAIL: gcc.target/i386/pieces-memset-37.c scan-assembler-not and[^\n\r]*%[re]sp FAIL: gcc.target/i386/pieces-memset-37.c scan-assembler-not %[re]bp FAIL: gcc.target/i386/pieces-memset-39.c scan-assembler-not and[^\n\r]*%[re]sp FAIL: gcc.target/i386/pieces-memset-39.c scan-assembler-not %[re]bp FAIL: gcc.target/i386/pr90773-14.c scan-assembler-times movd[\\t ]+%xmm[0-9]+, 16\\(%[^,]+\\) 1 FAIL: gcc.target/i386/pr90773-17.c scan-assembler-times vmovd[\\t ]+%xmm[0-9]+, 15\\(%[^,]+\\) 1 FAIL: gcc.target/i386/pr90773-14.c scan-assembler-times movd[\\t ]+%xmm[0-9]+, 16\\(%[^,]+\\) 1 FAIL: gcc.target/i386/pr90773-17.c scan-assembler-times vmovd[\\t ]+%xmm[0-9]+, 15\\(%[^,]+\\) 1 FAIL: gcc.target/i386/pr90773-5.c scan-assembler-times movq[\\t ]+%xmm[0-9]+, 13\\(%[^,]+\\) 1 FAIL: gcc.target/i386/pr100865-10b.c scan-assembler-not vzeroupper FAIL: gcc.target/i386/pr100865-11b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 FAIL: gcc.target/i386/pr100865-12b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 FAIL: gcc.target/i386/pr100865-4b.c scan-assembler-not vzeroupper FAIL: gcc.target/i386/pr100865-6b.c scan-assembler-times vmovdqu32[\\t ]%ymm[0-9]+, 8 FAIL: gcc.target/i386/pr100865-6b.c scan-assembler-not vzeroupper FAIL: gcc.target/i386/pr100865-7b.c scan-assembler-times vmovdqu64[\\t ]%ymm[0-9]+, 16 FAIL: gcc.target/i386/pr100865-7b.c scan-assembler-not vzeroupper FAIL: gcc.target/i386/pr100865-8a.c scan-assembler-times (?:vpbroadcastd|vpshufd)[\\t ]+[^\n]*, %xmm[0-9]+ 1 FAIL: gcc.target/i386/pr100865-8b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 FAIL: gcc.target/i386/pr100865-8c.c scan-assembler-times vpshufd[\\t ]+[^\n]*, %xmm[0-9]+ 1 FAIL: gcc.target/i386/pr100865-9b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 FAIL: gcc.target/i386/pr100865-9c.c scan-assembler-times vpshufd[\\t ]+[^\n]*, %xmm[0-9]+ 1 FAIL: gcc.target/i386/eh_return-1.c (internal compiler error) FAIL: gcc.target/i386/eh_return-1.c (test for excess errors) FAIL: gcc.target/i386/eh_return-2.c (internal compiler error) FAIL: gcc.target/i386/eh_return-2.c (test for excess errors) with diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index aea224ab235..61e7e94581c 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -23338,6 +23338,7 @@ ix86_optab_supported_p (int op, machine_mode mode1, machine_mode, rtx ix86_gen_scratch_sse_rtx (machine_mode mode) { + return gen_reg_rtx (mode); if (TARGET_SSE && !lra_in_progress) { unsigned int regno; > only failure like > gcc.target/i386/pr100865-10b.c scan-assembler-not vzeroupper > gcc.target/i386/pr100865-4b.c scan-assembler-not vzeroupper > gcc.target/i386/pr100865-6b.c scan-assembler-not vzeroupper > I guess it's related to xmm31 usage. > > Have you check in all patches rely on this part? > > -- H.J. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-08-06 12:26 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-14 2:34 [PATCH] [i386] Remove pass_cpb which is related to enable avx512 embedded broadcast from constant pool liuhongt 2021-07-14 4:40 ` Hongtao Liu 2021-07-14 12:37 ` H.J. Lu 2021-07-22 5:12 ` Hongtao Liu 2021-08-06 6:26 ` Hongtao Liu 2021-08-06 12:26 ` H.J. Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).