V2 [PATCH] i386: Add pass_remove_partial_avx

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
@ 2018-10-19 10:31 H.J. Lu
  2018-11-04 15:19 ` PING: " H.J. Lu
  2018-11-05 14:21 ` Jan Hubicka
  0 siblings, 2 replies; 21+ messages in thread
From: H.J. Lu @ 2018-10-19 10:31 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Pandey, Sunil K, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 2523 bytes --]

On 10/18/18, Jan Hubicka <hubicka@ucw.cz> wrote:
>> we need to generate
>>
>>	vxorp[ds]	%xmmN, %xmmN, %xmmN
>>	...
>>	vcvtss2sd	f(%rip), %xmmN, %xmmX
>>	...
>>	vcvtsi2ss	i(%rip), %xmmN, %xmmY
>>
>> to avoid partial XMM register stall.  This patch adds a pass to generate
>> a single
>>
>> 	vxorps		%xmmN, %xmmN, %xmmN
>>
>> at function entry, which is shared by all SF and DF conversions, instead
>> of generating one
>>
>> 	vxorp[ds]	%xmmN, %xmmN, %xmmN
>>
>> for each SF/DF conversion.
>>
>> Performance impacts on SPEC CPU 2017 rate with 1 copy using
>>
>> -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops
>>
>> are
>>
>> 1. On Broadwell server:
>>
>> 500.perlbench_r (-0.82%)
>> 502.gcc_r (0.73%)
>> 505.mcf_r (-0.24%)
>> 520.omnetpp_r (-2.22%)
>> 523.xalancbmk_r (-1.47%)
>> 525.x264_r (0.31%)
>> 531.deepsjeng_r (0.27%)
>> 541.leela_r (0.85%)
>> 548.exchange2_r (-0.11%)
>> 557.xz_r (-0.34%)
>> Geomean: (-0.23%)
>>
>> 503.bwaves_r (0.00%)
>> 507.cactuBSSN_r (-1.88%)
>> 508.namd_r (0.00%)
>> 510.parest_r (-0.56%)
>> 511.povray_r (0.49%)
>> 519.lbm_r (-1.28%)
>> 521.wrf_r (-0.28%)
>> 526.blender_r (0.55%)
>> 527.cam4_r (-0.20%)
>> 538.imagick_r (2.52%)
>> 544.nab_r (-0.18%)
>> 549.fotonik3d_r (-0.51%)
>> 554.roms_r (-0.22%)
>> Geomean: (0.00%)
>
> I wonder why the patch seems to have more effect on specint that should not
> care much
> about float<->double conversions?

These are within noise range.

>> number of vxorp[ds]:
>>
>> before		after		difference
>> 14570		4515		-69%
>>
>> OK for trunk?
>
> This looks very nice though.
>

> +  if (v4sf_const0)
> +    {
> +      /* Generate a single vxorps at function entry and preform df
> +	 rescan. */
> +      bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
> +      insn = BB_HEAD (bb);
> +      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> +      set_insn = emit_insn_after (set, insn);
> +      df_insn_rescan (set_insn);
> +      df_process_deferred_rescans ();
> +    }
>
> It seems suboptimal to place the const0 at the entry of function - if the
> conversoin happens in cold region of function this will just increase
> register
> pressure.  I guess right answer would be to look for the postdominance
> frontier

Did you mean "the nearest common dominator"?

> of the set of all uses of the zero register?
>

Here is the updated patch to adds a pass to generate a single

	vxorps		%xmmN, %xmmN, %xmmN

at entry of the nearest common dominator for basic blocks with SF/DF
conversions.  OK for trunk?

Thanks.


-- 
H.J.

[-- Attachment #2: 0001-i386-Add-pass_remove_partial_avx_dependency.patch --]
[-- Type: text/x-patch, Size: 11841 bytes --]

From e2a437f48778ae9586f2038220840ecc41566f69 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 15 Aug 2018 09:58:31 -0700
Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency

With -mavx, for

[hjl@gnu-cfl-1 skx-2]$ cat foo.i
extern float f;
extern double d;
extern int i;

void
foo (void)
{
  d = f;
  f = i;
}

we need to generate

	vxorp[ds]	%xmmN, %xmmN, %xmmN
	...
	vcvtss2sd	f(%rip), %xmmN, %xmmX
	...
	vcvtsi2ss	i(%rip), %xmmN, %xmmY

to avoid partial XMM register stall.  This patch adds a pass to generate
a single

	vxorps		%xmmN, %xmmN, %xmmN

at entry of the nearest common dominator for basic blocks with SF/DF
conversions, instead of generating one

	vxorp[ds]	%xmmN, %xmmN, %xmmN

for each SF/DF conversion.

Performance impacts on SPEC CPU 2017 rate with 1 copy using

-Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops

are

1. On Broadwell server:

500.perlbench_r (-0.82%)
502.gcc_r (0.73%)
505.mcf_r (-0.24%)
520.omnetpp_r (-2.22%)
523.xalancbmk_r (-1.47%)
525.x264_r (0.31%)
531.deepsjeng_r (0.27%)
541.leela_r (0.85%)
548.exchange2_r (-0.11%)
557.xz_r (-0.34%)
Geomean: (-0.23%)

503.bwaves_r (0.00%)
507.cactuBSSN_r (-1.88%)
508.namd_r (0.00%)
510.parest_r (-0.56%)
511.povray_r (0.49%)
519.lbm_r (-1.28%)
521.wrf_r (-0.28%)
526.blender_r (0.55%)
527.cam4_r (-0.20%)
538.imagick_r (2.52%)
544.nab_r (-0.18%)
549.fotonik3d_r (-0.51%)
554.roms_r (-0.22%)
Geomean: (0.00%)

2. On Skylake client:

500.perlbench_r (-0.29%)
502.gcc_r (-0.36%)
505.mcf_r (1.77%)
520.omnetpp_r (-0.26%)
523.xalancbmk_r (-3.69%)
525.x264_r (-0.32%)
531.deepsjeng_r (0.00%)
541.leela_r (-0.46%)
548.exchange2_r (0.00%)
557.xz_r (0.00%)
Geomean: (-0.34%)

503.bwaves_r (0.00%)
507.cactuBSSN_r (-0.56%)
508.namd_r (0.87%)
510.parest_r (0.00%)
511.povray_r (-0.73%)
519.lbm_r (0.84%)
521.wrf_r (0.00%)
526.blender_r (-0.81%)
527.cam4_r (-0.43%)
538.imagick_r (2.55%)
544.nab_r (0.28%)
549.fotonik3d_r (0.00%)
554.roms_r (0.32%)
Geomean: (0.12%)

3. On Skylake server:

500.perlbench_r (-0.55%)
502.gcc_r (0.69%)
505.mcf_r (0.00%)
520.omnetpp_r (-0.33%)
523.xalancbmk_r (-0.21%)
525.x264_r (-0.27%)
531.deepsjeng_r (0.00%)
541.leela_r (0.00%)
548.exchange2_r (-0.11%)
557.xz_r (0.00%)
Geomean: (0.00%)

503.bwaves_r (0.58%)
507.cactuBSSN_r (0.00%)
508.namd_r (0.00%)
510.parest_r (0.18%)
511.povray_r (-0.58%)
519.lbm_r (0.25%)
521.wrf_r (0.40%)
526.blender_r (0.34%)
527.cam4_r (0.19%)
538.imagick_r (5.87%)
544.nab_r (0.17%)
549.fotonik3d_r (0.00%)
554.roms_r (0.00%)
Geomean: (0.62%)

On Skylake client, impacts on 538.imagick_r are

size before:

   text	   data	    bss	    dec	    hex	filename
2555577	  10876	   5576	2572029	 273efd	imagick_r.exe

size after:

   text	   data	    bss	    dec	    hex	filename
2511825	  10876	   5576	2528277	 269415	imagick_r.exe

number of vxorp[ds]:

before		after		difference
14570		4515		-69%

gcc/

2018-08-28  H.J. Lu  <hongjiu.lu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* config/i386/i386-passes.def: Add
	pass_remove_partial_avx_dependency.
	* config/i386/i386-protos.h
	(make_pass_remove_partial_avx_dependency): New.
	* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
	New function.
	(pass_data_remove_partial_avx_dependency): New.
	(pass_remove_partial_avx_dependency): Likewise.
	(make_pass_remove_partial_avx_dependency): Likewise.
	* config/i386/i386.md (SF/DF conversion splitters): Disabled
	for TARGET_AVX.

gcc/testsuite/

2018-08-28  H.J. Lu  <hongjiu.lu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* gcc.target/i386/pr87007.c: New file.
---
 gcc/config/i386/i386-passes.def         |   2 +
 gcc/config/i386/i386-protos.h           |   2 +
 gcc/config/i386/i386.c                  | 179 ++++++++++++++++++++++++
 gcc/config/i386/i386.md                 |   9 +-
 gcc/testsuite/gcc.target/i386/pr87007.c |  15 ++
 5 files changed, 204 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007.c

diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index 4a6a95cc2d9..b439f3a9028 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -31,3 +31,5 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */);
 
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch);
+
+  INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index d1d59633dc0..1626c9d0d70 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -358,3 +358,5 @@ class rtl_opt_pass;
 extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *);
 extern rtl_opt_pass *make_pass_stv (gcc::context *);
 extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *);
+extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
+  (gcc::context *);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e47603eb5ff..4f0270ebfff 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2741,6 +2741,185 @@ make_pass_insert_endbranch (gcc::context *ctxt)
   return new pass_insert_endbranch (ctxt);
 }
 
+/* At entry of the nearest common dominator for basic blocks with
+   conversions, generate a single
+	vxorps %xmmN, %xmmN, %xmmN
+   for all
+	vcvtss2sd  op, %xmmN, %xmmX
+	vcvtsd2ss  op, %xmmN, %xmmX
+	vcvtsi2ss  op, %xmmN, %xmmX
+	vcvtsi2sd  op, %xmmN, %xmmX
+ */
+
+static unsigned int
+remove_partial_avx_dependency (void)
+{
+  timevar_push (TV_MACH_DEP);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+  df_md_add_problem ();
+  df_analyze ();
+
+  bitmap_obstack_initialize (NULL);
+  bitmap convert_bbs = BITMAP_ALLOC (NULL);
+
+  basic_block bb;
+  rtx_insn *insn, *set_insn;
+  rtx set;
+  rtx v4sf_const0 = NULL_RTX;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  set = single_set (insn);
+	  if (set)
+	    {
+	      machine_mode dest_vecmode, dest_mode;
+	      rtx src = SET_SRC (set);
+	      rtx dest, vec, zero;
+
+	      /* Check for conversions to SF or DF.  */
+	      switch (GET_CODE (src))
+		{
+		case FLOAT_TRUNCATE:
+		  /* DF -> SF.  */
+		  if (GET_MODE (XEXP (src, 0)) != DFmode)
+		    continue;
+		  /* Fall through.  */
+		case FLOAT_EXTEND:
+		  /* SF -> DF.  */
+		case FLOAT:
+		  /* SI -> SF, SI -> DF, DI -> SF, DI -> DF.  */
+		  dest = SET_DEST (set);
+		  dest_mode = GET_MODE (dest);
+		  switch (dest_mode)
+		    {
+		    case E_SFmode:
+		      dest_vecmode = V4SFmode;
+		      break;
+		    case E_DFmode:
+		      dest_vecmode = V2DFmode;
+		      break;
+		    default:
+		      continue;
+		    }
+
+		  if (!TARGET_64BIT
+		      && GET_MODE (XEXP (src, 0)) == DImode)
+		    continue;
+
+		  if (!v4sf_const0)
+		    v4sf_const0 = gen_reg_rtx (V4SFmode);
+
+		  if (dest_vecmode == V4SFmode)
+		    zero = v4sf_const0;
+		  else
+		    zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0);
+
+		  /* Change source to vector mode.  */
+		  src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src);
+		  src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero,
+					   GEN_INT (HOST_WIDE_INT_1U));
+		  /* Change destination to vector mode.  */
+		  vec = gen_reg_rtx (dest_vecmode);
+		  /* Generate a XMM vector SET.  */
+		  set = gen_rtx_SET (vec, src);
+		  set_insn = emit_insn_before (set, insn);
+		  df_insn_rescan (set_insn);
+
+		  src = gen_rtx_SUBREG (dest_mode, vec, 0);
+		  set = gen_rtx_SET (dest, src);
+
+		  /* Drop possible dead definitions.  */
+		  PATTERN (insn) = set;
+
+		  INSN_CODE (insn) = -1;
+		  recog_memoized (insn);
+		  df_insn_rescan (insn);
+		  bitmap_set_bit (convert_bbs, bb->index);
+		  break;
+
+		default:
+		  break;
+		}
+	    }
+	}
+    }
+
+  if (v4sf_const0)
+    {
+      /* Generate a single vxorps at entry of the nearest common
+	 dominator for basic blocks with conversions.  */
+      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
+					     convert_bbs);
+      insn = BB_HEAD (bb);
+      if (!NONDEBUG_INSN_P (insn))
+	insn = next_nonnote_nondebug_insn (insn);
+      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
+      set_insn = emit_insn_before (set, insn);
+      df_insn_rescan (set_insn);
+      df_process_deferred_rescans ();
+    }
+
+  bitmap_obstack_release (NULL);
+  BITMAP_FREE (convert_bbs);
+
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_remove_partial_avx_dependency =
+{
+  RTL_PASS, /* type */
+  "rpad", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_remove_partial_avx_dependency : public rtl_opt_pass
+{
+public:
+  pass_remove_partial_avx_dependency (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (TARGET_AVX
+	      && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+	      && TARGET_SSE_MATH
+	      && optimize
+	      && optimize_function_for_speed_p (cfun));
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return remove_partial_avx_dependency ();
+    }
+}; // class pass_rpad
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
+{
+  return new pass_remove_partial_avx_dependency (ctxt);
+}
+
 /* Return true if a red-zone is in use.  We can't use red-zone when
    there are local indirect jumps, like "indirect_jump" or "tablejump",
    which jumps to another place in the function, since "call" in the
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 7fb2b144f47..2c7bf0c0419 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4426,7 +4426,8 @@
   [(set (match_operand:DF 0 "sse_reg_operand")
         (float_extend:DF
           (match_operand:SF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || REGNO (operands[0]) != REGNO (operands[1]))
@@ -4584,7 +4585,8 @@
   [(set (match_operand:SF 0 "sse_reg_operand")
         (float_truncate:SF
 	  (match_operand:DF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || REGNO (operands[0]) != REGNO (operands[1]))
@@ -5088,7 +5090,8 @@
 (define_split
   [(set (match_operand:MODEF 0 "sse_reg_operand")
 	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!EXT_REX_SSE_REG_P (operands[0])
        || TARGET_AVX512VL)"
diff --git a/gcc/testsuite/gcc.target/i386/pr87007.c b/gcc/testsuite/gcc.target/i386/pr87007.c
new file mode 100644
index 00000000000..e6f17ad6dc0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr87007.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler-times "vxorp\[ds\]\[^\n\r\]*xmm\[0-9\]" 1 } } */
-- 
2.17.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* PING: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-10-19 10:31 V2 [PATCH] i386: Add pass_remove_partial_avx_dependency H.J. Lu
@ 2018-11-04 15:19 ` H.J. Lu
  2018-11-05 14:21 ` Jan Hubicka
  1 sibling, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2018-11-04 15:19 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Pandey, Sunil K, Uros Bizjak

On Fri, Oct 19, 2018 at 1:44 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On 10/18/18, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> we need to generate
> >>
> >>      vxorp[ds]       %xmmN, %xmmN, %xmmN
> >>      ...
> >>      vcvtss2sd       f(%rip), %xmmN, %xmmX
> >>      ...
> >>      vcvtsi2ss       i(%rip), %xmmN, %xmmY
> >>
> >> to avoid partial XMM register stall.  This patch adds a pass to generate
> >> a single
> >>
> >>      vxorps          %xmmN, %xmmN, %xmmN
> >>
> >> at function entry, which is shared by all SF and DF conversions, instead
> >> of generating one
> >>
> >>      vxorp[ds]       %xmmN, %xmmN, %xmmN
> >>
> >> for each SF/DF conversion.
> >>
> >> Performance impacts on SPEC CPU 2017 rate with 1 copy using
> >>
> >> -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops
> >>
> >> are
> >>
> >> 1. On Broadwell server:
> >>
> >> 500.perlbench_r (-0.82%)
> >> 502.gcc_r (0.73%)
> >> 505.mcf_r (-0.24%)
> >> 520.omnetpp_r (-2.22%)
> >> 523.xalancbmk_r (-1.47%)
> >> 525.x264_r (0.31%)
> >> 531.deepsjeng_r (0.27%)
> >> 541.leela_r (0.85%)
> >> 548.exchange2_r (-0.11%)
> >> 557.xz_r (-0.34%)
> >> Geomean: (-0.23%)
> >>
> >> 503.bwaves_r (0.00%)
> >> 507.cactuBSSN_r (-1.88%)
> >> 508.namd_r (0.00%)
> >> 510.parest_r (-0.56%)
> >> 511.povray_r (0.49%)
> >> 519.lbm_r (-1.28%)
> >> 521.wrf_r (-0.28%)
> >> 526.blender_r (0.55%)
> >> 527.cam4_r (-0.20%)
> >> 538.imagick_r (2.52%)
> >> 544.nab_r (-0.18%)
> >> 549.fotonik3d_r (-0.51%)
> >> 554.roms_r (-0.22%)
> >> Geomean: (0.00%)
> >
> > I wonder why the patch seems to have more effect on specint that should not
> > care much
> > about float<->double conversions?
>
> These are within noise range.
>
> >> number of vxorp[ds]:
> >>
> >> before               after           difference
> >> 14570                4515            -69%
> >>
> >> OK for trunk?
> >
> > This looks very nice though.
> >
>
> > +  if (v4sf_const0)
> > +    {
> > +      /* Generate a single vxorps at function entry and preform df
> > +      rescan. */
> > +      bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
> > +      insn = BB_HEAD (bb);
> > +      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > +      set_insn = emit_insn_after (set, insn);
> > +      df_insn_rescan (set_insn);
> > +      df_process_deferred_rescans ();
> > +    }
> >
> > It seems suboptimal to place the const0 at the entry of function - if the
> > conversoin happens in cold region of function this will just increase
> > register
> > pressure.  I guess right answer would be to look for the postdominance
> > frontier
>
> Did you mean "the nearest common dominator"?
>
> > of the set of all uses of the zero register?
> >
>
> Here is the updated patch to adds a pass to generate a single
>
>         vxorps          %xmmN, %xmmN, %xmmN
>
> at entry of the nearest common dominator for basic blocks with SF/DF
> conversions.  OK for trunk?
>

PING:

https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01175.html


-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-10-19 10:31 V2 [PATCH] i386: Add pass_remove_partial_avx_dependency H.J. Lu
  2018-11-04 15:19 ` PING: " H.J. Lu
@ 2018-11-05 14:21 ` Jan Hubicka
  2018-11-05 14:59   ` Jeff Law
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2018-11-05 14:21 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Pandey, Sunil K, Uros Bizjak, law

> 
> Did you mean "the nearest common dominator"?

If the nearest common dominator appears in the loop while all uses are
out of loops, this will result in suboptimal xor placement.
In this case you want to split edges out of the loop.

In general this is what the LCM framework will do for you if the problem
is modelled siimlar way as in mode_swtiching.  At entry function mode is
"no zero register needed" and all conversions need mode "zero register
needed".  Mode switching should then do the correct placement decisions
(reaching minimal number of executions of xor).

Jeff, whan is your optinion on the approach taken by the patch?
It seems like a special case of more general issue, but I do not see
very elegant way to solve it at least in the GCC 9 horisont, so if
the placement is correct we can probalby go either with new pass or
making this part of mode swithcing (which is anyway run by x86 backend)

Honza
> 
> > of the set of all uses of the zero register?
> >
> 
> Here is the updated patch to adds a pass to generate a single
> 
> 	vxorps		%xmmN, %xmmN, %xmmN
> 
> at entry of the nearest common dominator for basic blocks with SF/DF
> conversions.  OK for trunk?
> 
> Thanks.
> 
> 
> -- 
> H.J.

> From e2a437f48778ae9586f2038220840ecc41566f69 Mon Sep 17 00:00:00 2001
> From: "H.J. Lu" <hjl.tools@gmail.com>
> Date: Wed, 15 Aug 2018 09:58:31 -0700
> Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> 
> With -mavx, for
> 
> [hjl@gnu-cfl-1 skx-2]$ cat foo.i
> extern float f;
> extern double d;
> extern int i;
> 
> void
> foo (void)
> {
>   d = f;
>   f = i;
> }
> 
> we need to generate
> 
> 	vxorp[ds]	%xmmN, %xmmN, %xmmN
> 	...
> 	vcvtss2sd	f(%rip), %xmmN, %xmmX
> 	...
> 	vcvtsi2ss	i(%rip), %xmmN, %xmmY
> 
> to avoid partial XMM register stall.  This patch adds a pass to generate
> a single
> 
> 	vxorps		%xmmN, %xmmN, %xmmN
> 
> at entry of the nearest common dominator for basic blocks with SF/DF
> conversions, instead of generating one
> 
> 	vxorp[ds]	%xmmN, %xmmN, %xmmN
> 
> for each SF/DF conversion.
> 
> Performance impacts on SPEC CPU 2017 rate with 1 copy using
> 
> -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops
> 
> are
> 
> 1. On Broadwell server:
> 
> 500.perlbench_r (-0.82%)
> 502.gcc_r (0.73%)
> 505.mcf_r (-0.24%)
> 520.omnetpp_r (-2.22%)
> 523.xalancbmk_r (-1.47%)
> 525.x264_r (0.31%)
> 531.deepsjeng_r (0.27%)
> 541.leela_r (0.85%)
> 548.exchange2_r (-0.11%)
> 557.xz_r (-0.34%)
> Geomean: (-0.23%)
> 
> 503.bwaves_r (0.00%)
> 507.cactuBSSN_r (-1.88%)
> 508.namd_r (0.00%)
> 510.parest_r (-0.56%)
> 511.povray_r (0.49%)
> 519.lbm_r (-1.28%)
> 521.wrf_r (-0.28%)
> 526.blender_r (0.55%)
> 527.cam4_r (-0.20%)
> 538.imagick_r (2.52%)
> 544.nab_r (-0.18%)
> 549.fotonik3d_r (-0.51%)
> 554.roms_r (-0.22%)
> Geomean: (0.00%)
> 
> 2. On Skylake client:
> 
> 500.perlbench_r (-0.29%)
> 502.gcc_r (-0.36%)
> 505.mcf_r (1.77%)
> 520.omnetpp_r (-0.26%)
> 523.xalancbmk_r (-3.69%)
> 525.x264_r (-0.32%)
> 531.deepsjeng_r (0.00%)
> 541.leela_r (-0.46%)
> 548.exchange2_r (0.00%)
> 557.xz_r (0.00%)
> Geomean: (-0.34%)
> 
> 503.bwaves_r (0.00%)
> 507.cactuBSSN_r (-0.56%)
> 508.namd_r (0.87%)
> 510.parest_r (0.00%)
> 511.povray_r (-0.73%)
> 519.lbm_r (0.84%)
> 521.wrf_r (0.00%)
> 526.blender_r (-0.81%)
> 527.cam4_r (-0.43%)
> 538.imagick_r (2.55%)
> 544.nab_r (0.28%)
> 549.fotonik3d_r (0.00%)
> 554.roms_r (0.32%)
> Geomean: (0.12%)
> 
> 3. On Skylake server:
> 
> 500.perlbench_r (-0.55%)
> 502.gcc_r (0.69%)
> 505.mcf_r (0.00%)
> 520.omnetpp_r (-0.33%)
> 523.xalancbmk_r (-0.21%)
> 525.x264_r (-0.27%)
> 531.deepsjeng_r (0.00%)
> 541.leela_r (0.00%)
> 548.exchange2_r (-0.11%)
> 557.xz_r (0.00%)
> Geomean: (0.00%)
> 
> 503.bwaves_r (0.58%)
> 507.cactuBSSN_r (0.00%)
> 508.namd_r (0.00%)
> 510.parest_r (0.18%)
> 511.povray_r (-0.58%)
> 519.lbm_r (0.25%)
> 521.wrf_r (0.40%)
> 526.blender_r (0.34%)
> 527.cam4_r (0.19%)
> 538.imagick_r (5.87%)
> 544.nab_r (0.17%)
> 549.fotonik3d_r (0.00%)
> 554.roms_r (0.00%)
> Geomean: (0.62%)
> 
> On Skylake client, impacts on 538.imagick_r are
> 
> size before:
> 
>    text	   data	    bss	    dec	    hex	filename
> 2555577	  10876	   5576	2572029	 273efd	imagick_r.exe
> 
> size after:
> 
>    text	   data	    bss	    dec	    hex	filename
> 2511825	  10876	   5576	2528277	 269415	imagick_r.exe
> 
> number of vxorp[ds]:
> 
> before		after		difference
> 14570		4515		-69%
> 
> gcc/
> 
> 2018-08-28  H.J. Lu  <hongjiu.lu@intel.com>
> 	    Sunil K Pandey  <sunil.k.pandey@intel.com>
> 
> 	PR target/87007
> 	* config/i386/i386-passes.def: Add
> 	pass_remove_partial_avx_dependency.
> 	* config/i386/i386-protos.h
> 	(make_pass_remove_partial_avx_dependency): New.
> 	* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
> 	New function.
> 	(pass_data_remove_partial_avx_dependency): New.
> 	(pass_remove_partial_avx_dependency): Likewise.
> 	(make_pass_remove_partial_avx_dependency): Likewise.
> 	* config/i386/i386.md (SF/DF conversion splitters): Disabled
> 	for TARGET_AVX.
> 
> gcc/testsuite/
> 
> 2018-08-28  H.J. Lu  <hongjiu.lu@intel.com>
> 	    Sunil K Pandey  <sunil.k.pandey@intel.com>
> 
> 	PR target/87007
> 	* gcc.target/i386/pr87007.c: New file.
> ---
>  gcc/config/i386/i386-passes.def         |   2 +
>  gcc/config/i386/i386-protos.h           |   2 +
>  gcc/config/i386/i386.c                  | 179 ++++++++++++++++++++++++
>  gcc/config/i386/i386.md                 |   9 +-
>  gcc/testsuite/gcc.target/i386/pr87007.c |  15 ++
>  5 files changed, 204 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr87007.c
> 
> diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> index 4a6a95cc2d9..b439f3a9028 100644
> --- a/gcc/config/i386/i386-passes.def
> +++ b/gcc/config/i386/i386-passes.def
> @@ -31,3 +31,5 @@ along with GCC; see the file COPYING3.  If not see
>    INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */);
>  
>    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch);
> +
> +  INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index d1d59633dc0..1626c9d0d70 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -358,3 +358,5 @@ class rtl_opt_pass;
>  extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *);
>  extern rtl_opt_pass *make_pass_stv (gcc::context *);
>  extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *);
> +extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
> +  (gcc::context *);
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index e47603eb5ff..4f0270ebfff 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -2741,6 +2741,185 @@ make_pass_insert_endbranch (gcc::context *ctxt)
>    return new pass_insert_endbranch (ctxt);
>  }
>  
> +/* At entry of the nearest common dominator for basic blocks with
> +   conversions, generate a single
> +	vxorps %xmmN, %xmmN, %xmmN
> +   for all
> +	vcvtss2sd  op, %xmmN, %xmmX
> +	vcvtsd2ss  op, %xmmN, %xmmX
> +	vcvtsi2ss  op, %xmmN, %xmmX
> +	vcvtsi2sd  op, %xmmN, %xmmX
> + */
> +
> +static unsigned int
> +remove_partial_avx_dependency (void)
> +{
> +  timevar_push (TV_MACH_DEP);
> +
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  df_set_flags (DF_DEFER_INSN_RESCAN);
> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
> +  df_md_add_problem ();
> +  df_analyze ();
> +
> +  bitmap_obstack_initialize (NULL);
> +  bitmap convert_bbs = BITMAP_ALLOC (NULL);
> +
> +  basic_block bb;
> +  rtx_insn *insn, *set_insn;
> +  rtx set;
> +  rtx v4sf_const0 = NULL_RTX;
> +
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      FOR_BB_INSNS (bb, insn)
> +	{
> +	  if (!NONDEBUG_INSN_P (insn))
> +	    continue;
> +
> +	  set = single_set (insn);
> +	  if (set)
> +	    {
> +	      machine_mode dest_vecmode, dest_mode;
> +	      rtx src = SET_SRC (set);
> +	      rtx dest, vec, zero;
> +
> +	      /* Check for conversions to SF or DF.  */
> +	      switch (GET_CODE (src))
> +		{
> +		case FLOAT_TRUNCATE:
> +		  /* DF -> SF.  */
> +		  if (GET_MODE (XEXP (src, 0)) != DFmode)
> +		    continue;
> +		  /* Fall through.  */
> +		case FLOAT_EXTEND:
> +		  /* SF -> DF.  */
> +		case FLOAT:
> +		  /* SI -> SF, SI -> DF, DI -> SF, DI -> DF.  */
> +		  dest = SET_DEST (set);
> +		  dest_mode = GET_MODE (dest);
> +		  switch (dest_mode)
> +		    {
> +		    case E_SFmode:
> +		      dest_vecmode = V4SFmode;
> +		      break;
> +		    case E_DFmode:
> +		      dest_vecmode = V2DFmode;
> +		      break;
> +		    default:
> +		      continue;
> +		    }
> +
> +		  if (!TARGET_64BIT
> +		      && GET_MODE (XEXP (src, 0)) == DImode)
> +		    continue;
> +
> +		  if (!v4sf_const0)
> +		    v4sf_const0 = gen_reg_rtx (V4SFmode);
> +
> +		  if (dest_vecmode == V4SFmode)
> +		    zero = v4sf_const0;
> +		  else
> +		    zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0);
> +
> +		  /* Change source to vector mode.  */
> +		  src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src);
> +		  src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero,
> +					   GEN_INT (HOST_WIDE_INT_1U));
> +		  /* Change destination to vector mode.  */
> +		  vec = gen_reg_rtx (dest_vecmode);
> +		  /* Generate a XMM vector SET.  */
> +		  set = gen_rtx_SET (vec, src);
> +		  set_insn = emit_insn_before (set, insn);
> +		  df_insn_rescan (set_insn);
> +
> +		  src = gen_rtx_SUBREG (dest_mode, vec, 0);
> +		  set = gen_rtx_SET (dest, src);
> +
> +		  /* Drop possible dead definitions.  */
> +		  PATTERN (insn) = set;
> +
> +		  INSN_CODE (insn) = -1;
> +		  recog_memoized (insn);
> +		  df_insn_rescan (insn);
> +		  bitmap_set_bit (convert_bbs, bb->index);
> +		  break;
> +
> +		default:
> +		  break;
> +		}
> +	    }
> +	}
> +    }
> +
> +  if (v4sf_const0)
> +    {
> +      /* Generate a single vxorps at entry of the nearest common
> +	 dominator for basic blocks with conversions.  */
> +      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> +					     convert_bbs);
> +      insn = BB_HEAD (bb);
> +      if (!NONDEBUG_INSN_P (insn))
> +	insn = next_nonnote_nondebug_insn (insn);
> +      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> +      set_insn = emit_insn_before (set, insn);
> +      df_insn_rescan (set_insn);
> +      df_process_deferred_rescans ();
> +    }
> +
> +  bitmap_obstack_release (NULL);
> +  BITMAP_FREE (convert_bbs);
> +
> +  timevar_pop (TV_MACH_DEP);
> +  return 0;
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_remove_partial_avx_dependency =
> +{
> +  RTL_PASS, /* type */
> +  "rpad", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_MACH_DEP, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_remove_partial_avx_dependency : public rtl_opt_pass
> +{
> +public:
> +  pass_remove_partial_avx_dependency (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return (TARGET_AVX
> +	      && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> +	      && TARGET_SSE_MATH
> +	      && optimize
> +	      && optimize_function_for_speed_p (cfun));
> +    }
> +
> +  virtual unsigned int execute (function *)
> +    {
> +      return remove_partial_avx_dependency ();
> +    }
> +}; // class pass_rpad
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> +{
> +  return new pass_remove_partial_avx_dependency (ctxt);
> +}
> +
>  /* Return true if a red-zone is in use.  We can't use red-zone when
>     there are local indirect jumps, like "indirect_jump" or "tablejump",
>     which jumps to another place in the function, since "call" in the
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 7fb2b144f47..2c7bf0c0419 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -4426,7 +4426,8 @@
>    [(set (match_operand:DF 0 "sse_reg_operand")
>          (float_extend:DF
>            (match_operand:SF 1 "nonimmediate_operand")))]
> -  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +  "!TARGET_AVX
> +   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!REG_P (operands[1])
>         || REGNO (operands[0]) != REGNO (operands[1]))
> @@ -4584,7 +4585,8 @@
>    [(set (match_operand:SF 0 "sse_reg_operand")
>          (float_truncate:SF
>  	  (match_operand:DF 1 "nonimmediate_operand")))]
> -  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +  "!TARGET_AVX
> +   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!REG_P (operands[1])
>         || REGNO (operands[0]) != REGNO (operands[1]))
> @@ -5088,7 +5090,8 @@
>  (define_split
>    [(set (match_operand:MODEF 0 "sse_reg_operand")
>  	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
> -  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +  "!TARGET_AVX
> +   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!EXT_REX_SSE_REG_P (operands[0])
>         || TARGET_AVX512VL)"
> diff --git a/gcc/testsuite/gcc.target/i386/pr87007.c b/gcc/testsuite/gcc.target/i386/pr87007.c
> new file mode 100644
> index 00000000000..e6f17ad6dc0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr87007.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake" } */
> +
> +extern float f;
> +extern double d;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler-times "vxorp\[ds\]\[^\n\r\]*xmm\[0-9\]" 1 } } */
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-11-05 14:21 ` Jan Hubicka
@ 2018-11-05 14:59   ` Jeff Law
  2018-11-05 15:29     ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Law @ 2018-11-05 14:59 UTC (permalink / raw)
  To: Jan Hubicka, H.J. Lu; +Cc: GCC Patches, Pandey, Sunil K, Uros Bizjak

On 11/5/18 7:21 AM, Jan Hubicka wrote:
>>
>> Did you mean "the nearest common dominator"?
> 
> If the nearest common dominator appears in the loop while all uses are
> out of loops, this will result in suboptimal xor placement.
> In this case you want to split edges out of the loop.
> 
> In general this is what the LCM framework will do for you if the problem
> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> "no zero register needed" and all conversions need mode "zero register
> needed".  Mode switching should then do the correct placement decisions
> (reaching minimal number of executions of xor).
> 
> Jeff, whan is your optinion on the approach taken by the patch?
> It seems like a special case of more general issue, but I do not see
> very elegant way to solve it at least in the GCC 9 horisont, so if
> the placement is correct we can probalby go either with new pass or
> making this part of mode swithcing (which is anyway run by x86 backend)
So I haven't followed this discussion at all, but did touch on this
issue with some patch a month or two ago with a target patch that was
trying to avoid the partial stalls.

My assumption is that we're trying to find one or more places to
initialize the upper half of an avx register so as to avoid partial
register stall at existing sites that set the upper half.

This sounds like a classic PRE/LCM style problem (of which mode
switching is just another variant).   A common-dominator approach is
closer to a classic GCSE and is going to result is more initializations
at sub-optimal points than a PRE/LCM style.

The only advantage a common-dominator approach would have that I could
think of would be potentially further separating the initialization from
the subsequent use points which avoid store-store stalls or somesuch.  I
doubt this effect would be enough to overcome the inherent advantages of
a PRE/LCM approach.

ISTM that if we were to scan the RTL noting which instructions set the
upper part of the avx register, which instructions have potential stalls
and which instructions reset the upper half to an indeterminate state
(calls), then we have the local properties.  THen we feed that into a
traditional LCM solver and we get back the optimal points.

The only weirdness is that we don't want to move existing instructions
that set the upper bits.  Those are essentially fixed.  So maybe this is
actually better modeled by click's algorithm which has the concept of
pinned instructions.  But click's algorithm assumes SSA.  Ugh.

I'd probably have to sit down with it for a while -- it might be
possible to handle the fixed instructions using some of the ideas from
click (essentially exposing the earliest/latest results from LCM, then
picking a point on the dominator path between earliest and latest).

jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-11-05 14:59   ` Jeff Law
@ 2018-11-05 15:29     ` Jan Hubicka
  2018-11-28 19:49       ` H.J. Lu
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2018-11-05 15:29 UTC (permalink / raw)
  To: Jeff Law; +Cc: H.J. Lu, GCC Patches, Pandey, Sunil K, Uros Bizjak

> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> >>
> >> Did you mean "the nearest common dominator"?
> > 
> > If the nearest common dominator appears in the loop while all uses are
> > out of loops, this will result in suboptimal xor placement.
> > In this case you want to split edges out of the loop.
> > 
> > In general this is what the LCM framework will do for you if the problem
> > is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > "no zero register needed" and all conversions need mode "zero register
> > needed".  Mode switching should then do the correct placement decisions
> > (reaching minimal number of executions of xor).
> > 
> > Jeff, whan is your optinion on the approach taken by the patch?
> > It seems like a special case of more general issue, but I do not see
> > very elegant way to solve it at least in the GCC 9 horisont, so if
> > the placement is correct we can probalby go either with new pass or
> > making this part of mode swithcing (which is anyway run by x86 backend)
> So I haven't followed this discussion at all, but did touch on this
> issue with some patch a month or two ago with a target patch that was
> trying to avoid the partial stalls.
> 
> My assumption is that we're trying to find one or more places to
> initialize the upper half of an avx register so as to avoid partial
> register stall at existing sites that set the upper half.
> 
> This sounds like a classic PRE/LCM style problem (of which mode
> switching is just another variant).   A common-dominator approach is
> closer to a classic GCSE and is going to result is more initializations
> at sub-optimal points than a PRE/LCM style.

yes, it is usual code placement problem. It is special case because the
zero register is not modified by the conversion (just we need to have
zero somewhere).  So basically we do not have kills to the zero except
for entry block.

Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-11-05 15:29     ` Jan Hubicka
@ 2018-11-28 19:49       ` H.J. Lu
  2018-11-28 20:17         ` Jeff Law
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2018-11-28 19:49 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Jeffrey Law, GCC Patches, Pandey, Sunil K, Uros Bizjak

On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > >>
> > >> Did you mean "the nearest common dominator"?
> > >
> > > If the nearest common dominator appears in the loop while all uses are
> > > out of loops, this will result in suboptimal xor placement.
> > > In this case you want to split edges out of the loop.
> > >
> > > In general this is what the LCM framework will do for you if the problem
> > > is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > > "no zero register needed" and all conversions need mode "zero register
> > > needed".  Mode switching should then do the correct placement decisions
> > > (reaching minimal number of executions of xor).
> > >
> > > Jeff, whan is your optinion on the approach taken by the patch?
> > > It seems like a special case of more general issue, but I do not see
> > > very elegant way to solve it at least in the GCC 9 horisont, so if
> > > the placement is correct we can probalby go either with new pass or
> > > making this part of mode swithcing (which is anyway run by x86 backend)
> > So I haven't followed this discussion at all, but did touch on this
> > issue with some patch a month or two ago with a target patch that was
> > trying to avoid the partial stalls.
> >
> > My assumption is that we're trying to find one or more places to
> > initialize the upper half of an avx register so as to avoid partial
> > register stall at existing sites that set the upper half.
> >
> > This sounds like a classic PRE/LCM style problem (of which mode
> > switching is just another variant).   A common-dominator approach is
> > closer to a classic GCSE and is going to result is more initializations
> > at sub-optimal points than a PRE/LCM style.
>
> yes, it is usual code placement problem. It is special case because the
> zero register is not modified by the conversion (just we need to have
> zero somewhere).  So basically we do not have kills to the zero except
> for entry block.
>

Do you have  testcase to show thatf the nearest common dominator
in the loop, while all uses areout of loops, leads to suboptimal xor
placement?

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-11-28 19:49       ` H.J. Lu
@ 2018-11-28 20:17         ` Jeff Law
  2018-11-28 20:21           ` Jan Hubicka
  2018-12-30 17:10           ` H.J. Lu
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff Law @ 2018-11-28 20:17 UTC (permalink / raw)
  To: H.J. Lu, Jan Hubicka; +Cc: GCC Patches, Pandey, Sunil K, Uros Bizjak

On 11/28/18 12:48 PM, H.J. Lu wrote:
> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
>>>>>
>>>>> Did you mean "the nearest common dominator"?
>>>>
>>>> If the nearest common dominator appears in the loop while all uses are
>>>> out of loops, this will result in suboptimal xor placement.
>>>> In this case you want to split edges out of the loop.
>>>>
>>>> In general this is what the LCM framework will do for you if the problem
>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
>>>> "no zero register needed" and all conversions need mode "zero register
>>>> needed".  Mode switching should then do the correct placement decisions
>>>> (reaching minimal number of executions of xor).
>>>>
>>>> Jeff, whan is your optinion on the approach taken by the patch?
>>>> It seems like a special case of more general issue, but I do not see
>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
>>>> the placement is correct we can probalby go either with new pass or
>>>> making this part of mode swithcing (which is anyway run by x86 backend)
>>> So I haven't followed this discussion at all, but did touch on this
>>> issue with some patch a month or two ago with a target patch that was
>>> trying to avoid the partial stalls.
>>>
>>> My assumption is that we're trying to find one or more places to
>>> initialize the upper half of an avx register so as to avoid partial
>>> register stall at existing sites that set the upper half.
>>>
>>> This sounds like a classic PRE/LCM style problem (of which mode
>>> switching is just another variant).   A common-dominator approach is
>>> closer to a classic GCSE and is going to result is more initializations
>>> at sub-optimal points than a PRE/LCM style.
>>
>> yes, it is usual code placement problem. It is special case because the
>> zero register is not modified by the conversion (just we need to have
>> zero somewhere).  So basically we do not have kills to the zero except
>> for entry block.
>>
> 
> Do you have  testcase to show thatf the nearest common dominator
> in the loop, while all uses areout of loops, leads to suboptimal xor
> placement?
I don't have a testcase, but it's all but certain nearest common
dominator is going to be a suboptimal placement.  That's going to create
paths where you're going to emit the xor when it's not used.

The whole point of the LCM algorithms is they are optimal in terms of
expression evaluations.

jeff
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-11-28 20:17         ` Jeff Law
@ 2018-11-28 20:21           ` Jan Hubicka
  2018-12-30 17:33             ` H.J. Lu
  2018-12-30 17:10           ` H.J. Lu
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2018-11-28 20:21 UTC (permalink / raw)
  To: Jeff Law; +Cc: H.J. Lu, GCC Patches, Pandey, Sunil K, Uros Bizjak

> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> >>
> >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> >>>>>
> >>>>> Did you mean "the nearest common dominator"?
> >>>>
> >>>> If the nearest common dominator appears in the loop while all uses are
> >>>> out of loops, this will result in suboptimal xor placement.
> >>>> In this case you want to split edges out of the loop.
> >>>>
> >>>> In general this is what the LCM framework will do for you if the problem
> >>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> >>>> "no zero register needed" and all conversions need mode "zero register
> >>>> needed".  Mode switching should then do the correct placement decisions
> >>>> (reaching minimal number of executions of xor).
> >>>>
> >>>> Jeff, whan is your optinion on the approach taken by the patch?
> >>>> It seems like a special case of more general issue, but I do not see
> >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> >>>> the placement is correct we can probalby go either with new pass or
> >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> >>> So I haven't followed this discussion at all, but did touch on this
> >>> issue with some patch a month or two ago with a target patch that was
> >>> trying to avoid the partial stalls.
> >>>
> >>> My assumption is that we're trying to find one or more places to
> >>> initialize the upper half of an avx register so as to avoid partial
> >>> register stall at existing sites that set the upper half.
> >>>
> >>> This sounds like a classic PRE/LCM style problem (of which mode
> >>> switching is just another variant).   A common-dominator approach is
> >>> closer to a classic GCSE and is going to result is more initializations
> >>> at sub-optimal points than a PRE/LCM style.
> >>
> >> yes, it is usual code placement problem. It is special case because the
> >> zero register is not modified by the conversion (just we need to have
> >> zero somewhere).  So basically we do not have kills to the zero except
> >> for entry block.
> >>
> > 
> > Do you have  testcase to show thatf the nearest common dominator
> > in the loop, while all uses areout of loops, leads to suboptimal xor
> > placement?
> I don't have a testcase, but it's all but certain nearest common
> dominator is going to be a suboptimal placement.  That's going to create
> paths where you're going to emit the xor when it's not used.
> 
> The whole point of the LCM algorithms is they are optimal in terms of
> expression evaluations.

i think testcase should be something like

test()
{
  while (true)
  {
     if (cond1)
       {
       	 do_one_conversion;
	 return;
       }
     if (cond2)
       {
       	 do_other_conversion;
	 return;
       }
  }
}

Honza
> 
> jeff
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-11-28 20:17         ` Jeff Law
  2018-11-28 20:21           ` Jan Hubicka
@ 2018-12-30 17:10           ` H.J. Lu
  2019-01-07 13:56             ` V3 " H.J. Lu
  1 sibling, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2018-12-30 17:10 UTC (permalink / raw)
  To: Jeff Law
  Cc: Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak, Hongtao Liu

On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
>
> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> >>
> >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> >>>>>
> >>>>> Did you mean "the nearest common dominator"?
> >>>>
> >>>> If the nearest common dominator appears in the loop while all uses are
> >>>> out of loops, this will result in suboptimal xor placement.
> >>>> In this case you want to split edges out of the loop.
> >>>>
> >>>> In general this is what the LCM framework will do for you if the problem
> >>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> >>>> "no zero register needed" and all conversions need mode "zero register
> >>>> needed".  Mode switching should then do the correct placement decisions
> >>>> (reaching minimal number of executions of xor).
> >>>>
> >>>> Jeff, whan is your optinion on the approach taken by the patch?
> >>>> It seems like a special case of more general issue, but I do not see
> >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> >>>> the placement is correct we can probalby go either with new pass or
> >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> >>> So I haven't followed this discussion at all, but did touch on this
> >>> issue with some patch a month or two ago with a target patch that was
> >>> trying to avoid the partial stalls.
> >>>
> >>> My assumption is that we're trying to find one or more places to
> >>> initialize the upper half of an avx register so as to avoid partial
> >>> register stall at existing sites that set the upper half.
> >>>
> >>> This sounds like a classic PRE/LCM style problem (of which mode
> >>> switching is just another variant).   A common-dominator approach is
> >>> closer to a classic GCSE and is going to result is more initializations
> >>> at sub-optimal points than a PRE/LCM style.
> >>
> >> yes, it is usual code placement problem. It is special case because the
> >> zero register is not modified by the conversion (just we need to have
> >> zero somewhere).  So basically we do not have kills to the zero except
> >> for entry block.
> >>
> >
> > Do you have  testcase to show thatf the nearest common dominator
> > in the loop, while all uses areout of loops, leads to suboptimal xor
> > placement?
> I don't have a testcase, but it's all but certain nearest common
> dominator is going to be a suboptimal placement.  That's going to create
> paths where you're going to emit the xor when it's not used.
>
> The whole point of the LCM algorithms is they are optimal in terms of
> expression evaluations.

We tried LCM and it didn't work well for this case.  LCM places a single
VXOR close to the location where it is needed, which can be inside a
loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
solve

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007

where VXOR is executed multiple times inside of a function, instead of
just once.   We are investigating to generate a single VXOR at entry of the
nearest dominator for basic blocks with SF/DF conversions, which is in
the the fake loop that contains the whole function:

      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
                                             convert_bbs);
      while (bb->loop_father->latch
             != EXIT_BLOCK_PTR_FOR_FN (cfun))
        bb = get_immediate_dominator (CDI_DOMINATORS,
                                      bb->loop_father->header);

      insn = BB_HEAD (bb);
      if (!NONDEBUG_INSN_P (insn))
        insn = next_nonnote_nondebug_insn (insn);
      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
      set_insn = emit_insn_before (set, insn);


-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-11-28 20:21           ` Jan Hubicka
@ 2018-12-30 17:33             ` H.J. Lu
  2019-01-11 21:06               ` Jeff Law
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2018-12-30 17:33 UTC (permalink / raw)
  To: Jan Hubicka, Hongtao Liu
  Cc: Jeff Law, GCC Patches, Pandey, Sunil K, Uros Bizjak

On Wed, Nov 28, 2018 at 12:21 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> > >>
> > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > >>>>>
> > >>>>> Did you mean "the nearest common dominator"?
> > >>>>
> > >>>> If the nearest common dominator appears in the loop while all uses are
> > >>>> out of loops, this will result in suboptimal xor placement.
> > >>>> In this case you want to split edges out of the loop.
> > >>>>
> > >>>> In general this is what the LCM framework will do for you if the problem
> > >>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > >>>> "no zero register needed" and all conversions need mode "zero register
> > >>>> needed".  Mode switching should then do the correct placement decisions
> > >>>> (reaching minimal number of executions of xor).
> > >>>>
> > >>>> Jeff, whan is your optinion on the approach taken by the patch?
> > >>>> It seems like a special case of more general issue, but I do not see
> > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > >>>> the placement is correct we can probalby go either with new pass or
> > >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > >>> So I haven't followed this discussion at all, but did touch on this
> > >>> issue with some patch a month or two ago with a target patch that was
> > >>> trying to avoid the partial stalls.
> > >>>
> > >>> My assumption is that we're trying to find one or more places to
> > >>> initialize the upper half of an avx register so as to avoid partial
> > >>> register stall at existing sites that set the upper half.
> > >>>
> > >>> This sounds like a classic PRE/LCM style problem (of which mode
> > >>> switching is just another variant).   A common-dominator approach is
> > >>> closer to a classic GCSE and is going to result is more initializations
> > >>> at sub-optimal points than a PRE/LCM style.
> > >>
> > >> yes, it is usual code placement problem. It is special case because the
> > >> zero register is not modified by the conversion (just we need to have
> > >> zero somewhere).  So basically we do not have kills to the zero except
> > >> for entry block.
> > >>
> > >
> > > Do you have  testcase to show thatf the nearest common dominator
> > > in the loop, while all uses areout of loops, leads to suboptimal xor
> > > placement?
> > I don't have a testcase, but it's all but certain nearest common
> > dominator is going to be a suboptimal placement.  That's going to create
> > paths where you're going to emit the xor when it's not used.
> >
> > The whole point of the LCM algorithms is they are optimal in terms of
> > expression evaluations.
>
> i think testcase should be something like
>
> test()
> {
>   while (true)
>   {
>      if (cond1)
>        {
>          do_one_conversion;
>          return;
>        }
>      if (cond2)
>        {
>          do_other_conversion;
>          return;
>        }
>   }
> }

We got

[hjl@gnu-cfl-1 pr87007]$ cat test2.i
extern float f1[],f2[];
extern int i1[],i2[];
float
foo (int k, int n[])
{
  if (k == 1)
    return 1;

  if (k == 4)
    return 5;

  for(int i = 0; i != k; i++){
    if(n[i] > 100)
      f1[i] = i1[i];
    else
      f2[i] = i2[i];
  }

  return k;
}
[hjl@gnu-cfl-1 pr87007]$ make test2.s
/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/ -O2
-mavx -S test2.i
[hjl@gnu-cfl-1 pr87007]$ cat test2.s
.file "test2.i"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
vmovss .LC0(%rip), %xmm0
cmpl $1, %edi
je .L15
vmovss .LC1(%rip), %xmm0
cmpl $4, %edi
je .L15
vxorps %xmm0, %xmm0, %xmm0
testl %edi, %edi
je .L3
leal -1(%rdi), %ecx
xorl %eax, %eax
jmp .L6
.p2align 4,,10
.p2align 3
.L17:
vcvtsi2ss i1(,%rax,4), %xmm0, %xmm1
leaq 1(%rax), %rdx
vmovss %xmm1, f1(,%rax,4)
cmpq %rcx, %rax
je .L3
.L9:
movq %rdx, %rax
.L6:
cmpl $100, (%rsi,%rax,4)
jg .L17
vcvtsi2ss i2(,%rax,4), %xmm0, %xmm1
leaq 1(%rax), %rdx
vmovss %xmm1, f2(,%rax,4)
cmpq %rcx, %rax
jne .L9
.L3:
vcvtsi2ss %edi, %xmm0, %xmm0
.L15:
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.section .rodata.cst4,"aM",@progbits,4
.align 4
.LC0:
.long 1065353216
.align 4
.LC1:
.long 1084227584
.ident "GCC: (GNU) 9.0.0 20181230 (experimental)"
.section .note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 pr87007]$

The placement is optimal.

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* V3 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-12-30 17:10           ` H.J. Lu
@ 2019-01-07 13:56             ` H.J. Lu
  2019-01-18 15:49               ` PING^1: " H.J. Lu
  2019-01-21 18:54               ` Jeff Law
  0 siblings, 2 replies; 21+ messages in thread
From: H.J. Lu @ 2019-01-07 13:56 UTC (permalink / raw)
  To: Jeff Law
  Cc: Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 4048 bytes --]

On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> >
> > On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> > >>
> > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > >>>>>
> > >>>>> Did you mean "the nearest common dominator"?
> > >>>>
> > >>>> If the nearest common dominator appears in the loop while all uses are
> > >>>> out of loops, this will result in suboptimal xor placement.
> > >>>> In this case you want to split edges out of the loop.
> > >>>>
> > >>>> In general this is what the LCM framework will do for you if the problem
> > >>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > >>>> "no zero register needed" and all conversions need mode "zero register
> > >>>> needed".  Mode switching should then do the correct placement decisions
> > >>>> (reaching minimal number of executions of xor).
> > >>>>
> > >>>> Jeff, whan is your optinion on the approach taken by the patch?
> > >>>> It seems like a special case of more general issue, but I do not see
> > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > >>>> the placement is correct we can probalby go either with new pass or
> > >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > >>> So I haven't followed this discussion at all, but did touch on this
> > >>> issue with some patch a month or two ago with a target patch that was
> > >>> trying to avoid the partial stalls.
> > >>>
> > >>> My assumption is that we're trying to find one or more places to
> > >>> initialize the upper half of an avx register so as to avoid partial
> > >>> register stall at existing sites that set the upper half.
> > >>>
> > >>> This sounds like a classic PRE/LCM style problem (of which mode
> > >>> switching is just another variant).   A common-dominator approach is
> > >>> closer to a classic GCSE and is going to result is more initializations
> > >>> at sub-optimal points than a PRE/LCM style.
> > >>
> > >> yes, it is usual code placement problem. It is special case because the
> > >> zero register is not modified by the conversion (just we need to have
> > >> zero somewhere).  So basically we do not have kills to the zero except
> > >> for entry block.
> > >>
> > >
> > > Do you have  testcase to show thatf the nearest common dominator
> > > in the loop, while all uses areout of loops, leads to suboptimal xor
> > > placement?
> > I don't have a testcase, but it's all but certain nearest common
> > dominator is going to be a suboptimal placement.  That's going to create
> > paths where you're going to emit the xor when it's not used.
> >
> > The whole point of the LCM algorithms is they are optimal in terms of
> > expression evaluations.
>
> We tried LCM and it didn't work well for this case.  LCM places a single
> VXOR close to the location where it is needed, which can be inside a
> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> solve
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
>
> where VXOR is executed multiple times inside of a function, instead of
> just once.   We are investigating to generate a single VXOR at entry of the
> nearest dominator for basic blocks with SF/DF conversions, which is in
> the the fake loop that contains the whole function:
>
>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
>                                              convert_bbs);
>       while (bb->loop_father->latch
>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
>         bb = get_immediate_dominator (CDI_DOMINATORS,
>                                       bb->loop_father->header);
>
>       insn = BB_HEAD (bb);
>       if (!NONDEBUG_INSN_P (insn))
>         insn = next_nonnote_nondebug_insn (insn);
>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
>       set_insn = emit_insn_before (set, insn);
>

Here is the updated patch.  OK for trunk?

Thanks.

-- 
H.J.

[-- Attachment #2: 0001-i386-Add-pass_remove_partial_avx_dependency.patch --]
[-- Type: text/x-patch, Size: 14104 bytes --]

From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 7 Jan 2019 05:44:59 -0800
Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency

With -mavx, for

$ cat foo.i
extern float f;
extern double d;
extern int i;

void
foo (void)
{
  d = f;
  f = i;
}

we need to generate

	vxorp[ds]	%xmmN, %xmmN, %xmmN
	...
	vcvtss2sd	f(%rip), %xmmN, %xmmX
	...
	vcvtsi2ss	i(%rip), %xmmN, %xmmY

to avoid partial XMM register stall.  This patch adds a pass to generate
a single

	vxorps		%xmmN, %xmmN, %xmmN

at entry of the nearest dominator for basic blocks with SF/DF conversions,
which is in the fake loop that contains the whole function, instead of
generating one

	vxorp[ds]	%xmmN, %xmmN, %xmmN

for each SF/DF conversion.

NB: The LCM algorithm isn't appropriate here since it may place a vxorps
inside the loop.  Simple testcase show this:

$ cat badcase.c

extern float f;
extern double d;

void
foo (int n, int k)
{
  for (int j = 0; j != n; j++)
    if (j < k)
      d = f;
}

It generates

    ...
    loop:
      if(j < k)
        vxorps  %xmm0, %xmm0, %xmm0
        vcvtss2sd  %xmm1, %xmm0, %xmm0
      ...
    loopend
    ...

This is because LCM only works when there is a certain benifit.  But for
conditional branch, LCM wouldn't move

   vxorps  %xmm0, %xmm0, %xmm0

out of loop.  SPEC CPU 2017 on Intel Xeon with AVX512 shows:

1. The nearest dominator

|RATE			|Improvement|
|500.perlbench_r	| 0.55%	|
|538.imagick_r		| 8.43%	|
|544.nab_r		| 0.71%	|

2. LCM

|RATE			|Improvement|
|500.perlbench_r	| -0.76% |
|538.imagick_r		| 7.96%  |
|544.nab_r		| -0.13% |

Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512
using

-Ofast -flto -march=skylake-avx512 -mfpmath=sse -funroll-loops

are:

|INT RATE		|Improvement|
|500.perlbench_r	| 0.55%	|
|502.gcc_r		| 0.14%	|
|505.mcf_r		| 0.08%	|
|523.xalancbmk_r	| 0.18%	|
|525.x264_r		|-0.49%	|
|531.deepsjeng_r	|-0.04%	|
|541.leela_r		|-0.26%	|
|548.exchange2_r	|-0.3%	|
|557.xz_r		|BuildSame|

|FP RATE		|Improvement|
|503.bwaves_r	        |-0.29% |
|507.cactuBSSN_r	| 0.04%	|
|508.namd_r		|-0.74%	|
|510.parest_r		|-0.01%	|
|511.povray_r		| 2.23%	|
|519.lbm_r		| 0.1%	|
|521.wrf_r		| 0.49%	|
|526.blender_r		| 0.13%	|
|527.cam4_r		| 0.65%	|
|538.imagick_r		| 8.43%	|
|544.nab_r		| 0.71%	|
|549.fotonik3d_r	| 0.15%	|
|554.roms_r		| 0.08%	|

On Skylake client, impacts on 538.imagick_r are

size before:

   text	   data	    bss	    dec	    hex	filename
3018743    8432    4472 3031647  2e425f imagick_r

size after:

   text	   data	    bss	    dec	    hex	filename
2992351    8432    4472 3005255  2ddb47 imagick_r

number of vxorp[ds]:

before		after		difference
10453		3816		-63.49%

gcc/

2019-01-07  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* config/i386/i386-passes.def: Add
	pass_remove_partial_avx_dependency.
	* config/i386/i386-protos.h
	(make_pass_remove_partial_avx_dependency): New.
	* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
	New function.
	(pass_data_remove_partial_avx_dependency): New.
	(pass_remove_partial_avx_dependency): Likewise.
	(make_pass_remove_partial_avx_dependency): Likewise.
	* config/i386/i386.md (vec_scalar_convert): New attribute.
	(*extendsfdf2): Add vec_scalar_convert.
	(truncdfsf2): Likewise.
	(*float<SWI48:mode><MODEF:mode>2): Likewise.
	(SF/DF conversion splitters): Disabled for TARGET_AVX.

gcc/testsuite/

2019-01-07  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* gcc.target/i386/pr87007-1.c: New test.
	* gcc.target/i386/pr87007-2.c: Likewise.
---
 gcc/config/i386/i386-passes.def           |   2 +
 gcc/config/i386/i386-protos.h             |   2 +
 gcc/config/i386/i386.c                    | 174 ++++++++++++++++++++++
 gcc/config/i386/i386.md                   |  16 +-
 gcc/testsuite/gcc.target/i386/pr87007-1.c |  15 ++
 gcc/testsuite/gcc.target/i386/pr87007-2.c |  18 +++
 6 files changed, 224 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-2.c

diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index 4a6a95cc2d9..b439f3a9028 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -31,3 +31,5 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */);
 
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch);
+
+  INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 1e802bac1ea..b8f166a0b25 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -367,3 +367,5 @@ class rtl_opt_pass;
 extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *);
 extern rtl_opt_pass *make_pass_stv (gcc::context *);
 extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *);
+extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
+  (gcc::context *);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 61dbc95c086..57f9666554a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2793,6 +2793,180 @@ make_pass_insert_endbranch (gcc::context *ctxt)
   return new pass_insert_endbranch (ctxt);
 }
 
+/* At entry of the nearest common dominator for basic blocks with
+   conversions, generate a single
+	vxorps %xmmN, %xmmN, %xmmN
+   for all
+	vcvtss2sd  op, %xmmN, %xmmX
+	vcvtsd2ss  op, %xmmN, %xmmX
+	vcvtsi2ss  op, %xmmN, %xmmX
+	vcvtsi2sd  op, %xmmN, %xmmX
+
+   NB: We want to generate only a single vxorps to cover the whole
+   function.  The LCM algorithm isn't appropriate here since it may
+   place a vxorps inside the loop.  */
+
+static unsigned int
+remove_partial_avx_dependency (void)
+{
+  timevar_push (TV_MACH_DEP);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+  df_md_add_problem ();
+  df_analyze ();
+
+  bitmap_obstack_initialize (NULL);
+  bitmap convert_bbs = BITMAP_ALLOC (NULL);
+
+  basic_block bb;
+  rtx_insn *insn, *set_insn;
+  rtx set;
+  rtx v4sf_const0 = NULL_RTX;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  set = single_set (insn);
+	  if (!set)
+	    continue;
+
+	  if (get_attr_vec_scalar_convert (insn)
+	      != VEC_SCALAR_CONVERT_TRUE)
+	    continue;
+
+	  if (!v4sf_const0)
+	    v4sf_const0 = gen_reg_rtx (V4SFmode);
+
+	  /* Convert VEC_SCALAR_CONVERT_TRUE insns, DF -> SF, SF -> DF,
+	     SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
+	     vec_merge with subreg.  */
+	  rtx src = SET_SRC (set);
+	  rtx dest = SET_DEST (set);
+	  machine_mode dest_mode = GET_MODE (dest);
+
+	  rtx zero;
+	  machine_mode dest_vecmode;
+	  if (dest_mode == E_SFmode)
+	    {
+	      dest_vecmode = V4SFmode;
+	      zero = v4sf_const0;
+	    }
+	  else
+	    {
+	      dest_vecmode = V2DFmode;
+	      zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0);
+	    }
+
+	  /* Change source to vector mode.  */
+	  src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src);
+	  src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero,
+				   GEN_INT (HOST_WIDE_INT_1U));
+	  /* Change destination to vector mode.  */
+	  rtx vec = gen_reg_rtx (dest_vecmode);
+	  /* Generate an XMM vector SET.  */
+	  set = gen_rtx_SET (vec, src);
+	  set_insn = emit_insn_before (set, insn);
+	  df_insn_rescan (set_insn);
+
+	  src = gen_rtx_SUBREG (dest_mode, vec, 0);
+	  set = gen_rtx_SET (dest, src);
+
+	  /* Drop possible dead definitions.  */
+	  PATTERN (insn) = set;
+
+	  INSN_CODE (insn) = -1;
+	  recog_memoized (insn);
+	  df_insn_rescan (insn);
+	  bitmap_set_bit (convert_bbs, bb->index);
+	}
+    }
+
+  if (v4sf_const0)
+    {
+      /* (Re-)discover loops so that bb->loop_father can be used in the
+	 analysis below.  */
+      loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+
+      /* Generate a vxorps at entry of the nearest dominator for basic
+	 blocks with conversions, which is in the the fake loop that
+	 contains the whole function, so that there is only a single
+	 vxorps in the whole function.   */
+      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
+					     convert_bbs);
+      while (bb->loop_father->latch
+	     != EXIT_BLOCK_PTR_FOR_FN (cfun))
+	bb = get_immediate_dominator (CDI_DOMINATORS,
+				      bb->loop_father->header);
+
+      insn = BB_HEAD (bb);
+      if (!NONDEBUG_INSN_P (insn))
+	insn = next_nonnote_nondebug_insn (insn);
+      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
+      set_insn = emit_insn_before (set, insn);
+      df_insn_rescan (set_insn);
+      df_process_deferred_rescans ();
+      loop_optimizer_finalize ();
+    }
+
+  bitmap_obstack_release (NULL);
+  BITMAP_FREE (convert_bbs);
+
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_remove_partial_avx_dependency =
+{
+  RTL_PASS, /* type */
+  "rpad", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_remove_partial_avx_dependency : public rtl_opt_pass
+{
+public:
+  pass_remove_partial_avx_dependency (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (TARGET_AVX
+	      && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+	      && TARGET_SSE_MATH
+	      && optimize
+	      && optimize_function_for_speed_p (cfun));
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return remove_partial_avx_dependency ();
+    }
+}; // class pass_rpad
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
+{
+  return new pass_remove_partial_avx_dependency (ctxt);
+}
+
 /* Return true if a red-zone is in use.  We can't use red-zone when
    there are local indirect jumps, like "indirect_jump" or "tablejump",
    which jumps to another place in the function, since "call" in the
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b2d27faa4fd..31bcb67843d 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -777,6 +777,10 @@
 (define_attr "i387_cw" "trunc,floor,ceil,uninitialized,any"
   (const_string "any"))
 
+;; Define attribute to indicate scalar ssecvt/sseicvt insns.
+(define_attr "vec_scalar_convert" "false,true"
+  (const_string "false"))
+
 ;; Define attribute to classify add/sub insns that consumes carry flag (CF)
 (define_attr "use_carry" "0,1" (const_string "0"))
 
@@ -4388,6 +4392,7 @@
     }
 }
   [(set_attr "type" "fmov,fmov,ssecvt")
+   (set_attr "vec_scalar_convert" "false,false,true")
    (set_attr "prefix" "orig,orig,maybe_vex")
    (set_attr "mode" "SF,XF,DF")
    (set (attr "enabled")
@@ -4477,7 +4482,8 @@
   [(set (match_operand:DF 0 "sse_reg_operand")
         (float_extend:DF
           (match_operand:SF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || REGNO (operands[0]) != REGNO (operands[1]))
@@ -4552,6 +4558,7 @@
     }
 }
   [(set_attr "type" "fmov,fmov,ssecvt")
+   (set_attr "vec_scalar_convert" "false,false,true")
    (set_attr "mode" "SF")
    (set (attr "enabled")
      (if_then_else
@@ -4635,7 +4642,8 @@
   [(set (match_operand:SF 0 "sse_reg_operand")
         (float_truncate:SF
 	  (match_operand:DF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || REGNO (operands[0]) != REGNO (operands[1]))
@@ -5011,6 +5019,7 @@
    %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1}
    %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1}"
   [(set_attr "type" "fmov,sseicvt,sseicvt")
+   (set_attr "vec_scalar_convert" "false,true,true")
    (set_attr "prefix" "orig,maybe_vex,maybe_vex")
    (set_attr "mode" "<MODEF:MODE>")
    (set (attr "prefix_rex")
@@ -5139,7 +5148,8 @@
 (define_split
   [(set (match_operand:MODEF 0 "sse_reg_operand")
 	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!EXT_REX_SSE_REG_P (operands[0])
        || TARGET_AVX512VL)"
diff --git a/gcc/testsuite/gcc.target/i386/pr87007-1.c b/gcc/testsuite/gcc.target/i386/pr87007-1.c
new file mode 100644
index 00000000000..93cf1dcdfa5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr87007-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr87007-2.c b/gcc/testsuite/gcc.target/i386/pr87007-2.c
new file mode 100644
index 00000000000..cca7ae7afbc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr87007-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (int n, int k)
+{
+  for (int i = 0; i != n; i++)
+    if(i < k)
+      d = f;
+    else
+      f = i;
+}
+
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
-- 
2.19.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2018-12-30 17:33             ` H.J. Lu
@ 2019-01-11 21:06               ` Jeff Law
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Law @ 2019-01-11 21:06 UTC (permalink / raw)
  To: H.J. Lu, Jan Hubicka, Hongtao Liu
  Cc: GCC Patches, Pandey, Sunil K, Uros Bizjak

On 12/30/18 9:50 AM, H.J. Lu wrote:
> On Wed, Nov 28, 2018 at 12:21 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>>> On 11/28/18 12:48 PM, H.J. Lu wrote:
>>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>>
>>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
>>>>>>>>
>>>>>>>> Did you mean "the nearest common dominator"?
>>>>>>>
>>>>>>> If the nearest common dominator appears in the loop while all uses are
>>>>>>> out of loops, this will result in suboptimal xor placement.
>>>>>>> In this case you want to split edges out of the loop.
>>>>>>>
>>>>>>> In general this is what the LCM framework will do for you if the problem
>>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
>>>>>>> "no zero register needed" and all conversions need mode "zero register
>>>>>>> needed".  Mode switching should then do the correct placement decisions
>>>>>>> (reaching minimal number of executions of xor).
>>>>>>>
>>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
>>>>>>> It seems like a special case of more general issue, but I do not see
>>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
>>>>>>> the placement is correct we can probalby go either with new pass or
>>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
>>>>>> So I haven't followed this discussion at all, but did touch on this
>>>>>> issue with some patch a month or two ago with a target patch that was
>>>>>> trying to avoid the partial stalls.
>>>>>>
>>>>>> My assumption is that we're trying to find one or more places to
>>>>>> initialize the upper half of an avx register so as to avoid partial
>>>>>> register stall at existing sites that set the upper half.
>>>>>>
>>>>>> This sounds like a classic PRE/LCM style problem (of which mode
>>>>>> switching is just another variant).   A common-dominator approach is
>>>>>> closer to a classic GCSE and is going to result is more initializations
>>>>>> at sub-optimal points than a PRE/LCM style.
>>>>>
>>>>> yes, it is usual code placement problem. It is special case because the
>>>>> zero register is not modified by the conversion (just we need to have
>>>>> zero somewhere).  So basically we do not have kills to the zero except
>>>>> for entry block.
>>>>>
>>>>
>>>> Do you have  testcase to show thatf the nearest common dominator
>>>> in the loop, while all uses areout of loops, leads to suboptimal xor
>>>> placement?
>>> I don't have a testcase, but it's all but certain nearest common
>>> dominator is going to be a suboptimal placement.  That's going to create
>>> paths where you're going to emit the xor when it's not used.
>>>
>>> The whole point of the LCM algorithms is they are optimal in terms of
>>> expression evaluations.
>>
>> i think testcase should be something like
>>
>> test()
>> {
>>   while (true)
>>   {
>>      if (cond1)
>>        {
>>          do_one_conversion;
>>          return;
>>        }
>>      if (cond2)
>>        {
>>          do_other_conversion;
>>          return;
>>        }
>>   }
>> }
> 
> We got
> 
> [hjl@gnu-cfl-1 pr87007]$ cat test2.i
> extern float f1[],f2[];
> extern int i1[],i2[];
> float
> foo (int k, int n[])
> {
>   if (k == 1)
>     return 1;
> 
>   if (k == 4)
>     return 5;
> 
>   for(int i = 0; i != k; i++){
>     if(n[i] > 100)
>       f1[i] = i1[i];
>     else
>       f2[i] = i2[i];
>   }
> 
>   return k;
Note the if-if construct within the loop of the pseudo code will
generate a notably different cfg than the if-else in your code (which
creates a simple diamond).   The pseudo code also exits in those cases
rather than continuing the loop.

The differences in the CFGs you wind up with are critically important.

For the first chunk of pseudocode the computationally optimal placement
point is just before the conversions.  THere's precisely one execution
of the xor at runtime no matter how many loop iterations occur.

For your example code the optimal placement point is just outside the
loop.  Essentially both arms of the conditional need the xor.  So
possible insertion points are just before the loop, just inside the loop
or in both arms.  Obviously the optimal point is just before the loop.

JEff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* PING^1: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2019-01-07 13:56             ` V3 " H.J. Lu
@ 2019-01-18 15:49               ` H.J. Lu
  2019-01-21 18:54               ` Jeff Law
  1 sibling, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2019-01-18 15:49 UTC (permalink / raw)
  To: Jeff Law
  Cc: Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

On Mon, Jan 7, 2019 at 5:55 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> > >
> > > On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> > > >>
> > > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > > >>>>>
> > > >>>>> Did you mean "the nearest common dominator"?
> > > >>>>
> > > >>>> If the nearest common dominator appears in the loop while all uses are
> > > >>>> out of loops, this will result in suboptimal xor placement.
> > > >>>> In this case you want to split edges out of the loop.
> > > >>>>
> > > >>>> In general this is what the LCM framework will do for you if the problem
> > > >>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > > >>>> "no zero register needed" and all conversions need mode "zero register
> > > >>>> needed".  Mode switching should then do the correct placement decisions
> > > >>>> (reaching minimal number of executions of xor).
> > > >>>>
> > > >>>> Jeff, whan is your optinion on the approach taken by the patch?
> > > >>>> It seems like a special case of more general issue, but I do not see
> > > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > > >>>> the placement is correct we can probalby go either with new pass or
> > > >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > > >>> So I haven't followed this discussion at all, but did touch on this
> > > >>> issue with some patch a month or two ago with a target patch that was
> > > >>> trying to avoid the partial stalls.
> > > >>>
> > > >>> My assumption is that we're trying to find one or more places to
> > > >>> initialize the upper half of an avx register so as to avoid partial
> > > >>> register stall at existing sites that set the upper half.
> > > >>>
> > > >>> This sounds like a classic PRE/LCM style problem (of which mode
> > > >>> switching is just another variant).   A common-dominator approach is
> > > >>> closer to a classic GCSE and is going to result is more initializations
> > > >>> at sub-optimal points than a PRE/LCM style.
> > > >>
> > > >> yes, it is usual code placement problem. It is special case because the
> > > >> zero register is not modified by the conversion (just we need to have
> > > >> zero somewhere).  So basically we do not have kills to the zero except
> > > >> for entry block.
> > > >>
> > > >
> > > > Do you have  testcase to show thatf the nearest common dominator
> > > > in the loop, while all uses areout of loops, leads to suboptimal xor
> > > > placement?
> > > I don't have a testcase, but it's all but certain nearest common
> > > dominator is going to be a suboptimal placement.  That's going to create
> > > paths where you're going to emit the xor when it's not used.
> > >
> > > The whole point of the LCM algorithms is they are optimal in terms of
> > > expression evaluations.
> >
> > We tried LCM and it didn't work well for this case.  LCM places a single
> > VXOR close to the location where it is needed, which can be inside a
> > loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> > solve
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> >
> > where VXOR is executed multiple times inside of a function, instead of
> > just once.   We are investigating to generate a single VXOR at entry of the
> > nearest dominator for basic blocks with SF/DF conversions, which is in
> > the the fake loop that contains the whole function:
> >
> >       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> >                                              convert_bbs);
> >       while (bb->loop_father->latch
> >              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> >         bb = get_immediate_dominator (CDI_DOMINATORS,
> >                                       bb->loop_father->header);
> >
> >       insn = BB_HEAD (bb);
> >       if (!NONDEBUG_INSN_P (insn))
> >         insn = next_nonnote_nondebug_insn (insn);
> >       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> >       set_insn = emit_insn_before (set, insn);
> >
>
> Here is the updated patch.  OK for trunk?
>

This is a GCC 8/9 regression:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007

PING:

https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2019-01-07 13:56             ` V3 " H.J. Lu
  2019-01-18 15:49               ` PING^1: " H.J. Lu
@ 2019-01-21 18:54               ` Jeff Law
  2019-01-21 21:27                 ` H.J. Lu
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Law @ 2019-01-21 18:54 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

On 1/7/19 6:55 AM, H.J. Lu wrote:
> On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
>>> On 11/28/18 12:48 PM, H.J. Lu wrote:
>>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
>>>>>>>> Did you mean "the nearest common dominator"?
>>>>>>> If the nearest common dominator appears in the loop while all uses are
>>>>>>> out of loops, this will result in suboptimal xor placement.
>>>>>>> In this case you want to split edges out of the loop.
>>>>>>>
>>>>>>> In general this is what the LCM framework will do for you if the problem
>>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
>>>>>>> "no zero register needed" and all conversions need mode "zero register
>>>>>>> needed".  Mode switching should then do the correct placement decisions
>>>>>>> (reaching minimal number of executions of xor).
>>>>>>>
>>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
>>>>>>> It seems like a special case of more general issue, but I do not see
>>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
>>>>>>> the placement is correct we can probalby go either with new pass or
>>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
>>>>>> So I haven't followed this discussion at all, but did touch on this
>>>>>> issue with some patch a month or two ago with a target patch that was
>>>>>> trying to avoid the partial stalls.
>>>>>>
>>>>>> My assumption is that we're trying to find one or more places to
>>>>>> initialize the upper half of an avx register so as to avoid partial
>>>>>> register stall at existing sites that set the upper half.
>>>>>>
>>>>>> This sounds like a classic PRE/LCM style problem (of which mode
>>>>>> switching is just another variant).   A common-dominator approach is
>>>>>> closer to a classic GCSE and is going to result is more initializations
>>>>>> at sub-optimal points than a PRE/LCM style.
>>>>> yes, it is usual code placement problem. It is special case because the
>>>>> zero register is not modified by the conversion (just we need to have
>>>>> zero somewhere).  So basically we do not have kills to the zero except
>>>>> for entry block.
>>>>>
>>>> Do you have  testcase to show thatf the nearest common dominator
>>>> in the loop, while all uses areout of loops, leads to suboptimal xor
>>>> placement?
>>> I don't have a testcase, but it's all but certain nearest common
>>> dominator is going to be a suboptimal placement.  That's going to create
>>> paths where you're going to emit the xor when it's not used.
>>>
>>> The whole point of the LCM algorithms is they are optimal in terms of
>>> expression evaluations.
>> We tried LCM and it didn't work well for this case.  LCM places a single
>> VXOR close to the location where it is needed, which can be inside a
>> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
>> solve
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
>>
>> where VXOR is executed multiple times inside of a function, instead of
>> just once.   We are investigating to generate a single VXOR at entry of the
>> nearest dominator for basic blocks with SF/DF conversions, which is in
>> the the fake loop that contains the whole function:
>>
>>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
>>                                              convert_bbs);
>>       while (bb->loop_father->latch
>>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
>>         bb = get_immediate_dominator (CDI_DOMINATORS,
>>                                       bb->loop_father->header);
>>
>>       insn = BB_HEAD (bb);
>>       if (!NONDEBUG_INSN_P (insn))
>>         insn = next_nonnote_nondebug_insn (insn);
>>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
>>       set_insn = emit_insn_before (set, insn);
>>
> Here is the updated patch.  OK for trunk?
> 
> Thanks.
> 
> -- H.J.
> 
> 
> 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> 
> From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> From: "H.J. Lu" <hjl.tools@gmail.com>
> Date: Mon, 7 Jan 2019 05:44:59 -0800
> Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> 
> With -mavx, for
> 
> $ cat foo.i
> extern float f;
> extern double d;
> extern int i;
> 
> void
> foo (void)
> {
>   d = f;
>   f = i;
> }
> 
> we need to generate
> 
> 	vxorp[ds]	%xmmN, %xmmN, %xmmN
> 	...
> 	vcvtss2sd	f(%rip), %xmmN, %xmmX
> 	...
> 	vcvtsi2ss	i(%rip), %xmmN, %xmmY
> 
> to avoid partial XMM register stall.  This patch adds a pass to generate
> a single
> 
> 	vxorps		%xmmN, %xmmN, %xmmN
> 
> at entry of the nearest dominator for basic blocks with SF/DF conversions,
> which is in the fake loop that contains the whole function, instead of
> generating one
> 
> 	vxorp[ds]	%xmmN, %xmmN, %xmmN
> 
> for each SF/DF conversion.
> 
> NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> inside the loop.  Simple testcase show this:
> 
> $ cat badcase.c
> 
> extern float f;
> extern double d;
> 
> void
> foo (int n, int k)
> {
>   for (int j = 0; j != n; j++)
>     if (j < k)
>       d = f;
> }
> 
> It generates
> 
>     ...
>     loop:
>       if(j < k)
>         vxorps  %xmm0, %xmm0, %xmm0
>         vcvtss2sd  %xmm1, %xmm0, %xmm0
>       ...
>     loopend
>     ...
> 
> This is because LCM only works when there is a certain benifit.  But for
> conditional branch, LCM wouldn't move
> 
>    vxorps  %xmm0, %xmm0, %xmm0
It works this way for a reason.  There are obviously paths through the
loop where the conversion does not happen and thus the vxor is not
needed or desirable on those paths.

That's a fundamental property of the LCM algorithm -- it never inserts
an evaluation on a path through the CFG where it will not be used.

Your algorithm of inserting into the dominator block will introduce
runtime executions of the vxor on paths where it is not needed.

It's well known that relaxing that property of LCM can result in better
code generation in some circumstances.  Block copying and loop
restructuring are the gold standard for dealing with this kind of problem.

In this case you could split the iteration space so that you have two
loops.  one for 0..k and the other for k..n.  Note that GCC has support
for this kind of loop restructuring.  This has the advantage that the j
< k test does not happen each iteration of the loop and the vxor stuff
via LCM would be optimal.

There's many other cases where copying and restructuring results in
better common subexpression elimination (which is what you're doing).
Probably the best work I've seen in this space is Bodik's thesis.
Click's work from '95 touches on some of this as well, but isn't as
relevant to this specific instance.

Anyway, whether or not the patch should move forward is really up to Jan
(and Uros if he wants to be involved) I think.  I'm not fundamentally
opposed to HJ's approach as I'm aware of the different tradeoffs.

HJ's approach of pulling into the dominator block can result in
unnecessary evaluations.  But it can also reduce the number of
evaluations in other cases.  It really depends on the runtime behavior
of the code.   One could argue that the vxor stuff we're talking about
is most likely happening in loops, and probably not in deeply nested
control structures within those loops.  Thus pulling them out more
aggressively ala LICM may be better than LCM.

But where to throttle that aggressiveness isn't obvious.  A hybrid of
LICM + LCM would probably be better than either individually.

jeff






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2019-01-21 18:54               ` Jeff Law
@ 2019-01-21 21:27                 ` H.J. Lu
  2019-01-22 12:08                   ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2019-01-21 21:27 UTC (permalink / raw)
  To: Jeff Law
  Cc: Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <law@redhat.com> wrote:
>
> On 1/7/19 6:55 AM, H.J. Lu wrote:
> > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> >>>>>>>> Did you mean "the nearest common dominator"?
> >>>>>>> If the nearest common dominator appears in the loop while all uses are
> >>>>>>> out of loops, this will result in suboptimal xor placement.
> >>>>>>> In this case you want to split edges out of the loop.
> >>>>>>>
> >>>>>>> In general this is what the LCM framework will do for you if the problem
> >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> >>>>>>> "no zero register needed" and all conversions need mode "zero register
> >>>>>>> needed".  Mode switching should then do the correct placement decisions
> >>>>>>> (reaching minimal number of executions of xor).
> >>>>>>>
> >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> >>>>>>> It seems like a special case of more general issue, but I do not see
> >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> >>>>>>> the placement is correct we can probalby go either with new pass or
> >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
> >>>>>> So I haven't followed this discussion at all, but did touch on this
> >>>>>> issue with some patch a month or two ago with a target patch that was
> >>>>>> trying to avoid the partial stalls.
> >>>>>>
> >>>>>> My assumption is that we're trying to find one or more places to
> >>>>>> initialize the upper half of an avx register so as to avoid partial
> >>>>>> register stall at existing sites that set the upper half.
> >>>>>>
> >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> >>>>>> switching is just another variant).   A common-dominator approach is
> >>>>>> closer to a classic GCSE and is going to result is more initializations
> >>>>>> at sub-optimal points than a PRE/LCM style.
> >>>>> yes, it is usual code placement problem. It is special case because the
> >>>>> zero register is not modified by the conversion (just we need to have
> >>>>> zero somewhere).  So basically we do not have kills to the zero except
> >>>>> for entry block.
> >>>>>
> >>>> Do you have  testcase to show thatf the nearest common dominator
> >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> >>>> placement?
> >>> I don't have a testcase, but it's all but certain nearest common
> >>> dominator is going to be a suboptimal placement.  That's going to create
> >>> paths where you're going to emit the xor when it's not used.
> >>>
> >>> The whole point of the LCM algorithms is they are optimal in terms of
> >>> expression evaluations.
> >> We tried LCM and it didn't work well for this case.  LCM places a single
> >> VXOR close to the location where it is needed, which can be inside a
> >> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> >> solve
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> >>
> >> where VXOR is executed multiple times inside of a function, instead of
> >> just once.   We are investigating to generate a single VXOR at entry of the
> >> nearest dominator for basic blocks with SF/DF conversions, which is in
> >> the the fake loop that contains the whole function:
> >>
> >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> >>                                              convert_bbs);
> >>       while (bb->loop_father->latch
> >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> >>                                       bb->loop_father->header);
> >>
> >>       insn = BB_HEAD (bb);
> >>       if (!NONDEBUG_INSN_P (insn))
> >>         insn = next_nonnote_nondebug_insn (insn);
> >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> >>       set_insn = emit_insn_before (set, insn);
> >>
> > Here is the updated patch.  OK for trunk?
> >
> > Thanks.
> >
> > -- H.J.
> >
> >
> > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> >
> > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > From: "H.J. Lu" <hjl.tools@gmail.com>
> > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> >
> > With -mavx, for
> >
> > $ cat foo.i
> > extern float f;
> > extern double d;
> > extern int i;
> >
> > void
> > foo (void)
> > {
> >   d = f;
> >   f = i;
> > }
> >
> > we need to generate
> >
> >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> >       ...
> >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> >       ...
> >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> >
> > to avoid partial XMM register stall.  This patch adds a pass to generate
> > a single
> >
> >       vxorps          %xmmN, %xmmN, %xmmN
> >
> > at entry of the nearest dominator for basic blocks with SF/DF conversions,
> > which is in the fake loop that contains the whole function, instead of
> > generating one
> >
> >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> >
> > for each SF/DF conversion.
> >
> > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > inside the loop.  Simple testcase show this:
> >
> > $ cat badcase.c
> >
> > extern float f;
> > extern double d;
> >
> > void
> > foo (int n, int k)
> > {
> >   for (int j = 0; j != n; j++)
> >     if (j < k)
> >       d = f;
> > }
> >
> > It generates
> >
> >     ...
> >     loop:
> >       if(j < k)
> >         vxorps  %xmm0, %xmm0, %xmm0
> >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> >       ...
> >     loopend
> >     ...
> >
> > This is because LCM only works when there is a certain benifit.  But for
> > conditional branch, LCM wouldn't move
> >
> >    vxorps  %xmm0, %xmm0, %xmm0
> It works this way for a reason.  There are obviously paths through the
> loop where the conversion does not happen and thus the vxor is not
> needed or desirable on those paths.
>
> That's a fundamental property of the LCM algorithm -- it never inserts
> an evaluation on a path through the CFG where it will not be used.
>
> Your algorithm of inserting into the dominator block will introduce
> runtime executions of the vxor on paths where it is not needed.
>
> It's well known that relaxing that property of LCM can result in better
> code generation in some circumstances.  Block copying and loop
> restructuring are the gold standard for dealing with this kind of problem.
>
> In this case you could split the iteration space so that you have two
> loops.  one for 0..k and the other for k..n.  Note that GCC has support
> for this kind of loop restructuring.  This has the advantage that the j
> < k test does not happen each iteration of the loop and the vxor stuff
> via LCM would be optimal.
>
> There's many other cases where copying and restructuring results in
> better common subexpression elimination (which is what you're doing).
> Probably the best work I've seen in this space is Bodik's thesis.
> Click's work from '95 touches on some of this as well, but isn't as
> relevant to this specific instance.
>
> Anyway, whether or not the patch should move forward is really up to Jan
> (and Uros if he wants to be involved) I think.  I'm not fundamentally
> opposed to HJ's approach as I'm aware of the different tradeoffs.
>
> HJ's approach of pulling into the dominator block can result in
> unnecessary evaluations.  But it can also reduce the number of
> evaluations in other cases.  It really depends on the runtime behavior
> of the code.   One could argue that the vxor stuff we're talking about
> is most likely happening in loops, and probably not in deeply nested
> control structures within those loops.  Thus pulling them out more
> aggressively ala LICM may be better than LCM.

True, there is a trade-off.   My approach inserts a vxorps at the last possible
position.  Yes, vxorps will always be executed even if it may not be required.
Since it is executed only once in all cases, it is a win overall.

> But where to throttle that aggressiveness isn't obvious.  A hybrid of
> LICM + LCM would probably be better than either individually.
>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2019-01-21 21:27                 ` H.J. Lu
@ 2019-01-22 12:08                   ` Richard Biener
  2019-01-22 13:28                     ` H.J. Lu
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2019-01-22 12:08 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jeff Law, Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <law@redhat.com> wrote:
> >
> > On 1/7/19 6:55 AM, H.J. Lu wrote:
> > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> > >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > >>>>>>>> Did you mean "the nearest common dominator"?
> > >>>>>>> If the nearest common dominator appears in the loop while all uses are
> > >>>>>>> out of loops, this will result in suboptimal xor placement.
> > >>>>>>> In this case you want to split edges out of the loop.
> > >>>>>>>
> > >>>>>>> In general this is what the LCM framework will do for you if the problem
> > >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > >>>>>>> "no zero register needed" and all conversions need mode "zero register
> > >>>>>>> needed".  Mode switching should then do the correct placement decisions
> > >>>>>>> (reaching minimal number of executions of xor).
> > >>>>>>>
> > >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> > >>>>>>> It seems like a special case of more general issue, but I do not see
> > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > >>>>>>> the placement is correct we can probalby go either with new pass or
> > >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > >>>>>> So I haven't followed this discussion at all, but did touch on this
> > >>>>>> issue with some patch a month or two ago with a target patch that was
> > >>>>>> trying to avoid the partial stalls.
> > >>>>>>
> > >>>>>> My assumption is that we're trying to find one or more places to
> > >>>>>> initialize the upper half of an avx register so as to avoid partial
> > >>>>>> register stall at existing sites that set the upper half.
> > >>>>>>
> > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> > >>>>>> switching is just another variant).   A common-dominator approach is
> > >>>>>> closer to a classic GCSE and is going to result is more initializations
> > >>>>>> at sub-optimal points than a PRE/LCM style.
> > >>>>> yes, it is usual code placement problem. It is special case because the
> > >>>>> zero register is not modified by the conversion (just we need to have
> > >>>>> zero somewhere).  So basically we do not have kills to the zero except
> > >>>>> for entry block.
> > >>>>>
> > >>>> Do you have  testcase to show thatf the nearest common dominator
> > >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> > >>>> placement?
> > >>> I don't have a testcase, but it's all but certain nearest common
> > >>> dominator is going to be a suboptimal placement.  That's going to create
> > >>> paths where you're going to emit the xor when it's not used.
> > >>>
> > >>> The whole point of the LCM algorithms is they are optimal in terms of
> > >>> expression evaluations.
> > >> We tried LCM and it didn't work well for this case.  LCM places a single
> > >> VXOR close to the location where it is needed, which can be inside a
> > >> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> > >> solve
> > >>
> > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> > >>
> > >> where VXOR is executed multiple times inside of a function, instead of
> > >> just once.   We are investigating to generate a single VXOR at entry of the
> > >> nearest dominator for basic blocks with SF/DF conversions, which is in
> > >> the the fake loop that contains the whole function:
> > >>
> > >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> > >>                                              convert_bbs);
> > >>       while (bb->loop_father->latch
> > >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> > >>                                       bb->loop_father->header);
> > >>
> > >>       insn = BB_HEAD (bb);
> > >>       if (!NONDEBUG_INSN_P (insn))
> > >>         insn = next_nonnote_nondebug_insn (insn);
> > >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > >>       set_insn = emit_insn_before (set, insn);
> > >>
> > > Here is the updated patch.  OK for trunk?
> > >
> > > Thanks.
> > >
> > > -- H.J.
> > >
> > >
> > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> > >
> > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> > >
> > > With -mavx, for
> > >
> > > $ cat foo.i
> > > extern float f;
> > > extern double d;
> > > extern int i;
> > >
> > > void
> > > foo (void)
> > > {
> > >   d = f;
> > >   f = i;
> > > }
> > >
> > > we need to generate
> > >
> > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > >       ...
> > >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> > >       ...
> > >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> > >
> > > to avoid partial XMM register stall.  This patch adds a pass to generate
> > > a single
> > >
> > >       vxorps          %xmmN, %xmmN, %xmmN
> > >
> > > at entry of the nearest dominator for basic blocks with SF/DF conversions,
> > > which is in the fake loop that contains the whole function, instead of
> > > generating one
> > >
> > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > >
> > > for each SF/DF conversion.
> > >
> > > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > > inside the loop.  Simple testcase show this:
> > >
> > > $ cat badcase.c
> > >
> > > extern float f;
> > > extern double d;
> > >
> > > void
> > > foo (int n, int k)
> > > {
> > >   for (int j = 0; j != n; j++)
> > >     if (j < k)
> > >       d = f;
> > > }
> > >
> > > It generates
> > >
> > >     ...
> > >     loop:
> > >       if(j < k)
> > >         vxorps  %xmm0, %xmm0, %xmm0
> > >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> > >       ...
> > >     loopend
> > >     ...
> > >
> > > This is because LCM only works when there is a certain benifit.  But for
> > > conditional branch, LCM wouldn't move
> > >
> > >    vxorps  %xmm0, %xmm0, %xmm0
> > It works this way for a reason.  There are obviously paths through the
> > loop where the conversion does not happen and thus the vxor is not
> > needed or desirable on those paths.
> >
> > That's a fundamental property of the LCM algorithm -- it never inserts
> > an evaluation on a path through the CFG where it will not be used.
> >
> > Your algorithm of inserting into the dominator block will introduce
> > runtime executions of the vxor on paths where it is not needed.
> >
> > It's well known that relaxing that property of LCM can result in better
> > code generation in some circumstances.  Block copying and loop
> > restructuring are the gold standard for dealing with this kind of problem.
> >
> > In this case you could split the iteration space so that you have two
> > loops.  one for 0..k and the other for k..n.  Note that GCC has support
> > for this kind of loop restructuring.  This has the advantage that the j
> > < k test does not happen each iteration of the loop and the vxor stuff
> > via LCM would be optimal.
> >
> > There's many other cases where copying and restructuring results in
> > better common subexpression elimination (which is what you're doing).
> > Probably the best work I've seen in this space is Bodik's thesis.
> > Click's work from '95 touches on some of this as well, but isn't as
> > relevant to this specific instance.
> >
> > Anyway, whether or not the patch should move forward is really up to Jan
> > (and Uros if he wants to be involved) I think.  I'm not fundamentally
> > opposed to HJ's approach as I'm aware of the different tradeoffs.
> >
> > HJ's approach of pulling into the dominator block can result in
> > unnecessary evaluations.  But it can also reduce the number of
> > evaluations in other cases.  It really depends on the runtime behavior
> > of the code.   One could argue that the vxor stuff we're talking about
> > is most likely happening in loops, and probably not in deeply nested
> > control structures within those loops.  Thus pulling them out more
> > aggressively ala LICM may be better than LCM.
>
> True, there is a trade-off.   My approach inserts a vxorps at the last possible
> position.  Yes, vxorps will always be executed even if it may not be required.
> Since it is executed only once in all cases, it is a win overall.

Hopefully a simple vpxor won't end up powering up the other AVX512
unit if it lay dormant ...

And if we ever get to the state of having two separate ISAs in the same
function then you'd need to make sure you can execute vpxor in the
place you are inserting since it may now be executed when it wasn't
before (and I assume you already check that you do not zero the
reg if there's a value life in it if the conditional def you are instrumenting
is not executed).

> > But where to throttle that aggressiveness isn't obvious.  A hybrid of
> > LICM + LCM would probably be better than either individually.
> >
>
> Thanks.
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2019-01-22 12:08                   ` Richard Biener
@ 2019-01-22 13:28                     ` H.J. Lu
  2019-01-28 17:29                       ` PING ^1: " H.J. Lu
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2019-01-22 13:28 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

On Tue, Jan 22, 2019 at 4:08 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <law@redhat.com> wrote:
> > >
> > > On 1/7/19 6:55 AM, H.J. Lu wrote:
> > > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> > > >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> > > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > > >>>>>>>> Did you mean "the nearest common dominator"?
> > > >>>>>>> If the nearest common dominator appears in the loop while all uses are
> > > >>>>>>> out of loops, this will result in suboptimal xor placement.
> > > >>>>>>> In this case you want to split edges out of the loop.
> > > >>>>>>>
> > > >>>>>>> In general this is what the LCM framework will do for you if the problem
> > > >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > > >>>>>>> "no zero register needed" and all conversions need mode "zero register
> > > >>>>>>> needed".  Mode switching should then do the correct placement decisions
> > > >>>>>>> (reaching minimal number of executions of xor).
> > > >>>>>>>
> > > >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> > > >>>>>>> It seems like a special case of more general issue, but I do not see
> > > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > > >>>>>>> the placement is correct we can probalby go either with new pass or
> > > >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > > >>>>>> So I haven't followed this discussion at all, but did touch on this
> > > >>>>>> issue with some patch a month or two ago with a target patch that was
> > > >>>>>> trying to avoid the partial stalls.
> > > >>>>>>
> > > >>>>>> My assumption is that we're trying to find one or more places to
> > > >>>>>> initialize the upper half of an avx register so as to avoid partial
> > > >>>>>> register stall at existing sites that set the upper half.
> > > >>>>>>
> > > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> > > >>>>>> switching is just another variant).   A common-dominator approach is
> > > >>>>>> closer to a classic GCSE and is going to result is more initializations
> > > >>>>>> at sub-optimal points than a PRE/LCM style.
> > > >>>>> yes, it is usual code placement problem. It is special case because the
> > > >>>>> zero register is not modified by the conversion (just we need to have
> > > >>>>> zero somewhere).  So basically we do not have kills to the zero except
> > > >>>>> for entry block.
> > > >>>>>
> > > >>>> Do you have  testcase to show thatf the nearest common dominator
> > > >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> > > >>>> placement?
> > > >>> I don't have a testcase, but it's all but certain nearest common
> > > >>> dominator is going to be a suboptimal placement.  That's going to create
> > > >>> paths where you're going to emit the xor when it's not used.
> > > >>>
> > > >>> The whole point of the LCM algorithms is they are optimal in terms of
> > > >>> expression evaluations.
> > > >> We tried LCM and it didn't work well for this case.  LCM places a single
> > > >> VXOR close to the location where it is needed, which can be inside a
> > > >> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> > > >> solve
> > > >>
> > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> > > >>
> > > >> where VXOR is executed multiple times inside of a function, instead of
> > > >> just once.   We are investigating to generate a single VXOR at entry of the
> > > >> nearest dominator for basic blocks with SF/DF conversions, which is in
> > > >> the the fake loop that contains the whole function:
> > > >>
> > > >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> > > >>                                              convert_bbs);
> > > >>       while (bb->loop_father->latch
> > > >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > > >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> > > >>                                       bb->loop_father->header);
> > > >>
> > > >>       insn = BB_HEAD (bb);
> > > >>       if (!NONDEBUG_INSN_P (insn))
> > > >>         insn = next_nonnote_nondebug_insn (insn);
> > > >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > > >>       set_insn = emit_insn_before (set, insn);
> > > >>
> > > > Here is the updated patch.  OK for trunk?
> > > >
> > > > Thanks.
> > > >
> > > > -- H.J.
> > > >
> > > >
> > > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> > > >
> > > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > > > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> > > >
> > > > With -mavx, for
> > > >
> > > > $ cat foo.i
> > > > extern float f;
> > > > extern double d;
> > > > extern int i;
> > > >
> > > > void
> > > > foo (void)
> > > > {
> > > >   d = f;
> > > >   f = i;
> > > > }
> > > >
> > > > we need to generate
> > > >
> > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > >       ...
> > > >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> > > >       ...
> > > >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> > > >
> > > > to avoid partial XMM register stall.  This patch adds a pass to generate
> > > > a single
> > > >
> > > >       vxorps          %xmmN, %xmmN, %xmmN
> > > >
> > > > at entry of the nearest dominator for basic blocks with SF/DF conversions,
> > > > which is in the fake loop that contains the whole function, instead of
> > > > generating one
> > > >
> > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > >
> > > > for each SF/DF conversion.
> > > >
> > > > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > > > inside the loop.  Simple testcase show this:
> > > >
> > > > $ cat badcase.c
> > > >
> > > > extern float f;
> > > > extern double d;
> > > >
> > > > void
> > > > foo (int n, int k)
> > > > {
> > > >   for (int j = 0; j != n; j++)
> > > >     if (j < k)
> > > >       d = f;
> > > > }
> > > >
> > > > It generates
> > > >
> > > >     ...
> > > >     loop:
> > > >       if(j < k)
> > > >         vxorps  %xmm0, %xmm0, %xmm0
> > > >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> > > >       ...
> > > >     loopend
> > > >     ...
> > > >
> > > > This is because LCM only works when there is a certain benifit.  But for
> > > > conditional branch, LCM wouldn't move
> > > >
> > > >    vxorps  %xmm0, %xmm0, %xmm0
> > > It works this way for a reason.  There are obviously paths through the
> > > loop where the conversion does not happen and thus the vxor is not
> > > needed or desirable on those paths.
> > >
> > > That's a fundamental property of the LCM algorithm -- it never inserts
> > > an evaluation on a path through the CFG where it will not be used.
> > >
> > > Your algorithm of inserting into the dominator block will introduce
> > > runtime executions of the vxor on paths where it is not needed.
> > >
> > > It's well known that relaxing that property of LCM can result in better
> > > code generation in some circumstances.  Block copying and loop
> > > restructuring are the gold standard for dealing with this kind of problem.
> > >
> > > In this case you could split the iteration space so that you have two
> > > loops.  one for 0..k and the other for k..n.  Note that GCC has support
> > > for this kind of loop restructuring.  This has the advantage that the j
> > > < k test does not happen each iteration of the loop and the vxor stuff
> > > via LCM would be optimal.
> > >
> > > There's many other cases where copying and restructuring results in
> > > better common subexpression elimination (which is what you're doing).
> > > Probably the best work I've seen in this space is Bodik's thesis.
> > > Click's work from '95 touches on some of this as well, but isn't as
> > > relevant to this specific instance.
> > >
> > > Anyway, whether or not the patch should move forward is really up to Jan
> > > (and Uros if he wants to be involved) I think.  I'm not fundamentally
> > > opposed to HJ's approach as I'm aware of the different tradeoffs.
> > >
> > > HJ's approach of pulling into the dominator block can result in
> > > unnecessary evaluations.  But it can also reduce the number of
> > > evaluations in other cases.  It really depends on the runtime behavior
> > > of the code.   One could argue that the vxor stuff we're talking about
> > > is most likely happening in loops, and probably not in deeply nested
> > > control structures within those loops.  Thus pulling them out more
> > > aggressively ala LICM may be better than LCM.
> >
> > True, there is a trade-off.   My approach inserts a vxorps at the last possible
> > position.  Yes, vxorps will always be executed even if it may not be required.
> > Since it is executed only once in all cases, it is a win overall.
>
> Hopefully a simple vpxor won't end up powering up the other AVX512
> unit if it lay dormant ...

A 128-bit AVX vpxor won't touch AVX512.

> And if we ever get to the state of having two separate ISAs in the same
> function then you'd need to make sure you can execute vpxor in the
> place you are inserting since it may now be executed when it wasn't
> before (and I assume you already check that you do not zero the
> reg if there's a value life in it if the conditional def you are instrumenting
> is not executed).

A dedicated pseudo register is allocated and zeroed for INT->FP and
FP->FP conversions.  IRA/LRA take care of the rest.


-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* PING ^1: V3 [PATCH] i386: Add pass_remove_partial_avx_dependency
  2019-01-22 13:28                     ` H.J. Lu
@ 2019-01-28 17:29                       ` H.J. Lu
  2019-02-01 19:04                         ` [PATCH] [8/9 Regression] " H.J. Lu
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2019-01-28 17:29 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

On Tue, Jan 22, 2019 at 5:28 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jan 22, 2019 at 4:08 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <law@redhat.com> wrote:
> > > >
> > > > On 1/7/19 6:55 AM, H.J. Lu wrote:
> > > > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> > > > >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> > > > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > > > >>>>>>>> Did you mean "the nearest common dominator"?
> > > > >>>>>>> If the nearest common dominator appears in the loop while all uses are
> > > > >>>>>>> out of loops, this will result in suboptimal xor placement.
> > > > >>>>>>> In this case you want to split edges out of the loop.
> > > > >>>>>>>
> > > > >>>>>>> In general this is what the LCM framework will do for you if the problem
> > > > >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > > > >>>>>>> "no zero register needed" and all conversions need mode "zero register
> > > > >>>>>>> needed".  Mode switching should then do the correct placement decisions
> > > > >>>>>>> (reaching minimal number of executions of xor).
> > > > >>>>>>>
> > > > >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> > > > >>>>>>> It seems like a special case of more general issue, but I do not see
> > > > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > > > >>>>>>> the placement is correct we can probalby go either with new pass or
> > > > >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > > > >>>>>> So I haven't followed this discussion at all, but did touch on this
> > > > >>>>>> issue with some patch a month or two ago with a target patch that was
> > > > >>>>>> trying to avoid the partial stalls.
> > > > >>>>>>
> > > > >>>>>> My assumption is that we're trying to find one or more places to
> > > > >>>>>> initialize the upper half of an avx register so as to avoid partial
> > > > >>>>>> register stall at existing sites that set the upper half.
> > > > >>>>>>
> > > > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> > > > >>>>>> switching is just another variant).   A common-dominator approach is
> > > > >>>>>> closer to a classic GCSE and is going to result is more initializations
> > > > >>>>>> at sub-optimal points than a PRE/LCM style.
> > > > >>>>> yes, it is usual code placement problem. It is special case because the
> > > > >>>>> zero register is not modified by the conversion (just we need to have
> > > > >>>>> zero somewhere).  So basically we do not have kills to the zero except
> > > > >>>>> for entry block.
> > > > >>>>>
> > > > >>>> Do you have  testcase to show thatf the nearest common dominator
> > > > >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> > > > >>>> placement?
> > > > >>> I don't have a testcase, but it's all but certain nearest common
> > > > >>> dominator is going to be a suboptimal placement.  That's going to create
> > > > >>> paths where you're going to emit the xor when it's not used.
> > > > >>>
> > > > >>> The whole point of the LCM algorithms is they are optimal in terms of
> > > > >>> expression evaluations.
> > > > >> We tried LCM and it didn't work well for this case.  LCM places a single
> > > > >> VXOR close to the location where it is needed, which can be inside a
> > > > >> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> > > > >> solve
> > > > >>
> > > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> > > > >>
> > > > >> where VXOR is executed multiple times inside of a function, instead of
> > > > >> just once.   We are investigating to generate a single VXOR at entry of the
> > > > >> nearest dominator for basic blocks with SF/DF conversions, which is in
> > > > >> the the fake loop that contains the whole function:
> > > > >>
> > > > >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> > > > >>                                              convert_bbs);
> > > > >>       while (bb->loop_father->latch
> > > > >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > > > >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> > > > >>                                       bb->loop_father->header);
> > > > >>
> > > > >>       insn = BB_HEAD (bb);
> > > > >>       if (!NONDEBUG_INSN_P (insn))
> > > > >>         insn = next_nonnote_nondebug_insn (insn);
> > > > >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > > > >>       set_insn = emit_insn_before (set, insn);
> > > > >>
> > > > > Here is the updated patch.  OK for trunk?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > -- H.J.
> > > > >
> > > > >
> > > > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> > > > >
> > > > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > > > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > > > > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > > > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> > > > >
> > > > > With -mavx, for
> > > > >
> > > > > $ cat foo.i
> > > > > extern float f;
> > > > > extern double d;
> > > > > extern int i;
> > > > >
> > > > > void
> > > > > foo (void)
> > > > > {
> > > > >   d = f;
> > > > >   f = i;
> > > > > }
> > > > >
> > > > > we need to generate
> > > > >
> > > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > > >       ...
> > > > >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> > > > >       ...
> > > > >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> > > > >
> > > > > to avoid partial XMM register stall.  This patch adds a pass to generate
> > > > > a single
> > > > >
> > > > >       vxorps          %xmmN, %xmmN, %xmmN
> > > > >
> > > > > at entry of the nearest dominator for basic blocks with SF/DF conversions,
> > > > > which is in the fake loop that contains the whole function, instead of
> > > > > generating one
> > > > >
> > > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > > >
> > > > > for each SF/DF conversion.
> > > > >
> > > > > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > > > > inside the loop.  Simple testcase show this:
> > > > >
> > > > > $ cat badcase.c
> > > > >
> > > > > extern float f;
> > > > > extern double d;
> > > > >
> > > > > void
> > > > > foo (int n, int k)
> > > > > {
> > > > >   for (int j = 0; j != n; j++)
> > > > >     if (j < k)
> > > > >       d = f;
> > > > > }
> > > > >
> > > > > It generates
> > > > >
> > > > >     ...
> > > > >     loop:
> > > > >       if(j < k)
> > > > >         vxorps  %xmm0, %xmm0, %xmm0
> > > > >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> > > > >       ...
> > > > >     loopend
> > > > >     ...
> > > > >
> > > > > This is because LCM only works when there is a certain benifit.  But for
> > > > > conditional branch, LCM wouldn't move
> > > > >
> > > > >    vxorps  %xmm0, %xmm0, %xmm0
> > > > It works this way for a reason.  There are obviously paths through the
> > > > loop where the conversion does not happen and thus the vxor is not
> > > > needed or desirable on those paths.
> > > >
> > > > That's a fundamental property of the LCM algorithm -- it never inserts
> > > > an evaluation on a path through the CFG where it will not be used.
> > > >
> > > > Your algorithm of inserting into the dominator block will introduce
> > > > runtime executions of the vxor on paths where it is not needed.
> > > >
> > > > It's well known that relaxing that property of LCM can result in better
> > > > code generation in some circumstances.  Block copying and loop
> > > > restructuring are the gold standard for dealing with this kind of problem.
> > > >
> > > > In this case you could split the iteration space so that you have two
> > > > loops.  one for 0..k and the other for k..n.  Note that GCC has support
> > > > for this kind of loop restructuring.  This has the advantage that the j
> > > > < k test does not happen each iteration of the loop and the vxor stuff
> > > > via LCM would be optimal.
> > > >
> > > > There's many other cases where copying and restructuring results in
> > > > better common subexpression elimination (which is what you're doing).
> > > > Probably the best work I've seen in this space is Bodik's thesis.
> > > > Click's work from '95 touches on some of this as well, but isn't as
> > > > relevant to this specific instance.
> > > >
> > > > Anyway, whether or not the patch should move forward is really up to Jan
> > > > (and Uros if he wants to be involved) I think.  I'm not fundamentally
> > > > opposed to HJ's approach as I'm aware of the different tradeoffs.
> > > >
> > > > HJ's approach of pulling into the dominator block can result in
> > > > unnecessary evaluations.  But it can also reduce the number of
> > > > evaluations in other cases.  It really depends on the runtime behavior
> > > > of the code.   One could argue that the vxor stuff we're talking about
> > > > is most likely happening in loops, and probably not in deeply nested
> > > > control structures within those loops.  Thus pulling them out more
> > > > aggressively ala LICM may be better than LCM.
> > >
> > > True, there is a trade-off.   My approach inserts a vxorps at the last possible
> > > position.  Yes, vxorps will always be executed even if it may not be required.
> > > Since it is executed only once in all cases, it is a win overall.
> >
> > Hopefully a simple vpxor won't end up powering up the other AVX512
> > unit if it lay dormant ...
>
> A 128-bit AVX vpxor won't touch AVX512.
>
> > And if we ever get to the state of having two separate ISAs in the same
> > function then you'd need to make sure you can execute vpxor in the
> > place you are inserting since it may now be executed when it wasn't
> > before (and I assume you already check that you do not zero the
> > reg if there's a value life in it if the conditional def you are instrumenting
> > is not executed).
>
> A dedicated pseudo register is allocated and zeroed for INT->FP and
> FP->FP conversions.  IRA/LRA take care of the rest.
>

PING:

https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] [8/9 Regression] i386: Add pass_remove_partial_avx_dependency
  2019-01-28 17:29                       ` PING ^1: " H.J. Lu
@ 2019-02-01 19:04                         ` H.J. Lu
  2019-02-21 16:18                           ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2019-02-01 19:04 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, Jan Hubicka, GCC Patches, Pandey, Sunil K, Uros Bizjak,
	Hongtao Liu, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 11004 bytes --]

On Mon, Jan 28, 2019 at 9:08 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jan 22, 2019 at 5:28 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Tue, Jan 22, 2019 at 4:08 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <law@redhat.com> wrote:
> > > > >
> > > > > On 1/7/19 6:55 AM, H.J. Lu wrote:
> > > > > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> > > > > >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > > > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> > > > > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > > > > >>>>>>>> Did you mean "the nearest common dominator"?
> > > > > >>>>>>> If the nearest common dominator appears in the loop while all uses are
> > > > > >>>>>>> out of loops, this will result in suboptimal xor placement.
> > > > > >>>>>>> In this case you want to split edges out of the loop.
> > > > > >>>>>>>
> > > > > >>>>>>> In general this is what the LCM framework will do for you if the problem
> > > > > >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> > > > > >>>>>>> "no zero register needed" and all conversions need mode "zero register
> > > > > >>>>>>> needed".  Mode switching should then do the correct placement decisions
> > > > > >>>>>>> (reaching minimal number of executions of xor).
> > > > > >>>>>>>
> > > > > >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> > > > > >>>>>>> It seems like a special case of more general issue, but I do not see
> > > > > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > > > > >>>>>>> the placement is correct we can probalby go either with new pass or
> > > > > >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > > > > >>>>>> So I haven't followed this discussion at all, but did touch on this
> > > > > >>>>>> issue with some patch a month or two ago with a target patch that was
> > > > > >>>>>> trying to avoid the partial stalls.
> > > > > >>>>>>
> > > > > >>>>>> My assumption is that we're trying to find one or more places to
> > > > > >>>>>> initialize the upper half of an avx register so as to avoid partial
> > > > > >>>>>> register stall at existing sites that set the upper half.
> > > > > >>>>>>
> > > > > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> > > > > >>>>>> switching is just another variant).   A common-dominator approach is
> > > > > >>>>>> closer to a classic GCSE and is going to result is more initializations
> > > > > >>>>>> at sub-optimal points than a PRE/LCM style.
> > > > > >>>>> yes, it is usual code placement problem. It is special case because the
> > > > > >>>>> zero register is not modified by the conversion (just we need to have
> > > > > >>>>> zero somewhere).  So basically we do not have kills to the zero except
> > > > > >>>>> for entry block.
> > > > > >>>>>
> > > > > >>>> Do you have  testcase to show thatf the nearest common dominator
> > > > > >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> > > > > >>>> placement?
> > > > > >>> I don't have a testcase, but it's all but certain nearest common
> > > > > >>> dominator is going to be a suboptimal placement.  That's going to create
> > > > > >>> paths where you're going to emit the xor when it's not used.
> > > > > >>>
> > > > > >>> The whole point of the LCM algorithms is they are optimal in terms of
> > > > > >>> expression evaluations.
> > > > > >> We tried LCM and it didn't work well for this case.  LCM places a single
> > > > > >> VXOR close to the location where it is needed, which can be inside a
> > > > > >> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> > > > > >> solve
> > > > > >>
> > > > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> > > > > >>
> > > > > >> where VXOR is executed multiple times inside of a function, instead of
> > > > > >> just once.   We are investigating to generate a single VXOR at entry of the
> > > > > >> nearest dominator for basic blocks with SF/DF conversions, which is in
> > > > > >> the the fake loop that contains the whole function:
> > > > > >>
> > > > > >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> > > > > >>                                              convert_bbs);
> > > > > >>       while (bb->loop_father->latch
> > > > > >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > > > > >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> > > > > >>                                       bb->loop_father->header);
> > > > > >>
> > > > > >>       insn = BB_HEAD (bb);
> > > > > >>       if (!NONDEBUG_INSN_P (insn))
> > > > > >>         insn = next_nonnote_nondebug_insn (insn);
> > > > > >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > > > > >>       set_insn = emit_insn_before (set, insn);
> > > > > >>
> > > > > > Here is the updated patch.  OK for trunk?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > -- H.J.
> > > > > >
> > > > > >
> > > > > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> > > > > >
> > > > > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > > > > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > > > > > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > > > > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> > > > > >
> > > > > > With -mavx, for
> > > > > >
> > > > > > $ cat foo.i
> > > > > > extern float f;
> > > > > > extern double d;
> > > > > > extern int i;
> > > > > >
> > > > > > void
> > > > > > foo (void)
> > > > > > {
> > > > > >   d = f;
> > > > > >   f = i;
> > > > > > }
> > > > > >
> > > > > > we need to generate
> > > > > >
> > > > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > > > >       ...
> > > > > >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> > > > > >       ...
> > > > > >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> > > > > >
> > > > > > to avoid partial XMM register stall.  This patch adds a pass to generate
> > > > > > a single
> > > > > >
> > > > > >       vxorps          %xmmN, %xmmN, %xmmN
> > > > > >
> > > > > > at entry of the nearest dominator for basic blocks with SF/DF conversions,
> > > > > > which is in the fake loop that contains the whole function, instead of
> > > > > > generating one
> > > > > >
> > > > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > > > >
> > > > > > for each SF/DF conversion.
> > > > > >
> > > > > > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > > > > > inside the loop.  Simple testcase show this:
> > > > > >
> > > > > > $ cat badcase.c
> > > > > >
> > > > > > extern float f;
> > > > > > extern double d;
> > > > > >
> > > > > > void
> > > > > > foo (int n, int k)
> > > > > > {
> > > > > >   for (int j = 0; j != n; j++)
> > > > > >     if (j < k)
> > > > > >       d = f;
> > > > > > }
> > > > > >
> > > > > > It generates
> > > > > >
> > > > > >     ...
> > > > > >     loop:
> > > > > >       if(j < k)
> > > > > >         vxorps  %xmm0, %xmm0, %xmm0
> > > > > >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> > > > > >       ...
> > > > > >     loopend
> > > > > >     ...
> > > > > >
> > > > > > This is because LCM only works when there is a certain benifit.  But for
> > > > > > conditional branch, LCM wouldn't move
> > > > > >
> > > > > >    vxorps  %xmm0, %xmm0, %xmm0
> > > > > It works this way for a reason.  There are obviously paths through the
> > > > > loop where the conversion does not happen and thus the vxor is not
> > > > > needed or desirable on those paths.
> > > > >
> > > > > That's a fundamental property of the LCM algorithm -- it never inserts
> > > > > an evaluation on a path through the CFG where it will not be used.
> > > > >
> > > > > Your algorithm of inserting into the dominator block will introduce
> > > > > runtime executions of the vxor on paths where it is not needed.
> > > > >
> > > > > It's well known that relaxing that property of LCM can result in better
> > > > > code generation in some circumstances.  Block copying and loop
> > > > > restructuring are the gold standard for dealing with this kind of problem.
> > > > >
> > > > > In this case you could split the iteration space so that you have two
> > > > > loops.  one for 0..k and the other for k..n.  Note that GCC has support
> > > > > for this kind of loop restructuring.  This has the advantage that the j
> > > > > < k test does not happen each iteration of the loop and the vxor stuff
> > > > > via LCM would be optimal.
> > > > >
> > > > > There's many other cases where copying and restructuring results in
> > > > > better common subexpression elimination (which is what you're doing).
> > > > > Probably the best work I've seen in this space is Bodik's thesis.
> > > > > Click's work from '95 touches on some of this as well, but isn't as
> > > > > relevant to this specific instance.
> > > > >
> > > > > Anyway, whether or not the patch should move forward is really up to Jan
> > > > > (and Uros if he wants to be involved) I think.  I'm not fundamentally
> > > > > opposed to HJ's approach as I'm aware of the different tradeoffs.
> > > > >
> > > > > HJ's approach of pulling into the dominator block can result in
> > > > > unnecessary evaluations.  But it can also reduce the number of
> > > > > evaluations in other cases.  It really depends on the runtime behavior
> > > > > of the code.   One could argue that the vxor stuff we're talking about
> > > > > is most likely happening in loops, and probably not in deeply nested
> > > > > control structures within those loops.  Thus pulling them out more
> > > > > aggressively ala LICM may be better than LCM.
> > > >
> > > > True, there is a trade-off.   My approach inserts a vxorps at the last possible
> > > > position.  Yes, vxorps will always be executed even if it may not be required.
> > > > Since it is executed only once in all cases, it is a win overall.
> > >
> > > Hopefully a simple vpxor won't end up powering up the other AVX512
> > > unit if it lay dormant ...
> >
> > A 128-bit AVX vpxor won't touch AVX512.
> >
> > > And if we ever get to the state of having two separate ISAs in the same
> > > function then you'd need to make sure you can execute vpxor in the
> > > place you are inserting since it may now be executed when it wasn't
> > > before (and I assume you already check that you do not zero the
> > > reg if there's a value life in it if the conditional def you are instrumenting
> > > is not executed).
> >
> > A dedicated pseudo register is allocated and zeroed for INT->FP and
> > FP->FP conversions.  IRA/LRA take care of the rest.
> >
>
> PING:
>
> https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html
>

Here is the updated patch adjusted after PR target/89071 fix.

OK for trunk?

Thanks.

-- 
H.J.

[-- Attachment #2: 0001-8-9-Regression-i386-Add-pass_remove_partial_avx_depe.patch --]
[-- Type: application/x-patch, Size: 14934 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [8/9 Regression] i386: Add pass_remove_partial_avx_dependency
  2019-02-01 19:04                         ` [PATCH] [8/9 Regression] " H.J. Lu
@ 2019-02-21 16:18                           ` Jan Hubicka
  2019-02-21 17:43                             ` H.J. Lu
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2019-02-21 16:18 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Richard Biener, Jeff Law, GCC Patches, Pandey, Sunil K,
	Uros Bizjak, Hongtao Liu, Jakub Jelinek

Hello,

2019-02-01  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* config/i386/i386-passes.def: Add
	pass_remove_partial_avx_dependency.
	* config/i386/i386-protos.h
	(make_pass_remove_partial_avx_dependency): New.
	* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
	New function.
	(pass_data_remove_partial_avx_dependency): New.
	(pass_remove_partial_avx_dependency): Likewise.
	(make_pass_remove_partial_avx_dependency): Likewise.
	* config/i386/i386.md (partial_xmm_update): New attribute.
	(*extendsfdf2): Add partial_xmm_update.
	(truncdfsf2): Likewise.
	(*float<SWI48:mode><MODEF:mode>2): Likewise.
	(SF/DF conversion splitters): Disabled for TARGET_AVX.

gcc/testsuite/

2019-02-01  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* gcc.target/i386/pr87007-1.c: New test.
	* gcc.target/i386/pr87007-2.c: Likewise.

It seems to me that more systematic way would be to use mode switching
pass that uses the LCM framework and possibly tweak LCM to do the right
thing with respect to loops (easy solution would be to lift insertion
points to the dominators with smaller frequency even if there may be path
that does not execute the instruction needing the pxor).

Teaching LCM framework is however more intrusive than self contained
minipass and Since the patch solves a regression and is self contained I
guess we should go ahead with it for this release and look for more
systematic solutions later.

Patch is OK with the following change.

+static unsigned int
+remove_partial_avx_dependency (void)
+{
+  timevar_push (TV_MACH_DEP);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+  df_md_add_problem ();
+  df_analyze ();

Please delay the initialization after you hit first instruction that
needs processing.  The pass is run unconditionally and in many functions
it will do noting. Can you also gate the pass to run only of AVX is
enabled?

Patch is OK with this change. Please way a day for possible Uros' or RM
reactions.  Sorry for the delayed reaction.
Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [8/9 Regression] i386: Add pass_remove_partial_avx_dependency
  2019-02-21 16:18                           ` Jan Hubicka
@ 2019-02-21 17:43                             ` H.J. Lu
  0 siblings, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2019-02-21 17:43 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Biener, Jeff Law, GCC Patches, Pandey, Sunil K,
	Uros Bizjak, Hongtao Liu, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 4241 bytes --]

On Thu, Feb 21, 2019 at 5:58 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hello,
>
> 2019-02-01  H.J. Lu  <hongjiu.lu@intel.com>
>             Hongtao Liu  <hongtao.liu@intel.com>
>             Sunil K Pandey  <sunil.k.pandey@intel.com>
>
>         PR target/87007
>         * config/i386/i386-passes.def: Add
>         pass_remove_partial_avx_dependency.
>         * config/i386/i386-protos.h
>         (make_pass_remove_partial_avx_dependency): New.
>         * config/i386/i386.c (make_pass_remove_partial_avx_dependency):
>         New function.
>         (pass_data_remove_partial_avx_dependency): New.
>         (pass_remove_partial_avx_dependency): Likewise.
>         (make_pass_remove_partial_avx_dependency): Likewise.
>         * config/i386/i386.md (partial_xmm_update): New attribute.
>         (*extendsfdf2): Add partial_xmm_update.
>         (truncdfsf2): Likewise.
>         (*float<SWI48:mode><MODEF:mode>2): Likewise.
>         (SF/DF conversion splitters): Disabled for TARGET_AVX.
>
> gcc/testsuite/
>
> 2019-02-01  H.J. Lu  <hongjiu.lu@intel.com>
>             Hongtao Liu  <hongtao.liu@intel.com>
>             Sunil K Pandey  <sunil.k.pandey@intel.com>
>
>         PR target/87007
>         * gcc.target/i386/pr87007-1.c: New test.
>         * gcc.target/i386/pr87007-2.c: Likewise.
>
>
> It seems to me that more systematic way would be to use mode switching
> pass that uses the LCM framework and possibly tweak LCM to do the right
> thing with respect to loops (easy solution would be to lift insertion
> points to the dominators with smaller frequency even if there may be path
> that does not execute the instruction needing the pxor).
>
> Teaching LCM framework is however more intrusive than self contained
> minipass and Since the patch solves a regression and is self contained I
> guess we should go ahead with it for this release and look for more
> systematic solutions later.
>
> Patch is OK with the following change.
>
> +static unsigned int
> +remove_partial_avx_dependency (void)
> +{
> +  timevar_push (TV_MACH_DEP);
> +
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  df_set_flags (DF_DEFER_INSN_RESCAN);
> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
> +  df_md_add_problem ();
> +  df_analyze ();
>
> Please delay the initialization after you hit first instruction that

I changed it to:

  if (v4sf_const0)
    {
      calculate_dominance_info (CDI_DOMINATORS);
      df_set_flags (DF_DEFER_INSN_RESCAN);
      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
      df_md_add_problem ();
      df_analyze ();

      /* (Re-)discover loops so that bb->loop_father can be used in the
         analysis below.  */
      loop_optimizer_init (AVOID_CFG_MODIFICATIONS);

      /* Generate a vxorps at entry of the nearest dominator for basic
         blocks with conversions, which is in the the fake loop that
         contains the whole function, so that there is only a single
         vxorps in the whole function.   */
      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
                                             convert_bbs);
      while (bb->loop_father->latch
             != EXIT_BLOCK_PTR_FOR_FN (cfun))
        bb = get_immediate_dominator (CDI_DOMINATORS,
                                      bb->loop_father->header);

      insn = BB_HEAD (bb);
      if (!NONDEBUG_INSN_P (insn))
        insn = next_nonnote_nondebug_insn (insn);
      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
      set_insn = emit_insn_before (set, insn);
      df_insn_rescan (set_insn);
      df_process_deferred_rescans ();
      loop_optimizer_finalize ();
    }

> needs processing.  The pass is run unconditionally and in many functions
> it will do noting. Can you also gate the pass to run only of AVX is
> enabled?

There are

 virtual bool gate (function *)
    {
      return (TARGET_AVX
              && TARGET_SSE_PARTIAL_REG_DEPENDENCY
              && TARGET_SSE_MATH
              && optimize
              && optimize_function_for_speed_p (cfun));
    }

> Patch is OK with this change. Please way a day for possible Uros' or RM
> reactions.  Sorry for the delayed reaction.
> Honza

This is the updated patch I am going to check in tomorrow.

Thanks.

-- 
H.J.

[-- Attachment #2: 0001-8-9-Regression-i386-Add-pass_remove_partial_avx_depe.patch --]
[-- Type: text/x-patch, Size: 14989 bytes --]

From 0385aa137a1b553586cf786c331b0c167c5e2631 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 7 Jan 2019 05:44:59 -0800
Subject: [PATCH] [8/9 Regression] i386: Add pass_remove_partial_avx_dependency

With -mavx, for

$ cat foo.i
extern float f;
extern double d;
extern int i;

void
foo (void)
{
  d = f;
  f = i;
}

we need to generate

	vxorp[ds]	%xmmN, %xmmN, %xmmN
	...
	vcvtss2sd	f(%rip), %xmmN, %xmmX
	...
	vcvtsi2ss	i(%rip), %xmmN, %xmmY

to avoid partial XMM register stall.  This patch adds a pass to generate
a single

	vxorps		%xmmN, %xmmN, %xmmN

at entry of the nearest dominator for basic blocks with SF/DF conversions,
which is in the fake loop that contains the whole function, instead of
generating one

	vxorp[ds]	%xmmN, %xmmN, %xmmN

for each SF/DF conversion.

NB: The LCM algorithm isn't appropriate here since it may place a vxorps
inside the loop.  Simple testcase show this:

$ cat badcase.c

extern float f;
extern double d;

void
foo (int n, int k)
{
  for (int j = 0; j != n; j++)
    if (j < k)
      d = f;
}

It generates

    ...
    loop:
      if(j < k)
        vxorps    %xmm0, %xmm0, %xmm0
        vcvtss2sd f(%rip), %xmm0, %xmm0
      ...
    loopend
    ...

This is because LCM only works when there is a certain benifit.  But for
conditional branch, LCM wouldn't move

   vxorps  %xmm0, %xmm0, %xmm0

out of loop.  SPEC CPU 2017 on Intel Xeon with AVX512 shows:

1. The nearest dominator

|RATE			|Improvement|
|500.perlbench_r	| 0.55%	|
|538.imagick_r		| 8.43%	|
|544.nab_r		| 0.71%	|

2. LCM

|RATE			|Improvement|
|500.perlbench_r	| -0.76% |
|538.imagick_r		| 7.96%  |
|544.nab_r		| -0.13% |

Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512
using

-Ofast -flto -march=skylake-avx512 -funroll-loops

before

commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576
Author: uros <uros@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Jan 31 20:06:42 2019 +0000

            PR target/89071
            * config/i386/i386.md (*extendsfdf2): Split out reg->reg
            alternative to avoid partial SSE register stall for TARGET_AVX.
            (truncdfsf2): Ditto.
            (sse4_1_round<mode>2): Ditto.

    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268427 138bc75d-0d04-0410-961f-82ee72b054a4

are:

|INT RATE		|Improvement|
|500.perlbench_r	| 0.55%	|
|502.gcc_r		| 0.14%	|
|505.mcf_r		| 0.08%	|
|523.xalancbmk_r	| 0.18%	|
|525.x264_r		|-0.49%	|
|531.deepsjeng_r	|-0.04%	|
|541.leela_r		|-0.26%	|
|548.exchange2_r	|-0.3%	|
|557.xz_r		|BuildSame|

|FP RATE		|Improvement|
|503.bwaves_r	        |-0.29% |
|507.cactuBSSN_r	| 0.04%	|
|508.namd_r		|-0.74%	|
|510.parest_r		|-0.01%	|
|511.povray_r		| 2.23%	|
|519.lbm_r		| 0.1%	|
|521.wrf_r		| 0.49%	|
|526.blender_r		| 0.13%	|
|527.cam4_r		| 0.65%	|
|538.imagick_r		| 8.43%	|
|544.nab_r		| 0.71%	|
|549.fotonik3d_r	| 0.15%	|
|554.roms_r		| 0.08%	|

After commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576, on Skylake client,
impacts on 538.imagick_r with

-fno-unsafe-math-optimizations -march=native -Ofast -funroll-loops -flto

1. Size comparision:

before:

   text	   data	    bss	    dec	    hex	filename
2436377	   8352	   4528	2449257	 255f69 imagick_r

after:

   text	   data	    bss	    dec	    hex	filename
2425249	   8352	   4528	2438129	 2533f1 imagick_r

2. Number of vxorps:

before		after		difference
4948            4135            -19.66%

3. Performance improvement:

|RATE			|Improvement|
|538.imagick_r		| 5.5%  |

gcc/

2019-02-22  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* config/i386/i386-passes.def: Add
	pass_remove_partial_avx_dependency.
	* config/i386/i386-protos.h
	(make_pass_remove_partial_avx_dependency): New.
	* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
	New function.
	(pass_data_remove_partial_avx_dependency): New.
	(pass_remove_partial_avx_dependency): Likewise.
	(make_pass_remove_partial_avx_dependency): Likewise.
	* config/i386/i386.md (avx_partial_xmm_update): New attribute.
	(*extendsfdf2): Add avx_partial_xmm_update.
	(truncdfsf2): Likewise.
	(*float<SWI48:mode><MODEF:mode>2): Likewise.
	(SF/DF conversion splitters): Disabled for TARGET_AVX.

gcc/testsuite/

2019-02-22  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* gcc.target/i386/pr87007-1.c: New test.
	* gcc.target/i386/pr87007-2.c: Likewise.
---
 gcc/config/i386/i386-passes.def           |   2 +
 gcc/config/i386/i386-protos.h             |   2 +
 gcc/config/i386/i386.c                    | 174 ++++++++++++++++++++++
 gcc/config/i386/i386.md                   |  16 +-
 gcc/testsuite/gcc.target/i386/pr87007-1.c |  15 ++
 gcc/testsuite/gcc.target/i386/pr87007-2.c |  18 +++
 6 files changed, 224 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-2.c

diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index 87cfd94b8f6..f4facdc65d4 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -31,3 +31,5 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */);
 
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch);
+
+  INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 2d600173917..83645e89a81 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -369,3 +369,5 @@ class rtl_opt_pass;
 extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *);
 extern rtl_opt_pass *make_pass_stv (gcc::context *);
 extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *);
+extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
+  (gcc::context *);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 81dfed12837..e77653d66a4 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2793,6 +2793,180 @@ make_pass_insert_endbranch (gcc::context *ctxt)
   return new pass_insert_endbranch (ctxt);
 }
 
+/* At entry of the nearest common dominator for basic blocks with
+   conversions, generate a single
+	vxorps %xmmN, %xmmN, %xmmN
+   for all
+	vcvtss2sd  op, %xmmN, %xmmX
+	vcvtsd2ss  op, %xmmN, %xmmX
+	vcvtsi2ss  op, %xmmN, %xmmX
+	vcvtsi2sd  op, %xmmN, %xmmX
+
+   NB: We want to generate only a single vxorps to cover the whole
+   function.  The LCM algorithm isn't appropriate here since it may
+   place a vxorps inside the loop.  */
+
+static unsigned int
+remove_partial_avx_dependency (void)
+{
+  timevar_push (TV_MACH_DEP);
+
+  bitmap_obstack_initialize (NULL);
+  bitmap convert_bbs = BITMAP_ALLOC (NULL);
+
+  basic_block bb;
+  rtx_insn *insn, *set_insn;
+  rtx set;
+  rtx v4sf_const0 = NULL_RTX;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  set = single_set (insn);
+	  if (!set)
+	    continue;
+
+	  if (get_attr_avx_partial_xmm_update (insn)
+	      != AVX_PARTIAL_XMM_UPDATE_TRUE)
+	    continue;
+
+	  if (!v4sf_const0)
+	    v4sf_const0 = gen_reg_rtx (V4SFmode);
+
+	  /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
+	     SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
+	     vec_merge with subreg.  */
+	  rtx src = SET_SRC (set);
+	  rtx dest = SET_DEST (set);
+	  machine_mode dest_mode = GET_MODE (dest);
+
+	  rtx zero;
+	  machine_mode dest_vecmode;
+	  if (dest_mode == E_SFmode)
+	    {
+	      dest_vecmode = V4SFmode;
+	      zero = v4sf_const0;
+	    }
+	  else
+	    {
+	      dest_vecmode = V2DFmode;
+	      zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0);
+	    }
+
+	  /* Change source to vector mode.  */
+	  src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src);
+	  src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero,
+				   GEN_INT (HOST_WIDE_INT_1U));
+	  /* Change destination to vector mode.  */
+	  rtx vec = gen_reg_rtx (dest_vecmode);
+	  /* Generate an XMM vector SET.  */
+	  set = gen_rtx_SET (vec, src);
+	  set_insn = emit_insn_before (set, insn);
+	  df_insn_rescan (set_insn);
+
+	  src = gen_rtx_SUBREG (dest_mode, vec, 0);
+	  set = gen_rtx_SET (dest, src);
+
+	  /* Drop possible dead definitions.  */
+	  PATTERN (insn) = set;
+
+	  INSN_CODE (insn) = -1;
+	  recog_memoized (insn);
+	  df_insn_rescan (insn);
+	  bitmap_set_bit (convert_bbs, bb->index);
+	}
+    }
+
+  if (v4sf_const0)
+    {
+      calculate_dominance_info (CDI_DOMINATORS);
+      df_set_flags (DF_DEFER_INSN_RESCAN);
+      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+      df_md_add_problem ();
+      df_analyze ();
+
+      /* (Re-)discover loops so that bb->loop_father can be used in the
+	 analysis below.  */
+      loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+
+      /* Generate a vxorps at entry of the nearest dominator for basic
+	 blocks with conversions, which is in the the fake loop that
+	 contains the whole function, so that there is only a single
+	 vxorps in the whole function.   */
+      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
+					     convert_bbs);
+      while (bb->loop_father->latch
+	     != EXIT_BLOCK_PTR_FOR_FN (cfun))
+	bb = get_immediate_dominator (CDI_DOMINATORS,
+				      bb->loop_father->header);
+
+      insn = BB_HEAD (bb);
+      if (!NONDEBUG_INSN_P (insn))
+	insn = next_nonnote_nondebug_insn (insn);
+      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
+      set_insn = emit_insn_before (set, insn);
+      df_insn_rescan (set_insn);
+      df_process_deferred_rescans ();
+      loop_optimizer_finalize ();
+    }
+
+  bitmap_obstack_release (NULL);
+  BITMAP_FREE (convert_bbs);
+
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_remove_partial_avx_dependency =
+{
+  RTL_PASS, /* type */
+  "rpad", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_remove_partial_avx_dependency : public rtl_opt_pass
+{
+public:
+  pass_remove_partial_avx_dependency (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (TARGET_AVX
+	      && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+	      && TARGET_SSE_MATH
+	      && optimize
+	      && optimize_function_for_speed_p (cfun));
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return remove_partial_avx_dependency ();
+    }
+}; // class pass_rpad
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
+{
+  return new pass_remove_partial_avx_dependency (ctxt);
+}
+
 /* Return true if a red-zone is in use.  We can't use red-zone when
    there are local indirect jumps, like "indirect_jump" or "tablejump",
    which jumps to another place in the function, since "call" in the
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b1ae88c400f..90f16608787 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -778,6 +778,10 @@
 (define_attr "i387_cw" "trunc,floor,ceil,uninitialized,any"
   (const_string "any"))
 
+;; Define attribute to indicate AVX insns with partial XMM register update.
+(define_attr "avx_partial_xmm_update" "false,true"
+  (const_string "false"))
+
 ;; Define attribute to classify add/sub insns that consumes carry flag (CF)
 (define_attr "use_carry" "0,1" (const_string "0"))
 
@@ -4392,6 +4396,7 @@
     }
 }
   [(set_attr "type" "fmov,fmov,ssecvt,ssecvt")
+   (set_attr "avx_partial_xmm_update" "false,false,false,true")
    (set_attr "prefix" "orig,orig,maybe_vex,maybe_vex")
    (set_attr "mode" "SF,XF,DF,DF")
    (set (attr "enabled")
@@ -4481,7 +4486,8 @@
   [(set (match_operand:DF 0 "sse_reg_operand")
         (float_extend:DF
           (match_operand:SF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -4558,6 +4564,7 @@
     }
 }
   [(set_attr "type" "fmov,fmov,ssecvt,ssecvt")
+   (set_attr "avx_partial_xmm_update" "false,false,false,true")
    (set_attr "mode" "SF")
    (set (attr "enabled")
      (if_then_else
@@ -4641,7 +4648,8 @@
   [(set (match_operand:SF 0 "sse_reg_operand")
         (float_truncate:SF
 	  (match_operand:DF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -5017,6 +5025,7 @@
    %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1}
    %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1}"
   [(set_attr "type" "fmov,sseicvt,sseicvt")
+   (set_attr "avx_partial_xmm_update" "false,true,true")
    (set_attr "prefix" "orig,maybe_vex,maybe_vex")
    (set_attr "mode" "<MODEF:MODE>")
    (set (attr "prefix_rex")
@@ -5145,7 +5154,8 @@
 (define_split
   [(set (match_operand:MODEF 0 "sse_reg_operand")
 	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!EXT_REX_SSE_REG_P (operands[0])
        || TARGET_AVX512VL)"
diff --git a/gcc/testsuite/gcc.target/i386/pr87007-1.c b/gcc/testsuite/gcc.target/i386/pr87007-1.c
new file mode 100644
index 00000000000..93cf1dcdfa5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr87007-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr87007-2.c b/gcc/testsuite/gcc.target/i386/pr87007-2.c
new file mode 100644
index 00000000000..cca7ae7afbc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr87007-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (int n, int k)
+{
+  for (int i = 0; i != n; i++)
+    if(i < k)
+      d = f;
+    else
+      f = i;
+}
+
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-02-21 17:34 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-19 10:31 V2 [PATCH] i386: Add pass_remove_partial_avx_dependency H.J. Lu
2018-11-04 15:19 ` PING: " H.J. Lu
2018-11-05 14:21 ` Jan Hubicka
2018-11-05 14:59   ` Jeff Law
2018-11-05 15:29     ` Jan Hubicka
2018-11-28 19:49       ` H.J. Lu
2018-11-28 20:17         ` Jeff Law
2018-11-28 20:21           ` Jan Hubicka
2018-12-30 17:33             ` H.J. Lu
2019-01-11 21:06               ` Jeff Law
2018-12-30 17:10           ` H.J. Lu
2019-01-07 13:56             ` V3 " H.J. Lu
2019-01-18 15:49               ` PING^1: " H.J. Lu
2019-01-21 18:54               ` Jeff Law
2019-01-21 21:27                 ` H.J. Lu
2019-01-22 12:08                   ` Richard Biener
2019-01-22 13:28                     ` H.J. Lu
2019-01-28 17:29                       ` PING ^1: " H.J. Lu
2019-02-01 19:04                         ` [PATCH] [8/9 Regression] " H.J. Lu
2019-02-21 16:18                           ` Jan Hubicka
2019-02-21 17:43                             ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).