[PATCH 0/2] Align tight loops to solve cross cacheline issue

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Align tight loops to solve cross cacheline issue
@ 2024-05-15  3:04 Haochen Jiang
  2024-05-15  3:04 ` [PATCH 1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors Haochen Jiang
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Haochen Jiang @ 2024-05-15  3:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: hongtao.liu, ubizjak

Hi all,

Recently, we have encountered several random performance regressions in
benchmarks commit to commit. It is caused by cross cacheline issue for
tight loops.

We are trying to solve the issue by two patches. One is adjusting the
loop alignment for generic tune, the other is aligning tight and hot
loops more aggressively.

For SPECINT, we get a 0.85% improvement overall in rates, under option
-O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.

BenchMarks      EMR Rates
500.perlbench_r -1.21%
502.gcc_r       0.78%
505.mcf_r       0.00%
520.omnetpp_r   0.41%
523.xalancbmk_r 1.33%
525.x264_r      2.83%
531.deepsjeng_r 1.11%
541.leela_r     0.00%
548.exchange2_r 2.36%
557.xz_r        0.98%
Geomean-int     0.85%

Side effect is that we get a 1.40% increase in codesize.

BenchMarks      EMR Codesize
500.perlbench_r 0.70%
502.gcc_r       0.67%
505.mcf_r       3.26%
520.omnetpp_r   0.31%
523.xalancbmk_r 1.15%
525.x264_r      1.11%
531.deepsjeng_r 1.40%
541.leela_r     1.31%
548.exchange2_r 3.06%
557.xz_r        1.04%
Geomean-int     1.40%

Bootstrapped and regtested on x86_64-pc-linux-gnu.

After we committed into trunk for a month, if there isn't any unexpected
happen. We planned to backport it to GCC14.2.

Thx,
Haochen

Haochen Jiang (1):
  Adjust generic loop alignment from 16:11:8 to 16 for Intel processors

liuhongt (1):
  Align tight&hot loop without considering max skipping bytes.

 gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
 gcc/config/i386/i386.md          |  10 ++-
 gcc/config/i386/x86-tune-costs.h |   2 +-
 3 files changed, 154 insertions(+), 6 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
  2024-05-15  3:04 [PATCH 0/2] Align tight loops to solve cross cacheline issue Haochen Jiang
@ 2024-05-15  3:04 ` Haochen Jiang
  2024-05-15  3:04 ` [PATCH 2/2] Align tight&hot loop without considering max skipping bytes Haochen Jiang
  2024-05-15  3:30 ` [PATCH 0/2] Align tight loops to solve cross cacheline issue Jiang, Haochen
  2 siblings, 0 replies; 7+ messages in thread
From: Haochen Jiang @ 2024-05-15  3:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: hongtao.liu, ubizjak

Previously, we use 16:11:8 in generic tune for Intel processors, which
lead to cross cache line issue and result in some random performance
penalty in benchmarks with small loops commit to commit.

After changing to always aligning to 16 bytes, it will somehow solve
the issue.

gcc/ChangeLog:

	* config/i386/x86-tune-costs.h (generic_cost): Change from
	16:11:8 to 16.
---
 gcc/config/i386/x86-tune-costs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 65d7d1f7e42..d3aaaa4b5cc 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -3758,7 +3758,7 @@ struct processor_costs generic_cost = {
   generic_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
   COSTS_N_INSNS (2),			/* cond_not_taken_branch_cost.  */
-  "16:11:8",				/* Loop alignment.  */
+  "16",					/* Loop alignment.  */
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] Align tight&hot loop without considering max skipping bytes.
  2024-05-15  3:04 [PATCH 0/2] Align tight loops to solve cross cacheline issue Haochen Jiang
  2024-05-15  3:04 ` [PATCH 1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors Haochen Jiang
@ 2024-05-15  3:04 ` Haochen Jiang
  2024-05-15  3:30 ` [PATCH 0/2] Align tight loops to solve cross cacheline issue Jiang, Haochen
  2 siblings, 0 replies; 7+ messages in thread
From: Haochen Jiang @ 2024-05-15  3:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: hongtao.liu, ubizjak

From: liuhongt <hongtao.liu@intel.com>

When hot loop is small enough to fix into one cacheline, we should align
the loop with ceil_log2 (loop_size) without considering maximum
skipp bytes. It will help code prefetch.

gcc/ChangeLog:

	* config/i386/i386.cc (ix86_avoid_jump_mispredicts): Change
	gen_pad to gen_max_skip_align.
	(ix86_align_loops): New function.
	(ix86_reorg): Call ix86_align_loops.
	* config/i386/i386.md (pad): Rename to ..
	(max_skip_align): .. this, and accept 2 operands for align and
	skip.
---
 gcc/config/i386/i386.cc | 148 +++++++++++++++++++++++++++++++++++++++-
 gcc/config/i386/i386.md |  10 +--
 2 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index e67e5f62533..c617091c8e1 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23137,7 +23137,7 @@ ix86_avoid_jump_mispredicts (void)
 	  if (dump_file)
 	    fprintf (dump_file, "Padding insn %i by %i bytes!\n",
 		     INSN_UID (insn), padsize);
-          emit_insn_before (gen_pad (GEN_INT (padsize)), insn);
+	  emit_insn_before (gen_max_skip_align (GEN_INT (4), GEN_INT (padsize)), insn);
 	}
     }
 }
@@ -23410,6 +23410,150 @@ ix86_split_stlf_stall_load ()
     }
 }
 
+/* When a hot loop can be fit into one cacheline,
+   force align the loop without considering the max skip.  */
+static void
+ix86_align_loops ()
+{
+  basic_block bb;
+
+  /* Don't do this when we don't know cache line size.  */
+  if (ix86_cost->prefetch_block == 0)
+    return;
+
+  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+  profile_count count_threshold = cfun->cfg->count_max / param_align_threshold;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      rtx_insn *label = BB_HEAD (bb);
+      bool has_fallthru = 0;
+      edge e;
+      edge_iterator ei;
+
+      if (!LABEL_P (label))
+	continue;
+
+      profile_count fallthru_count = profile_count::zero ();
+      profile_count branch_count = profile_count::zero ();
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  if (e->flags & EDGE_FALLTHRU)
+	    has_fallthru = 1, fallthru_count += e->count ();
+	  else
+	    branch_count += e->count ();
+	}
+
+      if (!fallthru_count.initialized_p () || !branch_count.initialized_p ())
+	continue;
+
+      if (bb->loop_father
+	  && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun)
+	  && (has_fallthru
+	      ? (!(single_succ_p (bb)
+		   && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		 && optimize_bb_for_speed_p (bb)
+		 && branch_count + fallthru_count > count_threshold
+		 && (branch_count > fallthru_count * param_align_loop_iterations))
+	      /* In case there'no fallthru for the loop.
+		 Nops inserted won't be executed.  */
+	      : (branch_count > count_threshold
+		 || (bb->count > bb->prev_bb->count * 10
+		     && (bb->prev_bb->count
+			 <= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2)))))
+	{
+	  rtx_insn* insn, *end_insn;
+	  HOST_WIDE_INT size = 0;
+	  bool padding_p = true;
+	  basic_block tbb = bb;
+	  unsigned cond_branch_num = 0;
+	  bool detect_tight_loop_p = false;
+
+	  for (unsigned int i = 0; i != bb->loop_father->num_nodes;
+	       i++, tbb = tbb->next_bb)
+	    {
+	      /* Only handle continuous cfg layout. */
+	      if (bb->loop_father != tbb->loop_father)
+		{
+		  padding_p = false;
+		  break;
+		}
+
+	      FOR_BB_INSNS (tbb, insn)
+		{
+		  if (!NONDEBUG_INSN_P (insn))
+		    continue;
+		  size += ix86_min_insn_size (insn);
+
+		  /* We don't know size of inline asm.
+		     Don't align loop for call.  */
+		  if (asm_noperands (PATTERN (insn)) >= 0
+		      || CALL_P (insn))
+		    {
+		      size = -1;
+		      break;
+		    }
+		}
+
+	      if (size == -1 || size > ix86_cost->prefetch_block)
+		{
+		  padding_p = false;
+		  break;
+		}
+
+	      FOR_EACH_EDGE (e, ei, tbb->succs)
+		{
+		  /* It could be part of the loop.  */
+		  if (e->dest == bb)
+		    {
+		      detect_tight_loop_p = true;
+		      break;
+		    }
+		}
+
+	      if (detect_tight_loop_p)
+		break;
+
+	      end_insn = BB_END (tbb);
+	      if (JUMP_P (end_insn))
+		{
+		  /* For decoded icache:
+		     1. Up to two branches are allowed per Way.
+		     2. A non-conditional branch is the last micro-op in a Way.
+		  */
+		  if (onlyjump_p (end_insn)
+		      && (any_uncondjump_p (end_insn)
+			  || single_succ_p (tbb)))
+		    {
+		      padding_p = false;
+		      break;
+		    }
+		  else if (++cond_branch_num >= 2)
+		    {
+		      padding_p = false;
+		      break;
+		    }
+		}
+
+	    }
+
+	  if (padding_p && detect_tight_loop_p)
+	    {
+	      emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 (size)),
+						    GEN_INT (0)), label);
+	      /* End of function.  */
+	      if (!tbb || tbb == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		break;
+	      /* Skip bb which already fits into one cacheline.  */
+	      bb = tbb;
+	    }
+	}
+    }
+
+  loop_optimizer_finalize ();
+  free_dominance_info (CDI_DOMINATORS);
+}
+
 /* Implement machine specific optimizations.  We implement padding of returns
    for K8 CPUs and pass to avoid 4 jumps in the single 16 byte window.  */
 static void
@@ -23433,6 +23577,8 @@ ix86_reorg (void)
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
       if (TARGET_FOUR_JUMP_LIMIT)
 	ix86_avoid_jump_mispredicts ();
+
+      ix86_align_loops ();
 #endif
     }
 }
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 764bfe20ff2..686de0bf2ff 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -19150,16 +19150,18 @@
    (set_attr "length_immediate" "0")
    (set_attr "modrm" "0")])
 
-;; Pad to 16-byte boundary, max skip in op0.  Used to avoid
+;; Pad to 1 << op0 byte boundary, max skip in op1.  Used to avoid
 ;; branch prediction penalty for the third jump in a 16-byte
 ;; block on K8.
+;; Also it's used to align tight loops which can be fix into 1 cacheline.
+;; It can help code prefetch and reduce DSB miss.
 
-(define_insn "pad"
-  [(unspec_volatile [(match_operand 0)] UNSPECV_ALIGN)]
+(define_insn "max_skip_align"
+  [(unspec_volatile [(match_operand 0) (match_operand 1)] UNSPECV_ALIGN)]
   ""
 {
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
-  ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, 4, (int)INTVAL (operands[0]));
+  ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, (int)INTVAL (operands[0]), (int)INTVAL (operands[1]));
 #else
   /* It is tempting to use ASM_OUTPUT_ALIGN here, but we don't want to do that.
      The align insn is used to avoid 3 jump instructions in the row to improve
-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH 0/2] Align tight loops to solve cross cacheline issue
  2024-05-15  3:04 [PATCH 0/2] Align tight loops to solve cross cacheline issue Haochen Jiang
  2024-05-15  3:04 ` [PATCH 1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors Haochen Jiang
  2024-05-15  3:04 ` [PATCH 2/2] Align tight&hot loop without considering max skipping bytes Haochen Jiang
@ 2024-05-15  3:30 ` Jiang, Haochen
  2024-05-20  3:15   ` Hongtao Liu
  2 siblings, 1 reply; 7+ messages in thread
From: Jiang, Haochen @ 2024-05-15  3:30 UTC (permalink / raw)
  To: Jiang, Haochen, gcc-patches
  Cc: Liu, Hongtao, ubizjak, Jan Hubicka, Richard Biener

Also cc Honza and Richard since we touched generic tune.

Thx,
Haochen

> -----Original Message-----
> From: Haochen Jiang <haochen.jiang@intel.com>
> Sent: Wednesday, May 15, 2024 11:04 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao <hongtao.liu@intel.com>; ubizjak@gmail.com
> Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue
> 
> Hi all,
> 
> Recently, we have encountered several random performance regressions in
> benchmarks commit to commit. It is caused by cross cacheline issue for tight
> loops.
> 
> We are trying to solve the issue by two patches. One is adjusting the loop
> alignment for generic tune, the other is aligning tight and hot loops more
> aggressively.
> 
> For SPECINT, we get a 0.85% improvement overall in rates, under option
> -O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.
> 
> BenchMarks      EMR Rates
> 500.perlbench_r -1.21%
> 502.gcc_r       0.78%
> 505.mcf_r       0.00%
> 520.omnetpp_r   0.41%
> 523.xalancbmk_r 1.33%
> 525.x264_r      2.83%
> 531.deepsjeng_r 1.11%
> 541.leela_r     0.00%
> 548.exchange2_r 2.36%
> 557.xz_r        0.98%
> Geomean-int     0.85%
> 
> Side effect is that we get a 1.40% increase in codesize.
> 
> BenchMarks      EMR Codesize
> 500.perlbench_r 0.70%
> 502.gcc_r       0.67%
> 505.mcf_r       3.26%
> 520.omnetpp_r   0.31%
> 523.xalancbmk_r 1.15%
> 525.x264_r      1.11%
> 531.deepsjeng_r 1.40%
> 541.leela_r     1.31%
> 548.exchange2_r 3.06%
> 557.xz_r        1.04%
> Geomean-int     1.40%
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu.
> 
> After we committed into trunk for a month, if there isn't any unexpected
> happen. We planned to backport it to GCC14.2.
> 
> Thx,
> Haochen
> 
> Haochen Jiang (1):
>   Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
> 
> liuhongt (1):
>   Align tight&hot loop without considering max skipping bytes.
> 
>  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
>  gcc/config/i386/i386.md          |  10 ++-
>  gcc/config/i386/x86-tune-costs.h |   2 +-
>  3 files changed, 154 insertions(+), 6 deletions(-)
> 
> --
> 2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] Align tight loops to solve cross cacheline issue
  2024-05-15  3:30 ` [PATCH 0/2] Align tight loops to solve cross cacheline issue Jiang, Haochen
@ 2024-05-20  3:15   ` Hongtao Liu
  2024-05-27  1:33     ` Hongtao Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Hongtao Liu @ 2024-05-20  3:15 UTC (permalink / raw)
  To: Jiang, Haochen
  Cc: gcc-patches, Liu, Hongtao, ubizjak, Jan Hubicka, Richard Biener

On Wed, May 15, 2024 at 11:30 AM Jiang, Haochen <haochen.jiang@intel.com> wrote:
>
> Also cc Honza and Richard since we touched generic tune.
>
> Thx,
> Haochen
>
> > -----Original Message-----
> > From: Haochen Jiang <haochen.jiang@intel.com>
> > Sent: Wednesday, May 15, 2024 11:04 AM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Liu, Hongtao <hongtao.liu@intel.com>; ubizjak@gmail.com
> > Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue
> >
> > Hi all,
> >
> > Recently, we have encountered several random performance regressions in
> > benchmarks commit to commit. It is caused by cross cacheline issue for tight
> > loops.
> >
> > We are trying to solve the issue by two patches. One is adjusting the loop
> > alignment for generic tune, the other is aligning tight and hot loops more
> > aggressively.
> >
> > For SPECINT, we get a 0.85% improvement overall in rates, under option
> > -O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.
> >
> > BenchMarks      EMR Rates
> > 500.perlbench_r -1.21%
> > 502.gcc_r       0.78%
> > 505.mcf_r       0.00%
> > 520.omnetpp_r   0.41%
> > 523.xalancbmk_r 1.33%
> > 525.x264_r      2.83%
> > 531.deepsjeng_r 1.11%
> > 541.leela_r     0.00%
> > 548.exchange2_r 2.36%
> > 557.xz_r        0.98%
> > Geomean-int     0.85%
> >
> > Side effect is that we get a 1.40% increase in codesize.
> >
> > BenchMarks      EMR Codesize
> > 500.perlbench_r 0.70%
> > 502.gcc_r       0.67%
> > 505.mcf_r       3.26%
> > 520.omnetpp_r   0.31%
> > 523.xalancbmk_r 1.15%
> > 525.x264_r      1.11%
> > 531.deepsjeng_r 1.40%
> > 541.leela_r     1.31%
> > 548.exchange2_r 3.06%
> > 557.xz_r        1.04%
> > Geomean-int     1.40%
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu.
> >
> > After we committed into trunk for a month, if there isn't any unexpected
> > happen. We planned to backport it to GCC14.2.
> >
> > Thx,
> > Haochen
> >
> > Haochen Jiang (1):
> >   Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
For this one, current znver{1,2,3,4,5}_cost already set loop align as
16, so I think it should be fine set it to generic_cost.
> >
> > liuhongt (1):
> >   Align tight&hot loop without considering max skipping bytes.
For this one, although we have seen similar growth on AMD's
processors, it's still nice to have someone from AMD to look at this
to see if it's what they need.
> >
> >  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
> >  gcc/config/i386/i386.md          |  10 ++-
> >  gcc/config/i386/x86-tune-costs.h |   2 +-
> >  3 files changed, 154 insertions(+), 6 deletions(-)
> >
> > --
> > 2.31.1
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] Align tight loops to solve cross cacheline issue
  2024-05-20  3:15   ` Hongtao Liu
@ 2024-05-27  1:33     ` Hongtao Liu
  2024-05-29  3:30       ` Jiang, Haochen
  0 siblings, 1 reply; 7+ messages in thread
From: Hongtao Liu @ 2024-05-27  1:33 UTC (permalink / raw)
  To: Jiang, Haochen
  Cc: gcc-patches, Liu, Hongtao, ubizjak, Jan Hubicka, Richard Biener

On Mon, May 20, 2024 at 11:15 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, May 15, 2024 at 11:30 AM Jiang, Haochen <haochen.jiang@intel.com> wrote:
> >
> > Also cc Honza and Richard since we touched generic tune.
> >
> > Thx,
> > Haochen
> >
> > > -----Original Message-----
> > > From: Haochen Jiang <haochen.jiang@intel.com>
> > > Sent: Wednesday, May 15, 2024 11:04 AM
> > > To: gcc-patches@gcc.gnu.org
> > > Cc: Liu, Hongtao <hongtao.liu@intel.com>; ubizjak@gmail.com
> > > Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue
> > >
> > > Hi all,
> > >
> > > Recently, we have encountered several random performance regressions in
> > > benchmarks commit to commit. It is caused by cross cacheline issue for tight
> > > loops.
> > >
> > > We are trying to solve the issue by two patches. One is adjusting the loop
> > > alignment for generic tune, the other is aligning tight and hot loops more
> > > aggressively.
> > >
> > > For SPECINT, we get a 0.85% improvement overall in rates, under option
> > > -O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.
> > >
> > > BenchMarks      EMR Rates
> > > 500.perlbench_r -1.21%
> > > 502.gcc_r       0.78%
> > > 505.mcf_r       0.00%
> > > 520.omnetpp_r   0.41%
> > > 523.xalancbmk_r 1.33%
> > > 525.x264_r      2.83%
> > > 531.deepsjeng_r 1.11%
> > > 541.leela_r     0.00%
> > > 548.exchange2_r 2.36%
> > > 557.xz_r        0.98%
> > > Geomean-int     0.85%
> > >
> > > Side effect is that we get a 1.40% increase in codesize.
> > >
> > > BenchMarks      EMR Codesize
> > > 500.perlbench_r 0.70%
> > > 502.gcc_r       0.67%
> > > 505.mcf_r       3.26%
> > > 520.omnetpp_r   0.31%
> > > 523.xalancbmk_r 1.15%
> > > 525.x264_r      1.11%
> > > 531.deepsjeng_r 1.40%
> > > 541.leela_r     1.31%
> > > 548.exchange2_r 3.06%
> > > 557.xz_r        1.04%
> > > Geomean-int     1.40%
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu.
Ok for this if there's no objection in 48 hours.
> > >
> > > After we committed into trunk for a month, if there isn't any unexpected
> > > happen. We planned to backport it to GCC14.2.
> > >
> > > Thx,
> > > Haochen
> > >
> > > Haochen Jiang (1):
> > >   Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
> For this one, current znver{1,2,3,4,5}_cost already set loop align as
> 16, so I think it should be fine set it to generic_cost.
> > >
> > > liuhongt (1):
> > >   Align tight&hot loop without considering max skipping bytes.
> For this one, although we have seen similar growth on AMD's
> processors, it's still nice to have someone from AMD to look at this
> to see if it's what they need.
> > >
> > >  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
> > >  gcc/config/i386/i386.md          |  10 ++-
> > >  gcc/config/i386/x86-tune-costs.h |   2 +-
> > >  3 files changed, 154 insertions(+), 6 deletions(-)
> > >
> > > --
> > > 2.31.1
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH 0/2] Align tight loops to solve cross cacheline issue
  2024-05-27  1:33     ` Hongtao Liu
@ 2024-05-29  3:30       ` Jiang, Haochen
  0 siblings, 0 replies; 7+ messages in thread
From: Jiang, Haochen @ 2024-05-29  3:30 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: gcc-patches, Liu, Hongtao, ubizjak, Jan Hubicka, Richard Biener

> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu.
> Ok for this if there's no objection in 48 hours.
> > > >
> > > > After we committed into trunk for a month, if there isn't any
> > > > unexpected happen. We planned to backport it to GCC14.2.

I accidentally backported it to GCC14.2 for now since I did not realize
that my local branch is on GCC14, not trunk.

If there is something unexpected on trunk, I will revert the patches for
GCC14.

Thx,
Haochen

> > > >
> > > > Thx,
> > > > Haochen
> > > >
> > > > Haochen Jiang (1):
> > > >   Adjust generic loop alignment from 16:11:8 to 16 for Intel
> > > > processors
> > For this one, current znver{1,2,3,4,5}_cost already set loop align as
> > 16, so I think it should be fine set it to generic_cost.
> > > >
> > > > liuhongt (1):
> > > >   Align tight&hot loop without considering max skipping bytes.
> > For this one, although we have seen similar growth on AMD's
> > processors, it's still nice to have someone from AMD to look at this
> > to see if it's what they need.
> > > >
> > > >  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
> > > >  gcc/config/i386/i386.md          |  10 ++-
> > > >  gcc/config/i386/x86-tune-costs.h |   2 +-
> > > >  3 files changed, 154 insertions(+), 6 deletions(-)
> > > >
> > > > --
> > > > 2.31.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-29  3:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-15  3:04 [PATCH 0/2] Align tight loops to solve cross cacheline issue Haochen Jiang
2024-05-15  3:04 ` [PATCH 1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors Haochen Jiang
2024-05-15  3:04 ` [PATCH 2/2] Align tight&hot loop without considering max skipping bytes Haochen Jiang
2024-05-15  3:30 ` [PATCH 0/2] Align tight loops to solve cross cacheline issue Jiang, Haochen
2024-05-20  3:15   ` Hongtao Liu
2024-05-27  1:33     ` Hongtao Liu
2024-05-29  3:30       ` Jiang, Haochen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).