[PATCH 7/8] Model cache auto-prefetcher in scheduler

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 7/8] Model cache auto-prefetcher in scheduler
@ 2014-10-21  4:09 Maxim Kuvyrkov
  2014-10-21  5:44 ` Andrew Pinski
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2014-10-21  4:09 UTC (permalink / raw)
  To: GCC Patches; +Cc: Vladimir Makarov, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 2078 bytes --]

Hi,

This patch adds auto-prefetcher modeling to GCC scheduler.  The auto-prefetcher model is currently enabled only for ARM Cortex-A15, since this is the only CPU that I know of to have the hardware auto-prefetcher unit.

The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short note in Cortex-A15 manual (search for "L2 cache auto-prefether").  This patch, therefore, implements a very abstract model that makes scheduler prefer "mem_op (base+8); mem_op (base+12)" over "mem_op (base+12); mem_op (base+8)".  In other words, memory operations are tried to be issued in order of increasing memory offsets.

The auto-prefetcher model implementation is based on max_issue mutlipass lookahead scheduling, and its "guard" hook.  The guard hook examines contents of the ready list and the queue, and, if it finds instructions with lower memory offsets, marks instructions with higher memory offset as unavailable for immediate scheduling.

This patch has been in works since beginning of the year, and many of my previous scheduler cleanup patches were to prepare the infrastructure for this feature. 

Ramana, this change requires benchmarking, which I can't easily do at the moment.  I would appreciate any benchmarking results that you can share.  In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15.

At the moment the parameter is set to "2", which means that the autopref model will look through ready list and 1-stall queue in search of relevant instructions.  Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through ready list and 2-stall queue) should be considered and benchmarked.

Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and aarch64-linux-gnu.  OK to apply, provided no performance or correctness regressions?

[ChangeLog is part of the git patch]

Thank you,

--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0007-Model-cache-auto-prefetcher-in-scheduler.patch --]
[-- Type: application/octet-stream, Size: 17453 bytes --]

From 629c2cc593b49b8596b00e3e3d62444493aa3514 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Mon, 20 Oct 2014 23:13:23 +0100
Subject: [PATCH 7/8] Model cache auto-prefetcher in scheduler

	* config/arm/arm.c (sched-int.h): Include header.
	(arm_first_cycle_multipass_dfa_lookahead_guard,)
	(TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD): Define hook.
	Enable auto-prefetcher model for Cortex-A15.
	(arm_option_override): Set autoprefetcher parameter.
	* config/arm/t-arm (arm.o): Update.
	* haifa-sched.c (update_insn_after_change): Update.
	(rank_for_schedule): Use auto-prefetcher model, if requested.
	(autopref_multipass_init): New static function.
	(autopref_rank_for_schedule): New rank_for_schedule heuristic.
	(autopref_multipass_dfa_lookahead_guard_started_dump_p): New static
	variable for debug dumps.
	(autopref_multipass_dfa_lookahead_guard_1): New static helper function.
	(autopref_multipass_dfa_lookahead_guard): New global function that
	implements TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD hook.
	(init_h_i_d): Update.
	* params.def (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH): New tuning knob.
	* sched-int.h (autopref_multipass_data_): Structure for	auto-prefetcher
	data.
	(autopref_multipass_data_def, autopref_multipass_data_t): New typedefs.
	(struct _haifa_insn_data:autopref_multipass_data): New field.
	(INSN_AUTOPREF_MULTIPASS_DATA): New access macro.
	(autopref_multipass_dfa_lookahead_guard): Declare.
---
 gcc/config/arm/arm.c |   26 ++++++
 gcc/config/arm/t-arm |    3 +-
 gcc/haifa-sched.c    |  247 ++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/params.def       |    5 +
 gcc/sched-int.h      |   26 ++++++
 5 files changed, 306 insertions(+), 1 deletion(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 0f15c99..8e90fe7 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -68,6 +68,7 @@
 #include "gimple-expr.h"
 #include "builtins.h"
 #include "tm-constrs.h"
+#include "sched-int.h"
 
 /* Forward definitions of types.  */
 typedef struct minipool_node    Mnode;
@@ -247,6 +248,7 @@ static unsigned HOST_WIDE_INT arm_shift_truncation_mask (enum machine_mode);
 static bool arm_cannot_copy_insn_p (rtx_insn *);
 static int arm_issue_rate (void);
 static int arm_first_cycle_multipass_dfa_lookahead (void);
+static int arm_first_cycle_multipass_dfa_lookahead_guard (rtx, int);
 static void arm_output_dwarf_dtprel (FILE *, int, rtx) ATTRIBUTE_UNUSED;
 static bool arm_output_addr_const_extra (FILE *, rtx);
 static bool arm_allocate_stack_slots_for_args (void);
@@ -596,6 +598,10 @@ static const struct attribute_spec arm_attribute_table[] =
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   arm_first_cycle_multipass_dfa_lookahead
 
+#undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD
+#define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
+  arm_first_cycle_multipass_dfa_lookahead_guard
+
 #undef TARGET_MANGLE_TYPE
 #define TARGET_MANGLE_TYPE arm_mangle_type
 
@@ -3108,6 +3114,12 @@ arm_option_override (void)
                          global_options.x_param_values,
                          global_options_set.x_param_values);
 
+  /* Look through ready list and 1-cycle-delay queue for instructions
+     relevant for L2 auto-prefetcher.  */
+  maybe_set_param_value (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH, 2,
+                         global_options.x_param_values,
+                         global_options_set.x_param_values);
+
   /* Disable shrink-wrap when optimizing function for size, since it tends to
      generate additional returns.  */
   if (optimize_function_for_size_p (cfun) && TARGET_THUMB2)
@@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
   return issue_rate > 1 ? issue_rate : 0;
 }
 
+/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
+static int
+arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
+{
+  switch (arm_tune)
+    {
+    case cortexa15:
+      return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
+
+    default:
+      return 0;
+    }
+}
+
 /* A table and a function to perform ARM-specific name mangling for
    NEON vector types in order to conform to the AAPCS (see "Procedure
    Call Standard for the ARM Architecture", Appendix A).  To qualify
diff --git a/gcc/config/arm/t-arm b/gcc/config/arm/t-arm
index 99bd696..2ad7bf3 100644
--- a/gcc/config/arm/t-arm
+++ b/gcc/config/arm/t-arm
@@ -90,7 +90,8 @@ arm.o: $(srcdir)/config/arm/arm.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
   $(EXPR_H) $(OPTABS_H) $(RECOG_H) $(CGRAPH_H) \
   $(GGC_H) except.h $(C_PRAGMA_H) $(TM_P_H) \
   $(TARGET_H) $(TARGET_DEF_H) debug.h langhooks.h $(DF_H) \
-  intl.h libfuncs.h $(PARAMS_H) $(OPTS_H) $(srcdir)/config/arm/arm-cores.def \
+  intl.h libfuncs.h $(PARAMS_H) $(OPTS_H) sched-int.h \
+  $(srcdir)/config/arm/arm-cores.def \
   $(srcdir)/config/arm/arm-arches.def $(srcdir)/config/arm/arm-fpus.def \
   $(srcdir)/config/arm/arm_neon_builtins.def
 
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 26d9e29..801b4a8 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -835,6 +835,7 @@ add_delay_dependencies (rtx_insn *insn)
 /* Forward declarations.  */
 
 static int priority (rtx_insn *);
+static int autopref_rank_for_schedule (const rtx_insn *, const rtx_insn *);
 static int rank_for_schedule (const void *, const void *);
 static void swap_sort (rtx_insn **, int);
 static void queue_insn (rtx_insn *, int, const char *);
@@ -1178,6 +1179,10 @@ update_insn_after_change (rtx_insn *insn)
   INSN_COST (insn) = -1;
   /* Invalidate INSN_TICK, so it'll be recalculated.  */
   INSN_TICK (insn) = INVALID_TICK;
+
+  /* Invalidate autoprefetch data entry.  */
+  INSN_AUTOPREF_MULTIPASS_DATA (insn)[0].dont_delay = -1;
+  INSN_AUTOPREF_MULTIPASS_DATA (insn)[1].dont_delay = -1;
 }
 
 
@@ -2656,6 +2661,13 @@ rank_for_schedule (const void *x, const void *y)
   if (flag_sched_critical_path_heuristic && priority_val)
     return rfs_result (RFS_PRIORITY, priority_val, tmp, tmp2);
 
+  if (PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) >= 0)
+    {
+      int autopref = autopref_rank_for_schedule (tmp, tmp2);
+      if (autopref != 0)
+	return autopref;
+    }
+
   /* Prefer speculative insn with greater dependencies weakness.  */
   if (flag_sched_spec_insn_heuristic && spec_info)
     {
@@ -5432,6 +5444,239 @@ insn_finishes_cycle_p (rtx_insn *insn)
   return false;
 }
 
+/* Functions to model cache auto-prefetcher.
+
+   Some of the CPUs have cache auto-prefetcher, which /seems/ to initiate
+   memory prefetches if it sees instructions with consequitive memory accesses
+   in the instruction stream.  Details of such hardware units are not published,
+   so we can only guess what exactly is going on there.
+   In the scheduler, we model abstract auto-prefetcher.  If there are memory
+   insns in the ready list (or the queue) that have same memory base, but
+   different offsets, then we delay the insns with larger offsets until insns
+   with smaller offsets get scheduled.  If PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
+   is "1", then we look at the ready list; if it is N>1, then we also look
+   through N-1 queue entries.
+   If the param is N>=0, then rank_for_schedule will consider auto-prefetching
+   among its heuristics.
+   Param value of "-1" disables modelling of the auto-prefetcher.  */
+
+/* Initialize autoprefetcher model data for INSN.  */
+static void
+autopref_multipass_init (const rtx_insn *insn, int write)
+{
+  autopref_multipass_data_t data = &INSN_AUTOPREF_MULTIPASS_DATA (insn)[write];
+
+  gcc_assert (data->dont_delay == -1);
+  data->base = NULL_RTX;
+  data->offset = 0;
+  /* Set insn entry initialized, but not relevant for auto-prefetcher.  */
+  data->dont_delay = -2;
+
+  rtx set = single_set (insn);
+  if (set == NULL_RTX)
+    return;
+
+  rtx mem = write ? SET_DEST (set) : SET_SRC (set);
+  if (!MEM_P (mem))
+    return;
+
+  struct address_info info;
+  decompose_mem_address (&info, mem);
+
+  if (info.base == NULL || !REG_P (*info.base)
+      || (info.disp != NULL && !CONST_INT_P (*info.disp)))
+    return;
+
+  /* This insn is relevant for auto-prefetcher.  */
+  data->base = *info.base;
+  data->offset = info.disp ? INTVAL (*info.disp) : 0;
+  data->dont_delay = 0;
+}
+
+/* Helper function for rank_for_schedule sorting.  */
+static int
+autopref_rank_for_schedule (const rtx_insn *insn1, const rtx_insn *insn2)
+{
+  for (int write = 0; write < 2; ++write)
+    {
+      autopref_multipass_data_t data1
+	= &INSN_AUTOPREF_MULTIPASS_DATA (insn1)[write];
+      autopref_multipass_data_t data2
+	= &INSN_AUTOPREF_MULTIPASS_DATA (insn2)[write];
+
+      if (data1->dont_delay == -1)
+	autopref_multipass_init (insn1, write);
+      if (data1->dont_delay == -2)
+	continue;
+
+      if (data2->dont_delay == -1)
+	autopref_multipass_init (insn2, write);
+      if (data2->dont_delay == -2)
+	continue;
+
+      if (!rtx_equal_p (data1->base, data2->base))
+	continue;
+
+      return data1->offset - data2->offset;
+    }
+
+  return 0;
+}
+
+/* True if header of debug dump was printed.  */
+static bool autopref_multipass_dfa_lookahead_guard_started_dump_p;
+
+/* Helper for autopref_multipass_dfa_lookahead_guard.
+   Return "1" if INSN1 should be delayed in favor of INSN2.  */
+static int
+autopref_multipass_dfa_lookahead_guard_1 (const rtx_insn *insn1,
+					  const rtx_insn *insn2, int write)
+{
+  autopref_multipass_data_t data1
+    = &INSN_AUTOPREF_MULTIPASS_DATA (insn1)[write];
+  autopref_multipass_data_t data2
+    = &INSN_AUTOPREF_MULTIPASS_DATA (insn2)[write];
+
+  if (data2->dont_delay == -1)
+    autopref_multipass_init (insn2, write);
+  if (data2->dont_delay == -2)
+    return 0;
+
+  if (rtx_equal_p (data1->base, data2->base)
+      && data1->offset > data2->offset)
+    {
+      if (sched_verbose >= 2)
+	{
+          if (!autopref_multipass_dfa_lookahead_guard_started_dump_p)
+	    {
+	      fprintf (sched_dump,
+		       ";;\t\tnot trying in max_issue due to autoprefetch "
+		       "model: ");
+	      autopref_multipass_dfa_lookahead_guard_started_dump_p = true;
+	    }
+
+	  fprintf (sched_dump, " %d(%d)", INSN_UID (insn1), INSN_UID (insn2));
+	}
+
+      return 1;
+    }
+
+  return 0;
+}
+
+/* General note:
+
+   We could have also hooked autoprefetcher model into
+   first_cycle_multipass_backtrack / first_cycle_multipass_issue hooks
+   to enable intelligent selection of "[r1+0]=r2; [r1+4]=r3" on the same cycle
+   (e.g., once "[r1+0]=r2" is issued in max_issue(), "[r1+4]=r3" gets
+   unblocked).  We don't bother about this yet because target of interest
+   (ARM Cortex-A15) can issue only 1 memory operation per cycle.  */
+
+/* Implementation of first_cycle_multipass_dfa_lookahead_guard hook.
+   Return "1" if INSN1 should not be considered in max_issue due to
+   auto-prefetcher considerations.  */
+int
+autopref_multipass_dfa_lookahead_guard (const rtx_insn *insn1, int ready_index)
+{
+  int r = 0;
+
+  if (PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) <= 0)
+    return 0;
+
+  if (sched_verbose >= 2 && ready_index == 0)
+    autopref_multipass_dfa_lookahead_guard_started_dump_p = false;
+
+  for (int write = 0; write < 2; ++write)
+    {
+      autopref_multipass_data_t data1
+	= &INSN_AUTOPREF_MULTIPASS_DATA (insn1)[write];
+
+      if (data1->dont_delay == -1)
+	autopref_multipass_init (insn1, write);
+      if (data1->dont_delay == -2)
+	continue;
+
+      if (ready_index == 0 && data1->dont_delay == 1)
+	/* We allow only a single delay on priviledged instructions.
+	   Doing otherwise would cause infinite loop.  */
+	{
+	  if (sched_verbose >= 2)
+	    {
+	      if (!autopref_multipass_dfa_lookahead_guard_started_dump_p)
+		{
+		  fprintf (sched_dump,
+			   ";;\t\tnot trying in max_issue due to autoprefetch "
+			   "model: ");
+		  autopref_multipass_dfa_lookahead_guard_started_dump_p = true;
+		}
+
+	      fprintf (sched_dump, " *%d*", INSN_UID (insn1));
+	    }
+	  continue;
+	}
+
+      for (int i2 = 0; i2 < ready.n_ready; ++i2)
+	{
+	  rtx_insn *insn2 = get_ready_element (i2);
+	  if (insn1 == insn2)
+	    continue;
+	  r = autopref_multipass_dfa_lookahead_guard_1 (insn1, insn2, write);
+	  if (r)
+	    {
+	      if (ready_index == 0)
+		{
+		  r = -1;
+		  data1->dont_delay = 1;
+		}
+	      goto finish;
+	    }
+	}
+
+      if (PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) == 1)
+	continue;
+
+      /* Everything from the current queue slot should have been moved to
+	 the ready list.  */
+      gcc_assert (insn_queue[NEXT_Q_AFTER (q_ptr, 0)] == NULL_RTX);
+
+      int n_stalls = PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) - 1;
+      if (n_stalls > max_insn_queue_index)
+	n_stalls = max_insn_queue_index;
+
+      for (int stalls = 1; stalls <= n_stalls; ++stalls)
+	{
+	  for (rtx_insn_list *link = insn_queue[NEXT_Q_AFTER (q_ptr, stalls)];
+	       link != NULL_RTX;
+	       link = link->next ())
+	    {
+	      rtx_insn *insn2 = link->insn ();
+	      r = autopref_multipass_dfa_lookahead_guard_1 (insn1, insn2,
+							    write);
+	      if (r)
+		{
+		  /* Queue INSN1 until INSN2 can issue.  */
+		  r = -stalls;
+		  if (ready_index == 0)
+		    data1->dont_delay = 1;
+		  goto finish;
+		}
+	    }
+	}
+    }
+
+    finish:
+  if (sched_verbose >= 2
+      && autopref_multipass_dfa_lookahead_guard_started_dump_p
+      && (ready_index == ready.n_ready - 1 || r < 0))
+    /* This does not /always/ trigger.  We don't output EOL if the last
+       insn is not recognized (INSN_CODE < 0) and lookahead_guard is not
+       called.  We can live with this.  */
+    fprintf (sched_dump, "\n");
+
+  return r;
+}
+
 /* Define type for target data used in multipass scheduling.  */
 #ifndef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DATA_T
 # define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DATA_T int
@@ -8640,6 +8885,8 @@ init_h_i_d (rtx_insn *insn)
       INSN_EXACT_TICK (insn) = INVALID_TICK;
       INTER_TICK (insn) = INVALID_TICK;
       TODO_SPEC (insn) = HARD_DEP;
+      INSN_AUTOPREF_MULTIPASS_DATA (insn)[0].dont_delay = -1;
+      INSN_AUTOPREF_MULTIPASS_DATA (insn)[1].dont_delay = -1;
     }
 }
 
diff --git a/gcc/params.def b/gcc/params.def
index beff7e6..34e5f59 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -668,6 +668,11 @@ DEFPARAM (PARAM_SCHED_MEM_TRUE_DEP_COST,
 	  "Minimal distance between possibly conflicting store and load",
 	  1, 0, 0)
 
+DEFPARAM (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH,
+	  "sched-autopref-queue-depth",
+	  "Hardware autoprefetcher scheduler model control flag.  Number of lookahead cycles the model looks into; at '0' only enable instruction sorting heuristic.  Disabled by default.",
+	  -1, 0, 0)
+
 DEFPARAM(PARAM_MAX_LAST_VALUE_RTL,
 	 "max-last-value-rtl",
 	 "The maximum number of RTL nodes that can be recorded as combiner's last value",
diff --git a/gcc/sched-int.h b/gcc/sched-int.h
index 71a4b5c..3c8f107 100644
--- a/gcc/sched-int.h
+++ b/gcc/sched-int.h
@@ -794,6 +794,24 @@ struct reg_set_data
   struct reg_set_data *next_insn_set;
 };
 
+/* Data for modeling cache auto-prefetcher.  */
+struct autopref_multipass_data_
+{
+  /* Base part of memory address.  */
+  rtx base;
+  /* Memory offset.  */
+  int offset;
+  /* +1 if entry is relevant for auto-prefetcher, but insn should not be
+     delayed as that will break scheduling.
+     +0 if entry is relevant for auto-prefetcher and insn can be delayed
+     to allow another insn through.
+     -1 if entry is uninitialized.
+     -2 if entry is irrelevant for auto-prefetcher.  */
+  int dont_delay;
+};
+typedef struct autopref_multipass_data_ autopref_multipass_data_def;
+typedef autopref_multipass_data_def *autopref_multipass_data_t;
+
 struct _haifa_insn_data
 {
   /* We can't place 'struct _deps_list' into h_i_d instead of deps_list_t
@@ -891,6 +909,10 @@ struct _haifa_insn_data
 
   /* The deciding reason for INSN's place in the ready list.  */
   int last_rfs_win;
+
+  /* Two entries for cache auto-prefetcher model: one for mem reads,
+     and one for mem writes.  */
+  autopref_multipass_data_def autopref_multipass_data[2];
 };
 
 typedef struct _haifa_insn_data haifa_insn_data_def;
@@ -912,6 +934,8 @@ extern vec<haifa_insn_data_def> h_i_d;
   (HID (INSN)->reg_pressure_excess_cost_change)
 #define INSN_PRIORITY_STATUS(INSN) (HID (INSN)->priority_status)
 #define INSN_MODEL_INDEX(INSN) (HID (INSN)->model_index)
+#define INSN_AUTOPREF_MULTIPASS_DATA(INSN) \
+  (HID (INSN)->autopref_multipass_data)
 
 typedef struct _haifa_deps_insn_data haifa_deps_insn_data_def;
 typedef haifa_deps_insn_data_def *haifa_deps_insn_data_t;
@@ -1360,6 +1384,8 @@ extern int cycle_issued_insns;
 extern int issue_rate;
 extern int dfa_lookahead;
 
+extern int autopref_multipass_dfa_lookahead_guard (rtx, int);
+
 extern void ready_sort (struct ready_list *);
 extern rtx_insn *ready_element (struct ready_list *, int);
 extern rtx_insn **ready_lastpos (struct ready_list *);
-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-10-21  4:09 [PATCH 7/8] Model cache auto-prefetcher in scheduler Maxim Kuvyrkov
@ 2014-10-21  5:44 ` Andrew Pinski
  2014-11-10 13:15 ` Maxim Kuvyrkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Andrew Pinski @ 2014-10-21  5:44 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Vladimir Makarov, Ramana Radhakrishnan

On Mon, Oct 20, 2014 at 9:06 PM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> Hi,
>
> This patch adds auto-prefetcher modeling to GCC scheduler.  The auto-prefetcher model is currently enabled only for ARM Cortex-A15, since this is the only CPU that I know of to have the hardware auto-prefetcher unit.

That might be the only ARM processor but I know the PowerPC 970 and
power 4 have a hardware auto-prefetcher.  They are slightly different
in how many streams can be active.  The 970 has some streams reserved
for user streams.  The PowerPC Cell also has a similar thing.


Thanks,
Andrew

>
> The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short note in Cortex-A15 manual (search for "L2 cache auto-prefether").  This patch, therefore, implements a very abstract model that makes scheduler prefer "mem_op (base+8); mem_op (base+12)" over "mem_op (base+12); mem_op (base+8)".  In other words, memory operations are tried to be issued in order of increasing memory offsets.
>
> The auto-prefetcher model implementation is based on max_issue mutlipass lookahead scheduling, and its "guard" hook.  The guard hook examines contents of the ready list and the queue, and, if it finds instructions with lower memory offsets, marks instructions with higher memory offset as unavailable for immediate scheduling.
>
> This patch has been in works since beginning of the year, and many of my previous scheduler cleanup patches were to prepare the infrastructure for this feature.
>
> Ramana, this change requires benchmarking, which I can't easily do at the moment.  I would appreciate any benchmarking results that you can share.  In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15.
>
> At the moment the parameter is set to "2", which means that the autopref model will look through ready list and 1-stall queue in search of relevant instructions.  Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through ready list and 2-stall queue) should be considered and benchmarked.
>
> Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and aarch64-linux-gnu.  OK to apply, provided no performance or correctness regressions?
>
> [ChangeLog is part of the git patch]
>
> Thank you,
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-10-21  4:09 [PATCH 7/8] Model cache auto-prefetcher in scheduler Maxim Kuvyrkov
  2014-10-21  5:44 ` Andrew Pinski
@ 2014-11-10 13:15 ` Maxim Kuvyrkov
  2014-11-14  2:16 ` Vladimir Makarov
  2014-11-14  6:46 ` Jeff Law
  3 siblings, 0 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2014-11-10 13:15 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Vladimir Makarov, Ramana Radhakrishnan

On Oct 21, 2014, at 8:06 AM, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:

> Hi,
> 
> This patch adds auto-prefetcher modeling to GCC scheduler.  The auto-prefetcher model is currently enabled only for ARM Cortex-A15, since this is the only CPU that I know of to have the hardware auto-prefetcher unit.
> 
> The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short note in Cortex-A15 manual (search for "L2 cache auto-prefether").  This patch, therefore, implements a very abstract model that makes scheduler prefer "mem_op (base+8); mem_op (base+12)" over "mem_op (base+12); mem_op (base+8)".  In other words, memory operations are tried to be issued in order of increasing memory offsets.
> 
> The auto-prefetcher model implementation is based on max_issue mutlipass lookahead scheduling, and its "guard" hook.  The guard hook examines contents of the ready list and the queue, and, if it finds instructions with lower memory offsets, marks instructions with higher memory offset as unavailable for immediate scheduling.
> 
> This patch has been in works since beginning of the year, and many of my previous scheduler cleanup patches were to prepare the infrastructure for this feature. 
> 
> Ramana, this change requires benchmarking, which I can't easily do at the moment.  I would appreciate any benchmarking results that you can share.  In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15.
> 
> At the moment the parameter is set to "2", which means that the autopref model will look through ready list and 1-stall queue in search of relevant instructions.  Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through ready list and 2-stall queue) should be considered and benchmarked.
> 
> Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and aarch64-linux-gnu.  OK to apply, provided no performance or correctness regressions?
> 
> [ChangeLog is part of the git patch]

Ping?

All prerequisite patches for this one are now approved and [mostly] checked in.  This is the last outstanding item from my patch series to improve scheduling.

Thank you,

--
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-10-21  4:09 [PATCH 7/8] Model cache auto-prefetcher in scheduler Maxim Kuvyrkov
  2014-10-21  5:44 ` Andrew Pinski
  2014-11-10 13:15 ` Maxim Kuvyrkov
@ 2014-11-14  2:16 ` Vladimir Makarov
  2014-11-14 15:10   ` Maxim Kuvyrkov
  2014-11-14  6:46 ` Jeff Law
  3 siblings, 1 reply; 19+ messages in thread
From: Vladimir Makarov @ 2014-11-14  2:16 UTC (permalink / raw)
  To: Maxim Kuvyrkov, GCC Patches; +Cc: Ramana Radhakrishnan

On 2014-10-21 12:06 AM, Maxim Kuvyrkov wrote:
> Hi,
>
> This patch adds auto-prefetcher modeling to GCC scheduler.  The auto-prefetcher model is currently enabled only for ARM Cortex-A15, since this is the only CPU that I know of to have the hardware auto-prefetcher unit.
>
> The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short note in Cortex-A15 manual (search for "L2 cache auto-prefether").  This patch, therefore, implements a very abstract model that makes scheduler prefer "mem_op (base+8); mem_op (base+12)" over "mem_op (base+12); mem_op (base+8)".  In other words, memory operations are tried to be issued in order of increasing memory offsets.
>
> The auto-prefetcher model implementation is based on max_issue mutlipass lookahead scheduling, and its "guard" hook.  The guard hook examines contents of the ready list and the queue, and, if it finds instructions with lower memory offsets, marks instructions with higher memory offset as unavailable for immediate scheduling.
>
> This patch has been in works since beginning of the year, and many of my previous scheduler cleanup patches were to prepare the infrastructure for this feature.
>
> Ramana, this change requires benchmarking, which I can't easily do at the moment.  I would appreciate any benchmarking results that you can share.  In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15.
>
> At the moment the parameter is set to "2", which means that the autopref model will look through ready list and 1-stall queue in search of relevant instructions.  Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through ready list and 2-stall queue) should be considered and benchmarked.
>
> Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and aarch64-linux-gnu.  OK to apply, provided no performance or correctness regressions?
>
> [ChangeLog is part of the git patch]
>
>

I'd prefer symbolic constants for dont_delay.  Also the address can 
contains other parts, e.g. index for some targets.  It is not necessary 
to change the code but a comment would be nice that right now it is 
oriented for machine with base+disp only addressing.

Although it is probably matter of taste.  So you are free to commit it 
without any change.

   Thanks, Maxim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-10-21  4:09 [PATCH 7/8] Model cache auto-prefetcher in scheduler Maxim Kuvyrkov
                   ` (2 preceding siblings ...)
  2014-11-14  2:16 ` Vladimir Makarov
@ 2014-11-14  6:46 ` Jeff Law
  2014-11-14 15:24   ` Maxim Kuvyrkov
  3 siblings, 1 reply; 19+ messages in thread
From: Jeff Law @ 2014-11-14  6:46 UTC (permalink / raw)
  To: Maxim Kuvyrkov, GCC Patches; +Cc: Vladimir Makarov, Ramana Radhakrishnan

On 10/20/14 22:06, Maxim Kuvyrkov wrote:
> Hi,
> Ramana, this change requires benchmarking, which I can't easily do
> at
the moment. I would appreciate any benchmarking results that you can
share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
needs to be tuned/confirmed for Cortex-A15.
What were the results of that benchmarking?  IIRC I tabled reviewing 
this work waiting for those results (and I probably should have let you 
know that.  Sorry,  my bad there).

>
>
> 0007-Model-cache-auto-prefetcher-in-scheduler.patch
>
>
>  From 629c2cc593b49b8596b00e3e3d62444493aa3514 Mon Sep 17 00:00:00 2001
> From: Maxim Kuvyrkov<maxim.kuvyrkov@linaro.org>
> Date: Mon, 20 Oct 2014 23:13:23 +0100
> Subject: [PATCH 7/8] Model cache auto-prefetcher in scheduler
>
> 	* config/arm/arm.c (sched-int.h): Include header.
> 	(arm_first_cycle_multipass_dfa_lookahead_guard,)
> 	(TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD): Define hook.
> 	Enable auto-prefetcher model for Cortex-A15.
> 	(arm_option_override): Set autoprefetcher parameter.
> 	* config/arm/t-arm (arm.o): Update.
> 	* haifa-sched.c (update_insn_after_change): Update.
> 	(rank_for_schedule): Use auto-prefetcher model, if requested.
> 	(autopref_multipass_init): New static function.
> 	(autopref_rank_for_schedule): New rank_for_schedule heuristic.
> 	(autopref_multipass_dfa_lookahead_guard_started_dump_p): New static
> 	variable for debug dumps.
> 	(autopref_multipass_dfa_lookahead_guard_1): New static helper function.
> 	(autopref_multipass_dfa_lookahead_guard): New global function that
> 	implements TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD hook.
> 	(init_h_i_d): Update.
> 	* params.def (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH): New tuning knob.
> 	* sched-int.h (autopref_multipass_data_): Structure for	auto-prefetcher
> 	data.
> 	(autopref_multipass_data_def, autopref_multipass_data_t): New typedefs.
> 	(struct _haifa_insn_data:autopref_multipass_data): New field.
> 	(INSN_AUTOPREF_MULTIPASS_DATA): New access macro.
> 	(autopref_multipass_dfa_lookahead_guard): Declare.
Can this be built on top of Bin's work for insn fusion?  There's a lot 
of commonality in the structure of the insns you care about.  He's 
already got a nice little priority function that I think you could 
utilize to to ensure the insns with smaller offsets fire first.


My biggest concern would be sched2 coming along and undoing that work 
since you're not going to fuse those into move-multiple types of 
instructions.

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-11-14  2:16 ` Vladimir Makarov
@ 2014-11-14 15:10   ` Maxim Kuvyrkov
  0 siblings, 0 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2014-11-14 15:10 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: GCC Patches, Ramana Radhakrishnan

On Nov 14, 2014, at 4:57 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:

> On 2014-10-21 12:06 AM, Maxim Kuvyrkov wrote:
> 
...
> I'd prefer symbolic constants for dont_delay.  Also the address can contains other parts, e.g. index for some targets.  It is not necessary to change the code but a comment would be nice that right now it is oriented for machine with base+disp only addressing.
> 
> Although it is probably matter of taste.  So you are free to commit it without any change.

I'll add an enum with symbolic constants for dont_delay and a comment about handled memory address types.

Thanks for the review!

--
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-11-14  6:46 ` Jeff Law
@ 2014-11-14 15:24   ` Maxim Kuvyrkov
  2014-11-14 17:51     ` Jeff Law
  2014-11-19  9:40     ` Ramana Radhakrishnan
  0 siblings, 2 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2014-11-14 15:24 UTC (permalink / raw)
  To: Jeff Law; +Cc: GCC Patches, Vladimir Makarov, Ramana Radhakrishnan

On Nov 14, 2014, at 8:38 AM, Jeff Law <law@redhat.com> wrote:

> On 10/20/14 22:06, Maxim Kuvyrkov wrote:
>> Hi,
>> Ramana, this change requires benchmarking, which I can't easily do
>> at
> the moment. I would appreciate any benchmarking results that you can
> share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
> needs to be tuned/confirmed for Cortex-A15.
> What were the results of that benchmarking?  IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that.  Sorry,  my bad there).

I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers.  The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve.

...
> Can this be built on top of Bin's work for insn fusion?  There's a lot of commonality in the structure of the insns you care about.  He's already got a nice little priority function that I think you could utilize to to ensure the insns with smaller offsets fire first.

I would argue that macro-fusion should have been implemented the way autopref_model is -- via targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook.  To implement the autopref model I cleaned up and generalized existing infrastructure (max_issue and dfa_lookahead_guard hook) instead of adding yet another decision-making primitive to the scheduler.

> 
> 
> My biggest concern would be sched2 coming along and undoing that work since you're not going to fuse those into move-multiple types of instructions.

The autoprefetcher will be active only during sched2.  It is disabled during sched1 by the fact that max_issue is not used when scheduling for register pressure.

Thanks,

--
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-11-14 15:24   ` Maxim Kuvyrkov
@ 2014-11-14 17:51     ` Jeff Law
  2014-11-19  9:40     ` Ramana Radhakrishnan
  1 sibling, 0 replies; 19+ messages in thread
From: Jeff Law @ 2014-11-14 17:51 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Vladimir Makarov, Ramana Radhakrishnan

On 11/14/14 08:12, Maxim Kuvyrkov wrote:
> ...
>> Can this be built on top of Bin's work for insn fusion?  There's a
>> lot of commonality in the structure of the insns you care about.
>> He's already got a nice little priority function that I think you
>> could utilize to to ensure the insns with smaller offsets fire
>> first.
>
> I would argue that macro-fusion should have been implemented the way
> autopref_model is -- via
> targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook.  To
> implement the autopref model I cleaned up and generalized existing
> infrastructure (max_issue and dfa_lookahead_guard hook) instead of
> adding yet another decision-making primitive to the scheduler.
Fair enough.  We can put unifying the two implementations on the queue 
for next stage 1.

>
>>
>>
>> My biggest concern would be sched2 coming along and undoing that
>> work since you're not going to fuse those into move-multiple types
>> of instructions.
>
> The autoprefetcher will be active only during sched2.  It is disabled
> during sched1 by the fact that max_issue is not used when scheduling
> for register pressure.
Sorry, should have been clearer, that comment was in reference to using 
the fusion stuff to handle prefetching.  It was not in reference to your 
change.

Vlad's approval and my questions crossed last night, I don't have any 
objections to this going in, though I do want us to look at unifying the 
memory pairing and prefetching stuff during the next stage1.

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-11-14 15:24   ` Maxim Kuvyrkov
  2014-11-14 17:51     ` Jeff Law
@ 2014-11-19  9:40     ` Ramana Radhakrishnan
  2014-11-19 10:14       ` Maxim Kuvyrkov
  2015-01-16 15:20       ` Maxim Kuvyrkov
  1 sibling, 2 replies; 19+ messages in thread
From: Ramana Radhakrishnan @ 2014-11-19  9:40 UTC (permalink / raw)
  To: Maxim Kuvyrkov, Jeff Law; +Cc: GCC Patches, Vladimir Makarov



On 14/11/14 15:12, Maxim Kuvyrkov wrote:
> On Nov 14, 2014, at 8:38 AM, Jeff Law <law@redhat.com> wrote:
>
>> On 10/20/14 22:06, Maxim Kuvyrkov wrote:
>>> Hi,
>>> Ramana, this change requires benchmarking, which I can't easily do
>>> at
>> the moment. I would appreciate any benchmarking results that you can
>> share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
>> needs to be tuned/confirmed for Cortex-A15.
>> What were the results of that benchmarking?  IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that.  Sorry,  my bad there).
>
> I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers.  The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve.


I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the 
trees you gave me links to against the equivalent version on trunk.

The results with SPEC2k on A15 were in the noise with the default value 
for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm 
still waiting on results for values 0, 1 and 3 and hopefully something 
will come back soon for SPEC2k.


>
> @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
>    return issue_rate > 1 ? issue_rate : 0;
>  }
>
> +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
> +static int
> +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
> +{
> +  switch (arm_tune)
> +    {
> +    case cortexa15:
> +      return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
> +
> +    default:
> +      return 0;
> +    }
> +}
> +


It would be better to have this as a flag in the tuning tables rather 
than hardcoding for a core here. The backend has been moving in that 
direction for all core centric information and it is preferable that be 
continued.

So this logic here should just be

if (current_tune->multipass_lookahead)
   return autopref_multipass_lookahead_guard (insn, ready_index);
else
   return 0;




regards
Ramana




>
> ...
>> Can this be built on top of Bin's work for insn fusion?  There's a lot of commonality in the structure of the insns you care about.  He's already got a nice little priority function that I think you could utilize to to ensure the insns with smaller offsets fire first.
>
> I would argue that macro-fusion should have been implemented the way autopref_model is -- via targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook.  To implement the autopref model I cleaned up and generalized existing infrastructure (max_issue and dfa_lookahead_guard hook) instead of adding yet another decision-making primitive to the scheduler.
>
>>
>>
>> My biggest concern would be sched2 coming along and undoing that work since you're not going to fuse those into move-multiple types of instructions.
>
> The autoprefetcher will be active only during sched2.  It is disabled during sched1 by the fact that max_issue is not used when scheduling for register pressure.
>
> Thanks,
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-11-19  9:40     ` Ramana Radhakrishnan
@ 2014-11-19 10:14       ` Maxim Kuvyrkov
  2015-01-16 15:20       ` Maxim Kuvyrkov
  1 sibling, 0 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2014-11-19 10:14 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Jeff Law, GCC Patches, Vladimir Makarov

On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> wrote:

> 
> 
> On 14/11/14 15:12, Maxim Kuvyrkov wrote:
>> On Nov 14, 2014, at 8:38 AM, Jeff Law <law@redhat.com> wrote:
>> 
>>> On 10/20/14 22:06, Maxim Kuvyrkov wrote:
>>>> Hi,
>>>> Ramana, this change requires benchmarking, which I can't easily do
>>>> at
>>> the moment. I would appreciate any benchmarking results that you can
>>> share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
>>> needs to be tuned/confirmed for Cortex-A15.
>>> What were the results of that benchmarking?  IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that.  Sorry,  my bad there).
>> 
>> I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers.  The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve.
> 
> 
> I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk.
> 
> The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k.

I think there is a good chance that this optimization will not improve SPEC benchmarks, my main concern is to make sure SPECs don't regress.  If SPECs are neutral and the patch gives good improvement (2.5 times faster) on another benchmark (STREAM), is this a good enough argument for commit?

> 
> 
>> 
>> @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
>>   return issue_rate > 1 ? issue_rate : 0;
>> }
>> 
>> +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
>> +static int
>> +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
>> +{
>> +  switch (arm_tune)
>> +    {
>> +    case cortexa15:
>> +      return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
>> +
>> +    default:
>> +      return 0;
>> +    }
>> +}
>> +
> 
> 
> It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued.
> 
> So this logic here should just be
> 
> if (current_tune->multipass_lookahead)
>  return autopref_multipass_lookahead_guard (insn, ready_index);
> else
>  return 0;
> 

OK, I'll convert this to use tuning tables.

Thank you,

--
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2014-11-19  9:40     ` Ramana Radhakrishnan
  2014-11-19 10:14       ` Maxim Kuvyrkov
@ 2015-01-16 15:20       ` Maxim Kuvyrkov
  2015-01-16 15:34         ` Ramana Radhakrishnan
  2015-01-19 15:15         ` Richard Earnshaw
  1 sibling, 2 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2015-01-16 15:20 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Jeff Law, GCC Patches, Vladimir Makarov

[-- Attachment #1: Type: text/plain, Size: 3537 bytes --]

On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com> wrote:

> 

Hi Ramana,
Hi Vladimir,

I still don't have SPEC2000/SPEC2006 benchmark numbers for this patch.  Since stage3 is about to finish, I'm going to commit the target independent part of the patch now (it was approved by Vladimir a while ago).

I'll commit ARM part of the patch once Ramana is happy with it.

I've addressed all review comments and updated patch is attached.

Any objections?

> On 14/11/14 15:12, Maxim Kuvyrkov wrote:
>> On Nov 14, 2014, at 8:38 AM, Jeff Law <law@redhat.com> wrote:
>> 
>>> On 10/20/14 22:06, Maxim Kuvyrkov wrote:
>>>> Hi,
>>>> Ramana, this change requires benchmarking, which I can't easily do
>>>> at
>>> the moment. I would appreciate any benchmarking results that you can
>>> share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
>>> needs to be tuned/confirmed for Cortex-A15.
>>> What were the results of that benchmarking?  IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that.  Sorry,  my bad there).
>> 
>> I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers.  The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve.
> 
> 
> I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk.
> 
> The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k.

After going back to the original benchmark that exposed the autoprefetcher issue (STREAMS) the required value of the parameter to fix the regression turned out to be 5.

I have also discovered that Cortex-A57 also benefits from this optimization.  Running STREAMS benchmark on Juno in AArch32 mode showed increased performance by 10-15%, which is not as big as Cortex-A15's 250-300%, but still significant.

In this updated patch the parameter is set to maximum -- look through all scheduling queue of instructions in search of memory ops that are relevant to L2 autoprefetcher.  In practice the setting to maximum will not change much, since there are rarely many instructions in the queue.

> 
> 
>> 
>> @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
>>   return issue_rate > 1 ? issue_rate : 0;
>> }
>> 
>> +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
>> +static int
>> +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
>> +{
>> +  switch (arm_tune)
>> +    {
>> +    case cortexa15:
>> +      return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
>> +
>> +    default:
>> +      return 0;
>> +    }
>> +}
>> +
> 
> 
> It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued.
> 
> So this logic here should just be
> 
> if (current_tune->multipass_lookahead)
>  return autopref_multipass_lookahead_guard (insn, ready_index);
> else
>  return 0;
> 

Done.  The autoprefetcher is enabled in Cortex-A15 and Cortex-A57 tuning tables.

--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0001-Model-cache-auto-prefetcher-in-scheduler.patch --]
[-- Type: application/octet-stream, Size: 28352 bytes --]

From 46d29f2b60298986319a45c45b92fe4e376a7a12 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Mon, 20 Oct 2014 23:13:23 +0100
Subject: [PATCH] Model cache auto-prefetcher in scheduler

	* config/arm/arm.c (sched-int.h): Include header.
	(arm_first_cycle_multipass_dfa_lookahead_guard,)
	(TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD): Define hook.
	Enable auto-prefetcher model for Cortex-A15.
	(arm_option_override): Set autoprefetcher parameter.
	* config/arm/t-arm (arm.o): Update.
	* haifa-sched.c (update_insn_after_change): Update.
	(rank_for_schedule): Use auto-prefetcher model, if requested.
	(autopref_multipass_init): New static function.
	(autopref_rank_for_schedule): New rank_for_schedule heuristic.
	(autopref_multipass_dfa_lookahead_guard_started_dump_p): New static
	variable for debug dumps.
	(autopref_multipass_dfa_lookahead_guard_1): New static helper function.
	(autopref_multipass_dfa_lookahead_guard): New global function that
	implements TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD hook.
	(init_h_i_d): Update.
	* params.def (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH): New tuning knob.
	* sched-int.h (autopref_multipass_data_): Structure for	auto-prefetcher
	data.
	(autopref_multipass_data_def, autopref_multipass_data_t): New typedefs.
	(struct _haifa_insn_data:autopref_multipass_data): New field.
	(INSN_AUTOPREF_MULTIPASS_DATA): New access macro.
	(autopref_multipass_dfa_lookahead_guard): Declare.
---
 gcc/config/arm/arm-protos.h |    2 +
 gcc/config/arm/arm.c        |   80 ++++++++++----
 gcc/config/arm/t-arm        |    3 +-
 gcc/haifa-sched.c           |  253 +++++++++++++++++++++++++++++++++++++++++++
 gcc/params.def              |    5 +
 gcc/sched-int.h             |   34 ++++++
 6 files changed, 356 insertions(+), 21 deletions(-)

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 320215b..3db7e16 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -291,6 +291,8 @@ struct tune_params
   int max_insns_inline_memset;
   /* Bitfield encoding the fuseable pairs of instructions.  */
   unsigned int fuseable_ops;
+  /* Depth of scheduling queue to check for L2 autoprefetcher.  */
+  int sched_autopref_queue_depth;
 };
 
 extern const struct tune_params *current_tune;
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 337a69b..fddd770 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -97,6 +97,7 @@
 #include "builtins.h"
 #include "tm-constrs.h"
 #include "rtl-iter.h"
+#include "sched-int.h"
 
 /* Forward definitions of types.  */
 typedef struct minipool_node    Mnode;
@@ -269,6 +270,7 @@ static bool arm_macro_fusion_p (void);
 static bool arm_cannot_copy_insn_p (rtx_insn *);
 static int arm_issue_rate (void);
 static int arm_first_cycle_multipass_dfa_lookahead (void);
+static int arm_first_cycle_multipass_dfa_lookahead_guard (rtx_insn *, int);
 static void arm_output_dwarf_dtprel (FILE *, int, rtx) ATTRIBUTE_UNUSED;
 static bool arm_output_addr_const_extra (FILE *, rtx);
 static bool arm_allocate_stack_slots_for_args (void);
@@ -629,6 +631,10 @@ static const struct attribute_spec arm_attribute_table[] =
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   arm_first_cycle_multipass_dfa_lookahead
 
+#undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD
+#define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
+  arm_first_cycle_multipass_dfa_lookahead_guard
+
 #undef TARGET_MANGLE_TYPE
 #define TARGET_MANGLE_TYPE arm_mangle_type
 
@@ -1690,7 +1696,8 @@ const struct tune_params arm_slowmul_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1710,7 +1717,8 @@ const struct tune_params arm_fastmul_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1733,7 +1741,8 @@ const struct tune_params arm_strongarm_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1753,7 +1762,8 @@ const struct tune_params arm_xscale_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_9e_tune =
@@ -1773,7 +1783,8 @@ const struct tune_params arm_9e_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_v6t2_tune =
@@ -1793,7 +1804,8 @@ const struct tune_params arm_v6t2_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1814,7 +1826,8 @@ const struct tune_params arm_cortex_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
@@ -1834,7 +1847,8 @@ const struct tune_params arm_cortex_a8_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1854,7 +1868,8 @@ const struct tune_params arm_cortex_a7_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
@@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
   true, true,                                   /* Prefer 32-bit encodings.  */
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
@@ -1894,7 +1910,8 @@ const struct tune_params arm_cortex_a53_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_MOVW_MOVT				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
@@ -1914,7 +1931,8 @@ const struct tune_params arm_cortex_a57_tune =
   true, true,                                  /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_MOVW_MOVT				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
+  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_xgene1_tune =
@@ -1934,7 +1952,8 @@ const struct tune_params arm_xgene1_tune =
   true, true,                                  /* Prefer 32-bit encodings.  */
   false,				       /* Prefer Neon for stringops.  */
   32,					       /* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1957,7 +1976,8 @@ const struct tune_params arm_cortex_a5_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1977,7 +1997,8 @@ const struct tune_params arm_cortex_a9_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1997,7 +2018,8 @@ const struct tune_params arm_cortex_a12_tune =
   true, true,                                   /* Prefer 32-bit encodings.  */
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_MOVW_MOVT				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -2024,7 +2046,8 @@ const struct tune_params arm_v7m_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 /* Cortex-M7 tuning.  */
@@ -2046,7 +2069,8 @@ const struct tune_params arm_cortex_m7_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -2068,7 +2092,8 @@ const struct tune_params arm_v6m_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -2088,7 +2113,8 @@ const struct tune_params arm_fa726te_tune =
   false, false,                                 /* Prefer 32-bit encodings.  */
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
-  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
+  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
+  -1						/* Sched L2 autopref depth.  */
 };
 
 
@@ -3144,6 +3170,13 @@ arm_option_override (void)
                          global_options.x_param_values,
                          global_options_set.x_param_values);
 
+  /* Look through ready list and all of queue for instructions
+     relevant for L2 auto-prefetcher.  */
+  maybe_set_param_value (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH,
+			 current_tune->sched_autopref_queue_depth,
+                         global_options.x_param_values,
+                         global_options_set.x_param_values);
+
   /* Disable shrink-wrap when optimizing function for size, since it tends to
      generate additional returns.  */
   if (optimize_function_for_size_p (cfun) && TARGET_THUMB2)
@@ -27153,6 +27186,13 @@ arm_first_cycle_multipass_dfa_lookahead (void)
   return issue_rate > 1 && !sched_fusion ? issue_rate : 0;
 }
 
+/* Enable modeling of L2 auto-prefetcher.  */
+static int
+arm_first_cycle_multipass_dfa_lookahead_guard (rtx_insn *insn, int ready_index)
+{
+  return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
+}
+
 const char *
 arm_mangle_type (const_tree type)
 {
diff --git a/gcc/config/arm/t-arm b/gcc/config/arm/t-arm
index 4ef38a8..ab5b6e7 100644
--- a/gcc/config/arm/t-arm
+++ b/gcc/config/arm/t-arm
@@ -91,7 +91,8 @@ arm.o: $(srcdir)/config/arm/arm.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
   $(EXPR_H) $(OPTABS_H) $(RECOG_H) $(CGRAPH_H) \
   $(GGC_H) except.h $(C_PRAGMA_H) $(TM_P_H) \
   $(TARGET_H) $(TARGET_DEF_H) debug.h langhooks.h $(DF_H) \
-  intl.h libfuncs.h $(PARAMS_H) $(OPTS_H) $(srcdir)/config/arm/arm-cores.def \
+  intl.h libfuncs.h $(PARAMS_H) $(OPTS_H) sched-int.h \
+  $(srcdir)/config/arm/arm-cores.def \
   $(srcdir)/config/arm/arm-arches.def $(srcdir)/config/arm/arm-fpus.def \
   $(srcdir)/config/arm/arm-protos.h \
   $(srcdir)/config/arm/arm_neon_builtins.def
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 98cb9e4b..795ff79 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -841,6 +841,7 @@ add_delay_dependencies (rtx_insn *insn)
 /* Forward declarations.  */
 
 static int priority (rtx_insn *);
+static int autopref_rank_for_schedule (const rtx_insn *, const rtx_insn *);
 static int rank_for_schedule (const void *, const void *);
 static void swap_sort (rtx_insn **, int);
 static void queue_insn (rtx_insn *, int, const char *);
@@ -1184,6 +1185,12 @@ update_insn_after_change (rtx_insn *insn)
   INSN_COST (insn) = -1;
   /* Invalidate INSN_TICK, so it'll be recalculated.  */
   INSN_TICK (insn) = INVALID_TICK;
+
+  /* Invalidate autoprefetch data entry.  */
+  INSN_AUTOPREF_MULTIPASS_DATA (insn)[0].status
+    = AUTOPREF_MULTIPASS_DATA_UNINITIALIZED;
+  INSN_AUTOPREF_MULTIPASS_DATA (insn)[1].status
+    = AUTOPREF_MULTIPASS_DATA_UNINITIALIZED;
 }
 
 
@@ -2724,6 +2731,13 @@ rank_for_schedule (const void *x, const void *y)
   if (flag_sched_critical_path_heuristic && priority_val)
     return rfs_result (RFS_PRIORITY, priority_val, tmp, tmp2);
 
+  if (PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) >= 0)
+    {
+      int autopref = autopref_rank_for_schedule (tmp, tmp2);
+      if (autopref != 0)
+	return autopref;
+    }
+
   /* Prefer speculative insn with greater dependencies weakness.  */
   if (flag_sched_spec_insn_heuristic && spec_info)
     {
@@ -5500,6 +5514,241 @@ insn_finishes_cycle_p (rtx_insn *insn)
   return false;
 }
 
+/* Functions to model cache auto-prefetcher.
+
+   Some of the CPUs have cache auto-prefetcher, which /seems/ to initiate
+   memory prefetches if it sees instructions with consequitive memory accesses
+   in the instruction stream.  Details of such hardware units are not published,
+   so we can only guess what exactly is going on there.
+   In the scheduler, we model abstract auto-prefetcher.  If there are memory
+   insns in the ready list (or the queue) that have same memory base, but
+   different offsets, then we delay the insns with larger offsets until insns
+   with smaller offsets get scheduled.  If PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
+   is "1", then we look at the ready list; if it is N>1, then we also look
+   through N-1 queue entries.
+   If the param is N>=0, then rank_for_schedule will consider auto-prefetching
+   among its heuristics.
+   Param value of "-1" disables modelling of the auto-prefetcher.  */
+
+/* Initialize autoprefetcher model data for INSN.  */
+static void
+autopref_multipass_init (const rtx_insn *insn, int write)
+{
+  autopref_multipass_data_t data = &INSN_AUTOPREF_MULTIPASS_DATA (insn)[write];
+
+  gcc_assert (data->status == AUTOPREF_MULTIPASS_DATA_UNINITIALIZED);
+  data->base = NULL_RTX;
+  data->offset = 0;
+  /* Set insn entry initialized, but not relevant for auto-prefetcher.  */
+  data->status = AUTOPREF_MULTIPASS_DATA_IRRELEVANT;
+
+  rtx set = single_set (insn);
+  if (set == NULL_RTX)
+    return;
+
+  rtx mem = write ? SET_DEST (set) : SET_SRC (set);
+  if (!MEM_P (mem))
+    return;
+
+  struct address_info info;
+  decompose_mem_address (&info, mem);
+
+  /* TODO: Currently only (base+const) addressing is supported.  */
+  if (info.base == NULL || !REG_P (*info.base)
+      || (info.disp != NULL && !CONST_INT_P (*info.disp)))
+    return;
+
+  /* This insn is relevant for auto-prefetcher.  */
+  data->base = *info.base;
+  data->offset = info.disp ? INTVAL (*info.disp) : 0;
+  data->status = AUTOPREF_MULTIPASS_DATA_NORMAL;
+}
+
+/* Helper function for rank_for_schedule sorting.  */
+static int
+autopref_rank_for_schedule (const rtx_insn *insn1, const rtx_insn *insn2)
+{
+  for (int write = 0; write < 2; ++write)
+    {
+      autopref_multipass_data_t data1
+	= &INSN_AUTOPREF_MULTIPASS_DATA (insn1)[write];
+      autopref_multipass_data_t data2
+	= &INSN_AUTOPREF_MULTIPASS_DATA (insn2)[write];
+
+      if (data1->status == AUTOPREF_MULTIPASS_DATA_UNINITIALIZED)
+	autopref_multipass_init (insn1, write);
+      if (data1->status == AUTOPREF_MULTIPASS_DATA_IRRELEVANT)
+	continue;
+
+      if (data2->status == AUTOPREF_MULTIPASS_DATA_UNINITIALIZED)
+	autopref_multipass_init (insn2, write);
+      if (data2->status == AUTOPREF_MULTIPASS_DATA_IRRELEVANT)
+	continue;
+
+      if (!rtx_equal_p (data1->base, data2->base))
+	continue;
+
+      return data1->offset - data2->offset;
+    }
+
+  return 0;
+}
+
+/* True if header of debug dump was printed.  */
+static bool autopref_multipass_dfa_lookahead_guard_started_dump_p;
+
+/* Helper for autopref_multipass_dfa_lookahead_guard.
+   Return "1" if INSN1 should be delayed in favor of INSN2.  */
+static int
+autopref_multipass_dfa_lookahead_guard_1 (const rtx_insn *insn1,
+					  const rtx_insn *insn2, int write)
+{
+  autopref_multipass_data_t data1
+    = &INSN_AUTOPREF_MULTIPASS_DATA (insn1)[write];
+  autopref_multipass_data_t data2
+    = &INSN_AUTOPREF_MULTIPASS_DATA (insn2)[write];
+
+  if (data2->status == AUTOPREF_MULTIPASS_DATA_UNINITIALIZED)
+    autopref_multipass_init (insn2, write);
+  if (data2->status == AUTOPREF_MULTIPASS_DATA_IRRELEVANT)
+    return 0;
+
+  if (rtx_equal_p (data1->base, data2->base)
+      && data1->offset > data2->offset)
+    {
+      if (sched_verbose >= 2)
+	{
+          if (!autopref_multipass_dfa_lookahead_guard_started_dump_p)
+	    {
+	      fprintf (sched_dump,
+		       ";;\t\tnot trying in max_issue due to autoprefetch "
+		       "model: ");
+	      autopref_multipass_dfa_lookahead_guard_started_dump_p = true;
+	    }
+
+	  fprintf (sched_dump, " %d(%d)", INSN_UID (insn1), INSN_UID (insn2));
+	}
+
+      return 1;
+    }
+
+  return 0;
+}
+
+/* General note:
+
+   We could have also hooked autoprefetcher model into
+   first_cycle_multipass_backtrack / first_cycle_multipass_issue hooks
+   to enable intelligent selection of "[r1+0]=r2; [r1+4]=r3" on the same cycle
+   (e.g., once "[r1+0]=r2" is issued in max_issue(), "[r1+4]=r3" gets
+   unblocked).  We don't bother about this yet because target of interest
+   (ARM Cortex-A15) can issue only 1 memory operation per cycle.  */
+
+/* Implementation of first_cycle_multipass_dfa_lookahead_guard hook.
+   Return "1" if INSN1 should not be considered in max_issue due to
+   auto-prefetcher considerations.  */
+int
+autopref_multipass_dfa_lookahead_guard (rtx_insn *insn1, int ready_index)
+{
+  int r = 0;
+
+  if (PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) <= 0)
+    return 0;
+
+  if (sched_verbose >= 2 && ready_index == 0)
+    autopref_multipass_dfa_lookahead_guard_started_dump_p = false;
+
+  for (int write = 0; write < 2; ++write)
+    {
+      autopref_multipass_data_t data1
+	= &INSN_AUTOPREF_MULTIPASS_DATA (insn1)[write];
+
+      if (data1->status == AUTOPREF_MULTIPASS_DATA_UNINITIALIZED)
+	autopref_multipass_init (insn1, write);
+      if (data1->status == AUTOPREF_MULTIPASS_DATA_IRRELEVANT)
+	continue;
+
+      if (ready_index == 0
+	  && data1->status == AUTOPREF_MULTIPASS_DATA_DONT_DELAY)
+	/* We allow only a single delay on priviledged instructions.
+	   Doing otherwise would cause infinite loop.  */
+	{
+	  if (sched_verbose >= 2)
+	    {
+	      if (!autopref_multipass_dfa_lookahead_guard_started_dump_p)
+		{
+		  fprintf (sched_dump,
+			   ";;\t\tnot trying in max_issue due to autoprefetch "
+			   "model: ");
+		  autopref_multipass_dfa_lookahead_guard_started_dump_p = true;
+		}
+
+	      fprintf (sched_dump, " *%d*", INSN_UID (insn1));
+	    }
+	  continue;
+	}
+
+      for (int i2 = 0; i2 < ready.n_ready; ++i2)
+	{
+	  rtx_insn *insn2 = get_ready_element (i2);
+	  if (insn1 == insn2)
+	    continue;
+	  r = autopref_multipass_dfa_lookahead_guard_1 (insn1, insn2, write);
+	  if (r)
+	    {
+	      if (ready_index == 0)
+		{
+		  r = -1;
+		  data1->status = AUTOPREF_MULTIPASS_DATA_DONT_DELAY;
+		}
+	      goto finish;
+	    }
+	}
+
+      if (PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) == 1)
+	continue;
+
+      /* Everything from the current queue slot should have been moved to
+	 the ready list.  */
+      gcc_assert (insn_queue[NEXT_Q_AFTER (q_ptr, 0)] == NULL_RTX);
+
+      int n_stalls = PARAM_VALUE (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH) - 1;
+      if (n_stalls > max_insn_queue_index)
+	n_stalls = max_insn_queue_index;
+
+      for (int stalls = 1; stalls <= n_stalls; ++stalls)
+	{
+	  for (rtx_insn_list *link = insn_queue[NEXT_Q_AFTER (q_ptr, stalls)];
+	       link != NULL_RTX;
+	       link = link->next ())
+	    {
+	      rtx_insn *insn2 = link->insn ();
+	      r = autopref_multipass_dfa_lookahead_guard_1 (insn1, insn2,
+							    write);
+	      if (r)
+		{
+		  /* Queue INSN1 until INSN2 can issue.  */
+		  r = -stalls;
+		  if (ready_index == 0)
+		    data1->status = AUTOPREF_MULTIPASS_DATA_DONT_DELAY;
+		  goto finish;
+		}
+	    }
+	}
+    }
+
+    finish:
+  if (sched_verbose >= 2
+      && autopref_multipass_dfa_lookahead_guard_started_dump_p
+      && (ready_index == ready.n_ready - 1 || r < 0))
+    /* This does not /always/ trigger.  We don't output EOL if the last
+       insn is not recognized (INSN_CODE < 0) and lookahead_guard is not
+       called.  We can live with this.  */
+    fprintf (sched_dump, "\n");
+
+  return r;
+}
+
 /* Define type for target data used in multipass scheduling.  */
 #ifndef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DATA_T
 # define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DATA_T int
@@ -8710,6 +8959,10 @@ init_h_i_d (rtx_insn *insn)
       INSN_EXACT_TICK (insn) = INVALID_TICK;
       INTER_TICK (insn) = INVALID_TICK;
       TODO_SPEC (insn) = HARD_DEP;
+      INSN_AUTOPREF_MULTIPASS_DATA (insn)[0].status
+	= AUTOPREF_MULTIPASS_DATA_UNINITIALIZED;
+      INSN_AUTOPREF_MULTIPASS_DATA (insn)[1].status
+	= AUTOPREF_MULTIPASS_DATA_UNINITIALIZED;
     }
 }
 
diff --git a/gcc/params.def b/gcc/params.def
index 3f69ce0..192c1e0 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -668,6 +668,11 @@ DEFPARAM (PARAM_SCHED_MEM_TRUE_DEP_COST,
 	  "Minimal distance between possibly conflicting store and load",
 	  1, 0, 0)
 
+DEFPARAM (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH,
+	  "sched-autopref-queue-depth",
+	  "Hardware autoprefetcher scheduler model control flag.  Number of lookahead cycles the model looks into; at '0' only enable instruction sorting heuristic.  Disabled by default.",
+	  -1, 0, 0)
+
 DEFPARAM(PARAM_MAX_LAST_VALUE_RTL,
 	 "max-last-value-rtl",
 	 "The maximum number of RTL nodes that can be recorded as combiner's last value",
diff --git a/gcc/sched-int.h b/gcc/sched-int.h
index 9392d04..28e95ea 100644
--- a/gcc/sched-int.h
+++ b/gcc/sched-int.h
@@ -793,6 +793,32 @@ struct reg_set_data
   struct reg_set_data *next_insn_set;
 };
 
+enum autopref_multipass_data_status {
+  /* Entry is irrelevant for auto-prefetcher.  */
+  AUTOPREF_MULTIPASS_DATA_IRRELEVANT = -2,
+  /* Entry is uninitialized.  */
+  AUTOPREF_MULTIPASS_DATA_UNINITIALIZED = -1,
+  /* Entry is relevant for auto-prefetcher and insn can be delayed
+     to allow another insn through.  */
+  AUTOPREF_MULTIPASS_DATA_NORMAL = 0,
+  /* Entry is relevant for auto-prefetcher, but insn should not be
+     delayed as that will break scheduling.  */
+  AUTOPREF_MULTIPASS_DATA_DONT_DELAY = 1
+};
+
+/* Data for modeling cache auto-prefetcher.  */
+struct autopref_multipass_data_
+{
+  /* Base part of memory address.  */
+  rtx base;
+  /* Memory offset.  */
+  int offset;
+  /* Entry status.  */
+  enum autopref_multipass_data_status status;
+};
+typedef struct autopref_multipass_data_ autopref_multipass_data_def;
+typedef autopref_multipass_data_def *autopref_multipass_data_t;
+
 struct _haifa_insn_data
 {
   /* We can't place 'struct _deps_list' into h_i_d instead of deps_list_t
@@ -893,6 +919,10 @@ struct _haifa_insn_data
 
   /* The deciding reason for INSN's place in the ready list.  */
   int last_rfs_win;
+
+  /* Two entries for cache auto-prefetcher model: one for mem reads,
+     and one for mem writes.  */
+  autopref_multipass_data_def autopref_multipass_data[2];
 };
 
 typedef struct _haifa_insn_data haifa_insn_data_def;
@@ -915,6 +945,8 @@ extern vec<haifa_insn_data_def> h_i_d;
   (HID (INSN)->reg_pressure_excess_cost_change)
 #define INSN_PRIORITY_STATUS(INSN) (HID (INSN)->priority_status)
 #define INSN_MODEL_INDEX(INSN) (HID (INSN)->model_index)
+#define INSN_AUTOPREF_MULTIPASS_DATA(INSN) \
+  (HID (INSN)->autopref_multipass_data)
 
 typedef struct _haifa_deps_insn_data haifa_deps_insn_data_def;
 typedef haifa_deps_insn_data_def *haifa_deps_insn_data_t;
@@ -1363,6 +1395,8 @@ extern int cycle_issued_insns;
 extern int issue_rate;
 extern int dfa_lookahead;
 
+extern int autopref_multipass_dfa_lookahead_guard (rtx_insn *, int);
+
 extern void ready_sort (struct ready_list *);
 extern rtx_insn *ready_element (struct ready_list *, int);
 extern rtx_insn **ready_lastpos (struct ready_list *);
-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-16 15:20       ` Maxim Kuvyrkov
@ 2015-01-16 15:34         ` Ramana Radhakrishnan
  2015-01-19 15:15         ` Richard Earnshaw
  1 sibling, 0 replies; 19+ messages in thread
From: Ramana Radhakrishnan @ 2015-01-16 15:34 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Ramana Radhakrishnan, Jeff Law, GCC Patches, Vladimir Makarov

On Fri, Jan 16, 2015 at 3:06 PM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com> wrote:
>
>>
>
> Hi Ramana,
> Hi Vladimir,
>
> I still don't have SPEC2000/SPEC2006 benchmark numbers for this patch.  Since stage3 is about to finish, I'm going to commit the target independent part of the patch now (it was approved by Vladimir a while ago).
>
> I'll commit ARM part of the patch once Ramana is happy with it.
>
> I've addressed all review comments and updated patch is attached.
>
> Any objections?
>
>> On 14/11/14 15:12, Maxim Kuvyrkov wrote:
>>> On Nov 14, 2014, at 8:38 AM, Jeff Law <law@redhat.com> wrote:
>>>
>>>> On 10/20/14 22:06, Maxim Kuvyrkov wrote:
>>>>> Hi,
>>>>> Ramana, this change requires benchmarking, which I can't easily do
>>>>> at
>>>> the moment. I would appreciate any benchmarking results that you can
>>>> share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
>>>> needs to be tuned/confirmed for Cortex-A15.
>>>> What were the results of that benchmarking?  IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that.  Sorry,  my bad there).
>>>
>>> I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers.  The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve.
>>
>>
>> I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk.
>>
>> The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k.
>
> After going back to the original benchmark that exposed the autoprefetcher issue (STREAMS) the required value of the parameter to fix the regression turned out to be 5.
>
> I have also discovered that Cortex-A57 also benefits from this optimization.  Running STREAMS benchmark on Juno in AArch32 mode showed increased performance by 10-15%, which is not as big as Cortex-A15's 250-300%, but still significant.
>
> In this updated patch the parameter is set to maximum -- look through all scheduling queue of instructions in search of memory ops that are relevant to L2 autoprefetcher.  In practice the setting to maximum will not change much, since there are rarely many instructions in the queue.
>
>>
>>
>>>
>>> @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
>>>   return issue_rate > 1 ? issue_rate : 0;
>>> }
>>>
>>> +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
>>> +static int
>>> +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
>>> +{
>>> +  switch (arm_tune)
>>> +    {
>>> +    case cortexa15:
>>> +      return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
>>> +
>>> +    default:
>>> +      return 0;
>>> +    }
>>> +}
>>> +
>>
>>
>> It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued.
>>
>> So this logic here should just be
>>
>> if (current_tune->multipass_lookahead)
>>  return autopref_multipass_lookahead_guard (insn, ready_index);
>> else
>>  return 0;
>>
>
> Done.  The autoprefetcher is enabled in Cortex-A15 and Cortex-A57 tuning tables.


The ARM changes look sane. As long as you've tested this with a
bootstrap, I'm ok with it - we probably need a similar change for
AArch64 and Cortex-A57 but I think we should do that for the next
stage1. Thanks for persevering with this.


Ramana

>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-16 15:20       ` Maxim Kuvyrkov
  2015-01-16 15:34         ` Ramana Radhakrishnan
@ 2015-01-19 15:15         ` Richard Earnshaw
  2015-01-19 18:17           ` Maxim Kuvyrkov
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Earnshaw @ 2015-01-19 15:15 UTC (permalink / raw)
  To: Maxim Kuvyrkov, Ramana Radhakrishnan
  Cc: Jeff Law, GCC Patches, Vladimir Makarov

On 16/01/15 15:06, Maxim Kuvyrkov wrote:
> @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
>    true, true,                                   /* Prefer 32-bit encodings.  */
>    true,						/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
> -  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
> +  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> +  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
>  };

Hmm, two issues here:
1) This requires a static constructor for the tuning table entry (since
the value of max_insn_queue_index has to be looked up at run time.

2) Doesn't this mean that the depth of searching will depend on
properties of the automata rather than some machine specific values (so
that potentially adding or removing unrelated scheduler rules could
change the behaviour of the compiler)?

In general, how should someone tuning the compiler for this parameter
select a value that isn't one of (-1, m_i_q_d+1)?

R.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-19 15:15         ` Richard Earnshaw
@ 2015-01-19 18:17           ` Maxim Kuvyrkov
  2015-01-20  9:13             ` Ramana Radhakrishnan
  2015-01-20 10:53             ` Richard Earnshaw
  0 siblings, 2 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2015-01-19 18:17 UTC (permalink / raw)
  To: Richard Earnshaw
  Cc: Ramana Radhakrishnan, Jeff Law, GCC Patches, Vladimir Makarov

On Jan 19, 2015, at 6:05 PM, Richard Earnshaw <rearnsha@arm.com> wrote:

> On 16/01/15 15:06, Maxim Kuvyrkov wrote:
>> @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
>>   true, true,                                   /* Prefer 32-bit encodings.  */
>>   true,						/* Prefer Neon for stringops.  */
>>   8,						/* Maximum insns to inline memset.  */
>> -  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
>> +  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
>> +  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
>> };
> 
> 
> Hmm, two issues here:
> 1) This requires a static constructor for the tuning table entry (since
> the value of max_insn_queue_index has to be looked up at run time.

Are you sure?  I didn't check the object files, but, since max_insn_queue_index is a "const int", I would expect a relocation that would be resolved at link time, not a constructor.

Is it a problem to have a static constructor for the tables?

> 
> 2) Doesn't this mean that the depth of searching will depend on
> properties of the automata rather than some machine specific values (so
> that potentially adding or removing unrelated scheduler rules could
> change the behaviour of the compiler)?

No.  The extra queue entries that will appear from extending an unrelated automaton will be empty, so the search will check them, but won't find anything.

> 
> In general, how should someone tuning the compiler for this parameter
> select a value that isn't one of (-1, m_i_q_d+1)?

From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere.  If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day.

--
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-19 18:17           ` Maxim Kuvyrkov
@ 2015-01-20  9:13             ` Ramana Radhakrishnan
  2015-01-20 10:53             ` Richard Earnshaw
  1 sibling, 0 replies; 19+ messages in thread
From: Ramana Radhakrishnan @ 2015-01-20  9:13 UTC (permalink / raw)
  To: Maxim Kuvyrkov, Richard Earnshaw; +Cc: Jeff Law, GCC Patches, Vladimir Makarov



On 19/01/15 18:14, Maxim Kuvyrkov wrote:
> On Jan 19, 2015, at 6:05 PM, Richard Earnshaw <rearnsha@arm.com> wrote:
>
>> On 16/01/15 15:06, Maxim Kuvyrkov wrote:
>>> @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
>>>    true, true,                                   /* Prefer 32-bit encodings.  */
>>>    true,						/* Prefer Neon for stringops.  */
>>>    8,						/* Maximum insns to inline memset.  */
>>> -  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
>>> +  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
>>> +  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
>>> };
>>
>>
>> Hmm, two issues here:
>> 1) This requires a static constructor for the tuning table entry (since
>> the value of max_insn_queue_index has to be looked up at run time.
>
> Are you sure?  I didn't check the object files, but, since max_insn_queue_index is a "const int", I would expect a relocation that would be resolved at link time, not a constructor.

I did miss this during the original review, sorry about that.

In theory yes, but in practice apparently not. Probably something LTO 
might be able to handle, but in the absence of defaults to LTO I don't 
see how we can guarantee something like this.

I do see a static constructor being put in for this on a cross compiler 
build.


0000000000d79338 <_Z41__static_initialization_and_destruction_0ii>:

d79338:       55                      push   %rbp
   d79339:       48 89 e5                mov    %rsp,%rbp
   d7933c:       89 7d fc                mov    %edi,-0x4(%rbp)
   d7933f:       89 75 f8                mov    %esi,-0x8(%rbp)
   d79342:       83 7d fc 01             cmpl   $0x1,-0x4(%rbp)
   d79346:       75 27                   jne    d7936f 
<_Z41__static_initialization_and_destruction_0ii+0x37>
   d79348:       81 7d f8 ff ff 00 00    cmpl   $0xffff,-0x8(%rbp)
   d7934f:       75 1e                   jne    d7936f 
<_Z41__static_initialization_and_destruction_0ii+0x37>
   d79351:       8b 05 f1 43 48 00       mov    0x4843f1(%rip),%eax 
    # 11fd748 <max_insn_queue_index>
   d79357:       83 c0 01                add    $0x1,%eax
   d7935a:       89 05 b4 27 a9 00       mov    %eax,0xa927b4(%rip) 
    # 180bb14 <_ZL19arm_cortex_a15_tune+0x54>
   d79360:       8b 05 e2 43 48 00       mov    0x4843e2(%rip),%eax 
    # 11fd748 <max_insn_queue_index>
   d79366:       83 c0 01                add    $0x1,%eax
   d79369:       89 05 05 28 a9 00       mov    %eax,0xa92805(%rip) 
    # 180bb74 <_ZL19arm_cortex_a57_tune+0x54>
   d7936f:       5d                      pop    %rbp
   d79370:       c3                      retq

>
> Is it a problem to have a static constructor for the tables?
>
>>
>> 2) Doesn't this mean that the depth of searching will depend on
>> properties of the automata rather than some machine specific values (so
>> that potentially adding or removing unrelated scheduler rules could
>> change the behaviour of the compiler)?
>
> No.  The extra queue entries that will appear from extending an unrelated automaton will be empty, so the search will check them, but won't find anything.
>
>>
>> In general, how should someone tuning the compiler for this parameter
>> select a value that isn't one of (-1, m_i_q_d+1)?
>
>  From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere.  If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day.

There doesn't seem to be anything documented in the coding conventions 
for this specifically. In the absence of any such documentation, I'd 
rather not grow static constructors for something like this.

Can we just shrink the acceptable values to 3 and just call it a day here ?


regards
Ramana

>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-19 18:17           ` Maxim Kuvyrkov
  2015-01-20  9:13             ` Ramana Radhakrishnan
@ 2015-01-20 10:53             ` Richard Earnshaw
  2015-01-20 13:47               ` Maxim Kuvyrkov
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Earnshaw @ 2015-01-20 10:53 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Ramana Radhakrishnan, Jeff Law, GCC Patches, Vladimir Makarov

On 19/01/15 18:14, Maxim Kuvyrkov wrote:
> On Jan 19, 2015, at 6:05 PM, Richard Earnshaw <rearnsha@arm.com> wrote:
> 
>> On 16/01/15 15:06, Maxim Kuvyrkov wrote:
>>> @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
>>>   true, true,                                   /* Prefer 32-bit encodings.  */
>>>   true,						/* Prefer Neon for stringops.  */
>>>   8,						/* Maximum insns to inline memset.  */
>>> -  ARM_FUSE_NOTHING				/* Fuseable pairs of instructions.  */
>>> +  ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
>>> +  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
>>> };
>>
>>
>> Hmm, two issues here:
>> 1) This requires a static constructor for the tuning table entry (since
>> the value of max_insn_queue_index has to be looked up at run time.
> 
> Are you sure?  I didn't check the object files, but, since max_insn_queue_index is a "const int", I would expect a relocation that would be resolved at link time, not a constructor.
> 

Yes, I'm sure.  Relocations can only resolve addresses of objects, not
their contents.  LTO might eliminate the need for the reloc, but
otherwise the compiler will never see the definition and will need to
create a static constructor.

> Is it a problem to have a static constructor for the tables?

Needing constructors means that the compiler can't put the object into
read-only sections of the image.  It's not a huge problem, but if there
are ways by which they can be avoided, that's likely to be preferable;
there's a small run-time overhead to running them.

> 
>>
>> 2) Doesn't this mean that the depth of searching will depend on
>> properties of the automata rather than some machine specific values (so
>> that potentially adding or removing unrelated scheduler rules could
>> change the behaviour of the compiler)?
> 
> No.  The extra queue entries that will appear from extending an unrelated automaton will be empty, so the search will check them, but won't find anything.
> 

OK, so there's just a minor performance cost of checking values that
never hit.


>>
>> In general, how should someone tuning the compiler for this parameter
>> select a value that isn't one of (-1, m_i_q_d+1)?
> 
> From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere.  If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day.
> 

You only mention 3 values: what was the fourth?  It might be better then
to define a set of values that represent each of these cases and only
allow the tuning parameters to select one of those.  The init code then
uses that set to select how to set up the various parameters to meet
those goals.

So something like

ARM_SCHED_AUTOPREF_OFF
ARM_SCHED_AUTOPREF_RANK
ARM_SCHED_AUTOPREF_FULL


R.
> --
> Maxim Kuvyrkov
> www.linaro.org
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-20 10:53             ` Richard Earnshaw
@ 2015-01-20 13:47               ` Maxim Kuvyrkov
  2015-01-20 13:48                 ` Richard Earnshaw
  2015-02-11 13:07                 ` Jiong Wang
  0 siblings, 2 replies; 19+ messages in thread
From: Maxim Kuvyrkov @ 2015-01-20 13:47 UTC (permalink / raw)
  To: Richard Earnshaw
  Cc: Ramana Radhakrishnan, Jeff Law, GCC Patches, Vladimir Makarov

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On Jan 20, 2015, at 1:24 PM, Richard Earnshaw <rearnsha@arm.com> wrote:
...
>>> In general, how should someone tuning the compiler for this parameter
>>> select a value that isn't one of (-1, m_i_q_d+1)?
>> 
>> From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere.  If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day.
>> 
> 
> You only mention 3 values: what was the fourth?  

Typo.  No fourth.

> It might be better then
> to define a set of values that represent each of these cases and only
> allow the tuning parameters to select one of those.  The init code then
> uses that set to select how to set up the various parameters to meet
> those goals.
> 
> So something like
> 
> ARM_SCHED_AUTOPREF_OFF
> ARM_SCHED_AUTOPREF_RANK
> ARM_SCHED_AUTOPREF_FULL

A patch is attached.  I bootstrapped it on arm-linux-gnueabihf.  OK to apply?

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0001-Use-enum-for-sched_autopref-tune-settings.patch --]
[-- Type: application/octet-stream, Size: 10265 bytes --]

From 9d9ee7c33210960970d0d78ccc7a16a58b392f85 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Tue, 20 Jan 2015 12:30:37 +0000
Subject: [PATCH 1/3] Use enum for sched_autopref tune settings

	* config/arm/arm-protos.h (enum arm_sched_autopref): New constants.
	(struct tune_params): Use the enum.
	* arm.c (arm_*_tune): Update.
	(arm_option_override): Update.
---
 gcc/config/arm/arm-protos.h |    9 +++++++-
 gcc/config/arm/arm.c        |   51 +++++++++++++++++++++++++------------------
 2 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 3db7e16..307babb 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -257,6 +257,13 @@ struct cpu_vec_costs {
 
 struct cpu_cost_table;
 
+enum arm_sched_autopref
+  {
+    ARM_SCHED_AUTOPREF_OFF,
+    ARM_SCHED_AUTOPREF_RANK,
+    ARM_SCHED_AUTOPREF_FULL
+  };
+
 struct tune_params
 {
   bool (*rtx_costs) (rtx, RTX_CODE, RTX_CODE, int *, bool);
@@ -292,7 +299,7 @@ struct tune_params
   /* Bitfield encoding the fuseable pairs of instructions.  */
   unsigned int fuseable_ops;
   /* Depth of scheduling queue to check for L2 autoprefetcher.  */
-  int sched_autopref_queue_depth;
+  enum arm_sched_autopref sched_autopref;
 };
 
 extern const struct tune_params *current_tune;
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index fddd770..34672ce 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -1697,7 +1697,7 @@ const struct tune_params arm_slowmul_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1718,7 +1718,7 @@ const struct tune_params arm_fastmul_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1742,7 +1742,7 @@ const struct tune_params arm_strongarm_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1763,7 +1763,7 @@ const struct tune_params arm_xscale_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_9e_tune =
@@ -1784,7 +1784,7 @@ const struct tune_params arm_9e_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_v6t2_tune =
@@ -1805,7 +1805,7 @@ const struct tune_params arm_v6t2_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1827,7 +1827,7 @@ const struct tune_params arm_cortex_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
@@ -1848,7 +1848,7 @@ const struct tune_params arm_cortex_a8_tune =
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1869,7 +1869,7 @@ const struct tune_params arm_cortex_a7_tune =
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
@@ -1890,7 +1890,7 @@ const struct tune_params arm_cortex_a15_tune =
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_FULL			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
@@ -1911,7 +1911,7 @@ const struct tune_params arm_cortex_a53_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
@@ -1932,7 +1932,7 @@ const struct tune_params arm_cortex_a57_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
-  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_FULL			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_xgene1_tune =
@@ -1953,7 +1953,7 @@ const struct tune_params arm_xgene1_tune =
   false,				       /* Prefer Neon for stringops.  */
   32,					       /* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1977,7 +1977,7 @@ const struct tune_params arm_cortex_a5_tune =
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1998,7 +1998,7 @@ const struct tune_params arm_cortex_a9_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -2019,7 +2019,7 @@ const struct tune_params arm_cortex_a12_tune =
   true,						/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -2047,7 +2047,7 @@ const struct tune_params arm_v7m_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 /* Cortex-M7 tuning.  */
@@ -2070,7 +2070,7 @@ const struct tune_params arm_cortex_m7_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -2093,7 +2093,7 @@ const struct tune_params arm_v6m_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -2114,7 +2114,7 @@ const struct tune_params arm_fa726te_tune =
   false,					/* Prefer Neon for stringops.  */
   8,						/* Maximum insns to inline memset.  */
   ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
-  -1						/* Sched L2 autopref depth.  */
+  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
 };
 
 
@@ -3172,8 +3172,17 @@ arm_option_override (void)
 
   /* Look through ready list and all of queue for instructions
      relevant for L2 auto-prefetcher.  */
+  int param_sched_autopref_queue_depth;
+  if (current_tune->sched_autopref == ARM_SCHED_AUTOPREF_OFF)
+    param_sched_autopref_queue_depth = -1;
+  else if (current_tune->sched_autopref == ARM_SCHED_AUTOPREF_RANK)
+    param_sched_autopref_queue_depth = 0;
+  else if (current_tune->sched_autopref == ARM_SCHED_AUTOPREF_FULL)
+    param_sched_autopref_queue_depth = max_insn_queue_index + 1;
+  else
+    gcc_unreachable ();
   maybe_set_param_value (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH,
-			 current_tune->sched_autopref_queue_depth,
+			 param_sched_autopref_queue_depth,
                          global_options.x_param_values,
                          global_options_set.x_param_values);
 
-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-20 13:47               ` Maxim Kuvyrkov
@ 2015-01-20 13:48                 ` Richard Earnshaw
  2015-02-11 13:07                 ` Jiong Wang
  1 sibling, 0 replies; 19+ messages in thread
From: Richard Earnshaw @ 2015-01-20 13:48 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Ramana Radhakrishnan, Jeff Law, GCC Patches, Vladimir Makarov

On 20/01/15 13:26, Maxim Kuvyrkov wrote:
> On Jan 20, 2015, at 1:24 PM, Richard Earnshaw <rearnsha@arm.com> wrote:
> ...
>>>> In general, how should someone tuning the compiler for this parameter
>>>> select a value that isn't one of (-1, m_i_q_d+1)?
>>>
>>> From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere.  If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day.
>>>
>>
>> You only mention 3 values: what was the fourth?  
> 
> Typo.  No fourth.
> 
>> It might be better then
>> to define a set of values that represent each of these cases and only
>> allow the tuning parameters to select one of those.  The init code then
>> uses that set to select how to set up the various parameters to meet
>> those goals.
>>
>> So something like
>>
>> ARM_SCHED_AUTOPREF_OFF
>> ARM_SCHED_AUTOPREF_RANK
>> ARM_SCHED_AUTOPREF_FULL
> 
> A patch is attached.  I bootstrapped it on arm-linux-gnueabihf.  OK to apply?
> 

OK.  Thanks.

R.

> --
> Maxim Kuvyrkov
> www.linaro.org
> 
> 
> 0001-Use-enum-for-sched_autopref-tune-settings.patch
> 
> 
> From 9d9ee7c33210960970d0d78ccc7a16a58b392f85 Mon Sep 17 00:00:00 2001
> From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
> Date: Tue, 20 Jan 2015 12:30:37 +0000
> Subject: [PATCH 1/3] Use enum for sched_autopref tune settings
> 
> 	* config/arm/arm-protos.h (enum arm_sched_autopref): New constants.
> 	(struct tune_params): Use the enum.
> 	* arm.c (arm_*_tune): Update.
> 	(arm_option_override): Update.
> ---
>  gcc/config/arm/arm-protos.h |    9 +++++++-
>  gcc/config/arm/arm.c        |   51 +++++++++++++++++++++++++------------------
>  2 files changed, 38 insertions(+), 22 deletions(-)
> 
> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
> index 3db7e16..307babb 100644
> --- a/gcc/config/arm/arm-protos.h
> +++ b/gcc/config/arm/arm-protos.h
> @@ -257,6 +257,13 @@ struct cpu_vec_costs {
>  
>  struct cpu_cost_table;
>  
> +enum arm_sched_autopref
> +  {
> +    ARM_SCHED_AUTOPREF_OFF,
> +    ARM_SCHED_AUTOPREF_RANK,
> +    ARM_SCHED_AUTOPREF_FULL
> +  };
> +
>  struct tune_params
>  {
>    bool (*rtx_costs) (rtx, RTX_CODE, RTX_CODE, int *, bool);
> @@ -292,7 +299,7 @@ struct tune_params
>    /* Bitfield encoding the fuseable pairs of instructions.  */
>    unsigned int fuseable_ops;
>    /* Depth of scheduling queue to check for L2 autoprefetcher.  */
> -  int sched_autopref_queue_depth;
> +  enum arm_sched_autopref sched_autopref;
>  };
>  
>  extern const struct tune_params *current_tune;
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index fddd770..34672ce 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -1697,7 +1697,7 @@ const struct tune_params arm_slowmul_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_fastmul_tune =
> @@ -1718,7 +1718,7 @@ const struct tune_params arm_fastmul_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  /* StrongARM has early execution of branches, so a sequence that is worth
> @@ -1742,7 +1742,7 @@ const struct tune_params arm_strongarm_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_xscale_tune =
> @@ -1763,7 +1763,7 @@ const struct tune_params arm_xscale_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_9e_tune =
> @@ -1784,7 +1784,7 @@ const struct tune_params arm_9e_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_v6t2_tune =
> @@ -1805,7 +1805,7 @@ const struct tune_params arm_v6t2_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
> @@ -1827,7 +1827,7 @@ const struct tune_params arm_cortex_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_cortex_a8_tune =
> @@ -1848,7 +1848,7 @@ const struct tune_params arm_cortex_a8_tune =
>    true,						/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_cortex_a7_tune =
> @@ -1869,7 +1869,7 @@ const struct tune_params arm_cortex_a7_tune =
>    true,						/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_cortex_a15_tune =
> @@ -1890,7 +1890,7 @@ const struct tune_params arm_cortex_a15_tune =
>    true,						/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_FULL			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_cortex_a53_tune =
> @@ -1911,7 +1911,7 @@ const struct tune_params arm_cortex_a53_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_cortex_a57_tune =
> @@ -1932,7 +1932,7 @@ const struct tune_params arm_cortex_a57_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
> -  max_insn_queue_index + 1			/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_FULL			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_xgene1_tune =
> @@ -1953,7 +1953,7 @@ const struct tune_params arm_xgene1_tune =
>    false,				       /* Prefer Neon for stringops.  */
>    32,					       /* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  /* Branches can be dual-issued on Cortex-A5, so conditional execution is
> @@ -1977,7 +1977,7 @@ const struct tune_params arm_cortex_a5_tune =
>    true,						/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_cortex_a9_tune =
> @@ -1998,7 +1998,7 @@ const struct tune_params arm_cortex_a9_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_cortex_a12_tune =
> @@ -2019,7 +2019,7 @@ const struct tune_params arm_cortex_a12_tune =
>    true,						/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_MOVW_MOVT,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
> @@ -2047,7 +2047,7 @@ const struct tune_params arm_v7m_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  /* Cortex-M7 tuning.  */
> @@ -2070,7 +2070,7 @@ const struct tune_params arm_cortex_m7_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
> @@ -2093,7 +2093,7 @@ const struct tune_params arm_v6m_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  const struct tune_params arm_fa726te_tune =
> @@ -2114,7 +2114,7 @@ const struct tune_params arm_fa726te_tune =
>    false,					/* Prefer Neon for stringops.  */
>    8,						/* Maximum insns to inline memset.  */
>    ARM_FUSE_NOTHING,				/* Fuseable pairs of instructions.  */
> -  -1						/* Sched L2 autopref depth.  */
> +  ARM_SCHED_AUTOPREF_OFF			/* Sched L2 autopref.  */
>  };
>  
>  
> @@ -3172,8 +3172,17 @@ arm_option_override (void)
>  
>    /* Look through ready list and all of queue for instructions
>       relevant for L2 auto-prefetcher.  */
> +  int param_sched_autopref_queue_depth;
> +  if (current_tune->sched_autopref == ARM_SCHED_AUTOPREF_OFF)
> +    param_sched_autopref_queue_depth = -1;
> +  else if (current_tune->sched_autopref == ARM_SCHED_AUTOPREF_RANK)
> +    param_sched_autopref_queue_depth = 0;
> +  else if (current_tune->sched_autopref == ARM_SCHED_AUTOPREF_FULL)
> +    param_sched_autopref_queue_depth = max_insn_queue_index + 1;
> +  else
> +    gcc_unreachable ();
>    maybe_set_param_value (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH,
> -			 current_tune->sched_autopref_queue_depth,
> +			 param_sched_autopref_queue_depth,
>                           global_options.x_param_values,
>                           global_options_set.x_param_values);
>  
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
  2015-01-20 13:47               ` Maxim Kuvyrkov
  2015-01-20 13:48                 ` Richard Earnshaw
@ 2015-02-11 13:07                 ` Jiong Wang
  1 sibling, 0 replies; 19+ messages in thread
From: Jiong Wang @ 2015-02-11 13:07 UTC (permalink / raw)
  To: Maxim Kuvyrkov, Richard Earnshaw
  Cc: Ramana Radhakrishnan, Jeff Law, GCC Patches, Vladimir Makarov


On 20/01/15 13:26, Maxim Kuvyrkov wrote:
> On Jan 20, 2015, at 1:24 PM, Richard Earnshaw <rearnsha@arm.com> wrote:
> ...
>>>> In general, how should someone tuning the compiler for this parameter
>>>> select a value that isn't one of (-1, m_i_q_d+1)?
>>>  From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere.  If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day.
>>>
>> You only mention 3 values: what was the fourth?
> Typo.  No fourth.
>
>> It might be better then
>> to define a set of values that represent each of these cases and only
>> allow the tuning parameters to select one of those.  The init code then
>> uses that set to select how to set up the various parameters to meet
>> those goals.
>>
>> So something like
>>
>> ARM_SCHED_AUTOPREF_OFF
>> ARM_SCHED_AUTOPREF_RANK
>> ARM_SCHED_AUTOPREF_FULL
> A patch is attached.  I bootstrapped it on arm-linux-gnueabihf.  OK to apply?
bootstrap failure on chromebook, reproduced on two chromebook.

see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65020 please.


>
> --
> Maxim Kuvyrkov
> www.linaro.org
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-02-11 13:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-21  4:09 [PATCH 7/8] Model cache auto-prefetcher in scheduler Maxim Kuvyrkov
2014-10-21  5:44 ` Andrew Pinski
2014-11-10 13:15 ` Maxim Kuvyrkov
2014-11-14  2:16 ` Vladimir Makarov
2014-11-14 15:10   ` Maxim Kuvyrkov
2014-11-14  6:46 ` Jeff Law
2014-11-14 15:24   ` Maxim Kuvyrkov
2014-11-14 17:51     ` Jeff Law
2014-11-19  9:40     ` Ramana Radhakrishnan
2014-11-19 10:14       ` Maxim Kuvyrkov
2015-01-16 15:20       ` Maxim Kuvyrkov
2015-01-16 15:34         ` Ramana Radhakrishnan
2015-01-19 15:15         ` Richard Earnshaw
2015-01-19 18:17           ` Maxim Kuvyrkov
2015-01-20  9:13             ` Ramana Radhakrishnan
2015-01-20 10:53             ` Richard Earnshaw
2015-01-20 13:47               ` Maxim Kuvyrkov
2015-01-20 13:48                 ` Richard Earnshaw
2015-02-11 13:07                 ` Jiong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).