public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular
@ 2017-01-30 11:27 Maxim Kuvyrkov
  2017-01-30 11:35 ` [PATCH 1/6] Add debug counter for loop array prefetching Maxim Kuvyrkov
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 11:27 UTC (permalink / raw)
  To: GCC Patches; +Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther

This patch series improves -fprefetch-loop-arrays pass through small fixes and tweaks, and then enables it for several AArch64 cores.

My tunings were done on and for Qualcomm hardware, with results varying between +0.5-1.9% for SPEC2006 INT and +0.25%-1.0% for SPEC2006 FP at -O3, depending on hardware revision.

This patch series enables restricted -fprefetch-loop-arrays at -O2, which also improves SPEC2006 numbers

Biggest progressions are on 419.mcf and 437.leslie3d, with no serious regressions on other benchmarks.

I'm now investigating making -fprefetch-loop-arrays more aggressive for Qualcomm hardware, which improves performance on most benchmarks, but also causes big regressions on 454.calculix and 462.libquantum.  If I can fix these two regressions, prefetching will give another boost to AArch64.

Andrew just posted similar prefetching tunings for Cavium's cores, and the two patches have trivial conflicts.  I'll post mine as-is, since it address one of the comments on Andrew's review (adding a stand-alone struct for tuning parameters).

Andrew, feel free to just copy-paste it to your patch, since it is just a mechanical change.

All patches were bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
 
--
Maxim Kuvyrkov
www.linaro.org



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/6] Add debug counter for loop array prefetching.
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
@ 2017-01-30 11:35 ` Maxim Kuvyrkov
  2017-01-30 12:33   ` Richard Biener
  2017-01-30 11:43 ` [PATCH 2/6] Improve debug output of loop data prefetching Maxim Kuvyrkov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 11:35 UTC (permalink / raw)
  To: GCC Patches; +Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 274 bytes --]

This patch adds a debug counter to -fprefetch-loop-arrays pass.  It can be activated by "-fdbg-cnt=prefetch:10" to allow only 10 first prefetches to be issued.

Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0001-Add-debug-counter-for-loop-array-prefetching.patch --]
[-- Type: application/octet-stream, Size: 1845 bytes --]

From c9e4da0085e6b00246c1d06a6b14d96e57ae34dd Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 13:00:15 +0000
Subject: [PATCH 1/6] Add debug counter for loop array prefetching.

	* dbgcnt.def (prefetch): New debug counter.
	* tree-ssa-loop-prefetch.c (dbgcnt.h): New include.
	(schedule_prefetches): Stop issueing prefetches if debug counter
	tripped.

Change-Id: Ia6160364735a889fb6d2370be21d98f87c0fe962
---
 gcc/dbgcnt.def               | 1 +
 gcc/tree-ssa-loop-prefetch.c | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index ded20e0..6c63508 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -174,6 +174,7 @@ DEBUG_COUNTER (merged_ipa_icf)
 DEBUG_COUNTER (postreload_cse)
 DEBUG_COUNTER (pre)
 DEBUG_COUNTER (pre_insn)
+DEBUG_COUNTER (prefetch)
 DEBUG_COUNTER (registered_jump_thread)
 DEBUG_COUNTER (sched2_func)
 DEBUG_COUNTER (sched_block)
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 54cd942..69251d5 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -47,6 +47,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "langhooks.h"
 #include "tree-inline.h"
 #include "tree-data-ref.h"
+#include "dbgcnt.h"
 
 /* This pass inserts prefetch instructions to optimize cache usage during
    accesses to arrays in loops.  It processes loops sequentially and:
@@ -1057,6 +1058,10 @@ schedule_prefetches (struct mem_ref_group *groups, unsigned unroll_factor,
 	if (2 * remaining_prefetch_slots < prefetch_slots)
 	  continue;
 
+	/* Stop prefetching if debug counter is activated.  */
+	if (!dbg_cnt (prefetch))
+	  continue;
+
 	ref->issue_prefetch_p = true;
 
 	if (remaining_prefetch_slots <= prefetch_slots)
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/6] Improve debug output of loop data prefetching.
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
  2017-01-30 11:35 ` [PATCH 1/6] Add debug counter for loop array prefetching Maxim Kuvyrkov
@ 2017-01-30 11:43 ` Maxim Kuvyrkov
  2017-01-30 12:37   ` Richard Biener
  2017-01-30 11:47 ` [PATCH 3/6] Fix prefetch heuristic calculation Maxim Kuvyrkov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 11:43 UTC (permalink / raw)
  To: GCC Patches; +Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 762 bytes --]

Current debug output from -fprefetch-loop-arrays refers to prefetching instances by their (void *) address, which makes it painful to compare dumps, e.g., when investigating how different parameter values affect prefetching decisions.

This patch adds UIDs to two main prefetching concepts: mem_ref_group and mem_ref.  [Mem_refs are combined into mem_ref_groups so that they can re-use prefetches.]  Debug output is then changed to identify prefetch opportunities as <group>:<ref> instead of 0xNNNNNNNN.  Believe me, it makes a world of difference for debugging tree-ssa-loop-prefetch.c.

There is no change in code-generation from this patch.

Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0002-Improve-debug-output-of-loop-data-prefetching.patch --]
[-- Type: application/octet-stream, Size: 9060 bytes --]

From 090b1ec114526082516b05ea90c3b706c5432738 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 13:08:37 +0000
Subject: [PATCH 2/6] Improve debug output of loop data prefetching.

	* tree-ssa-loop-prefetch.c (struct mem_ref_group, struct mem_ref):
	New "uid" fields to hold pretty-print IDs of group and ref.
	Memory references are now identified as <group_id>:<ref_id>
	instead of using [random] addresses.
	(dump_mem_details): Simplify, no functional change.
	(dump_mem_ref): Simplify and make output more concise.
	Replace couple of fprintf's throughout code with calls to dump_mem_ref.
	(find_or_create_group): Initialize group uid.
	(record_ref): Initialize ref uid.  Improve debug output.
	(prune_group_by_reuse, should_issue_prefetch_p,)
	(should_issue_prefetch_p, schedule_prefetches, issue_prefetch_ref,)
	(mark_nontemporal_store, determine_loop_nest_reuse):
	Improve debug output.

Change-Id: Ia594d5854de96183f3c5a669c161e5a9a73a29d7
---
 gcc/tree-ssa-loop-prefetch.c | 63 ++++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 69251d5..d025041 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -227,6 +227,7 @@ struct mem_ref_group
   tree step;			/* Step of the reference.  */
   struct mem_ref *refs;		/* References in the group.  */
   struct mem_ref_group *next;	/* Next group of references.  */
+  unsigned int uid;		/* Group UID, used only for debugging.  */
 };
 
 /* Assigned to PREFETCH_BEFORE when all iterations are to be prefetched.  */
@@ -269,6 +270,7 @@ struct mem_ref
   unsigned reuse_distance;	/* The amount of data accessed before the first
 				   reuse of this value.  */
   struct mem_ref *next;		/* The next reference in the group.  */
+  unsigned int uid;		/* Ref UID, used only for debugging.  */
   unsigned write_p : 1;		/* Is it a write?  */
   unsigned independent_p : 1;	/* True if the reference is independent on
 				   all other references inside the loop.  */
@@ -290,11 +292,8 @@ dump_mem_details (FILE *file, tree base, tree step,
   else
     print_generic_expr (file, step, TDF_TREE);
   fprintf (file, ")\n");
-  fprintf (file, "  delta ");
-  fprintf (file, HOST_WIDE_INT_PRINT_DEC, delta);
-  fprintf (file, "\n");
-  fprintf (file, "  %s\n", write_p ? "write" : "read");
-  fprintf (file, "\n");
+  fprintf (file, "  delta " HOST_WIDE_INT_PRINT_DEC "\n", delta);
+  fprintf (file, "  %s\n\n", write_p ? "write" : "read");
 }
 
 /* Dumps information about reference REF to FILE.  */
@@ -302,12 +301,9 @@ dump_mem_details (FILE *file, tree base, tree step,
 static void
 dump_mem_ref (FILE *file, struct mem_ref *ref)
 {
-  fprintf (file, "Reference %p:\n", (void *) ref);
-
-  fprintf (file, "  group %p ", (void *) ref->group);
-
-  dump_mem_details (file, ref->group->base, ref->group->step, ref->delta,
-                   ref->write_p);
+  fprintf (file, "reference %u:%u (", ref->group->uid, ref->uid);
+  print_generic_expr (file, ref->mem, TDF_SLIM);
+  fprintf (file, ")\n");
 }
 
 /* Finds a group with BASE and STEP in GROUPS, or creates one if it does not
@@ -316,6 +312,9 @@ dump_mem_ref (FILE *file, struct mem_ref *ref)
 static struct mem_ref_group *
 find_or_create_group (struct mem_ref_group **groups, tree base, tree step)
 {
+  /* Global count for setting struct mem_ref_group->uid.  */
+  static unsigned int last_mem_ref_group_uid = 0;
+
   struct mem_ref_group *group;
 
   for (; *groups; groups = &(*groups)->next)
@@ -335,6 +334,7 @@ find_or_create_group (struct mem_ref_group **groups, tree base, tree step)
   group->base = base;
   group->step = step;
   group->refs = NULL;
+  group->uid = ++last_mem_ref_group_uid;
   group->next = *groups;
   *groups = group;
 
@@ -348,11 +348,14 @@ static void
 record_ref (struct mem_ref_group *group, gimple *stmt, tree mem,
 	    HOST_WIDE_INT delta, bool write_p)
 {
+  unsigned int last_mem_ref_uid = 0;
   struct mem_ref **aref;
 
   /* Do not record the same address twice.  */
   for (aref = &group->refs; *aref; aref = &(*aref)->next)
     {
+      last_mem_ref_uid = (*aref)->uid;
+
       /* It does not have to be possible for write reference to reuse the read
 	 prefetch, or vice versa.  */
       if (!WRITE_CAN_USE_READ_PREFETCH
@@ -381,9 +384,16 @@ record_ref (struct mem_ref_group *group, gimple *stmt, tree mem,
   (*aref)->next = NULL;
   (*aref)->independent_p = false;
   (*aref)->storent_p = false;
+  (*aref)->uid = last_mem_ref_uid + 1;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
-    dump_mem_ref (dump_file, *aref);
+    {
+      dump_mem_ref (dump_file, *aref);
+
+      fprintf (dump_file, "  group %u ", group->uid);
+      dump_mem_details (dump_file, group->base, group->step, delta,
+			write_p);
+    }
 }
 
 /* Release memory references in GROUPS.  */
@@ -938,7 +948,7 @@ prune_group_by_reuse (struct mem_ref_group *group)
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
-	  fprintf (dump_file, "Reference %p:", (void *) ref_pruned);
+	  dump_mem_ref (dump_file, ref_pruned);
 
 	  if (ref_pruned->prefetch_before == PREFETCH_ALL
 	      && ref_pruned->prefetch_mod == 1)
@@ -986,8 +996,8 @@ should_issue_prefetch_p (struct mem_ref *ref)
   if (ref->prefetch_before != PREFETCH_ALL)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
-        fprintf (dump_file, "Ignoring %p due to prefetch_before\n",
-		 (void *) ref);
+        fprintf (dump_file, "Ignoring reference %u:%u due to prefetch_before\n",
+		 ref->group->uid, ref->uid);
       return false;
     }
 
@@ -995,7 +1005,7 @@ should_issue_prefetch_p (struct mem_ref *ref)
   if (ref->storent_p)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
-        fprintf (dump_file, "Ignoring nontemporal store %p\n", (void *) ref);
+        fprintf (dump_file, "Ignoring nontemporal store reference %u:%u\n", ref->group->uid, ref->uid);
       return false;
     }
 
@@ -1063,6 +1073,9 @@ schedule_prefetches (struct mem_ref_group *groups, unsigned unroll_factor,
 	  continue;
 
 	ref->issue_prefetch_p = true;
+	if (dump_file && (dump_flags & TDF_DETAILS))
+	  fprintf (dump_file, "Decided to issue prefetch for reference %u:%u\n",
+		   ref->group->uid, ref->uid);
 
 	if (remaining_prefetch_slots <= prefetch_slots)
 	  return true;
@@ -1126,9 +1139,9 @@ issue_prefetch_ref (struct mem_ref *ref, unsigned unroll_factor, unsigned ahead)
   bool nontemporal = ref->reuse_distance >= L2_CACHE_SIZE_BYTES;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
-    fprintf (dump_file, "Issued%s prefetch for %p.\n",
+    fprintf (dump_file, "Issued%s prefetch for reference %u:%u.\n",
 	     nontemporal ? " nontemporal" : "",
-	     (void *) ref);
+	     ref->group->uid, ref->uid);
 
   bsi = gsi_for_stmt (ref->stmt);
 
@@ -1148,8 +1161,8 @@ issue_prefetch_ref (struct mem_ref *ref, unsigned unroll_factor, unsigned ahead)
           delta = (ahead + ap * ref->prefetch_mod) *
 		   int_cst_value (ref->group->step);
           addr = fold_build_pointer_plus_hwi (addr_base, delta);
-          addr = force_gimple_operand_gsi (&bsi, unshare_expr (addr), true, NULL,
-                                           true, GSI_SAME_STMT);
+          addr = force_gimple_operand_gsi (&bsi, unshare_expr (addr), true,
+					   NULL, true, GSI_SAME_STMT);
         }
       else
         {
@@ -1233,8 +1246,8 @@ mark_nontemporal_store (struct mem_ref *ref)
     return false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
-    fprintf (dump_file, "Marked reference %p as a nontemporal store.\n",
-	     (void *) ref);
+    fprintf (dump_file, "Marked reference %u:%u as a nontemporal store.\n",
+	     ref->group->uid, ref->uid);
 
   gimple_assign_set_nontemporal_move (ref->stmt, true);
   ref->storent_p = true;
@@ -1344,7 +1357,7 @@ should_unroll_loop_p (struct loop *loop, struct tree_niter_desc *desc,
 
 /* Determine the coefficient by that unroll LOOP, from the information
    contained in the list of memory references REFS.  Description of
-   umber of iterations of LOOP is stored to DESC.  NINSNS is the number of
+   number of iterations of LOOP is stored to DESC.  NINSNS is the number of
    insns of the LOOP.  EST_NITER is the estimated number of iterations of
    the loop, or -1 if no estimate is available.  */
 
@@ -1719,8 +1732,8 @@ determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs,
       fprintf (dump_file, "Reuse distances:\n");
       for (gr = refs; gr; gr = gr->next)
 	for (ref = gr->refs; ref; ref = ref->next)
-	  fprintf (dump_file, " ref %p distance %u\n",
-		   (void *) ref, ref->reuse_distance);
+	  fprintf (dump_file, " reference %u:%u distance %u\n",
+		   ref->group->uid, ref->uid, ref->reuse_distance);
     }
 
   return true;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/6] Fix prefetch heuristic calculation.
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
  2017-01-30 11:35 ` [PATCH 1/6] Add debug counter for loop array prefetching Maxim Kuvyrkov
  2017-01-30 11:43 ` [PATCH 2/6] Improve debug output of loop data prefetching Maxim Kuvyrkov
@ 2017-01-30 11:47 ` Maxim Kuvyrkov
  2017-01-30 13:28   ` Richard Biener
  2017-01-30 11:53 ` [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64 Maxim Kuvyrkov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 11:47 UTC (permalink / raw)
  To: GCC Patches; +Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 441 bytes --]

This patch fixes heuristic in loop array prefetching to use "round-to-nearest" instead of "floor", which is what _all_ other similar heuristics in the pass are doing.

This subtle change allows a critical loop in 429.mcf to get prefetches without making the pass too aggressive, which causes regressions on other benchmarks.

Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0003-Fix-prefetch-heuristic-calculation.patch --]
[-- Type: application/octet-stream, Size: 1131 bytes --]

From fe7a7e384acac79e89e4f355a17ccea040b22f9c Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 13:21:05 +0000
Subject: [PATCH 3/6] Fix prefetch heuristic calculation.

	* tree-ssa-loop-prefetch.c (mem_ref_count_reasonable_p): Round to
	nearest instead of floor.

Change-Id: I7bde101366cc3c4a11b145819fc24a5c5e1a0169
---
 gcc/tree-ssa-loop-prefetch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index d025041..b574f15 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1793,7 +1793,7 @@ mem_ref_count_reasonable_p (unsigned ninsns, unsigned mem_ref_count)
      TODO: Implement a more precise computation by estimating the time
      for each CPU or memory op in the loop. Time estimates for memory ops
      should account for cache misses.  */
-  insn_to_mem_ratio = ninsns / mem_ref_count;
+  insn_to_mem_ratio = (ninsns + mem_ref_count / 2) / mem_ref_count;
 
   if (insn_to_mem_ratio < PREFETCH_MIN_INSN_TO_MEM_RATIO)
     {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
                   ` (2 preceding siblings ...)
  2017-01-30 11:47 ` [PATCH 3/6] Fix prefetch heuristic calculation Maxim Kuvyrkov
@ 2017-01-30 11:53 ` Maxim Kuvyrkov
  2017-01-30 16:39   ` Andrew Pinski
  2017-05-29 10:07   ` Maxim Kuvyrkov
  2017-01-30 12:08 ` [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that benefit from prefetching Maxim Kuvyrkov
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 11:53 UTC (permalink / raw)
  To: GCC Patches; +Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 529 bytes --]

This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.

This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .

This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".

Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0004-Port-prefetch-configuration-from-aarch32-to-aarch64-.patch --]
[-- Type: application/octet-stream, Size: 8130 bytes --]

From 4e5dacceba2b0d427670faf2fd52b05956fbcc99 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:42:54 +0000
Subject: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
 backend.

	* config/aarch64/aarch64-protos.h (struct tune_params):
	Replace cache_line_size with struct { } prefetch -- same as in
	arm-protos.h.
	* config/aarch64/aarch64.c (AARCH64_PREFETCH_NOT_BENEFICIAL,)
	(AARCH64_PREFETCH_BENEFICIAL): New macros.
	(tune_params *_tunings): Update all tunings (no functional change).
	(aarch64_override_options_internal): Set PARAM_SIMULTANEOUS_PREFETCHES,
	PARAM_L1_CACHE_SIZE, PARAM_L1_CACHE_LINE_SIZE, and PARAM_L2_CACHE_SIZE
	from tunings structures.

Change-Id: I1ddbac1863dcf078a2e5b14dd904debc76a7da94
---
 gcc/config/aarch64/aarch64-protos.h | 11 ++++++--
 gcc/config/aarch64/aarch64.c        | 56 +++++++++++++++++++++++++++----------
 2 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 602f54f..f14fe25 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -218,9 +218,14 @@ struct tune_params
   int min_div_recip_mul_df;
   /* Value for aarch64_case_values_threshold; or 0 for the default.  */
   unsigned int max_case_values;
-  /* Value for PARAM_L1_CACHE_LINE_SIZE; or 0 to use the default.  */
-  unsigned int cache_line_size;
-
+  /* Explicit prefetch data.  */
+  struct
+    {
+      int num_slots;
+      int l1_cache_size;
+      int l1_cache_line_size;
+      int l2_cache_size;
+    } prefetch;
 /* An enum specifying how to take into account CPU autoprefetch capabilities
    during instruction scheduling:
    - AUTOPREFETCHER_OFF: Do not take autoprefetch capabilities into account.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 27f2dc9..5cec965 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -514,6 +514,15 @@ static const cpu_approx_modes xgene1_approx_modes =
   AARCH64_APPROX_ALL	/* recip_sqrt  */
 };
 
+#define AARCH64_PREFETCH_NOT_BENEFICIAL { 0, -1, -1, -1 }
+#define AARCH64_PREFETCH_BENEFICIAL(num_slots,l1_size,l1_line_size,l2_size) \
+  {									\
+    num_slots,								\
+    l1_size,								\
+    l1_line_size,							\
+    l2_size								\
+  }
+
 static const struct tune_params generic_tunings =
 {
   &cortexa57_extra_costs,
@@ -534,7 +543,7 @@ static const struct tune_params generic_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -560,7 +569,7 @@ static const struct tune_params cortexa35_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -586,7 +595,7 @@ static const struct tune_params cortexa53_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -612,7 +621,7 @@ static const struct tune_params cortexa57_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
 };
@@ -638,7 +647,7 @@ static const struct tune_params cortexa72_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -664,7 +673,7 @@ static const struct tune_params cortexa73_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -689,7 +698,7 @@ static const struct tune_params exynosm1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   48,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
 };
@@ -714,7 +723,7 @@ static const struct tune_params thunderx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)	/* tune_flags.  */
 };
@@ -739,7 +748,7 @@ static const struct tune_params xgene1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -765,7 +774,7 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_STRONG,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)		/* tune_flags.  */
 };
@@ -790,7 +799,7 @@ static const struct tune_params thunderx2t99_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -8612,11 +8621,28 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			 opts->x_param_values,
 			 global_options_set.x_param_values);
 
-  /* Set the L1 cache line size.  */
-  if (selected_cpu->tune->cache_line_size != 0)
+  /* Set up parameters to be used in prefetching algorithm.  Do not
+     override the defaults unless we are tuning for a core we have
+     researched values for.  */
+  if (aarch64_tune_params.prefetch.num_slots > 0)
+    maybe_set_param_value (PARAM_SIMULTANEOUS_PREFETCHES,
+			   aarch64_tune_params.prefetch.num_slots,
+			   global_options.x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l1_cache_size >= 0)
+    maybe_set_param_value (PARAM_L1_CACHE_SIZE,
+			   aarch64_tune_params.prefetch.l1_cache_size,
+			   global_options.x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l1_cache_line_size >= 0)
     maybe_set_param_value (PARAM_L1_CACHE_LINE_SIZE,
-			   selected_cpu->tune->cache_line_size,
-			   opts->x_param_values,
+			   aarch64_tune_params.prefetch.l1_cache_line_size,
+			   global_options.x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l2_cache_size >= 0)
+    maybe_set_param_value (PARAM_L2_CACHE_SIZE,
+			   aarch64_tune_params.prefetch.l2_cache_size,
+			   global_options.x_param_values,
 			   global_options_set.x_param_values);
 
   aarch64_override_options_after_change_1 (opts);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that  benefit from prefetching.
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
                   ` (3 preceding siblings ...)
  2017-01-30 11:53 ` [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64 Maxim Kuvyrkov
@ 2017-01-30 12:08 ` Maxim Kuvyrkov
  2017-01-30 12:31   ` Kyrill Tkachov
  2017-01-30 12:14 ` [PATCH 6/6][AArch64] Update prefetch tuning parameters for falkor and qdf24xx tunings Maxim Kuvyrkov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 12:08 UTC (permalink / raw)
  To: GCC Patches; +Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 452 bytes --]

This patch enables prefetching at -O3 for aarch64 cores that set "simultaneous prefetches" parameter above 0.  There are currently no such settings, so this patch doesn't change default code generation.

I'm now working on improvements to -fprefetch-loop-arrays pass to make it suitable for -O2.  I'll post this work in the next month.

Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0005-Enable-fprefetch-loop-arrays-at-O3-for-cores-that-be.patch --]
[-- Type: application/octet-stream, Size: 1420 bytes --]

From 1e4d4d041f580188caf6f64e9da978dd2c29e3f9 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:50:07 +0000
Subject: [PATCH 5/6] Enable -fprefetch-loop-arrays at -O3 for cores that
 benefit from prefetching.

	* config/aarch64/aarch64.c (aarch64_override_options_internal):
	Set flag_prefetch_loop_arrays if tuning sets prefetching slots > 1.

Change-Id: Id41411e671e0a55dc7268e0ad0a4e8ff1421c90a
---
 gcc/config/aarch64/aarch64.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 5cec965..e1b03ec 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -8621,6 +8621,15 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			 opts->x_param_values,
 			 global_options_set.x_param_values);
 
+  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we
+     have deemed it beneficial (signified by setting
+     prefetch.num_slots to 1 or more).  */
+  if (flag_prefetch_loop_arrays < 0
+      && HAVE_prefetch
+      && optimize >= 3
+      && aarch64_tune_params.prefetch.num_slots > 0)
+    flag_prefetch_loop_arrays = 1;
+
   /* Set up parameters to be used in prefetching algorithm.  Do not
      override the defaults unless we are tuning for a core we have
      researched values for.  */
-- 
2.7.4


[-- Attachment #3: Type: text/plain, Size: 2 bytes --]




^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 6/6][AArch64] Update prefetch tuning parameters for falkor and qdf24xx  tunings.
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
                   ` (4 preceding siblings ...)
  2017-01-30 12:08 ` [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that benefit from prefetching Maxim Kuvyrkov
@ 2017-01-30 12:14 ` Maxim Kuvyrkov
  2017-06-08 16:32   ` James Greenhalgh
  2017-01-30 16:55 ` [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Andrew Pinski
  2017-02-20 14:44 ` Kyrill Tkachov
  7 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 12:14 UTC (permalink / raw)
  To: GCC Patches; +Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 183 bytes --]

This patch enables software prefetching at -O3 for Qualcomm's qdf24xx cores.

Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0006-Update-prefetch-tuning-parameters-for-falkor-and-qdf.patch --]
[-- Type: application/octet-stream, Size: 1165 bytes --]

From f3dc0ae7203ec095e2f54894a0fbf13e807d3d46 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:53:49 +0000
Subject: [PATCH 6/6] Update prefetch tuning parameters for falkor and qdf24xx
 tunings.

	* config/aarch64/aarch64.c (tune_params qdf24xx_tunings,)
	(tune_params falkor_tunings): Update prefetch settings, which, in turn,
	enable prefetching by default at -O3.

Change-Id: I2f0da54a8c262f6fbd0dcfde4584141bb09f1013
---
 gcc/config/aarch64/aarch64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e1b03ec..ccee947 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -774,7 +774,7 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
+  AARCH64_PREFETCH_BENEFICIAL (4, 32, 64, 1024),
   tune_params::AUTOPREFETCHER_STRONG,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)		/* tune_flags.  */
 };
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that  benefit from prefetching.
  2017-01-30 12:08 ` [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that benefit from prefetching Maxim Kuvyrkov
@ 2017-01-30 12:31   ` Kyrill Tkachov
  2017-01-30 15:03     ` Maxim Kuvyrkov
  0 siblings, 1 reply; 32+ messages in thread
From: Kyrill Tkachov @ 2017-01-30 12:31 UTC (permalink / raw)
  To: Maxim Kuvyrkov, GCC Patches; +Cc: Andrew Pinski, Richard Guenther

Hi Maxim,

On 30/01/17 12:06, Maxim Kuvyrkov wrote:
> This patch enables prefetching at -O3 for aarch64 cores that set "simultaneous prefetches" parameter above 0.  There are currently no such settings, so this patch doesn't change default code generation.
>
> I'm now working on improvements to -fprefetch-loop-arrays pass to make it suitable for -O2.  I'll post this work in the next month.
>
> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>

Are you aiming to get this in for GCC 8?
I have one small comment on this patch:

+  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we
+     have deemed it beneficial (signified by setting
+     prefetch.num_slots to 1 or more).  */
+  if (flag_prefetch_loop_arrays < 0
+      && HAVE_prefetch

HAVE_prefetch will always be true on aarch64.
I imagine midend code that had logic like this would need this check, but aarch64-specific code shouldn't need it.

+      && optimize >= 3
+      && aarch64_tune_params.prefetch.num_slots > 0)
+    flag_prefetch_loop_arrays = 1;
+

Cheers,
Kyrill

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/6] Add debug counter for loop array prefetching.
  2017-01-30 11:35 ` [PATCH 1/6] Add debug counter for loop array prefetching Maxim Kuvyrkov
@ 2017-01-30 12:33   ` Richard Biener
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Biener @ 2017-01-30 12:33 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Kyrylo Tkachov, Andrew Pinski

On Mon, Jan 30, 2017 at 12:28 PM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> This patch adds a debug counter to -fprefetch-loop-arrays pass.  It can be activated by "-fdbg-cnt=prefetch:10" to allow only 10 first prefetches to be issued.
>
> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

Ok.

RIchard.

> --
> Maxim Kuvyrkov
> www.linaro.org
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/6] Improve debug output of loop data prefetching.
  2017-01-30 11:43 ` [PATCH 2/6] Improve debug output of loop data prefetching Maxim Kuvyrkov
@ 2017-01-30 12:37   ` Richard Biener
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Biener @ 2017-01-30 12:37 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Kyrylo Tkachov, Andrew Pinski

On Mon, Jan 30, 2017 at 12:36 PM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> Current debug output from -fprefetch-loop-arrays refers to prefetching instances by their (void *) address, which makes it painful to compare dumps, e.g., when investigating how different parameter values affect prefetching decisions.
>
> This patch adds UIDs to two main prefetching concepts: mem_ref_group and mem_ref.  [Mem_refs are combined into mem_ref_groups so that they can re-use prefetches.]  Debug output is then changed to identify prefetch opportunities as <group>:<ref> instead of 0xNNNNNNNN.  Believe me, it makes a world of difference for debugging tree-ssa-loop-prefetch.c.
>
> There is no change in code-generation from this patch.
>
> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

Ok.

Richard.

> --
> Maxim Kuvyrkov
> www.linaro.org
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/6] Fix prefetch heuristic calculation.
  2017-01-30 11:47 ` [PATCH 3/6] Fix prefetch heuristic calculation Maxim Kuvyrkov
@ 2017-01-30 13:28   ` Richard Biener
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Biener @ 2017-01-30 13:28 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Kyrylo Tkachov, Andrew Pinski

On Mon, Jan 30, 2017 at 12:43 PM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> This patch fixes heuristic in loop array prefetching to use "round-to-nearest" instead of "floor", which is what _all_ other similar heuristics in the pass are doing.

_all_ is a bit over-exaggregating...  In the context we are testing
this round value against a min_insn_to_mem_ratio so rounding
down makes sense to me.

What am I missing?

Richard.

> This subtle change allows a critical loop in 429.mcf to get prefetches without making the pass too aggressive, which causes regressions on other benchmarks.
>
> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that  benefit from prefetching.
  2017-01-30 12:31   ` Kyrill Tkachov
@ 2017-01-30 15:03     ` Maxim Kuvyrkov
  2017-02-03 11:58       ` Maxim Kuvyrkov
  0 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-01-30 15:03 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches, Andrew Pinski, Richard Guenther

> On Jan 30, 2017, at 3:23 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
> 
> Hi Maxim,
> 
> On 30/01/17 12:06, Maxim Kuvyrkov wrote:
>> This patch enables prefetching at -O3 for aarch64 cores that set "simultaneous prefetches" parameter above 0.  There are currently no such settings, so this patch doesn't change default code generation.
>> 
>> I'm now working on improvements to -fprefetch-loop-arrays pass to make it suitable for -O2.  I'll post this work in the next month.
>> 
>> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>> 
> 
> Are you aiming to get this in for GCC 8?
> I have one small comment on this patch:
> 
> +  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we
> +     have deemed it beneficial (signified by setting
> +     prefetch.num_slots to 1 or more).  */
> +  if (flag_prefetch_loop_arrays < 0
> +      && HAVE_prefetch
> 
> HAVE_prefetch will always be true on aarch64.
> I imagine midend code that had logic like this would need this check, but aarch64-specific code shouldn't need it.

Agree, I'll remove HAVE_prefetch.

This pattern was copied from other backends, and HAVE_prefetch is most likely a historical artifact.

--
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-01-30 11:53 ` [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64 Maxim Kuvyrkov
@ 2017-01-30 16:39   ` Andrew Pinski
  2017-02-03 11:52     ` Maxim Kuvyrkov
  2017-06-08 13:48     ` James Greenhalgh
  2017-05-29 10:07   ` Maxim Kuvyrkov
  1 sibling, 2 replies; 32+ messages in thread
From: Andrew Pinski @ 2017-01-30 16:39 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Kyrylo Tkachov, Richard Guenther

On Mon, Jan 30, 2017 at 3:48 AM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.
>
> This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .
>
> This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".

I am not a fan of the macro at all.

Thanks,
Andrew


>
> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
                   ` (5 preceding siblings ...)
  2017-01-30 12:14 ` [PATCH 6/6][AArch64] Update prefetch tuning parameters for falkor and qdf24xx tunings Maxim Kuvyrkov
@ 2017-01-30 16:55 ` Andrew Pinski
  2017-02-20 14:44 ` Kyrill Tkachov
  7 siblings, 0 replies; 32+ messages in thread
From: Andrew Pinski @ 2017-01-30 16:55 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Kyrylo Tkachov, Richard Guenther

On Mon, Jan 30, 2017 at 3:24 AM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> This patch series improves -fprefetch-loop-arrays pass through small fixes and tweaks, and then enables it for several AArch64 cores.
>
> My tunings were done on and for Qualcomm hardware, with results varying between +0.5-1.9% for SPEC2006 INT and +0.25%-1.0% for SPEC2006 FP at -O3, depending on hardware revision.
>
> This patch series enables restricted -fprefetch-loop-arrays at -O2, which also improves SPEC2006 numbers
>
> Biggest progressions are on 419.mcf and 437.leslie3d, with no serious regressions on other benchmarks.
>
> I'm now investigating making -fprefetch-loop-arrays more aggressive for Qualcomm hardware, which improves performance on most benchmarks, but also causes big regressions on 454.calculix and 462.libquantum.  If I can fix these two regressions, prefetching will give another boost to AArch64.

I have a patch which causes more aggressively already which improves
libquantum for CN88xx; I have not submitted yet as I had just
restarted the upstreaming my patch sets.

Thanks,
Andrew

>
> Andrew just posted similar prefetching tunings for Cavium's cores, and the two patches have trivial conflicts.  I'll post mine as-is, since it address one of the comments on Andrew's review (adding a stand-alone struct for tuning parameters).
>
> Andrew, feel free to just copy-paste it to your patch, since it is just a mechanical change.
>
> All patches were bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-01-30 16:39   ` Andrew Pinski
@ 2017-02-03 11:52     ` Maxim Kuvyrkov
  2017-06-08 13:48     ` James Greenhalgh
  1 sibling, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-02-03 11:52 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: GCC Patches, Kyrylo Tkachov, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 1256 bytes --]

> On Jan 30, 2017, at 7:35 PM, Andrew Pinski <apinski@cavium.com> wrote:
> 
> On Mon, Jan 30, 2017 at 3:48 AM, Maxim Kuvyrkov
> <maxim.kuvyrkov@linaro.org> wrote:
>> This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.
>> 
>> This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .
>> 
>> This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".
> 
> I am not a fan of the macro at all.

Are you referring to the AARCH64_PREFETCH_NON_BENEFICIAL macro?  If so, then consider consider how it simplifies updates to prefetch tuning parameters for targets that don't care about prefetching.  My guess is that growing number of ARM cores was the reason for this macro in aarch32 backend.  Similarly, the number of aarch64 cores will only increase over time, and updating every single tuning structure will become a pain.

Here is an updated patch that fixes references from global_options.foo to opts->foo.  I spotted this mistake while looking at Andrew's patch.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0004-Port-prefetch-configuration-from-aarch32-to-aarch64-.patch --]
[-- Type: application/octet-stream, Size: 8012 bytes --]

From 692fbaa0c7a46674b722d4454e36389eb6317a8f Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:42:54 +0000
Subject: [PATCH 4/7] Port prefetch configuration from aarch32 to aarch64
 backend.

	* config/aarch64/aarch64-protos.h (struct tune_params):
	Replace cache_line_size with struct { } prefetch -- same as in
	arm-protos.h.
	* config/aarch64/aarch64.c (AARCH64_PREFETCH_NOT_BENEFICIAL,)
	(AARCH64_PREFETCH_BENEFICIAL): New macros.
	(tune_params *_tunings): Update all tunings (no functional change).
	(aarch64_override_options_internal): Set PARAM_SIMULTANEOUS_PREFETCHES,
	PARAM_L1_CACHE_SIZE, PARAM_L1_CACHE_LINE_SIZE, and PARAM_L2_CACHE_SIZE
	from tunings structures.

Change-Id: I1ddbac1863dcf078a2e5b14dd904debc76a7da94
---
 gcc/config/aarch64/aarch64-protos.h | 11 +++++---
 gcc/config/aarch64/aarch64.c        | 54 +++++++++++++++++++++++++++----------
 2 files changed, 48 insertions(+), 17 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 9543f8c..0adb84c 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -224,9 +224,14 @@ struct tune_params
   int min_div_recip_mul_df;
   /* Value for aarch64_case_values_threshold; or 0 for the default.  */
   unsigned int max_case_values;
-  /* Value for PARAM_L1_CACHE_LINE_SIZE; or 0 to use the default.  */
-  unsigned int cache_line_size;
-
+  /* Explicit prefetch data.  */
+  struct
+    {
+      int num_slots;
+      int l1_cache_size;
+      int l1_cache_line_size;
+      int l2_cache_size;
+    } prefetch;
 /* An enum specifying how to take into account CPU autoprefetch capabilities
    during instruction scheduling:
    - AUTOPREFETCHER_OFF: Do not take autoprefetch capabilities into account.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ebd17a4..41160ff 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -526,6 +526,15 @@ static const cpu_approx_modes xgene1_approx_modes =
   AARCH64_APPROX_ALL	/* recip_sqrt  */
 };
 
+#define AARCH64_PREFETCH_NOT_BENEFICIAL { 0, -1, -1, -1 }
+#define AARCH64_PREFETCH_BENEFICIAL(num_slots,l1_size,l1_line_size,l2_size) \
+  {									\
+    num_slots,								\
+    l1_size,								\
+    l1_line_size,							\
+    l2_size								\
+  }
+
 static const struct tune_params generic_tunings =
 {
   &cortexa57_extra_costs,
@@ -546,7 +555,7 @@ static const struct tune_params generic_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -572,7 +581,7 @@ static const struct tune_params cortexa35_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -598,7 +607,7 @@ static const struct tune_params cortexa53_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -624,7 +633,7 @@ static const struct tune_params cortexa57_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
 };
@@ -650,7 +659,7 @@ static const struct tune_params cortexa72_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -676,7 +685,7 @@ static const struct tune_params cortexa73_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -701,7 +710,7 @@ static const struct tune_params exynosm1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   48,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
 };
@@ -726,7 +735,7 @@ static const struct tune_params thunderx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)	/* tune_flags.  */
 };
@@ -751,7 +760,7 @@ static const struct tune_params xgene1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -777,7 +786,7 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_STRONG,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)		/* tune_flags.  */
 };
@@ -802,7 +811,7 @@ static const struct tune_params thunderx2t99_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -8682,10 +8691,27 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			 opts->x_param_values,
 			 global_options_set.x_param_values);
 
-  /* Set the L1 cache line size.  */
-  if (selected_cpu->tune->cache_line_size != 0)
+  /* Set up parameters to be used in prefetching algorithm.  Do not
+     override the defaults unless we are tuning for a core we have
+     researched values for.  */
+  if (aarch64_tune_params.prefetch.num_slots > 0)
+    maybe_set_param_value (PARAM_SIMULTANEOUS_PREFETCHES,
+			   aarch64_tune_params.prefetch.num_slots,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l1_cache_size >= 0)
+    maybe_set_param_value (PARAM_L1_CACHE_SIZE,
+			   aarch64_tune_params.prefetch.l1_cache_size,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l1_cache_line_size >= 0)
     maybe_set_param_value (PARAM_L1_CACHE_LINE_SIZE,
-			   selected_cpu->tune->cache_line_size,
+			   aarch64_tune_params.prefetch.l1_cache_line_size,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l2_cache_size >= 0)
+    maybe_set_param_value (PARAM_L2_CACHE_SIZE,
+			   aarch64_tune_params.prefetch.l2_cache_size,
 			   opts->x_param_values,
 			   global_options_set.x_param_values);
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that  benefit from prefetching.
  2017-01-30 15:03     ` Maxim Kuvyrkov
@ 2017-02-03 11:58       ` Maxim Kuvyrkov
  2017-05-29 10:09         ` Maxim Kuvyrkov
  2017-06-08 16:31         ` James Greenhalgh
  0 siblings, 2 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-02-03 11:58 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Kyrill Tkachov, GCC Patches, Andrew Pinski, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]

> On Jan 30, 2017, at 5:50 PM, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
> 
>> On Jan 30, 2017, at 3:23 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>> 
>> Hi Maxim,
>> 
>> On 30/01/17 12:06, Maxim Kuvyrkov wrote:
>>> This patch enables prefetching at -O3 for aarch64 cores that set "simultaneous prefetches" parameter above 0.  There are currently no such settings, so this patch doesn't change default code generation.
>>> 
>>> I'm now working on improvements to -fprefetch-loop-arrays pass to make it suitable for -O2. I'll post this work in the next month.
>>> 
>>> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>>> 
>> 
>> Are you aiming to get this in for GCC 8?
>> I have one small comment on this patch:
>> 
>> +  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we
>> +     have deemed it beneficial (signified by setting
>> +     prefetch.num_slots to 1 or more).  */
>> +  if (flag_prefetch_loop_arrays < 0
>> +      && HAVE_prefetch
>> 
>> HAVE_prefetch will always be true on aarch64.
>> I imagine midend code that had logic like this would need this check, but aarch64-specific code shouldn't need it.
> 
> Agree, I'll remove HAVE_prefetch.
> 
> This pattern was copied from other backends, and HAVE_prefetch is most likely a historical artifact.

Andrew raised a good point in the review of his patch that it is a bad idea to use one of prefetching parameters (simultaneous_prefetches) as indicator for whether to enable prefetching pass by default.  Indeed there are cases when we want to set simultaneous_prefetch according to HW documentation (or experimental results), but not enable prefetching pass by default.

This update to the patch addresses it.  The patch adds a new explicit field to prefetch tuning structure "default_opt_level" that sets optimization level from which prefetching should be enabled by default.  The current value is to enable prefetching at -O3; additionally, this parameter will come handy for enabling prefetching at -O2 [when it is ready].

--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0005-Enable-fprefetch-loop-arrays-at-O3-for-cores-that-be.patch --]
[-- Type: application/octet-stream, Size: 3878 bytes --]

From c6559cac199eaa91bbf8b513830942ad4f816060 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:50:07 +0000
Subject: [PATCH 5/7] Enable -fprefetch-loop-arrays at -O3 for cores that
 benefit from prefetching.

	* config/aarch64/aarch64.c (aarch64_override_options_internal):
	Set flag_prefetch_loop_arrays if tuning sets prefetching slots > 1.

Change-Id: Id41411e671e0a55dc7268e0ad0a4e8ff1421c90a
---
 gcc/config/aarch64/aarch64-protos.h |  1 +
 gcc/config/aarch64/aarch64.c        | 25 +++++++++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 0adb84c..7791215 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -231,6 +231,7 @@ struct tune_params
       int l1_cache_size;
       int l1_cache_line_size;
       int l2_cache_size;
+      int default_opt_level;
     } prefetch;
 /* An enum specifying how to take into account CPU autoprefetch capabilities
    during instruction scheduling:
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 41160ff..cce2265 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -526,13 +526,15 @@ static const cpu_approx_modes xgene1_approx_modes =
   AARCH64_APPROX_ALL	/* recip_sqrt  */
 };
 
-#define AARCH64_PREFETCH_NOT_BENEFICIAL { 0, -1, -1, -1 }
-#define AARCH64_PREFETCH_BENEFICIAL(num_slots,l1_size,l1_line_size,l2_size) \
+#define AARCH64_PREFETCH_NOT_BENEFICIAL { 0, -1, -1, -1, -1 }
+#define AARCH64_PREFETCH_BENEFICIAL(num_slots,l1_size,l1_line_size,l2_size, \
+				    opt_level)				\
   {									\
     num_slots,								\
     l1_size,								\
     l1_line_size,							\
-    l2_size								\
+    l2_size,								\
+    opt_level								\
   }
 
 static const struct tune_params generic_tunings =
@@ -710,7 +712,7 @@ static const struct tune_params exynosm1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   48,	/* max_case_values.  */
-  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1, -1),
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
 };
@@ -786,7 +788,7 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1, -1),
   tune_params::AUTOPREFETCHER_STRONG,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)		/* tune_flags.  */
 };
@@ -811,7 +813,7 @@ static const struct tune_params thunderx2t99_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1, -1),
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -8715,6 +8717,17 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			   opts->x_param_values,
 			   global_options_set.x_param_values);
 
+  /* Enable sw prefetching at specified optimization level for
+     CPUS that have prefetch.  Lower optimization level threshold by 1
+     when profiling is enabled.  */
+  if (opts->x_flag_prefetch_loop_arrays < 0
+      && !opts->x_optimize_size
+      && aarch64_tune_params.default_opt_level >= 0
+      && (opts->x_optimize
+	  + (opts->x_flag_profile_use
+	     ? 1 : 0)) >= aarch64_tune_params.default_opt_level)
+    opts->x_flag_prefetch_loop_arrays = 1;
+
   aarch64_override_options_after_change_1 (opts);
 }
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular
  2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
                   ` (6 preceding siblings ...)
  2017-01-30 16:55 ` [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Andrew Pinski
@ 2017-02-20 14:44 ` Kyrill Tkachov
  2017-02-28 10:03   ` Maxim Kuvyrkov
  7 siblings, 1 reply; 32+ messages in thread
From: Kyrill Tkachov @ 2017-02-20 14:44 UTC (permalink / raw)
  To: Maxim Kuvyrkov, GCC Patches; +Cc: Andrew Pinski, Richard Guenther

Hi Maxim,

On 30/01/17 11:24, Maxim Kuvyrkov wrote:
> This patch series improves -fprefetch-loop-arrays pass through small fixes and tweaks, and then enables it for several AArch64 cores.
>
> My tunings were done on and for Qualcomm hardware, with results varying between +0.5-1.9% for SPEC2006 INT and +0.25%-1.0% for SPEC2006 FP at -O3, depending on hardware revision.
>
> This patch series enables restricted -fprefetch-loop-arrays at -O2, which also improves SPEC2006 numbers
>
> Biggest progressions are on 419.mcf and 437.leslie3d, with no serious regressions on other benchmarks.
>
> I'm now investigating making -fprefetch-loop-arrays more aggressive for Qualcomm hardware, which improves performance on most benchmarks, but also causes big regressions on 454.calculix and 462.libquantum.  If I can fix these two regressions, prefetching will give another boost to AArch64.
>
> Andrew just posted similar prefetching tunings for Cavium's cores, and the two patches have trivial conflicts.  I'll post mine as-is, since it address one of the comments on Andrew's review (adding a stand-alone struct for tuning parameters).
>
> Andrew, feel free to just copy-paste it to your patch, since it is just a mechanical change.
>
> All patches were bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>   

I've tried these patches out on Cortex-A72 and Cortex-A53, with the tuning structs entries appropriately
modified to enable the changes on those cores.
I'm seeing the mcf and leslie3d improvements as well on Cortex-A72 and Cortex-A53 and no noticeable regressions.
I've also verified that the improvements are due to the prefetch instructions rather than just the unrolling that
the pass does.
So I'm in favor of enabling this for the cores that benefit from it.

Do you plan to get this in for GCC 8?
Thanks,
Kyrill

> --
> Maxim Kuvyrkov
> www.linaro.org
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular
  2017-02-20 14:44 ` Kyrill Tkachov
@ 2017-02-28 10:03   ` Maxim Kuvyrkov
  2017-05-28  8:18     ` Andrew Pinski
  0 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-02-28 10:03 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches, Andrew Pinski, Richard Guenther

> On Feb 20, 2017, at 5:38 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
> 
> Hi Maxim,
> 
> On 30/01/17 11:24, Maxim Kuvyrkov wrote:
>> This patch series improves -fprefetch-loop-arrays pass through small fixes and tweaks, and then enables it for several AArch64 cores.
>> 
>> My tunings were done on and for Qualcomm hardware, with results varying between +0.5-1.9% for SPEC2006 INT and +0.25%-1.0% for SPEC2006 FP at -O3, depending on hardware revision.
>> 
>> This patch series enables restricted -fprefetch-loop-arrays at -O2, which also improves SPEC2006 numbers
>> 
>> Biggest progressions are on 419.mcf and 437.leslie3d, with no serious regressions on other benchmarks.
>> 
>> I'm now investigating making -fprefetch-loop-arrays more aggressive for Qualcomm hardware, which improves performance on most benchmarks, but also causes big regressions on 454.calculix and 462.libquantum.  If I can fix these two regressions, prefetching will give another boost to AArch64.
>> 
>> Andrew just posted similar prefetching tunings for Cavium's cores, and the two patches have trivial conflicts.  I'll post mine as-is, since it address one of the comments on Andrew's review (adding a stand-alone struct for tuning parameters).
>> 
>> Andrew, feel free to just copy-paste it to your patch, since it is just a mechanical change.
>> 
>> All patches were bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>>  
> 
> I've tried these patches out on Cortex-A72 and Cortex-A53, with the tuning structs entries appropriately
> modified to enable the changes on those cores.
> I'm seeing the mcf and leslie3d improvements as well on Cortex-A72 and Cortex-A53 and no noticeable regressions.
> I've also verified that the improvements are due to the prefetch instructions rather than just the unrolling that
> the pass does.
> So I'm in favor of enabling this for the cores that benefit from it.
> 
> Do you plan to get this in for GCC 8?

Hi Kyrill,

My hope was to push them in time for GCC 7, but it seems to late now.  I'll return to these patches at the beginning of Stage 1.

--
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular
  2017-02-28 10:03   ` Maxim Kuvyrkov
@ 2017-05-28  8:18     ` Andrew Pinski
  2017-05-29 10:24       ` Maxim Kuvyrkov
  2017-06-09 16:06       ` Maxim Kuvyrkov
  0 siblings, 2 replies; 32+ messages in thread
From: Andrew Pinski @ 2017-05-28  8:18 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Kyrill Tkachov, GCC Patches, Richard Guenther

On Tue, Feb 28, 2017 at 1:53 AM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
>> On Feb 20, 2017, at 5:38 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>>
>> Hi Maxim,
>>
>> On 30/01/17 11:24, Maxim Kuvyrkov wrote:
>>> This patch series improves -fprefetch-loop-arrays pass through small fixes and tweaks, and then enables it for several AArch64 cores.
>>>
>>> My tunings were done on and for Qualcomm hardware, with results varying between +0.5-1.9% for SPEC2006 INT and +0.25%-1.0% for SPEC2006 FP at -O3, depending on hardware revision.
>>>
>>> This patch series enables restricted -fprefetch-loop-arrays at -O2, which also improves SPEC2006 numbers
>>>
>>> Biggest progressions are on 419.mcf and 437.leslie3d, with no serious regressions on other benchmarks.
>>>
>>> I'm now investigating making -fprefetch-loop-arrays more aggressive for Qualcomm hardware, which improves performance on most benchmarks, but also causes big regressions on 454.calculix and 462.libquantum.  If I can fix these two regressions, prefetching will give another boost to AArch64.
>>>
>>> Andrew just posted similar prefetching tunings for Cavium's cores, and the two patches have trivial conflicts.  I'll post mine as-is, since it address one of the comments on Andrew's review (adding a stand-alone struct for tuning parameters).
>>>
>>> Andrew, feel free to just copy-paste it to your patch, since it is just a mechanical change.
>>>
>>> All patches were bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>>>
>>
>> I've tried these patches out on Cortex-A72 and Cortex-A53, with the tuning structs entries appropriately
>> modified to enable the changes on those cores.
>> I'm seeing the mcf and leslie3d improvements as well on Cortex-A72 and Cortex-A53 and no noticeable regressions.
>> I've also verified that the improvements are due to the prefetch instructions rather than just the unrolling that
>> the pass does.
>> So I'm in favor of enabling this for the cores that benefit from it.
>>
>> Do you plan to get this in for GCC 8?
>
> Hi Kyrill,
>
> My hope was to push them in time for GCC 7, but it seems to late now.  I'll return to these patches at the beginning of Stage 1.

Ping on this patch set as I really want to get in the prefetching side
for ThunderX 1 and 2.  Or should I resubmit my patch set?

Thanks,
Andrew

>
> --
> Maxim Kuvyrkov
> www.linaro.org
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-01-30 11:53 ` [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64 Maxim Kuvyrkov
  2017-01-30 16:39   ` Andrew Pinski
@ 2017-05-29 10:07   ` Maxim Kuvyrkov
  1 sibling, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-05-29 10:07 UTC (permalink / raw)
  To: Richard Earnshaw, James Greenhalgh
  Cc: Kyrylo Tkachov, Andrew Pinski, Richard Guenther, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

> On Jan 30, 2017, at 2:48 PM, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
> 
> This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.
> 
> This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .
> 
> This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".
> 
> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.


AArch64 maintainers, ping?  Here is a patch rebased against current trunk.   OK, assuming bootstrap and reg-test passes?


--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0003-Port-prefetch-configuration-from-aarch32-to-aarch64-.patch --]
[-- Type: application/octet-stream, Size: 8013 bytes --]

From 6376f11ae2ff95d992d93efee9a6b6506835477e Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:42:54 +0000
Subject: [PATCH 3/6] Port prefetch configuration from aarch32 to aarch64
 backend.

	* config/aarch64/aarch64-protos.h (struct tune_params):
	Replace cache_line_size with struct { } prefetch -- same as in
	arm-protos.h.
	* config/aarch64/aarch64.c (AARCH64_PREFETCH_NOT_BENEFICIAL,)
	(AARCH64_PREFETCH_BENEFICIAL): New macros.
	(tune_params *_tunings): Update all tunings (no functional change).
	(aarch64_override_options_internal): Set PARAM_SIMULTANEOUS_PREFETCHES,
	PARAM_L1_CACHE_SIZE, PARAM_L1_CACHE_LINE_SIZE, and PARAM_L2_CACHE_SIZE
	from tunings structures.

Change-Id: I1ddbac1863dcf078a2e5b14dd904debc76a7da94
---
 gcc/config/aarch64/aarch64-protos.h | 11 +++++---
 gcc/config/aarch64/aarch64.c        | 54 +++++++++++++++++++++++++++----------
 2 files changed, 48 insertions(+), 17 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ac91865..edca3e2 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -224,9 +224,14 @@ struct tune_params
   int min_div_recip_mul_df;
   /* Value for aarch64_case_values_threshold; or 0 for the default.  */
   unsigned int max_case_values;
-  /* Value for PARAM_L1_CACHE_LINE_SIZE; or 0 to use the default.  */
-  unsigned int cache_line_size;
-
+  /* Explicit prefetch data.  */
+  struct
+    {
+      int num_slots;
+      int l1_cache_size;
+      int l1_cache_line_size;
+      int l2_cache_size;
+    } prefetch;
 /* An enum specifying how to take into account CPU autoprefetch capabilities
    during instruction scheduling:
    - AUTOPREFETCHER_OFF: Do not take autoprefetch capabilities into account.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 5e546bf..a87b216 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -526,6 +526,15 @@ static const cpu_approx_modes xgene1_approx_modes =
   AARCH64_APPROX_ALL	/* recip_sqrt  */
 };
 
+#define AARCH64_PREFETCH_NOT_BENEFICIAL { 0, -1, -1, -1 }
+#define AARCH64_PREFETCH_BENEFICIAL(num_slots,l1_size,l1_line_size,l2_size) \
+  {									\
+    num_slots,								\
+    l1_size,								\
+    l1_line_size,							\
+    l2_size								\
+  }
+
 static const struct tune_params generic_tunings =
 {
   &cortexa57_extra_costs,
@@ -546,7 +555,7 @@ static const struct tune_params generic_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -572,7 +581,7 @@ static const struct tune_params cortexa35_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -598,7 +607,7 @@ static const struct tune_params cortexa53_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -624,7 +633,7 @@ static const struct tune_params cortexa57_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
 };
@@ -650,7 +659,7 @@ static const struct tune_params cortexa72_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -676,7 +685,7 @@ static const struct tune_params cortexa73_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -701,7 +710,7 @@ static const struct tune_params exynosm1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   48,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
 };
@@ -726,7 +735,7 @@ static const struct tune_params thunderx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)	/* tune_flags.  */
 };
@@ -751,7 +760,7 @@ static const struct tune_params xgene1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
+  AARCH64_PREFETCH_NOT_BENEFICIAL,
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -777,7 +786,7 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_STRONG,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)		/* tune_flags.  */
 };
@@ -802,7 +811,7 @@ static const struct tune_params thunderx2t99_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -8749,10 +8758,27 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			 opts->x_param_values,
 			 global_options_set.x_param_values);
 
-  /* Set the L1 cache line size.  */
-  if (selected_cpu->tune->cache_line_size != 0)
+  /* Set up parameters to be used in prefetching algorithm.  Do not
+     override the defaults unless we are tuning for a core we have
+     researched values for.  */
+  if (aarch64_tune_params.prefetch.num_slots > 0)
+    maybe_set_param_value (PARAM_SIMULTANEOUS_PREFETCHES,
+			   aarch64_tune_params.prefetch.num_slots,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l1_cache_size >= 0)
+    maybe_set_param_value (PARAM_L1_CACHE_SIZE,
+			   aarch64_tune_params.prefetch.l1_cache_size,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l1_cache_line_size >= 0)
     maybe_set_param_value (PARAM_L1_CACHE_LINE_SIZE,
-			   selected_cpu->tune->cache_line_size,
+			   aarch64_tune_params.prefetch.l1_cache_line_size,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch.l2_cache_size >= 0)
+    maybe_set_param_value (PARAM_L2_CACHE_SIZE,
+			   aarch64_tune_params.prefetch.l2_cache_size,
 			   opts->x_param_values,
 			   global_options_set.x_param_values);
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that benefit from prefetching.
  2017-02-03 11:58       ` Maxim Kuvyrkov
@ 2017-05-29 10:09         ` Maxim Kuvyrkov
  2017-06-08 16:31         ` James Greenhalgh
  1 sibling, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-05-29 10:09 UTC (permalink / raw)
  To: Richard Biener, James Greenhalgh
  Cc: Kyrill Tkachov, GCC Patches, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 2409 bytes --]

On 3 February 2017 at 14:58, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
>> On Jan 30, 2017, at 5:50 PM, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
>>
>>> On Jan 30, 2017, at 3:23 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>>>
>>> Hi Maxim,
>>>
>>> On 30/01/17 12:06, Maxim Kuvyrkov wrote:
>>>> This patch enables prefetching at -O3 for aarch64 cores that set "simultaneous prefetches" parameter above 0.  There are currently no such settings, so this patch doesn't change default code generation.
>>>>
>>>> I'm now working on improvements to -fprefetch-loop-arrays pass to make it suitable for -O2. I'll post this work in the next month.
>>>>
>>>> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>>>>
>>>
>>> Are you aiming to get this in for GCC 8?
>>> I have one small comment on this patch:
>>>
>>> +  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we
>>> +     have deemed it beneficial (signified by setting
>>> +     prefetch.num_slots to 1 or more).  */
>>> +  if (flag_prefetch_loop_arrays < 0
>>> +      && HAVE_prefetch
>>>
>>> HAVE_prefetch will always be true on aarch64.
>>> I imagine midend code that had logic like this would need this check, but aarch64-specific code shouldn't need it.
>>
>> Agree, I'll remove HAVE_prefetch.
>>
>> This pattern was copied from other backends, and HAVE_prefetch is most likely a historical artifact.
>
> Andrew raised a good point in the review of his patch that it is a bad idea to use one of prefetching parameters (simultaneous_prefetches) as indicator for whether to enable prefetching pass by default.  Indeed there are cases when we want to set simultaneous_prefetch according to HW documentation (or experimental results), but not enable prefetching pass by default.
>
> This update to the patch addresses it.  The patch adds a new explicit field to prefetch tuning structure "default_opt_level" that sets optimization level from which prefetching should be enabled by default.  The current value is to enable prefetching at -O3; additionally, this parameter will come handy for enabling prefetching at -O2 [when it is ready].
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>

AArch64 maintainers, ping?  Here is a rebased patch against current
trunk.  OK to commit, assuming bootstrap and reg-test passes?

-- 
Maxim Kuvyrkov
www.linaro.org

[-- Attachment #2: 0004-Enable-fprefetch-loop-arrays-at-O3-for-cores-that-be.patch --]
[-- Type: application/octet-stream, Size: 3798 bytes --]

From a882c45b74a9480c9b3f52d731f3ab0abeca8a97 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:50:07 +0000
Subject: [PATCH 4/6] Enable -fprefetch-loop-arrays at -O3 for cores that
 benefit from prefetching.

	* config/aarch64/aarch64.c (aarch64_override_options_internal):
	Set flag_prefetch_loop_arrays if tuning sets prefetching slots > 1.

Change-Id: Id41411e671e0a55dc7268e0ad0a4e8ff1421c90a
---
 gcc/config/aarch64/aarch64-protos.h |  1 +
 gcc/config/aarch64/aarch64.c        | 25 +++++++++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index edca3e2..27725e2 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -231,6 +231,7 @@ struct tune_params
       int l1_cache_size;
       int l1_cache_line_size;
       int l2_cache_size;
+      int default_opt_level;
     } prefetch;
 /* An enum specifying how to take into account CPU autoprefetch capabilities
    during instruction scheduling:
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index a87b216..35db3f2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -526,13 +526,15 @@ static const cpu_approx_modes xgene1_approx_modes =
   AARCH64_APPROX_ALL	/* recip_sqrt  */
 };
 
-#define AARCH64_PREFETCH_NOT_BENEFICIAL { 0, -1, -1, -1 }
-#define AARCH64_PREFETCH_BENEFICIAL(num_slots,l1_size,l1_line_size,l2_size) \
+#define AARCH64_PREFETCH_NOT_BENEFICIAL { 0, -1, -1, -1, -1 }
+#define AARCH64_PREFETCH_BENEFICIAL(num_slots,l1_size,l1_line_size,l2_size, \
+				    opt_level)				\
   {									\
     num_slots,								\
     l1_size,								\
     l1_line_size,							\
-    l2_size								\
+    l2_size,								\
+    opt_level								\
   }
 
 static const struct tune_params generic_tunings =
@@ -710,7 +712,7 @@ static const struct tune_params exynosm1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   48,	/* max_case_values.  */
-  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1, -1),
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
 };
@@ -786,7 +788,7 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1, -1),
   tune_params::AUTOPREFETCHER_STRONG,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)		/* tune_flags.  */
 };
@@ -811,7 +813,7 @@ static const struct tune_params thunderx2t99_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1),
+  AARCH64_PREFETCH_BENEFICIAL (0, -1, 64, -1, -1),
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
 };
@@ -8782,6 +8784,17 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			   opts->x_param_values,
 			   global_options_set.x_param_values);
 
+  /* Enable sw prefetching at specified optimization level for
+     CPUS that have prefetch.  Lower optimization level threshold by 1
+     when profiling is enabled.  */
+  if (opts->x_flag_prefetch_loop_arrays < 0
+      && !opts->x_optimize_size
+      && aarch64_tune_params.prefetch.default_opt_level >= 0
+      && (opts->x_optimize
+	  + (opts->x_flag_profile_use
+	     ? 1 : 0)) >= aarch64_tune_params.prefetch.default_opt_level)
+    opts->x_flag_prefetch_loop_arrays = 1;
+
   aarch64_override_options_after_change_1 (opts);
 }
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular
  2017-05-28  8:18     ` Andrew Pinski
@ 2017-05-29 10:24       ` Maxim Kuvyrkov
  2017-06-09 16:06       ` Maxim Kuvyrkov
  1 sibling, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-05-29 10:24 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Kyrill Tkachov, GCC Patches, Richard Guenther

Hi Andrew,

Thanks for pinging this.  I've re-started the submission.

On 28 May 2017 at 08:01, Andrew Pinski <apinski@cavium.com> wrote:
> On Tue, Feb 28, 2017 at 1:53 AM, Maxim Kuvyrkov
> <maxim.kuvyrkov@linaro.org> wrote:
>>> On Feb 20, 2017, at 5:38 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>>>
>>> Hi Maxim,
>>>
>>> On 30/01/17 11:24, Maxim Kuvyrkov wrote:
>>>> This patch series improves -fprefetch-loop-arrays pass through small fixes and tweaks, and then enables it for several AArch64 cores.
>>>>
>>>> My tunings were done on and for Qualcomm hardware, with results varying between +0.5-1.9% for SPEC2006 INT and +0.25%-1.0% for SPEC2006 FP at -O3, depending on hardware revision.
>>>>
>>>> This patch series enables restricted -fprefetch-loop-arrays at -O2, which also improves SPEC2006 numbers
>>>>
>>>> Biggest progressions are on 419.mcf and 437.leslie3d, with no serious regressions on other benchmarks.
>>>>
>>>> I'm now investigating making -fprefetch-loop-arrays more aggressive for Qualcomm hardware, which improves performance on most benchmarks, but also causes big regressions on 454.calculix and 462.libquantum.  If I can fix these two regressions, prefetching will give another boost to AArch64.
>>>>
>>>> Andrew just posted similar prefetching tunings for Cavium's cores, and the two patches have trivial conflicts.  I'll post mine as-is, since it address one of the comments on Andrew's review (adding a stand-alone struct for tuning parameters).
>>>>
>>>> Andrew, feel free to just copy-paste it to your patch, since it is just a mechanical change.
>>>>
>>>> All patches were bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>>>>
>>>
>>> I've tried these patches out on Cortex-A72 and Cortex-A53, with the tuning structs entries appropriately
>>> modified to enable the changes on those cores.
>>> I'm seeing the mcf and leslie3d improvements as well on Cortex-A72 and Cortex-A53 and no noticeable regressions.
>>> I've also verified that the improvements are due to the prefetch instructions rather than just the unrolling that
>>> the pass does.
>>> So I'm in favor of enabling this for the cores that benefit from it.
>>>
>>> Do you plan to get this in for GCC 8?
>>
>> Hi Kyrill,
>>
>> My hope was to push them in time for GCC 7, but it seems to late now.  I'll return to these patches at the beginning of Stage 1.
>
> Ping on this patch set as I really want to get in the prefetching side
> for ThunderX 1 and 2.  Or should I resubmit my patch set?
>
> Thanks,
> Andrew
>
>>
>> --
>> Maxim Kuvyrkov
>> www.linaro.org
>>



-- 
Maxim Kuvyrkov
www.linaro.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-01-30 16:39   ` Andrew Pinski
  2017-02-03 11:52     ` Maxim Kuvyrkov
@ 2017-06-08 13:48     ` James Greenhalgh
  2017-06-08 15:13       ` Richard Earnshaw (lists)
  1 sibling, 1 reply; 32+ messages in thread
From: James Greenhalgh @ 2017-06-08 13:48 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Maxim Kuvyrkov, GCC Patches, Kyrylo Tkachov, Richard Guenther, nd

On Mon, Jan 30, 2017 at 08:35:00AM -0800, Andrew Pinski wrote:
> On Mon, Jan 30, 2017 at 3:48 AM, Maxim Kuvyrkov
> <maxim.kuvyrkov@linaro.org> wrote:
> > This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.
> >
> > This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .
> >
> > This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".
> 
> I am not a fan of the macro at all.

I'm with Andrew for this. The precedent in the AArch64 port is for
explicitly spelling this out, as we do with the branch costs, approx_modes,
vector costs etc.

I'd rather we went that route than the macro you're using. I don't have
any objections to the rest of your patch.

Thanks,
James

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-06-08 13:48     ` James Greenhalgh
@ 2017-06-08 15:13       ` Richard Earnshaw (lists)
  2017-06-09  7:32         ` Maxim Kuvyrkov
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-08 15:13 UTC (permalink / raw)
  To: James Greenhalgh, Andrew Pinski
  Cc: Maxim Kuvyrkov, GCC Patches, Kyrylo Tkachov, Richard Guenther, nd

On 08/06/17 14:47, James Greenhalgh wrote:
> On Mon, Jan 30, 2017 at 08:35:00AM -0800, Andrew Pinski wrote:
>> On Mon, Jan 30, 2017 at 3:48 AM, Maxim Kuvyrkov
>> <maxim.kuvyrkov@linaro.org> wrote:
>>> This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.
>>>
>>> This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .
>>>
>>> This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".
>>
>> I am not a fan of the macro at all.
> 
> I'm with Andrew for this. The precedent in the AArch64 port is for
> explicitly spelling this out, as we do with the branch costs, approx_modes,
> vector costs etc.
> 
> I'd rather we went that route than the macro you're using. I don't have
> any objections to the rest of your patch.
> 
> Thanks,
> James
> 

I disagree with having to write all these things out, but I do agree
that we should be self-consistent within the port.

The purpose of introducing the macros in the ARM port was to avoid the
common problem of merging adding new parameters with adding new ports.
C initializers for these tables assign values sequentially  with
zero-padding at the end, so failing to merge such changes carefully
leads to real bugs in the compiler (even if just performance bugs).

I notice that the last entry in the current tune params table is an int,
rather than something that is type-checked (like the penultimate entry -
an enum).  Having the final entry be type checked at least ensures that
the right number of elements exist, even if the order is not strictly
checked.

My 2p.

R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that  benefit from prefetching.
  2017-02-03 11:58       ` Maxim Kuvyrkov
  2017-05-29 10:09         ` Maxim Kuvyrkov
@ 2017-06-08 16:31         ` James Greenhalgh
  2017-06-09 16:00           ` Maxim Kuvyrkov
  1 sibling, 1 reply; 32+ messages in thread
From: James Greenhalgh @ 2017-06-08 16:31 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Kyrill Tkachov, GCC Patches, Andrew Pinski, Richard Guenther, nd

On Fri, Feb 03, 2017 at 02:58:23PM +0300, Maxim Kuvyrkov wrote:
> > On Jan 30, 2017, at 5:50 PM, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
> > 
> >> On Jan 30, 2017, at 3:23 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
> >> 
> >> Hi Maxim,
> >> 
> >> On 30/01/17 12:06, Maxim Kuvyrkov wrote:
> >>> This patch enables prefetching at -O3 for aarch64 cores that set "simultaneous prefetches" parameter above 0.  There are currently no such settings, so this patch doesn't change default code generation.
> >>> 
> >>> I'm now working on improvements to -fprefetch-loop-arrays pass to make it suitable for -O2. I'll post this work in the next month.
> >>> 
> >>> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
> >>> 
> >> 
> >> Are you aiming to get this in for GCC 8?
> >> I have one small comment on this patch:
> >> 
> >> +  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we
> >> +     have deemed it beneficial (signified by setting
> >> +     prefetch.num_slots to 1 or more).  */
> >> +  if (flag_prefetch_loop_arrays < 0
> >> +      && HAVE_prefetch
> >> 
> >> HAVE_prefetch will always be true on aarch64.
> >> I imagine midend code that had logic like this would need this check, but aarch64-specific code shouldn't need it.
> > 
> > Agree, I'll remove HAVE_prefetch.
> > 
> > This pattern was copied from other backends, and HAVE_prefetch is most likely a historical artifact.
> 
> Andrew raised a good point in the review of his patch that it is a bad idea
> to use one of prefetching parameters (simultaneous_prefetches) as indicator
> for whether to enable prefetching pass by default.  Indeed there are cases
> when we want to set simultaneous_prefetch according to HW documentation (or
> experimental results), but not enable prefetching pass by default.
> 
> This update to the patch addresses it.  The patch adds a new explicit field
> to prefetch tuning structure "default_opt_level" that sets optimization level
> from which prefetching should be enabled by default.  The current value is to
> enable prefetching at -O3; additionally, this parameter will come handy for
> enabling prefetching at -O2 [when it is ready].

I really don't like the scheme of changing the optimisation threshold when
profiling data is used.

I've seen too many reports and presentations by the uninitiated who believe
that the use of profiling data has made the difference, when in reality
it is just GCC changing behaviour on which passes run. It is very
misleading!

With that line removed, and any rebasing needed over changes to the macro,
I'm happy with this patch.

Thanks,
James

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/6][AArch64] Update prefetch tuning parameters for falkor and qdf24xx  tunings.
  2017-01-30 12:14 ` [PATCH 6/6][AArch64] Update prefetch tuning parameters for falkor and qdf24xx tunings Maxim Kuvyrkov
@ 2017-06-08 16:32   ` James Greenhalgh
  2017-06-09 16:02     ` Maxim Kuvyrkov
  0 siblings, 1 reply; 32+ messages in thread
From: James Greenhalgh @ 2017-06-08 16:32 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: GCC Patches, Kyrylo Tkachov, Andrew Pinski, Richard Guenther, nd

On Mon, Jan 30, 2017 at 03:08:04PM +0300, Maxim Kuvyrkov wrote:
> This patch enables software prefetching at -O3 for Qualcomm's qdf24xx cores.
> 
> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.

This patch is OK in whatever form it takes after rebasing for the macro
in 4/6.

Thanks,
James


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-06-08 15:13       ` Richard Earnshaw (lists)
@ 2017-06-09  7:32         ` Maxim Kuvyrkov
  2017-06-09 10:04           ` James Greenhalgh
  0 siblings, 1 reply; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-06-09  7:32 UTC (permalink / raw)
  To: Richard Earnshaw (lists)
  Cc: James Greenhalgh, Andrew Pinski, GCC Patches, Kyrylo Tkachov,
	Richard Guenther, nd

> On Jun 8, 2017, at 6:13 PM, Richard Earnshaw (lists) <Richard.Earnshaw@arm.com> wrote:
> 
> On 08/06/17 14:47, James Greenhalgh wrote:
>> On Mon, Jan 30, 2017 at 08:35:00AM -0800, Andrew Pinski wrote:
>>> On Mon, Jan 30, 2017 at 3:48 AM, Maxim Kuvyrkov
>>> <maxim.kuvyrkov@linaro.org> wrote:
>>>> This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.
>>>> 
>>>> This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .
>>>> 
>>>> This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".
>>> 
>>> I am not a fan of the macro at all.
>> 
>> I'm with Andrew for this. The precedent in the AArch64 port is for
>> explicitly spelling this out, as we do with the branch costs, approx_modes,
>> vector costs etc.
>> 
>> I'd rather we went that route than the macro you're using. I don't have
>> any objections to the rest of your patch.
>> 
>> Thanks,
>> James
>> 
> 
> I disagree with having to write all these things out, but I do agree
> that we should be self-consistent within the port.

I'm re-writing the patch with approach that James suggested, and instead of AARCH64_PREFETCH_NOT_BENEFICIAL there will be "&generic_prefetch_tune", so not much copy-paste of parameters.

> 
> The purpose of introducing the macros in the ARM port was to avoid the
> common problem of merging adding new parameters with adding new ports.
> C initializers for these tables assign values sequentially  with
> zero-padding at the end, so failing to merge such changes carefully
> leads to real bugs in the compiler (even if just performance bugs).
> 
> I notice that the last entry in the current tune params table is an int,
> rather than something that is type-checked (like the penultimate entry -
> an enum).  Having the final entry be type checked at least ensures that
> the right number of elements exist, even if the order is not strictly
> checked.

If you prefer, I can move "&generic_prefetch_tune" to the bottom of "struct tune_params", thus enabling type checking on struct pointer.

--
Maxim Kuvyrkov
www.linaro.org



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-06-09  7:32         ` Maxim Kuvyrkov
@ 2017-06-09 10:04           ` James Greenhalgh
  2017-06-09 15:56             ` Maxim Kuvyrkov
  0 siblings, 1 reply; 32+ messages in thread
From: James Greenhalgh @ 2017-06-09 10:04 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Richard Earnshaw (lists),
	Andrew Pinski, GCC Patches, Kyrylo Tkachov, Richard Guenther, nd

On Fri, Jun 09, 2017 at 10:30:02AM +0300, Maxim Kuvyrkov wrote:
> > On Jun 8, 2017, at 6:13 PM, Richard Earnshaw (lists) <Richard.Earnshaw@arm.com> wrote:
> > 
> > On 08/06/17 14:47, James Greenhalgh wrote:
> >> On Mon, Jan 30, 2017 at 08:35:00AM -0800, Andrew Pinski wrote:
> >>> On Mon, Jan 30, 2017 at 3:48 AM, Maxim Kuvyrkov
> >>> <maxim.kuvyrkov@linaro.org> wrote:
> >>>> This patch port prefetch configuration from aarch32 backend to aarch64.  There is no code-generation change from this patch.
> >>>> 
> >>>> This patch also happens to address Kyrill's comment on Andrew's prefetching patch at https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02133.html .
> >>>> 
> >>>> This patch also fixes a minor bug in aarch64_override_options_internal(), which used "selected_cpu->tune" instead of "aarch64_tune_params".
> >>> 
> >>> I am not a fan of the macro at all.
> >> 
> >> I'm with Andrew for this. The precedent in the AArch64 port is for
> >> explicitly spelling this out, as we do with the branch costs, approx_modes,
> >> vector costs etc.
> >> 
> >> I'd rather we went that route than the macro you're using. I don't have
> >> any objections to the rest of your patch.
> >> 
> >> Thanks,
> >> James
> >> 
> > 
> > I disagree with having to write all these things out, but I do agree
> > that we should be self-consistent within the port.
> 
> I'm re-writing the patch with approach that James suggested, and instead of
> AARCH64_PREFETCH_NOT_BENEFICIAL there will be "&generic_prefetch_tune", so
> not much copy-paste of parameters.
> 
> > 
> > The purpose of introducing the macros in the ARM port was to avoid the
> > common problem of merging adding new parameters with adding new ports.
> > C initializers for these tables assign values sequentially  with
> > zero-padding at the end, so failing to merge such changes carefully
> > leads to real bugs in the compiler (even if just performance bugs).
> > 
> > I notice that the last entry in the current tune params table is an int,
> > rather than something that is type-checked (like the penultimate entry -
> > an enum).  Having the final entry be type checked at least ensures that
> > the right number of elements exist, even if the order is not strictly
> > checked.
> 
> If you prefer, I can move "&generic_prefetch_tune" to the bottom of "struct
> tune_params", thus enabling type checking on struct pointer.

For the zero-cost/small-benefit tradefoff it gives, I'd say "why not".

With that change in place and the obvious rebase needed for patch 6/6 in
place, both this (4/6) and patch 6/6 are OK.

Cheers,
James

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64
  2017-06-09 10:04           ` James Greenhalgh
@ 2017-06-09 15:56             ` Maxim Kuvyrkov
  0 siblings, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-06-09 15:56 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Richard Earnshaw (lists),
	Andrew Pinski, GCC Patches, Kyrylo Tkachov, Richard Guenther, nd

[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]

> On Jun 9, 2017, at 1:04 PM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>>> 
...
>>> I disagree with having to write all these things out, but I do agree
>>> that we should be self-consistent within the port.
>> 
>> I'm re-writing the patch with approach that James suggested, and instead of
>> AARCH64_PREFETCH_NOT_BENEFICIAL there will be "&generic_prefetch_tune", so
>> not much copy-paste of parameters.
>> 
>>> 
>>> The purpose of introducing the macros in the ARM port was to avoid the
>>> common problem of merging adding new parameters with adding new ports.
>>> C initializers for these tables assign values sequentially  with
>>> zero-padding at the end, so failing to merge such changes carefully
>>> leads to real bugs in the compiler (even if just performance bugs).
>>> 
>>> I notice that the last entry in the current tune params table is an int,
>>> rather than something that is type-checked (like the penultimate entry -
>>> an enum).  Having the final entry be type checked at least ensures that
>>> the right number of elements exist, even if the order is not strictly
>>> checked.
>> 
>> If you prefer, I can move "&generic_prefetch_tune" to the bottom of "struct
>> tune_params", thus enabling type checking on struct pointer.
> 
> For the zero-cost/small-benefit tradefoff it gives, I'd say "why not".
> 
> With that change in place and the obvious rebase needed for patch 6/6 in
> place, both this (4/6) and patch 6/6 are OK.

Thanks, James.

I'm going to commit this once bootstrap / regtest finish on aarch64-linux-gnu.

--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0001-Port-prefetch-configuration-from-aarch32-to-aarch64-.patch --]
[-- Type: application/octet-stream, Size: 10506 bytes --]

From 0ce9245af57905fb13d94f41a5c226d0a1560442 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:42:54 +0000
Subject: [PATCH 1/4] Port prefetch configuration from aarch32 to aarch64
 backend.

	* config/aarch64/aarch64-protos.h (struct cpu_prefetch_tune):
	New tune structure.
	(struct tune_params): Use cpu_prefetch_tune instead of cache_line_size.
	[Unrelated to main purpose of the patch] Place the pointer field last
	to enable type checking errors when tune structure are wrongly merged.
	* config/aarch64/aarch64.c (generic_prefetch_tune,)
	(exynosm1_prefetch_tune, qdf24xx_prefetch_tune,)
	(thunderx2t99_prefetch_tune): New tune constants.
	(tune_params *_tunings): Update all tunings (no functional change).
	(aarch64_override_options_internal): Set PARAM_SIMULTANEOUS_PREFETCHES,
	PARAM_L1_CACHE_SIZE, PARAM_L1_CACHE_LINE_SIZE, and PARAM_L2_CACHE_SIZE
	from tunings structures.

Change-Id: I1ddbac1863dcf078a2e5b14dd904debc76a7da94
---
 gcc/config/aarch64/aarch64-protos.h |  16 ++++--
 gcc/config/aarch64/aarch64.c        | 102 +++++++++++++++++++++++++++---------
 2 files changed, 90 insertions(+), 28 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ac91865..bb06139 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -203,6 +203,15 @@ struct cpu_approx_modes
   const unsigned int recip_sqrt;	/* Reciprocal square root.  */
 };
 
+/* Cache prefetch settings for prefetch-loop-arrays.  */
+struct cpu_prefetch_tune
+{
+  const int num_slots;
+  const int l1_cache_size;
+  const int l1_cache_line_size;
+  const int l2_cache_size;
+};
+
 struct tune_params
 {
   const struct cpu_cost_table *insn_extra_cost;
@@ -224,9 +233,6 @@ struct tune_params
   int min_div_recip_mul_df;
   /* Value for aarch64_case_values_threshold; or 0 for the default.  */
   unsigned int max_case_values;
-  /* Value for PARAM_L1_CACHE_LINE_SIZE; or 0 to use the default.  */
-  unsigned int cache_line_size;
-
 /* An enum specifying how to take into account CPU autoprefetch capabilities
    during instruction scheduling:
    - AUTOPREFETCHER_OFF: Do not take autoprefetch capabilities into account.
@@ -244,6 +250,10 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+
+  /* Place prefetch struct pointer at the end to enable type checking
+     errors when tune_params misses elements (e.g., from erroneous merges).  */
+  const struct cpu_prefetch_tune *prefetch;
 };
 
 #define AARCH64_FUSION_PAIR(x, name) \
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index bce490f..f5fbf36 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -526,6 +526,39 @@ static const cpu_approx_modes xgene1_approx_modes =
   AARCH64_APPROX_ALL	/* recip_sqrt  */
 };
 
+/* Generic prefetch settings (which disable prefetch).  */
+static const cpu_prefetch_tune generic_prefetch_tune =
+{
+  0,			/* num_slots  */
+  -1,			/* l1_cache_size  */
+  -1,			/* l1_cache_line_size  */
+  -1			/* l2_cache_size  */
+};
+
+static const cpu_prefetch_tune exynosm1_prefetch_tune =
+{
+  0,			/* num_slots  */
+  -1,			/* l1_cache_size  */
+  64,			/* l1_cache_line_size  */
+  -1			/* l2_cache_size  */
+};
+
+static const cpu_prefetch_tune qdf24xx_prefetch_tune =
+{
+  0,			/* num_slots  */
+  -1,			/* l1_cache_size  */
+  64,			/* l1_cache_line_size  */
+  -1			/* l2_cache_size  */
+};
+
+static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
+{
+  0,			/* num_slots  */
+  -1,			/* l1_cache_size  */
+  64,			/* l1_cache_line_size  */
+  -1			/* l2_cache_size  */
+};
+
 static const struct tune_params generic_tunings =
 {
   &cortexa57_extra_costs,
@@ -546,9 +579,9 @@ static const struct tune_params generic_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -572,9 +605,9 @@ static const struct tune_params cortexa35_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -598,9 +631,9 @@ static const struct tune_params cortexa53_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -624,9 +657,9 @@ static const struct tune_params cortexa57_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -650,9 +683,9 @@ static const struct tune_params cortexa72_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
 static const struct tune_params cortexa73_tunings =
@@ -676,11 +709,13 @@ static const struct tune_params cortexa73_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
+
+
 static const struct tune_params exynosm1_tunings =
 {
   &exynosm1_extra_costs,
@@ -701,9 +736,9 @@ static const struct tune_params exynosm1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   48,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
+  &exynosm1_prefetch_tune
 };
 
 static const struct tune_params thunderx_tunings =
@@ -726,9 +761,9 @@ static const struct tune_params thunderx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
 static const struct tune_params xgene1_tunings =
@@ -751,9 +786,9 @@ static const struct tune_params xgene1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &generic_prefetch_tune
 };
 
 static const struct tune_params qdf24xx_tunings =
@@ -777,9 +812,9 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_STRONG,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)		/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),		/* tune_flags.  */
+  &qdf24xx_prefetch_tune
 };
 
 static const struct tune_params thunderx2t99_tunings =
@@ -802,9 +837,9 @@ static const struct tune_params thunderx2t99_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
-  64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &thunderx2t99_prefetch_tune
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -8747,10 +8782,27 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			 opts->x_param_values,
 			 global_options_set.x_param_values);
 
-  /* Set the L1 cache line size.  */
-  if (selected_cpu->tune->cache_line_size != 0)
+  /* Set up parameters to be used in prefetching algorithm.  Do not
+     override the defaults unless we are tuning for a core we have
+     researched values for.  */
+  if (aarch64_tune_params.prefetch->num_slots > 0)
+    maybe_set_param_value (PARAM_SIMULTANEOUS_PREFETCHES,
+			   aarch64_tune_params.prefetch->num_slots,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch->l1_cache_size >= 0)
+    maybe_set_param_value (PARAM_L1_CACHE_SIZE,
+			   aarch64_tune_params.prefetch->l1_cache_size,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch->l1_cache_line_size >= 0)
     maybe_set_param_value (PARAM_L1_CACHE_LINE_SIZE,
-			   selected_cpu->tune->cache_line_size,
+			   aarch64_tune_params.prefetch->l1_cache_line_size,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+  if (aarch64_tune_params.prefetch->l2_cache_size >= 0)
+    maybe_set_param_value (PARAM_L2_CACHE_SIZE,
+			   aarch64_tune_params.prefetch->l2_cache_size,
 			   opts->x_param_values,
 			   global_options_set.x_param_values);
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that  benefit from prefetching.
  2017-06-08 16:31         ` James Greenhalgh
@ 2017-06-09 16:00           ` Maxim Kuvyrkov
  0 siblings, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-06-09 16:00 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Kyrill Tkachov, GCC Patches, Andrew Pinski, Richard Guenther, nd

[-- Attachment #1: Type: text/plain, Size: 3161 bytes --]

> On Jun 8, 2017, at 7:31 PM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> On Fri, Feb 03, 2017 at 02:58:23PM +0300, Maxim Kuvyrkov wrote:
>>> On Jan 30, 2017, at 5:50 PM, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
>>> 
>>>> On Jan 30, 2017, at 3:23 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>>>> 
>>>> Hi Maxim,
>>>> 
>>>> On 30/01/17 12:06, Maxim Kuvyrkov wrote:
>>>>> This patch enables prefetching at -O3 for aarch64 cores that set "simultaneous prefetches" parameter above 0.  There are currently no such settings, so this patch doesn't change default code generation.
>>>>> 
>>>>> I'm now working on improvements to -fprefetch-loop-arrays pass to make it suitable for -O2. I'll post this work in the next month.
>>>>> 
>>>>> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
>>>>> 
>>>> 
>>>> Are you aiming to get this in for GCC 8?
>>>> I have one small comment on this patch:
>>>> 
>>>> +  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we
>>>> +     have deemed it beneficial (signified by setting
>>>> +     prefetch.num_slots to 1 or more).  */
>>>> +  if (flag_prefetch_loop_arrays < 0
>>>> +      && HAVE_prefetch
>>>> 
>>>> HAVE_prefetch will always be true on aarch64.
>>>> I imagine midend code that had logic like this would need this check, but aarch64-specific code shouldn't need it.
>>> 
>>> Agree, I'll remove HAVE_prefetch.
>>> 
>>> This pattern was copied from other backends, and HAVE_prefetch is most likely a historical artifact.
>> 
>> Andrew raised a good point in the review of his patch that it is a bad idea
>> to use one of prefetching parameters (simultaneous_prefetches) as indicator
>> for whether to enable prefetching pass by default.  Indeed there are cases
>> when we want to set simultaneous_prefetch according to HW documentation (or
>> experimental results), but not enable prefetching pass by default.
>> 
>> This update to the patch addresses it.  The patch adds a new explicit field
>> to prefetch tuning structure "default_opt_level" that sets optimization level
>> from which prefetching should be enabled by default.  The current value is to
>> enable prefetching at -O3; additionally, this parameter will come handy for
>> enabling prefetching at -O2 [when it is ready].
> 
> I really don't like the scheme of changing the optimisation threshold when
> profiling data is used.
> 
> I've seen too many reports and presentations by the uninitiated who believe
> that the use of profiling data has made the difference, when in reality
> it is just GCC changing behaviour on which passes run. It is very
> misleading!

OK, agree.  That line came from i386 backend.  I'll run benchmarks for enabling prefetching at -O2 at a later date, and, possibly, we will have a performance argument for reducing prefetch threshold when profile data is available.

> 
> With that line removed, and any rebasing needed over changes to the macro,
> I'm happy with this patch.

Attached is updated patch.  I'll commit it after bootstrap / regtest.

Thanks,

--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0002-Enable-fprefetch-loop-arrays-at-given-optimization-l.patch --]
[-- Type: application/octet-stream, Size: 3183 bytes --]

From 64d3a2c4cd9c5f3b2d7a768359376a89a61d9652 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:50:07 +0000
Subject: [PATCH 2/4] Enable -fprefetch-loop-arrays at given optimization
 level.

	* config/aarch64/aarch64.c (aarch64_override_options_internal):
	Set flag_prefetch_loop_arrays according to tuning data.

Change-Id: Id41411e671e0a55dc7268e0ad0a4e8ff1421c90a
---
 gcc/config/aarch64/aarch64-protos.h |  1 +
 gcc/config/aarch64/aarch64.c        | 21 +++++++++++++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index bb06139..bfe44a7 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -210,6 +210,7 @@ struct cpu_prefetch_tune
   const int l1_cache_size;
   const int l1_cache_line_size;
   const int l2_cache_size;
+  const int default_opt_level;
 };
 
 struct tune_params
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f5fbf36..d007fa0 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -532,7 +532,8 @@ static const cpu_prefetch_tune generic_prefetch_tune =
   0,			/* num_slots  */
   -1,			/* l1_cache_size  */
   -1,			/* l1_cache_line_size  */
-  -1			/* l2_cache_size  */
+  -1,			/* l2_cache_size  */
+  -1			/* default_opt_level  */
 };
 
 static const cpu_prefetch_tune exynosm1_prefetch_tune =
@@ -540,7 +541,8 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune =
   0,			/* num_slots  */
   -1,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
-  -1			/* l2_cache_size  */
+  -1,			/* l2_cache_size  */
+  -1			/* default_opt_level  */
 };
 
 static const cpu_prefetch_tune qdf24xx_prefetch_tune =
@@ -548,7 +550,8 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   0,			/* num_slots  */
   -1,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
-  -1			/* l2_cache_size  */
+  -1,			/* l2_cache_size  */
+  -1			/* default_opt_level  */
 };
 
 static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
@@ -556,7 +559,8 @@ static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
   0,			/* num_slots  */
   -1,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
-  -1			/* l2_cache_size  */
+  -1,			/* l2_cache_size  */
+  -1			/* default_opt_level  */
 };
 
 static const struct tune_params generic_tunings =
@@ -8806,6 +8810,15 @@ aarch64_override_options_internal (struct gcc_options *opts)
 			   opts->x_param_values,
 			   global_options_set.x_param_values);
 
+  /* Enable sw prefetching at specified optimization level for
+     CPUS that have prefetch.  Lower optimization level threshold by 1
+     when profiling is enabled.  */
+  if (opts->x_flag_prefetch_loop_arrays < 0
+      && !opts->x_optimize_size
+      && aarch64_tune_params.prefetch->default_opt_level >= 0
+      && opts->x_optimize >= aarch64_tune_params.prefetch->default_opt_level)
+    opts->x_flag_prefetch_loop_arrays = 1;
+
   aarch64_override_options_after_change_1 (opts);
 }
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/6][AArch64] Update prefetch tuning parameters for falkor and qdf24xx  tunings.
  2017-06-08 16:32   ` James Greenhalgh
@ 2017-06-09 16:02     ` Maxim Kuvyrkov
  0 siblings, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-06-09 16:02 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: GCC Patches, Kyrylo Tkachov, Andrew Pinski, Richard Guenther, nd

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

> On Jun 8, 2017, at 7:32 PM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> On Mon, Jan 30, 2017 at 03:08:04PM +0300, Maxim Kuvyrkov wrote:
>> This patch enables software prefetching at -O3 for Qualcomm's qdf24xx cores.
>> 
>> Bootstrapped and regtested on x86_64-linux-gnu and aarch64-linux-gnu.
> 
> This patch is OK in whatever form it takes after rebasing for the macro
> in 4/6.

Here is updated patch for the record.

--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0003-Update-prefetch-tuning-parameters-for-qdf24xx.patch --]
[-- Type: application/octet-stream, Size: 1198 bytes --]

From 9f0bf8dfde0d41e0ea1d03da9ff8c2590cc96a85 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 15 Dec 2016 12:53:49 +0000
Subject: [PATCH 3/4] Update prefetch tuning parameters for qdf24xx.

	* config/aarch64/aarch64.c (qdf24xx_prefetch_tune): Update
	prefetch settings, and enable prefetching by default at -O3.

Change-Id: I2f0da54a8c262f6fbd0dcfde4584141bb09f1013
---
 gcc/config/aarch64/aarch64.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index d007fa0..bdcdf02 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -547,11 +547,11 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune =
 
 static const cpu_prefetch_tune qdf24xx_prefetch_tune =
 {
-  0,			/* num_slots  */
-  -1,			/* l1_cache_size  */
+  4,			/* num_slots  */
+  32,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
-  -1,			/* l2_cache_size  */
-  -1			/* default_opt_level  */
+  1024,			/* l2_cache_size  */
+  3			/* default_opt_level  */
 };
 
 static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular
  2017-05-28  8:18     ` Andrew Pinski
  2017-05-29 10:24       ` Maxim Kuvyrkov
@ 2017-06-09 16:06       ` Maxim Kuvyrkov
  1 sibling, 0 replies; 32+ messages in thread
From: Maxim Kuvyrkov @ 2017-06-09 16:06 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Kyrill Tkachov, GCC Patches, Richard Guenther, James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 686 bytes --]

> On May 28, 2017, at 8:01 AM, Andrew Pinski <apinski@cavium.com> wrote:
> 
> On Tue, Feb 28, 2017 at 1:53 AM, Maxim Kuvyrkov
>> 
...
>> Hi Kyrill,
>> 
>> My hope was to push them in time for GCC 7, but it seems to late now.  I'll return to these patches at the beginning of Stage 1.
> 
> Ping on this patch set as I really want to get in the prefetching side
> for ThunderX 1 and 2.  Or should I resubmit my patch set?

Hi Andrew,

James G. approved my prefetching patches, and I will be merging them today or on Monday.

I've rebase your ThunderX patch on top of my patchset.  I leave it to you to re-test and get it reviewed.

--
Maxim Kuvyrkov
www.linaro.org


[-- Attachment #2: 0004-tFrom-Maxim-Kuvyrkov-maxim.kuvyrkov-linaro.org.patch --]
[-- Type: application/octet-stream, Size: 4370 bytes --]

From 5071322839ac6200b713bebbc4152d9659a82c9c Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Fri, 3 Feb 2017 11:26:24 +0000
Subject: [PATCH 4/4] tFrom: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>

Prefetch tuning for ThunderX.
---
 gcc/config/aarch64/aarch64-cores.def |  4 +--
 gcc/config/aarch64/aarch64.c         | 52 +++++++++++++++++++++++++++++++++---
 2 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def
index 92b57cf..b4f6d94 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -63,8 +63,8 @@ AARCH64_CORE("qdf24xx",     qdf24xx,   cortexa57, 8A,  AARCH64_FL_FOR_ARCH8 | AA
 AARCH64_CORE("thunderx",      thunderx,      thunderx,  8A,  AARCH64_FL_FOR_ARCH8 | AARCH64_FL_CRC | AARCH64_FL_CRYPTO, thunderx,  0x43, 0x0a0, -1)
 /* Do not swap around "thunderxt88p1" and "thunderxt88",
    this order is required to handle variant correctly. */
-AARCH64_CORE("thunderxt88p1", thunderxt88p1, thunderx,  8A,  AARCH64_FL_FOR_ARCH8 | AARCH64_FL_CRC | AARCH64_FL_CRYPTO,	thunderx,  0x43, 0x0a1, 0)
-AARCH64_CORE("thunderxt88",   thunderxt88,   thunderx,  8A,  AARCH64_FL_FOR_ARCH8 | AARCH64_FL_CRC | AARCH64_FL_CRYPTO, thunderx,  0x43, 0x0a1, -1)
+AARCH64_CORE("thunderxt88p1", thunderxt88p1, thunderx,  8A,  AARCH64_FL_FOR_ARCH8 | AARCH64_FL_CRC | AARCH64_FL_CRYPTO,	thunderxt88,  0x43, 0x0a1, 0)
+AARCH64_CORE("thunderxt88",   thunderxt88,   thunderx,  8A,  AARCH64_FL_FOR_ARCH8 | AARCH64_FL_CRC | AARCH64_FL_CRYPTO, thunderxt88,  0x43, 0x0a1, -1)
 AARCH64_CORE("thunderxt81",   thunderxt81,   thunderx,  8A,  AARCH64_FL_FOR_ARCH8 | AARCH64_FL_CRC | AARCH64_FL_CRYPTO, thunderx,  0x43, 0x0a2, -1)
 AARCH64_CORE("thunderxt83",   thunderxt83,   thunderx,  8A,  AARCH64_FL_FOR_ARCH8 | AARCH64_FL_CRC | AARCH64_FL_CRYPTO, thunderx,  0x43, 0x0a3, -1)
 AARCH64_CORE("thunderx2t99",  thunderx2t99,  thunderx2t99, 8_1A,  AARCH64_FL_FOR_ARCH8_1 | AARCH64_FL_CRYPTO, thunderx2t99, 0x43, 0x0af, -1)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index bdcdf02..00165e2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -554,12 +554,30 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   3			/* default_opt_level  */
 };
 
+static const cpu_prefetch_tune thunderx_prefetch_tune =
+{
+  8,			/* num_slots  */
+  32,			/* l1_cache_size  */
+  128,			/* l1_cache_line_size  */
+  0,			/* l2_cache_size  */
+  -1			/* default_opt_level  */
+};
+
+static const cpu_prefetch_tune thunderxt88_prefetch_tune =
+{
+  8,			/* num_slots  */
+  32,			/* l1_cache_size  */
+  128,			/* l1_cache_line_size  */
+  16*1024,		/* l2_cache_size  */
+  3			/* default_opt_level  */
+};
+
 static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
 {
-  0,			/* num_slots  */
-  -1,			/* l1_cache_size  */
+  8,			/* num_slots  */
+  32,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
-  -1,			/* l2_cache_size  */
+  256,			/* l2_cache_size  */
   -1			/* default_opt_level  */
 };
 
@@ -767,7 +785,33 @@ static const struct tune_params thunderx_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &thunderx_prefetch_tune
+};
+
+/* Tunings for ThunderX CN88xx */
+static const struct tune_params thunderxt88_tunings =
+{
+  &thunderx_extra_costs,
+  &generic_addrcost_table,
+  &thunderx_regmove_cost,
+  &thunderx_vector_cost,
+  &generic_branch_cost,
+  &generic_approx_modes,
+  6, /* memmov_cost  */
+  2, /* issue_rate  */
+  AARCH64_FUSE_CMP_BRANCH, /* fusible_ops  */
+  8,    /* function_align.  */
+  8,    /* jump_align.  */
+  8,    /* loop_align.  */
+  2,    /* int_reassoc_width.  */
+  4,    /* fp_reassoc_width.  */
+  1,    /* vec_reassoc_width.  */
+  2,    /* min_div_recip_mul_sf.  */
+  2,    /* min_div_recip_mul_df.  */
+  0,    /* max_case_values.  */
+  tune_params::AUTOPREFETCHER_OFF,      /* autoprefetcher_model.  */
+  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),      /* tune_flags.  */
+  &thunderxt88_prefetch_tune
 };
 
 static const struct tune_params xgene1_tunings =
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-06-09 16:06 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-30 11:27 [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Maxim Kuvyrkov
2017-01-30 11:35 ` [PATCH 1/6] Add debug counter for loop array prefetching Maxim Kuvyrkov
2017-01-30 12:33   ` Richard Biener
2017-01-30 11:43 ` [PATCH 2/6] Improve debug output of loop data prefetching Maxim Kuvyrkov
2017-01-30 12:37   ` Richard Biener
2017-01-30 11:47 ` [PATCH 3/6] Fix prefetch heuristic calculation Maxim Kuvyrkov
2017-01-30 13:28   ` Richard Biener
2017-01-30 11:53 ` [PATCH 4/6] Port prefetch configuration from aarch32 to aarch64 Maxim Kuvyrkov
2017-01-30 16:39   ` Andrew Pinski
2017-02-03 11:52     ` Maxim Kuvyrkov
2017-06-08 13:48     ` James Greenhalgh
2017-06-08 15:13       ` Richard Earnshaw (lists)
2017-06-09  7:32         ` Maxim Kuvyrkov
2017-06-09 10:04           ` James Greenhalgh
2017-06-09 15:56             ` Maxim Kuvyrkov
2017-05-29 10:07   ` Maxim Kuvyrkov
2017-01-30 12:08 ` [PATCH 5/6][AArch64] Enable -fprefetch-loop-arrays at -O3 for cores that benefit from prefetching Maxim Kuvyrkov
2017-01-30 12:31   ` Kyrill Tkachov
2017-01-30 15:03     ` Maxim Kuvyrkov
2017-02-03 11:58       ` Maxim Kuvyrkov
2017-05-29 10:09         ` Maxim Kuvyrkov
2017-06-08 16:31         ` James Greenhalgh
2017-06-09 16:00           ` Maxim Kuvyrkov
2017-01-30 12:14 ` [PATCH 6/6][AArch64] Update prefetch tuning parameters for falkor and qdf24xx tunings Maxim Kuvyrkov
2017-06-08 16:32   ` James Greenhalgh
2017-06-09 16:02     ` Maxim Kuvyrkov
2017-01-30 16:55 ` [PATCH 0/6] Improve -fprefetch-loop-arrays in general and for AArch64 in particular Andrew Pinski
2017-02-20 14:44 ` Kyrill Tkachov
2017-02-28 10:03   ` Maxim Kuvyrkov
2017-05-28  8:18     ` Andrew Pinski
2017-05-29 10:24       ` Maxim Kuvyrkov
2017-06-09 16:06       ` Maxim Kuvyrkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).