[PR91598] Improve autoprefetcher heuristic in haifa-sched.c

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
@ 2019-08-29 16:43 Maxim Kuvyrkov
  2019-08-29 17:34 ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Maxim Kuvyrkov @ 2019-08-29 16:43 UTC (permalink / raw)
  To: GCC Patches; +Cc: Alexander Monakov, Wilco Dijkstra

[-- Attachment #1: Type: text/plain, Size: 1860 bytes --]

Hi,

This patch tweaks autoprefetcher heuristic in scheduler to better group memory loads and stores together.

From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598:

There are two separate changes, both related to instruction scheduler, that cause the regression.  The first change in r253235 is responsible for 70% of the regression.
===
    haifa-sched: fix autopref_rank_for_schedule qsort comparator
    
            * haifa-sched.c (autopref_rank_for_schedule): Order 'irrelevant' insns
            first, always call autopref_rank_data otherwise.
    
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253235 138bc75d-0d04-0410-961f-82ee72b054a4
===

After this change instead of
r1 = [rb + 0]
r2 = [rb + 8]
r3 = [rb + 16]
r4 = <math with r1>
r5 = <math with r2>
r6 = <math with r3>

we get
r1 = [rb + 0]
<math with r1>
r2 = [rb + 8]
<math with r2>
r3 = [rb + 16]
<math with r3>

which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This schedule happens because r2= load gets lower priority than the "irrelevant" <math with r1> due to the above patch.

If we think about it, the fact that "r1 = [rb + 0]" can be scheduled means that true dependencies of all similar base+offset loads are resolved.  Therefore, for autoprefetcher-friendly schedule we should prioritize memory reads before "irrelevant" instructions.

On the other hand, following similar logic, we want to delay memory stores as much as possible to start scheduling them only after all potential producers are scheduled.  I.e., for autoprefetcher-friendly schedule we should prioritize "irrelevant" instructions before memory writes.

Obvious patch to implement the above is attached.  It brings 70% of regressed performance on this testcase back.

OK to commit?

Regards,

--
Maxim Kuvyrkov
www.linaro.org



[-- Attachment #2: 0001-Improve-autoprefetcher-heuristic-partly-fix-regressi.patch --]
[-- Type: application/octet-stream, Size: 1381 bytes --]

From f6b4647a72bc798f9b1be7bd487417229d5773a4 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 29 Aug 2019 15:21:36 +0000
Subject: [PATCH 1/3] Improve autoprefetcher heuristic (partly fix regression
 in PR91598)

	PR rtl-optimization/91598
	* haifa-sched.c (autopref_rank_for_schedule): Prioritize "irrelevant"
	insns after memory reads and before memory writes.

Change-Id: Ie9dd3664c652760ca605fcb3fe53190429870ded
---
 gcc/haifa-sched.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 5025aae421d6..1bb968776f40 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -5675,9 +5675,16 @@ autopref_rank_for_schedule (const rtx_insn *insn1, const rtx_insn *insn2)
       int irrel2 = data2->status == AUTOPREF_MULTIPASS_DATA_IRRELEVANT;
 
       if (!irrel1 && !irrel2)
+	/* Sort memory references from lowest offset to the largest.  */
 	r = data1->offset - data2->offset;
-      else
+      else if (write)
+	/* Schedule "irrelevant" insns before memory stores to resolve
+	   as many producer dependencies of stores as possible.  */
 	r = irrel2 - irrel1;
+      else
+	/* Schedule "irrelevant" insns after memory reads to avoid breaking
+	   memory read sequences.  */
+	r = irrel1 - irrel2;
     }
 
   return r;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  2019-08-29 16:43 [PR91598] Improve autoprefetcher heuristic in haifa-sched.c Maxim Kuvyrkov
@ 2019-08-29 17:34 ` Richard Biener
  2019-08-29 17:36   ` Maxim Kuvyrkov
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2019-08-29 17:34 UTC (permalink / raw)
  To: gcc-patches, Maxim Kuvyrkov, GCC Patches
  Cc: Alexander Monakov, Wilco Dijkstra

On August 29, 2019 5:40:47 PM GMT+02:00, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
>Hi,
>
>This patch tweaks autoprefetcher heuristic in scheduler to better group
>memory loads and stores together.
>
From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598:
>
>There are two separate changes, both related to instruction scheduler,
>that cause the regression.  The first change in r253235 is responsible
>for 70% of the regression.
>===
>    haifa-sched: fix autopref_rank_for_schedule qsort comparator
>    
> * haifa-sched.c (autopref_rank_for_schedule): Order 'irrelevant' insns
>            first, always call autopref_rank_data otherwise.
>    
>    
>    
>git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253235
>138bc75d-0d04-0410-961f-82ee72b054a4
>===
>
>After this change instead of
>r1 = [rb + 0]
>r2 = [rb + 8]
>r3 = [rb + 16]
>r4 = <math with r1>
>r5 = <math with r2>
>r6 = <math with r3>
>
>we get
>r1 = [rb + 0]
><math with r1>
>r2 = [rb + 8]
><math with r2>
>r3 = [rb + 16]
><math with r3>
>
>which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This
>schedule happens because r2= load gets lower priority than the
>"irrelevant" <math with r1> due to the above patch.
>
>If we think about it, the fact that "r1 = [rb + 0]" can be scheduled
>means that true dependencies of all similar base+offset loads are
>resolved.  Therefore, for autoprefetcher-friendly schedule we should
>prioritize memory reads before "irrelevant" instructions.

But isn't there also max number of load issues in a fetch window to consider? 
So interleaving arithmetic with loads might be profitable. 

>On the other hand, following similar logic, we want to delay memory
>stores as much as possible to start scheduling them only after all
>potential producers are scheduled.  I.e., for autoprefetcher-friendly
>schedule we should prioritize "irrelevant" instructions before memory
>writes.
>
>Obvious patch to implement the above is attached.  It brings 70% of
>regressed performance on this testcase back.
>
>OK to commit?
>
>Regards,
>
>--
>Maxim Kuvyrkov
>www.linaro.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  2019-08-29 17:34 ` Richard Biener
@ 2019-08-29 17:36   ` Maxim Kuvyrkov
  2019-08-29 18:18     ` Alexander Monakov
       [not found]     ` <VI1PR0801MB2127C0534510021E6A4BBE0583A20@VI1PR0801MB2127.eurprd08.prod.outlook.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Maxim Kuvyrkov @ 2019-08-29 17:36 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches, Alexander Monakov, Wilco Dijkstra

> On Aug 29, 2019, at 7:29 PM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On August 29, 2019 5:40:47 PM GMT+02:00, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
>> Hi,
>> 
>> This patch tweaks autoprefetcher heuristic in scheduler to better group
>> memory loads and stores together.
>> 
>> From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598:
>> 
>> There are two separate changes, both related to instruction scheduler,
>> that cause the regression.  The first change in r253235 is responsible
>> for 70% of the regression.
>> ===
>>   haifa-sched: fix autopref_rank_for_schedule qsort comparator
>> 
>> * haifa-sched.c (autopref_rank_for_schedule): Order 'irrelevant' insns
>>           first, always call autopref_rank_data otherwise.
>> 
>> 
>> 
>> git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253235
>> 138bc75d-0d04-0410-961f-82ee72b054a4
>> ===
>> 
>> After this change instead of
>> r1 = [rb + 0]
>> r2 = [rb + 8]
>> r3 = [rb + 16]
>> r4 = <math with r1>
>> r5 = <math with r2>
>> r6 = <math with r3>
>> 
>> we get
>> r1 = [rb + 0]
>> <math with r1>
>> r2 = [rb + 8]
>> <math with r2>
>> r3 = [rb + 16]
>> <math with r3>
>> 
>> which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This
>> schedule happens because r2= load gets lower priority than the
>> "irrelevant" <math with r1> due to the above patch.
>> 
>> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled
>> means that true dependencies of all similar base+offset loads are
>> resolved.  Therefore, for autoprefetcher-friendly schedule we should
>> prioritize memory reads before "irrelevant" instructions.
> 
> But isn't there also max number of load issues in a fetch window to consider? 
> So interleaving arithmetic with loads might be profitable. 

It appears that cores with autoprefetcher hardware prefer loads and stores bundled together, not interspersed with other instructions to occupy the rest of CPU units.

Autoprefetching heuristic is enabled only for cores that support it, and isn't active for by default.

> 
>> On the other hand, following similar logic, we want to delay memory
>> stores as much as possible to start scheduling them only after all
>> potential producers are scheduled.  I.e., for autoprefetcher-friendly
>> schedule we should prioritize "irrelevant" instructions before memory
>> writes.
>> 
>> Obvious patch to implement the above is attached.  It brings 70% of
>> regressed performance on this testcase back.
>> 
>> OK to commit?
>> 
>> Regards,
>> 
>> --
>> Maxim Kuvyrkov
>> www.linaro.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
       [not found]     ` <VI1PR0801MB2127C0534510021E6A4BBE0583A20@VI1PR0801MB2127.eurprd08.prod.outlook.com>
@ 2019-08-29 18:18       ` Wilco Dijkstra
  2019-09-03 16:55       ` Wilco Dijkstra
  1 sibling, 0 replies; 9+ messages in thread
From: Wilco Dijkstra @ 2019-08-29 18:18 UTC (permalink / raw)
  To: Maxim Kuvyrkov, Richard Guenther; +Cc: gcc-patches, Alexander Monakov, nd

Hi Maxim,

 >  It appears that cores with autoprefetcher hardware prefer loads and stores bundled together, not interspersed with > other instructions to occupy the rest of CPU units.

 I don't believe it is as simple as that - modern cores have multiple prefetchers but
 won't prefer bundling loads and stores in large blocks. That would result in terrible
 performance due to dispatch and issue stalls. Also the increased register pressure
 could cause extra spilling. If we group loads and stores, we'd definitely need to
 limit them to say 4 or so at most, and then interleave ALU operations.

  > Autoprefetching heuristic is enabled only for cores that support it, and isn't active for by default.

 It's enabled on most cores, including the default (generic). So we do have to be
 careful that this doesn't regress any other benchmarks or do worse on modern
 cores.

 Cheers,
 Wilco

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  2019-08-29 17:36   ` Maxim Kuvyrkov
@ 2019-08-29 18:18     ` Alexander Monakov
  2019-08-29 20:41       ` Wilco Dijkstra
  2019-08-30  7:42       ` Richard Biener
       [not found]     ` <VI1PR0801MB2127C0534510021E6A4BBE0583A20@VI1PR0801MB2127.eurprd08.prod.outlook.com>
  1 sibling, 2 replies; 9+ messages in thread
From: Alexander Monakov @ 2019-08-29 18:18 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Richard Guenther, gcc-patches, Wilco Dijkstra

On Thu, 29 Aug 2019, Maxim Kuvyrkov wrote:

> >> r1 = [rb + 0]
> >> <math with r1>
> >> r2 = [rb + 8]
> >> <math with r2>
> >> r3 = [rb + 16]
> >> <math with r3>
> >> 
> >> which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This
> >> schedule happens because r2= load gets lower priority than the
> >> "irrelevant" <math with r1> due to the above patch.
> >> 
> >> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled
> >> means that true dependencies of all similar base+offset loads are
> >> resolved.  Therefore, for autoprefetcher-friendly schedule we should
> >> prioritize memory reads before "irrelevant" instructions.
> > 
> > But isn't there also max number of load issues in a fetch window to consider? 
> > So interleaving arithmetic with loads might be profitable. 
> 
> It appears that cores with autoprefetcher hardware prefer loads and stores
> bundled together, not interspersed with other instructions to occupy the rest
> of CPU units.

Let me point out that the motivating example has a bigger effect in play:

(1) r1 = [rb + 0]
(2) <math with r1>
(3) r2 = [rb + 8]
(4) <math with r2>
(5) r3 = [rb + 16]
(6) <math with r3>

here Cortex-A53, being an in-order core, cannot issue the load at (3) until
after the load at (1) has completed, because the use at (2) depends on it.
The good schedule allows the three loads to issue in a pipelined fashion.

So essentially the main issue is not a hardware peculiarity, but rather the
bad schedule being totally wrong (it could only make sense if loads had 1-cycle
latency, which they do not).

I think this highlights how implementing this autoprefetch heuristic via the
dfa_lookahead_guard interface looks questionable in the first place, but the
patch itself makes sense to me.

Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  2019-08-29 18:18     ` Alexander Monakov
@ 2019-08-29 20:41       ` Wilco Dijkstra
  2019-08-30  7:42       ` Richard Biener
  1 sibling, 0 replies; 9+ messages in thread
From: Wilco Dijkstra @ 2019-08-29 20:41 UTC (permalink / raw)
  To: Alexander Monakov, Maxim Kuvyrkov; +Cc: Richard Guenther, gcc-patches, nd

Hi Alexander,
 
> So essentially the main issue is not a hardware peculiarity, but rather the
> bad schedule being totally wrong (it could only make sense if loads had 1-cycle
> latency, which they do not).

The scheduling is only bad because the specific intrinsics used are mapped
onto asm statements, so they are ignored by the scheduler and modelled
with zero latencies.

> I think this highlights how implementing this autoprefetch heuristic via the
> dfa_lookahead_guard interface looks questionable in the first place, but the
> patch itself makes sense to me.

Yes I'm still not sure what this autoprefetch heuristic is trying to accomplish...
We could try disabling it and see whether it actually helps.

Wilco

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  2019-08-29 18:18     ` Alexander Monakov
  2019-08-29 20:41       ` Wilco Dijkstra
@ 2019-08-30  7:42       ` Richard Biener
  1 sibling, 0 replies; 9+ messages in thread
From: Richard Biener @ 2019-08-30  7:42 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Maxim Kuvyrkov, GCC Patches, Wilco Dijkstra

On Thu, Aug 29, 2019 at 7:36 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Thu, 29 Aug 2019, Maxim Kuvyrkov wrote:
>
> > >> r1 = [rb + 0]
> > >> <math with r1>
> > >> r2 = [rb + 8]
> > >> <math with r2>
> > >> r3 = [rb + 16]
> > >> <math with r3>
> > >>
> > >> which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This
> > >> schedule happens because r2= load gets lower priority than the
> > >> "irrelevant" <math with r1> due to the above patch.
> > >>
> > >> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled
> > >> means that true dependencies of all similar base+offset loads are
> > >> resolved.  Therefore, for autoprefetcher-friendly schedule we should
> > >> prioritize memory reads before "irrelevant" instructions.
> > >
> > > But isn't there also max number of load issues in a fetch window to consider?
> > > So interleaving arithmetic with loads might be profitable.
> >
> > It appears that cores with autoprefetcher hardware prefer loads and stores
> > bundled together, not interspersed with other instructions to occupy the rest
> > of CPU units.
>
> Let me point out that the motivating example has a bigger effect in play:
>
> (1) r1 = [rb + 0]
> (2) <math with r1>
> (3) r2 = [rb + 8]
> (4) <math with r2>
> (5) r3 = [rb + 16]
> (6) <math with r3>
>
> here Cortex-A53, being an in-order core, cannot issue the load at (3) until
> after the load at (1) has completed, because the use at (2) depends on it.
> The good schedule allows the three loads to issue in a pipelined fashion.

OK, so with dispatch/issue issues I was thinking of scheduling independent
work like AGU ops inbetween the loads.  It might be that some in-order
cores like to see two adjacent loads to fire auto-prefetching but any such
heuristic should probably be very sub-architecture specific.

> So essentially the main issue is not a hardware peculiarity, but rather the
> bad schedule being totally wrong (it could only make sense if loads had 1-cycle
> latency, which they do not).
>
> I think this highlights how implementing this autoprefetch heuristic via the
> dfa_lookahead_guard interface looks questionable in the first place, but the
> patch itself makes sense to me.
>
> Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
       [not found]     ` <VI1PR0801MB2127C0534510021E6A4BBE0583A20@VI1PR0801MB2127.eurprd08.prod.outlook.com>
  2019-08-29 18:18       ` Wilco Dijkstra
@ 2019-09-03 16:55       ` Wilco Dijkstra
  2021-08-17  9:43         ` Maxim Kuvyrkov
  1 sibling, 1 reply; 9+ messages in thread
From: Wilco Dijkstra @ 2019-09-03 16:55 UTC (permalink / raw)
  To: Maxim Kuvyrkov, Richard Guenther; +Cc: gcc-patches, Alexander Monakov, nd

Hi Maxim,
  
>  > Autoprefetching heuristic is enabled only for cores that support it, and isn't active for by default.
>  
> It's enabled on most cores, including the default (generic). So we do have to be
> careful that this doesn't regress any other benchmarks or do worse on modern
> cores.

I benchmarked your scheduler change on a few AArch64 machines, and it either has
no effect or a positive effect on SPECFP with no major outliers (and only minor
codesize differences). So I think your proposed patch is OK as is.

Cheers,
Wilco
  
     

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  2019-09-03 16:55       ` Wilco Dijkstra
@ 2021-08-17  9:43         ` Maxim Kuvyrkov
  0 siblings, 0 replies; 9+ messages in thread
From: Maxim Kuvyrkov @ 2021-08-17  9:43 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Richard Guenther, gcc-patches, Alexander Monakov, nd

[-- Attachment #1: Type: text/plain, Size: 963 bytes --]

Hi All,

I've forgotten to commit this patch when it was approved 2 years ago.  It
still applies cleanly to the current mainline and I've retested it
(bootstrap+regtest) on aarch64-linux-gnu and arm-linux-gnueabihf with no
regressions.

I'll commit this shortly.

Regards,

On Tue, 3 Sept 2019 at 19:55, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> Hi Maxim,
>
> >  > Autoprefetching heuristic is enabled only for cores that support it,
> and isn't active for by default.
> >
> > It's enabled on most cores, including the default (generic). So we do
> have to be
> > careful that this doesn't regress any other benchmarks or do worse on
> modern
> > cores.
>
> I benchmarked your scheduler change on a few AArch64 machines, and it
> either has
> no effect or a positive effect on SPECFP with no major outliers (and only
> minor
> codesize differences). So I think your proposed patch is OK as is.
>
> Cheers,
> Wilco
>
>



-- 
Maxim Kuvyrkov
www.linaro.org

[-- Attachment #2: 0001-Improve-autoprefetcher-heuristic-partly-fix-regressi.patch --]
[-- Type: application/octet-stream, Size: 1341 bytes --]

From 6e56e141f0fa5a16fde7bca0b5e2172428d80c83 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Date: Thu, 29 Aug 2019 15:21:36 +0000
Subject: [PATCH 1/3] Improve autoprefetcher heuristic (partly fix regression
 in PR91598)

	PR rtl-optimization/91598
	* haifa-sched.c (autopref_rank_for_schedule): Prioritize "irrelevant"
	insns after memory reads and before memory writes.

Change-Id: Ie9dd3664c652760ca605fcb3fe53190429870ded
---
 gcc/haifa-sched.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 26d112795511..5048dd4a5a32 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -5683,9 +5683,16 @@ autopref_rank_for_schedule (const rtx_insn *insn1, const rtx_insn *insn2)
       int irrel2 = data2->status == AUTOPREF_MULTIPASS_DATA_IRRELEVANT;
 
       if (!irrel1 && !irrel2)
+	/* Sort memory references from lowest offset to the largest.  */
 	r = data1->offset - data2->offset;
-      else
+      else if (write)
+	/* Schedule "irrelevant" insns before memory stores to resolve
+	   as many producer dependencies of stores as possible.  */
 	r = irrel2 - irrel1;
+      else
+	/* Schedule "irrelevant" insns after memory reads to avoid breaking
+	   memory read sequences.  */
+	r = irrel1 - irrel2;
     }
 
   return r;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-08-17  9:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-29 16:43 [PR91598] Improve autoprefetcher heuristic in haifa-sched.c Maxim Kuvyrkov
2019-08-29 17:34 ` Richard Biener
2019-08-29 17:36   ` Maxim Kuvyrkov
2019-08-29 18:18     ` Alexander Monakov
2019-08-29 20:41       ` Wilco Dijkstra
2019-08-30  7:42       ` Richard Biener
     [not found]     ` <VI1PR0801MB2127C0534510021E6A4BBE0583A20@VI1PR0801MB2127.eurprd08.prod.outlook.com>
2019-08-29 18:18       ` Wilco Dijkstra
2019-09-03 16:55       ` Wilco Dijkstra
2021-08-17  9:43         ` Maxim Kuvyrkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).