Re: [patch] Improve loop array prefetch for IA-64

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [patch] Improve loop array prefetch for IA-64
       [not found] ` <571f6b510606021517r65edcb8fh1a6bf06370fb0a19@mail.gmail.com.suse.lists.egcs>
@ 2006-06-03  3:54   ` Andi Kleen
  0 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2006-06-03  3:54 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: gcc, mark.davis

"Steven Bosscher" <stevenb.gcc@gmail.com> writes:

> On 6/2/06, Davis, Mark <mark.davis@intel.com> wrote:
> > Question: does gcc now know the difference between prefetching to cache L1 via
> > "lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"?
> 
> The ia64 backend knows the difference, see the prefetch pattern in ia64.md.
> 
> But ia64 is the only backend that supports this kind of explicit
> locality parameter. And since no-one from the ia64 community cared
> much about gcc until recently, gcc's prefetching pass (which is
> limited anyway) does not generate lfetch.nt1 or other prefetches with
> explicit locality parameters.

Actually SSE X86 has prefetches with different locality hints (T0, T1, T2, NTA)

However x86 always needs to have the items in L1 cache to do anything
with them even for FP data so it might not be very useful to do this
particular optimization for it.

T0 vs NTA is useful though and at least AMD K8 can make use of them - when
data is streamed and not reused and there is a lot of it then NTA is a good idea.

> > For floating point data, the latter is the only interesting case because float loads only
> > access the L2.  Thus using "lfetch" for floating point arrays will unnecessarily wipe out > the contents of L1.  (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...)
> 
> You could experiment with this for ia64 by hacking issue_prefetch_ref
> in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating
> point types.

Perhaps it could generate different prefetches based on the array size being
worked on?

I guess e.g. for an 1MB array walk NTA is probably a good idea (with the 1MB being
a tunable) 

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [patch] Improve loop array prefetch for IA-64
  2006-06-02 15:21 Davis, Mark
  2006-06-02 22:17 ` Steven Bosscher
@ 2006-06-03  0:19 ` Canqun Yang
  1 sibling, 0 replies; 10+ messages in thread
From: Canqun Yang @ 2006-06-03  0:19 UTC (permalink / raw)
  To: Davis, Mark, gcc, gcc-patches

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=gb2312, Size: 1494 bytes --]


--- "Davis, Mark" <mark.davis@intel.com>:

> Canqun,
> 
> Nice job getting this ready for the current version of gcc!
> 
> Question: does gcc now know the difference between prefetching to cache L1 via "lfetch", as
> opposed to prefetching only to level L2 via "lfetch.nt1"?  For floating point data, the latter
> is the only interesting case because float loads only access the L2.  Thus using "lfetch" for
> floating point arrays will unnecessarily wipe out the contents of L1.  (gcc 3.2.3 only seems to
> generate "lfetch", which is why I ask...)
> 

Yes, GCC does. I have tried this on the old prefetch implementation at RTL level and the new one
at TREE level, but no significant performance difference for SPECfp2000 and NAS benchmarks.
Nevertheless, it worth taking more time to inspect it.

Canqun Yang


> Thanks,
> Mark 
> 
> -----Original Message-----
> From: Canqun Yang [mailto:canqun@yahoo.com.cn] 
> Sent: Friday, June 02, 2006 5:14 AM
> To: gcc@gcc.gnu.org; gcc-patches@gcc.gnu.org
> Subject: [patch] Improve loop array prefetch for IA-64
> 
> Hi, all
> 
> This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite
> on
> Itanium-2 system, respectively. More performance increase is hopeful by further tuning the
> parameters and improving the prefetch algorithm at tree level. 
> 
> 
> Canqun Yang
> 
> 

__________________________________________________
¸Ï¿ì×¢²áÑÅ»¢³¬´óÈÝÁ¿Ãâ·ÑÓÊÏä?
http://cn.mail.yahoo.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] Improve loop array prefetch for IA-64
  2006-06-02 22:17 ` Steven Bosscher
@ 2006-06-02 22:32   ` Steven Bosscher
  0 siblings, 0 replies; 10+ messages in thread
From: Steven Bosscher @ 2006-06-02 22:32 UTC (permalink / raw)
  To: Davis, Mark; +Cc: Canqun Yang, gcc, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 606 bytes --]

On 6/3/06, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> > For floating point data, the latter is the only interesting case because float loads only
> > access the L2.  Thus using "lfetch" for floating point arrays will unnecessarily wipe out
> > the contents of L1.  (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...)
>
> You could experiment with this for ia64 by hacking issue_prefetch_ref
> in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating
> point types.

E.g. something like this, which is (needless to say) untested but
something you could play with.

Gr.
Steven

[-- Attachment #2: hack.diff --]
[-- Type: text/x-patch, Size: 1595 bytes --]

Index: tree-ssa-loop-prefetch.c
===================================================================
--- tree-ssa-loop-prefetch.c	(revision 114315)
+++ tree-ssa-loop-prefetch.c	(working copy)
@@ -816,7 +816,7 @@ static void
 issue_prefetch_ref (struct mem_ref *ref, unsigned unroll_factor, unsigned ahead)
 {
   HOST_WIDE_INT delta;
-  tree addr, addr_base, prefetch, params, write_p;
+  tree addr, addr_base, prefetch, params, write_p, locality;
   block_stmt_iterator bsi;
   unsigned n_prefetches, ap;
 
@@ -838,11 +838,21 @@ issue_prefetch_ref (struct mem_ref *ref,
 			  addr_base, build_int_cst (ptr_type_node, delta));
       addr = force_gimple_operand_bsi (&bsi, unshare_expr (addr), true, NULL);
 
-      /* Create the prefetch instruction.  */
+      /* Create the prefetch instruction.  Do this by building a call to
+         `void __builtin_prefetch (const void *ADDR, int RW, int LOCALITY)'.
+
+	 ??? The `locality' parameter is a shameless, untested hack to
+	 force lfetch.nt1 -- hopefully.  */
       write_p = ref->write_p ? integer_one_node : integer_zero_node;
-      params = tree_cons (NULL_TREE, addr,
-			  tree_cons (NULL_TREE, write_p, NULL_TREE));
-				 
+      locality = FLOAT_TYPE_P (mem_ref->base)
+		 ? integer_one_node : integer_zero_node;
+      params = tree_cons (NULL_TREE,
+			  addr,
+			  tree_cons (NULL_TREE,
+				     write_p,
+				     tree_cons (NULL_TREE,
+						locality,
+						NULL_TREE)));
       prefetch = build_function_call_expr (built_in_decls[BUILT_IN_PREFETCH],
 					   params);
       bsi_insert_before (&bsi, prefetch, BSI_SAME_STMT);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] Improve loop array prefetch for IA-64
  2006-06-02 15:21 Davis, Mark
@ 2006-06-02 22:17 ` Steven Bosscher
  2006-06-02 22:32   ` Steven Bosscher
  2006-06-03  0:19 ` Canqun Yang
  1 sibling, 1 reply; 10+ messages in thread
From: Steven Bosscher @ 2006-06-02 22:17 UTC (permalink / raw)
  To: Davis, Mark; +Cc: Canqun Yang, gcc, gcc-patches

On 6/2/06, Davis, Mark <mark.davis@intel.com> wrote:
> Question: does gcc now know the difference between prefetching to cache L1 via
> "lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"?

The ia64 backend knows the difference, see the prefetch pattern in ia64.md.

But ia64 is the only backend that supports this kind of explicit
locality parameter. And since no-one from the ia64 community cared
much about gcc until recently, gcc's prefetching pass (which is
limited anyway) does not generate lfetch.nt1 or other prefetches with
explicit locality parameters.

> For floating point data, the latter is the only interesting case because float loads only
> access the L2.  Thus using "lfetch" for floating point arrays will unnecessarily wipe out > the contents of L1.  (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...)

You could experiment with this for ia64 by hacking issue_prefetch_ref
in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating
point types.

Gr.
Steven

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [patch] Improve loop array prefetch for IA-64
@ 2006-06-02 15:21 Davis, Mark
  2006-06-02 22:17 ` Steven Bosscher
  2006-06-03  0:19 ` Canqun Yang
  0 siblings, 2 replies; 10+ messages in thread
From: Davis, Mark @ 2006-06-02 15:21 UTC (permalink / raw)
  To: Canqun Yang, gcc, gcc-patches

Canqun,

Nice job getting this ready for the current version of gcc!

Question: does gcc now know the difference between prefetching to cache L1 via "lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"?  For floating point data, the latter is the only interesting case because float loads only access the L2.  Thus using "lfetch" for floating point arrays will unnecessarily wipe out the contents of L1.  (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...)

Thanks,
Mark 

-----Original Message-----
From: Canqun Yang [mailto:canqun@yahoo.com.cn] 
Sent: Friday, June 02, 2006 5:14 AM
To: gcc@gcc.gnu.org; gcc-patches@gcc.gnu.org
Subject: [patch] Improve loop array prefetch for IA-64

Hi, all

This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on
Itanium-2 system, respectively. More performance increase is hopeful by further tuning the
parameters and improving the prefetch algorithm at tree level. 

Canqun Yang

__________________________________________________
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] Improve loop array prefetch for IA-64
@ 2006-06-02 11:00 Canqun Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Canqun Yang @ 2006-06-02 11:00 UTC (permalink / raw)
  To: gcc, gcc-patches

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=gb2312, Size: 1173 bytes --]

--- Andrey Belevantsev <abel@ispras.ru>:

> Canqun Yang wrote:
> > Hi, all
> > 
> > This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite
> on
> > Itanium-2 system, respectively. More performance increase is hopeful by further tuning the
> > parameters and improving the prefetch algorithm at tree level. 
> 
> Hi Canqun,
> 
> It's great news that you continued to work on prefetching tuning for 
> ia64!  Do you plan to port your other changes for the old RTL 
> prefetching to the tree level?
> 

Yes. But I have no much time to do it now. I am busy for other things.

> > @@ -1985,13 +1985,18 @@
> >     ??? This number is bogus and needs to be replaced before the value is
> >     actually used in optimizations.  */
> 
> I suggest to remove this comment as it has become outdated with your 
> patch.  Instead you might say how did you choose this particular value 
> (and PREFETCH_BLOCK too).  Just my 2c.
> 
> Andrey
> 
> 

Please refer to my previous mail and attatched paper.

Canqun Yang

__________________________________________________
¸Ï¿ì×¢²áÑÅ»¢³¬´óÈÝÁ¿Ãâ·ÑÓÊÏä?
http://cn.mail.yahoo.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] Improve loop array prefetch for IA-64
  2006-06-02  9:50 ` Steven Bosscher
@ 2006-06-02 10:36   ` Canqun Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Canqun Yang @ 2006-06-02 10:36 UTC (permalink / raw)
  To: Steven Bosscher, Andrey Belevantsev, gcc, gcc-patches

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=gb2312, Size: 2085 bytes --]

--- Steven Bosscher <stevenb.gcc@gmail.com>:

> On 6/2/06, Canqun Yang <canqun@yahoo.com.cn> wrote:
> > This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite
> on
> > Itanium-2 system, respectively. More performance increase is hopeful by further tuning the
> > parameters and improving the prefetch algorithm at tree level.
> 
> Bravo.
> 
> > --- ia64.h (revision 114307)
> > +++ ia64.h (working copy)
> > @@ -1985,13 +1985,18 @@
> >    ??? This number is bogus and needs to be replaced before the value is
> >    actually used in optimizations.  */
> >
> > -#define SIMULTANEOUS_PREFETCHES 6
> > +#define SIMULTANEOUS_PREFETCHES 18
> 
> Is the number still bogus as the comment suggests, or is there a
> rationale for 18?  It looks quite high.
> 

The number is still bogus. But the original value 6 is small. For most of SPECfp2000 and NAS
benchmarks, 12 is enough. Only SPECfp2000 program 171.swim need many prefetches. The best value
for 171.swim is 20. I attached my paper on ACSAC05 to this mail. This paper describes  more clear
than that in proceedings of GCC Summit 2005.   

> > +/* A number that should roughly corresponding to the nunmber of instructions
> > +   executed before the prefetch is completed.  */
> > +
> > +#define PREFETCH_LATENCY 400
> 
> Likewise.  Is 400 cycles the memory latency on itanium-2?
> 
> Gr.
> Steven
> 

It is not the memory latency on itanium-2. The default value of PREFETCH_LATENCY is 200. It
roughly equals to the number of instructions executed before the prefetch is completed. Itanium-2
is a multi-issue architecture, and may issue one or more instructions at each cycle. So I still
roughly estimate that the average IPC (instructions per cycle) is about 2. Double the
PREFETCH_LATENCY can ensure that the prefetches are issued duly. 

The prefetch algorithm can not get the exact execution cycles of the loop at present. So 400 is
still bogus.

Canqun Yang

__________________________________________________
¸Ï¿ì×¢²áÑÅ»¢³¬´óÈÝÁ¿Ãâ·ÑÓÊÏä?
http://cn.mail.yahoo.com

[-- Attachment #2: 2769953213-acsac05-yang.pdf --]
[-- Type: application/pdf, Size: 130449 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] Improve loop array prefetch for IA-64
  2006-06-02  9:14 Canqun Yang
  2006-06-02  9:48 ` Andrey Belevantsev
@ 2006-06-02  9:50 ` Steven Bosscher
  2006-06-02 10:36   ` Canqun Yang
  1 sibling, 1 reply; 10+ messages in thread
From: Steven Bosscher @ 2006-06-02  9:50 UTC (permalink / raw)
  To: Canqun Yang; +Cc: gcc, gcc-patches

On 6/2/06, Canqun Yang <canqun@yahoo.com.cn> wrote:
> This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on
> Itanium-2 system, respectively. More performance increase is hopeful by further tuning the
> parameters and improving the prefetch algorithm at tree level.

Bravo.

> --- ia64.h (revision 114307)
> +++ ia64.h (working copy)
> @@ -1985,13 +1985,18 @@
>    ??? This number is bogus and needs to be replaced before the value is
>    actually used in optimizations.  */
>
> -#define SIMULTANEOUS_PREFETCHES 6
> +#define SIMULTANEOUS_PREFETCHES 18

Is the number still bogus as the comment suggests, or is there a
rationale for 18?  It looks quite high.

> +/* A number that should roughly corresponding to the nunmber of instructions
> +   executed before the prefetch is completed.  */
> +
> +#define PREFETCH_LATENCY 400

Likewise.  Is 400 cycles the memory latency on itanium-2?

Gr.
Steven

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] Improve loop array prefetch for IA-64
  2006-06-02  9:14 Canqun Yang
@ 2006-06-02  9:48 ` Andrey Belevantsev
  2006-06-02  9:50 ` Steven Bosscher
  1 sibling, 0 replies; 10+ messages in thread
From: Andrey Belevantsev @ 2006-06-02  9:48 UTC (permalink / raw)
  To: Canqun Yang; +Cc: gcc, gcc-patches

Canqun Yang wrote:
> Hi, all
> 
> This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on
> Itanium-2 system, respectively. More performance increase is hopeful by further tuning the
> parameters and improving the prefetch algorithm at tree level. 

Hi Canqun,

It's great news that you continued to work on prefetching tuning for 
ia64!  Do you plan to port your other changes for the old RTL 
prefetching to the tree level?

> @@ -1985,13 +1985,18 @@
>     ??? This number is bogus and needs to be replaced before the value is
>     actually used in optimizations.  */

I suggest to remove this comment as it has become outdated with your 
patch.  Instead you might say how did you choose this particular value 
(and PREFETCH_BLOCK too).  Just my 2c.

Andrey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch] Improve loop array prefetch for IA-64
@ 2006-06-02  9:14 Canqun Yang
  2006-06-02  9:48 ` Andrey Belevantsev
  2006-06-02  9:50 ` Steven Bosscher
  0 siblings, 2 replies; 10+ messages in thread
From: Canqun Yang @ 2006-06-02  9:14 UTC (permalink / raw)
  To: gcc, gcc-patches

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=gb2312, Size: 1847 bytes --]

Hi, all

This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on
Itanium-2 system, respectively. More performance increase is hopeful by further tuning the
parameters and improving the prefetch algorithm at tree level. 

Details of NAS benchmarks are listed below.


GCC options: -O3 -fprefetch-loop-arrays
Target: Itanium-2 1.6GHz; L2 Cache 256K, L3 Cache 6M
Execution times in seconds

       -this patch +this patch
bt.W       14.43    14.17
cg.A       13.76    6.86
ep.W       7.83     7.79
ft.A       18.73    20.15
is.B       11.85    10.94
lu.W       20.55    20.27
mg.A       15.09    11.86
sp.W       37.11    35.49
geomean    15.84    13.94
speedup             13.68%


2006-06-02  Canqun Yang  <canqun@nudt.edu.cn>

 * config/ia64/ia64.h (SIMULTANEOUS_PREFETCHES): Define to 18.
 (PREFETCH_BLOCK): Define to 128.
 (PREFETCH_LATENCY): Define to 400.

Index: ia64.h
===================================================================
--- ia64.h (revision 114307)
+++ ia64.h (working copy)
@@ -1985,13 +1985,18 @@
    ??? This number is bogus and needs to be replaced before the value is
    actually used in optimizations.  */
 
-#define SIMULTANEOUS_PREFETCHES 6
+#define SIMULTANEOUS_PREFETCHES 18
 
 /* If this architecture supports prefetch, define this to be the size of
    the cache line that is prefetched.  */
 
-#define PREFETCH_BLOCK 32
+#define PREFETCH_BLOCK 128
 
+/* A number that should roughly corresponding to the nunmber of instructions
+   executed before the prefetch is completed.  */
+
+#define PREFETCH_LATENCY 400
+
 #define HANDLE_SYSV_PRAGMA 1
 
 /* A C expression for the maximum number of instructions to execute via


Canqun Yang


__________________________________________________
¸Ï¿ì×¢²áÑÅ»¢³¬´óÈÝÁ¿Ãâ·ÑÓÊÏä?
http://cn.mail.yahoo.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-06-03  3:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <E11A89E888E04547A5E0158061F40A2008234B@hdsmsx412.amr.corp.intel.com.suse.lists.egcs>
     [not found] ` <571f6b510606021517r65edcb8fh1a6bf06370fb0a19@mail.gmail.com.suse.lists.egcs>
2006-06-03  3:54   ` [patch] Improve loop array prefetch for IA-64 Andi Kleen
2006-06-02 15:21 Davis, Mark
2006-06-02 22:17 ` Steven Bosscher
2006-06-02 22:32   ` Steven Bosscher
2006-06-03  0:19 ` Canqun Yang
  -- strict thread matches above, loose matches on Subject: below --
2006-06-02 11:00 Canqun Yang
2006-06-02  9:14 Canqun Yang
2006-06-02  9:48 ` Andrey Belevantsev
2006-06-02  9:50 ` Steven Bosscher
2006-06-02 10:36   ` Canqun Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).