user-guided speculative precomputation? (my wacky ia64 idea)

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* user-guided speculative precomputation? (my wacky ia64 idea)
@ 2004-04-01  0:07 Duraid Madina
  2004-04-06  6:02 ` Jim Wilson
  0 siblings, 1 reply; 10+ messages in thread
From: Duraid Madina @ 2004-04-01  0:07 UTC (permalink / raw)
  To: gcc

Hi GCCers,

	I was bored enough to look through some IA64 emitted by GCC and noticed 
that IU utilization was in most places pretty horrid. nops everywhere! 
Oh well.

	I thought to myself, why does GCC even bother issuing to 6 ports? How 
bad would it be if I clamped it down to 4? A few hours later, I had my 
answer: _not too bad_. By my hackish estimations (I haven't gone ahead 
and bootstrapped an appropriately crippled compiler, but instead just 
performed some calculations on 'representative' (short) instruction 
traces to back-of-the-envelope accuracy ;) and it looks like SPECint2k 
will lose "only" around 20% performance.

	Then I thought, why not treat such a wide machine as a multithreaded 
one, but where one thread doesn't have any flow control instructions? ;) 
i.e. instead of sprinkling nops all over the place, just give up and 
deliberately reserve an instruction out of every bundle for another 
stream of execution.

	What to do with such a stream? Well, why not support speculative 
precomputation? (aggressively chase pointers so as to prefetch data into 
cache before the "main" thread needs it, as in "Memory Latency-Tolerance 
Approaches for Itanium Processors" by Wang et al, in proc. HPCA'02) It's 
tricky to do this automatically, but my question to the group is: would 
it be sensible to add support for something like a "#pragma 
delinquent_load" that could be inserted at the appropriate point in e.g. 
graph/tree/list traversal code etc.? A "cookie cutter" prefetching 
sequence could then be fitted in the reserved area alongside the main 
thread, silently prefetching data more intelligently than any non-insane 
hardware prefetcher could ever hope to. OoO is great, but I've yet to 
see one that can keep track of things hundreds of cycles out...

	The potential gains seem phenomenal, particularly for in-order machines 
such as the (current) Itanium line, but one wonders if it couldn't also 
help other architectures, at least in mitigating L2/L3 misses.

	Duraid

P.S. The real question, of course, is whether or not this post is an 
April fool's joke. ;)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-01  0:07 user-guided speculative precomputation? (my wacky ia64 idea) Duraid Madina
@ 2004-04-06  6:02 ` Jim Wilson
  2004-04-07  0:08   ` Duraid Madina
  0 siblings, 1 reply; 10+ messages in thread
From: Jim Wilson @ 2004-04-06  6:02 UTC (permalink / raw)
  To: Duraid Madina; +Cc: gcc

Duraid Madina wrote:
> tricky to do this automatically, but my question to the group is: would 
> it be sensible to add support for something like a "#pragma 
> delinquent_load" that could be inserted at the appropriate point in e.g. 
> graph/tree/list traversal code etc.? A "cookie cutter" prefetching 
> sequence could then be fitted in the reserved area alongside the main 
> thread

You didn't define what #pragma delinquent_load does, and I don't have a 
copy of that paper.

Probably you would have to try to implement it before you can tell 
whether you can make it work.

The limited number of bundle formats makes it difficult to reserve an 
issue slot.  You will get differing types of left-over slots depending 
on what instructions from the main thread get scheduled, and you will 
have differing types of insns that need to fit into those slots.  This 
could lead to a difficult packing problem.  And there is also the issue 
of stop bits.
-- 
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-06  6:02 ` Jim Wilson
@ 2004-04-07  0:08   ` Duraid Madina
  2004-04-07  1:54     ` James Morrison
  2004-04-07 23:46     ` Jim Wilson
  0 siblings, 2 replies; 10+ messages in thread
From: Duraid Madina @ 2004-04-07  0:08 UTC (permalink / raw)
  To: Jim Wilson; +Cc: gcc

Jim Wilson wrote:
> Duraid Madina wrote:
> 
>> tricky to do this automatically, but my question to the group is: 
>> would it be sensible to add support for something like a "#pragma 
>> delinquent_load" that could be inserted at the appropriate point in 
>> e.g. graph/tree/list traversal code etc.? A "cookie cutter" 
>> prefetching sequence could then be fitted in the reserved area 
>> alongside the main thread
>
> You didn't define what #pragma delinquent_load does, and I don't have a 
> copy of that paper.

The paper is freely available here:

http://www.hpcaconf.org/hpca8/program/papers/wang_memory_tolerance.ps

and well worth reading for anyone interested in this kind of 
optimization or just wondering what future IA64 processors might be like.

This #pragma would be added to code such as (taken from 
http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm 
)

{
	n = NodeArray[0];
	while(n && remaining)
	{
		doSomeWork();
#pragma delinquent_load
		n->i = n->next->j + n->next->k + n->next->l;
		n = n->next;
		remaining--;
	}
}

because the line following the pragma, in many cases, would be 
responsible for nearly 100% of the L2 or L3 misses of a program, and 
therefore responsible for a significant fraction of the program's 
execution time if doSomeWork() doesn't actually do a _lot_ of work, 
which is very common in things to do with graph traversal (e.g. database 
querying etc)

The pragma signals two things to the compiler:

	- The following line is responsible for a very high proportion of total 
L2/L3 cache misses (in case you're wondering, many real world programs 
seem to be well characterised in this way, having only a few delinquent 
loads despite being very large programs)

	- The LHS of the following line is probably involved in both loop 
control as well as the loop body

> Probably you would have to try to implement it before you can tell 
> whether you can make it work.

mm. Me and my big mouth..

> The limited number of bundle formats makes it difficult to reserve an 
> issue slot.  You will get differing types of left-over slots depending 
> on what instructions from the main thread get scheduled, and you will 
> have differing types of insns that need to fit into those slots.  This 
> could lead to a difficult packing problem.  And there is also the issue 
> of stop bits.

Agreed, though I trust that by "difficult packing problem" you mean a 
problem that might not be very likely to succeed, not a computationally 
difficult task. The thing that gives me hope is that the other 'thread' 
really doesn't have to do a lot at all; it just has to dereference 
pointers, and loop. Still, when at the mercy of the primary thread's 
branching it might be difficult to get such a thing running usefully 
alongside the main computation, even if a great deal of the chasing 
thread's execution state can be stored in (e.g.) some reserved global 
registers. I'm really not sure that the bundle formats will be such a 
big restriction, because if you take a look at what GCC emits for 
"pointer-mad" code, there's often quite a bit of room to move.

	Duraid

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-07  0:08   ` Duraid Madina
@ 2004-04-07  1:54     ` James Morrison
  2004-04-07  4:45       ` Daniel Jacobowitz
  2004-04-07 23:46     ` Jim Wilson
  1 sibling, 1 reply; 10+ messages in thread
From: James Morrison @ 2004-04-07  1:54 UTC (permalink / raw)
  To: Duraid Madina; +Cc: Jim Wilson, gcc

Duraid Madina <duraid@octopus.com.au> writes:
> This #pragma would be added to code such as (taken from
> http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
> )
> 
> {
> 	n = NodeArray[0];
> 	while(n && remaining)
> 	{
> 		doSomeWork();
> #pragma delinquent_load
> 		n->i = n->next->j + n->next->k + n->next->l;
> 		n = n->next;
> 		remaining--;
> 	}
> }
> 

 I'm not sure if I should be scared by this example or put it up to lazyness.
However, n is going to be compared each time we go through the loop even
though it is only supposed to be tested once.  If n is ever 0 at the top of
the loop, after executing the loop, then a segfault would have already occured.
I don't suppose it's possible to optimize this, is it?

Jim

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-07  1:54     ` James Morrison
@ 2004-04-07  4:45       ` Daniel Jacobowitz
  2004-04-07  5:19         ` James Morrison
  2004-04-07  5:53         ` Duraid Madina
  0 siblings, 2 replies; 10+ messages in thread
From: Daniel Jacobowitz @ 2004-04-07  4:45 UTC (permalink / raw)
  To: James Morrison; +Cc: Duraid Madina, Jim Wilson, gcc

On Tue, Apr 06, 2004 at 09:52:51PM -0400, James Morrison wrote:
> 
> Duraid Madina <duraid@octopus.com.au> writes:
> > This #pragma would be added to code such as (taken from
> > http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
> > )
> > 
> > {
> > 	n = NodeArray[0];
> > 	while(n && remaining)
> > 	{
> > 		doSomeWork();
> > #pragma delinquent_load
> > 		n->i = n->next->j + n->next->k + n->next->l;
> > 		n = n->next;
> > 		remaining--;
> > 	}
> > }
> > 
> 
>  I'm not sure if I should be scared by this example or put it up to lazyness.
> However, n is going to be compared each time we go through the loop even
> though it is only supposed to be tested once.  If n is ever 0 at the top of
> the loop, after executing the loop, then a segfault would have already occured.
> I don't suppose it's possible to optimize this, is it?

Eh, you're incorrect.  Note the n = n->next at the bottom of the loop.

In general, if the circumstance you're describing had actually occured
in this example - yes, I believe that at least tree-ssa or lno could
optimize this.  There would be a non-null marker on the dereferenced
pointer, which would reach a PHI at the top of the loop.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-07  4:45       ` Daniel Jacobowitz
@ 2004-04-07  5:19         ` James Morrison
  2004-04-07  5:53         ` Duraid Madina
  1 sibling, 0 replies; 10+ messages in thread
From: James Morrison @ 2004-04-07  5:19 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Duraid Madina, Jim Wilson, gcc


Daniel Jacobowitz <drow@false.org> writes:
> > > {
> > > 	n = NodeArray[0];
> > > 	while(n && remaining)
> > > 	{
> > > 		doSomeWork();
> > > #pragma delinquent_load
> > > 		n->i = n->next->j + n->next->k + n->next->l;
                         ^^^^^^^^ next = 0 => segfault.
> > > 		n = n->next;
> > > 		remaining--;
> > > 	}
> > > }
> > > 
> > 
> >  I'm not sure if I should be scared by this example or put it up to lazyness.
> > However, n is going to be compared each time we go through the loop even
> > though it is only supposed to be tested once.  If n is ever 0 at the top of
> > the loop, after executing the loop, then a segfault would have already occured.
> > I don't suppose it's possible to optimize this, is it?
> 
> Eh, you're incorrect.  Note the n = n->next at the bottom of the loop.
> 
> In general, if the circumstance you're describing had actually occured
> in this example - yes, I believe that at least tree-ssa or lno could
> optimize this.  There would be a non-null marker on the dereferenced
> pointer, which would reach a PHI at the top of the loop.
> 
> -- 
> Daniel Jacobowitz
> MontaVista Software                         Debian GNU/Linux Developer

 Note the line above that.  However, the lno branch from 20040321 doesn't
take the if (n == 0) out of the bottom of the loop.  I think it is valid to
have a signal handler catch a segfault which then sets n to null.

Jim

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-07  4:45       ` Daniel Jacobowitz
  2004-04-07  5:19         ` James Morrison
@ 2004-04-07  5:53         ` Duraid Madina
  1 sibling, 0 replies; 10+ messages in thread
From: Duraid Madina @ 2004-04-07  5:53 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: James Morrison, Jim Wilson, gcc

Daniel Jacobowitz wrote:
> On Tue, Apr 06, 2004 at 09:52:51PM -0400, James Morrison wrote:
> 
>>Duraid Madina <duraid@octopus.com.au> writes:
>>
>>>This #pragma would be added to code such as (taken from
>>>http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
>>>)
>>>
>>>{
>>>	n = NodeArray[0];
>>>	while(n && remaining)
>>>	{
>>>		doSomeWork();
>>>#pragma delinquent_load
>>>		n->i = n->next->j + n->next->k + n->next->l;
>>>		n = n->next;
>>>		remaining--;
>>>	}
>>>}
>>>
>>
>> I'm not sure if I should be scared by this example or put it up to lazyness.
>>However, n is going to be compared each time we go through the loop even
>>though it is only supposed to be tested once.  If n is ever 0 at the top of
>>the loop, after executing the loop, then a segfault would have already occured.
>>I don't suppose it's possible to optimize this, is it?
> 
> 
> Eh, you're incorrect.  Note the n = n->next at the bottom of the loop.

The code isn't clean (stick if()s in or whatever makes you happy), the 
point is just to demonstrate the kind of pointer thrashing code that 
speculative precomputation attacks which OoO does not. Which leads me to 
wonder:

> In general, if the circumstance you're describing had actually occured
> in this example - yes, I believe that at least tree-ssa or lno could
> optimize this.  There would be a non-null marker on the dereferenced
> pointer, which would reach a PHI at the top of the loop.

What exactly would tree-ssa or lno _do_? Are there opportunities to 
insert prefetches (I mean loads for the purpose of bringing stuff into 
cache ahead of time, not deliberate prefetch opcodes) far enough before 
the accesses? You'd want to be at least hundreds of cycles ahead here, 
perhaps >300 or so.

Basically, duplicating pointer computation and reads seems to be a big 
win if you can do it far enough in advance (and if it is localised 
enough that you don't blow out your instruction cache). It seems that 
such code is not uncommon in "the real world". And while it has the 
biggest impact on in-order machines such as current Itanium CPUs, there 
are times (as described in the HPCA paper) where OoO and spec. pre-comp 
are actually complementary, so it could be a worthwhile optimization to 
think about at the tree-ssa level, and applicable to a number of 
different architectures.

Any comments? I'm no GCC expert so I'd be very glad to hear from those 
who are..

	Duraid

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-07  0:08   ` Duraid Madina
  2004-04-07  1:54     ` James Morrison
@ 2004-04-07 23:46     ` Jim Wilson
  2004-04-08  0:10       ` Duraid Madina
  1 sibling, 1 reply; 10+ messages in thread
From: Jim Wilson @ 2004-04-07 23:46 UTC (permalink / raw)
  To: Duraid Madina; +Cc: gcc

On Tue, 2004-04-06 at 17:08, Duraid Madina wrote:
> This #pragma would be added to code such as (taken from 
> http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm 

There is some potential for confusion over what the pragma applies to. 
In a more complex example, the following line might contain more than
one statement, or one statement might be spread over several lines, or
the next line may contain a control-flow statement like an if.  In some
of these cases, it might not be clear what to do unless this is defined
well.

I think there is no hope of implementing this in the current RTL
optimizers.  This would have to be done in the tree-ssa infrastructure
which is not yet in the mainline compiler, but hopefully coming soon.

The papers talk about using SMT systems, and running two threads on
different processors.  This allows the two threads to run independently,
and if we can get the second thread to run ahead of the first one, then
the first one runs faster than it does without the second one.

But if we are running both threads on a single IA-64 cpu, then they are
now running in lock-step and things break down.  If the second thread
stalls because of a cache miss or page fault, then the primary thread
stalls too, so now the secondary thread is slowly down the primary
thread in some cases which is undesirable.  This doesn't seem to work
unless we use prefetch instructions in the secondary thread, and now we
aren't doing speculative precomputation anymore.  We are just doing
simple prefetching.

> Agreed, though I trust that by "difficult packing problem" you mean a 
> problem that might not be very likely to succeed

Yes.  You might not be able to get rid of as many nops as you hoped
because of packing limitations.
-- 
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-07 23:46     ` Jim Wilson
@ 2004-04-08  0:10       ` Duraid Madina
  2004-04-08  0:39         ` Jim Wilson
  0 siblings, 1 reply; 10+ messages in thread
From: Duraid Madina @ 2004-04-08  0:10 UTC (permalink / raw)
  To: Jim Wilson; +Cc: gcc

Jim Wilson wrote:
> There is some potential for confusion over what the pragma applies to. 
> In a more complex example, the following line might contain more than
> one statement, or one statement might be spread over several lines, or
> the next line may contain a control-flow statement like an if.  In some
> of these cases, it might not be clear what to do unless this is defined
> well.

Agreed, a simple pragma isn't really up to the task, but it seemed 
easier than dredging up this information from profile feedback.

> I think there is no hope of implementing this in the current RTL
> optimizers.  This would have to be done in the tree-ssa infrastructure
> which is not yet in the mainline compiler, but hopefully coming soon.

This will be the entry point for a new wave of GCC hackers, I hope!

> But if we are running both threads on a single IA-64 cpu, then they are
> now running in lock-step and things break down.  If the second thread
> stalls because of a cache miss or page fault, then the primary thread
> stalls too, so now the secondary thread is slowly down the primary
> thread in some cases which is undesirable.  This doesn't seem to work
> unless we use prefetch instructions in the secondary thread, and now we
> aren't doing speculative precomputation anymore.  We are just doing
> simple prefetching.

On that note, does GCC *ever* emit ld8.a / chk.a pairs? This would be a 
very nice addition for yucky pointer code, and isn't the can of worms 
for the compiler that ld.s/chk.s is.

> Yes.  You might not be able to get rid of as many nops as you hoped
> because of packing limitations.

If 1% of those nops turn into chk.a instructions, I'll be very pleased.. 
Bring on tree-ssa! :)

	Duraid

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: user-guided speculative precomputation? (my wacky ia64 idea)
  2004-04-08  0:10       ` Duraid Madina
@ 2004-04-08  0:39         ` Jim Wilson
  0 siblings, 0 replies; 10+ messages in thread
From: Jim Wilson @ 2004-04-08  0:39 UTC (permalink / raw)
  To: Duraid Madina; +Cc: gcc

On Wed, 2004-04-07 at 17:09, Duraid Madina wrote:
> On that note, does GCC *ever* emit ld8.a / chk.a pairs? This would be a 
> very nice addition for yucky pointer code, and isn't the can of worms 
> for the compiler that ld.s/chk.s is.

No.

Cygnus did try to implement some advanced/speculative load support about
4-5 years ago, but we couldn't get a reliable performance increase.  I
think part of the problem was that the alias info in the RTL wasn't
quite good enough, and as a result we ended up generating too many
advanced loads which hurt performance.  It would be worthwhile trying
again some day, as there have been infrastructure improvements since
then.  Or maybe we didn't find the right set of heuristics to use. 
Unfortunately, there isn't anyone doing serious IA-64 gcc work at this
time.
-- 
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-04-08  0:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-01  0:07 user-guided speculative precomputation? (my wacky ia64 idea) Duraid Madina
2004-04-06  6:02 ` Jim Wilson
2004-04-07  0:08   ` Duraid Madina
2004-04-07  1:54     ` James Morrison
2004-04-07  4:45       ` Daniel Jacobowitz
2004-04-07  5:19         ` James Morrison
2004-04-07  5:53         ` Duraid Madina
2004-04-07 23:46     ` Jim Wilson
2004-04-08  0:10       ` Duraid Madina
2004-04-08  0:39         ` Jim Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).