* user-guided speculative precomputation? (my wacky ia64 idea)
@ 2004-04-01 0:07 Duraid Madina
2004-04-06 6:02 ` Jim Wilson
0 siblings, 1 reply; 10+ messages in thread
From: Duraid Madina @ 2004-04-01 0:07 UTC (permalink / raw)
To: gcc
Hi GCCers,
I was bored enough to look through some IA64 emitted by GCC and noticed
that IU utilization was in most places pretty horrid. nops everywhere!
Oh well.
I thought to myself, why does GCC even bother issuing to 6 ports? How
bad would it be if I clamped it down to 4? A few hours later, I had my
answer: _not too bad_. By my hackish estimations (I haven't gone ahead
and bootstrapped an appropriately crippled compiler, but instead just
performed some calculations on 'representative' (short) instruction
traces to back-of-the-envelope accuracy ;) and it looks like SPECint2k
will lose "only" around 20% performance.
Then I thought, why not treat such a wide machine as a multithreaded
one, but where one thread doesn't have any flow control instructions? ;)
i.e. instead of sprinkling nops all over the place, just give up and
deliberately reserve an instruction out of every bundle for another
stream of execution.
What to do with such a stream? Well, why not support speculative
precomputation? (aggressively chase pointers so as to prefetch data into
cache before the "main" thread needs it, as in "Memory Latency-Tolerance
Approaches for Itanium Processors" by Wang et al, in proc. HPCA'02) It's
tricky to do this automatically, but my question to the group is: would
it be sensible to add support for something like a "#pragma
delinquent_load" that could be inserted at the appropriate point in e.g.
graph/tree/list traversal code etc.? A "cookie cutter" prefetching
sequence could then be fitted in the reserved area alongside the main
thread, silently prefetching data more intelligently than any non-insane
hardware prefetcher could ever hope to. OoO is great, but I've yet to
see one that can keep track of things hundreds of cycles out...
The potential gains seem phenomenal, particularly for in-order machines
such as the (current) Itanium line, but one wonders if it couldn't also
help other architectures, at least in mitigating L2/L3 misses.
Duraid
P.S. The real question, of course, is whether or not this post is an
April fool's joke. ;)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-01 0:07 user-guided speculative precomputation? (my wacky ia64 idea) Duraid Madina
@ 2004-04-06 6:02 ` Jim Wilson
2004-04-07 0:08 ` Duraid Madina
0 siblings, 1 reply; 10+ messages in thread
From: Jim Wilson @ 2004-04-06 6:02 UTC (permalink / raw)
To: Duraid Madina; +Cc: gcc
Duraid Madina wrote:
> tricky to do this automatically, but my question to the group is: would
> it be sensible to add support for something like a "#pragma
> delinquent_load" that could be inserted at the appropriate point in e.g.
> graph/tree/list traversal code etc.? A "cookie cutter" prefetching
> sequence could then be fitted in the reserved area alongside the main
> thread
You didn't define what #pragma delinquent_load does, and I don't have a
copy of that paper.
Probably you would have to try to implement it before you can tell
whether you can make it work.
The limited number of bundle formats makes it difficult to reserve an
issue slot. You will get differing types of left-over slots depending
on what instructions from the main thread get scheduled, and you will
have differing types of insns that need to fit into those slots. This
could lead to a difficult packing problem. And there is also the issue
of stop bits.
--
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-06 6:02 ` Jim Wilson
@ 2004-04-07 0:08 ` Duraid Madina
2004-04-07 1:54 ` James Morrison
2004-04-07 23:46 ` Jim Wilson
0 siblings, 2 replies; 10+ messages in thread
From: Duraid Madina @ 2004-04-07 0:08 UTC (permalink / raw)
To: Jim Wilson; +Cc: gcc
Jim Wilson wrote:
> Duraid Madina wrote:
>
>> tricky to do this automatically, but my question to the group is:
>> would it be sensible to add support for something like a "#pragma
>> delinquent_load" that could be inserted at the appropriate point in
>> e.g. graph/tree/list traversal code etc.? A "cookie cutter"
>> prefetching sequence could then be fitted in the reserved area
>> alongside the main thread
>
> You didn't define what #pragma delinquent_load does, and I don't have a
> copy of that paper.
The paper is freely available here:
http://www.hpcaconf.org/hpca8/program/papers/wang_memory_tolerance.ps
and well worth reading for anyone interested in this kind of
optimization or just wondering what future IA64 processors might be like.
This #pragma would be added to code such as (taken from
http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
)
{
n = NodeArray[0];
while(n && remaining)
{
doSomeWork();
#pragma delinquent_load
n->i = n->next->j + n->next->k + n->next->l;
n = n->next;
remaining--;
}
}
because the line following the pragma, in many cases, would be
responsible for nearly 100% of the L2 or L3 misses of a program, and
therefore responsible for a significant fraction of the program's
execution time if doSomeWork() doesn't actually do a _lot_ of work,
which is very common in things to do with graph traversal (e.g. database
querying etc)
The pragma signals two things to the compiler:
- The following line is responsible for a very high proportion of total
L2/L3 cache misses (in case you're wondering, many real world programs
seem to be well characterised in this way, having only a few delinquent
loads despite being very large programs)
- The LHS of the following line is probably involved in both loop
control as well as the loop body
> Probably you would have to try to implement it before you can tell
> whether you can make it work.
mm. Me and my big mouth..
> The limited number of bundle formats makes it difficult to reserve an
> issue slot. You will get differing types of left-over slots depending
> on what instructions from the main thread get scheduled, and you will
> have differing types of insns that need to fit into those slots. This
> could lead to a difficult packing problem. And there is also the issue
> of stop bits.
Agreed, though I trust that by "difficult packing problem" you mean a
problem that might not be very likely to succeed, not a computationally
difficult task. The thing that gives me hope is that the other 'thread'
really doesn't have to do a lot at all; it just has to dereference
pointers, and loop. Still, when at the mercy of the primary thread's
branching it might be difficult to get such a thing running usefully
alongside the main computation, even if a great deal of the chasing
thread's execution state can be stored in (e.g.) some reserved global
registers. I'm really not sure that the bundle formats will be such a
big restriction, because if you take a look at what GCC emits for
"pointer-mad" code, there's often quite a bit of room to move.
Duraid
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-07 0:08 ` Duraid Madina
@ 2004-04-07 1:54 ` James Morrison
2004-04-07 4:45 ` Daniel Jacobowitz
2004-04-07 23:46 ` Jim Wilson
1 sibling, 1 reply; 10+ messages in thread
From: James Morrison @ 2004-04-07 1:54 UTC (permalink / raw)
To: Duraid Madina; +Cc: Jim Wilson, gcc
Duraid Madina <duraid@octopus.com.au> writes:
> This #pragma would be added to code such as (taken from
> http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
> )
>
> {
> n = NodeArray[0];
> while(n && remaining)
> {
> doSomeWork();
> #pragma delinquent_load
> n->i = n->next->j + n->next->k + n->next->l;
> n = n->next;
> remaining--;
> }
> }
>
I'm not sure if I should be scared by this example or put it up to lazyness.
However, n is going to be compared each time we go through the loop even
though it is only supposed to be tested once. If n is ever 0 at the top of
the loop, after executing the loop, then a segfault would have already occured.
I don't suppose it's possible to optimize this, is it?
Jim
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-07 1:54 ` James Morrison
@ 2004-04-07 4:45 ` Daniel Jacobowitz
2004-04-07 5:19 ` James Morrison
2004-04-07 5:53 ` Duraid Madina
0 siblings, 2 replies; 10+ messages in thread
From: Daniel Jacobowitz @ 2004-04-07 4:45 UTC (permalink / raw)
To: James Morrison; +Cc: Duraid Madina, Jim Wilson, gcc
On Tue, Apr 06, 2004 at 09:52:51PM -0400, James Morrison wrote:
>
> Duraid Madina <duraid@octopus.com.au> writes:
> > This #pragma would be added to code such as (taken from
> > http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
> > )
> >
> > {
> > n = NodeArray[0];
> > while(n && remaining)
> > {
> > doSomeWork();
> > #pragma delinquent_load
> > n->i = n->next->j + n->next->k + n->next->l;
> > n = n->next;
> > remaining--;
> > }
> > }
> >
>
> I'm not sure if I should be scared by this example or put it up to lazyness.
> However, n is going to be compared each time we go through the loop even
> though it is only supposed to be tested once. If n is ever 0 at the top of
> the loop, after executing the loop, then a segfault would have already occured.
> I don't suppose it's possible to optimize this, is it?
Eh, you're incorrect. Note the n = n->next at the bottom of the loop.
In general, if the circumstance you're describing had actually occured
in this example - yes, I believe that at least tree-ssa or lno could
optimize this. There would be a non-null marker on the dereferenced
pointer, which would reach a PHI at the top of the loop.
--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-07 4:45 ` Daniel Jacobowitz
@ 2004-04-07 5:19 ` James Morrison
2004-04-07 5:53 ` Duraid Madina
1 sibling, 0 replies; 10+ messages in thread
From: James Morrison @ 2004-04-07 5:19 UTC (permalink / raw)
To: Daniel Jacobowitz; +Cc: Duraid Madina, Jim Wilson, gcc
Daniel Jacobowitz <drow@false.org> writes:
> > > {
> > > n = NodeArray[0];
> > > while(n && remaining)
> > > {
> > > doSomeWork();
> > > #pragma delinquent_load
> > > n->i = n->next->j + n->next->k + n->next->l;
^^^^^^^^ next = 0 => segfault.
> > > n = n->next;
> > > remaining--;
> > > }
> > > }
> > >
> >
> > I'm not sure if I should be scared by this example or put it up to lazyness.
> > However, n is going to be compared each time we go through the loop even
> > though it is only supposed to be tested once. If n is ever 0 at the top of
> > the loop, after executing the loop, then a segfault would have already occured.
> > I don't suppose it's possible to optimize this, is it?
>
> Eh, you're incorrect. Note the n = n->next at the bottom of the loop.
>
> In general, if the circumstance you're describing had actually occured
> in this example - yes, I believe that at least tree-ssa or lno could
> optimize this. There would be a non-null marker on the dereferenced
> pointer, which would reach a PHI at the top of the loop.
>
> --
> Daniel Jacobowitz
> MontaVista Software Debian GNU/Linux Developer
Note the line above that. However, the lno branch from 20040321 doesn't
take the if (n == 0) out of the bottom of the loop. I think it is valid to
have a signal handler catch a segfault which then sets n to null.
Jim
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-07 4:45 ` Daniel Jacobowitz
2004-04-07 5:19 ` James Morrison
@ 2004-04-07 5:53 ` Duraid Madina
1 sibling, 0 replies; 10+ messages in thread
From: Duraid Madina @ 2004-04-07 5:53 UTC (permalink / raw)
To: Daniel Jacobowitz; +Cc: James Morrison, Jim Wilson, gcc
Daniel Jacobowitz wrote:
> On Tue, Apr 06, 2004 at 09:52:51PM -0400, James Morrison wrote:
>
>>Duraid Madina <duraid@octopus.com.au> writes:
>>
>>>This #pragma would be added to code such as (taken from
>>>http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
>>>)
>>>
>>>{
>>> n = NodeArray[0];
>>> while(n && remaining)
>>> {
>>> doSomeWork();
>>>#pragma delinquent_load
>>> n->i = n->next->j + n->next->k + n->next->l;
>>> n = n->next;
>>> remaining--;
>>> }
>>>}
>>>
>>
>> I'm not sure if I should be scared by this example or put it up to lazyness.
>>However, n is going to be compared each time we go through the loop even
>>though it is only supposed to be tested once. If n is ever 0 at the top of
>>the loop, after executing the loop, then a segfault would have already occured.
>>I don't suppose it's possible to optimize this, is it?
>
>
> Eh, you're incorrect. Note the n = n->next at the bottom of the loop.
The code isn't clean (stick if()s in or whatever makes you happy), the
point is just to demonstrate the kind of pointer thrashing code that
speculative precomputation attacks which OoO does not. Which leads me to
wonder:
> In general, if the circumstance you're describing had actually occured
> in this example - yes, I believe that at least tree-ssa or lno could
> optimize this. There would be a non-null marker on the dereferenced
> pointer, which would reach a PHI at the top of the loop.
What exactly would tree-ssa or lno _do_? Are there opportunities to
insert prefetches (I mean loads for the purpose of bringing stuff into
cache ahead of time, not deliberate prefetch opcodes) far enough before
the accesses? You'd want to be at least hundreds of cycles ahead here,
perhaps >300 or so.
Basically, duplicating pointer computation and reads seems to be a big
win if you can do it far enough in advance (and if it is localised
enough that you don't blow out your instruction cache). It seems that
such code is not uncommon in "the real world". And while it has the
biggest impact on in-order machines such as current Itanium CPUs, there
are times (as described in the HPCA paper) where OoO and spec. pre-comp
are actually complementary, so it could be a worthwhile optimization to
think about at the tree-ssa level, and applicable to a number of
different architectures.
Any comments? I'm no GCC expert so I'd be very glad to hear from those
who are..
Duraid
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-07 0:08 ` Duraid Madina
2004-04-07 1:54 ` James Morrison
@ 2004-04-07 23:46 ` Jim Wilson
2004-04-08 0:10 ` Duraid Madina
1 sibling, 1 reply; 10+ messages in thread
From: Jim Wilson @ 2004-04-07 23:46 UTC (permalink / raw)
To: Duraid Madina; +Cc: gcc
On Tue, 2004-04-06 at 17:08, Duraid Madina wrote:
> This #pragma would be added to code such as (taken from
> http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p06_xeon.htm
There is some potential for confusion over what the pragma applies to.
In a more complex example, the following line might contain more than
one statement, or one statement might be spread over several lines, or
the next line may contain a control-flow statement like an if. In some
of these cases, it might not be clear what to do unless this is defined
well.
I think there is no hope of implementing this in the current RTL
optimizers. This would have to be done in the tree-ssa infrastructure
which is not yet in the mainline compiler, but hopefully coming soon.
The papers talk about using SMT systems, and running two threads on
different processors. This allows the two threads to run independently,
and if we can get the second thread to run ahead of the first one, then
the first one runs faster than it does without the second one.
But if we are running both threads on a single IA-64 cpu, then they are
now running in lock-step and things break down. If the second thread
stalls because of a cache miss or page fault, then the primary thread
stalls too, so now the secondary thread is slowly down the primary
thread in some cases which is undesirable. This doesn't seem to work
unless we use prefetch instructions in the secondary thread, and now we
aren't doing speculative precomputation anymore. We are just doing
simple prefetching.
> Agreed, though I trust that by "difficult packing problem" you mean a
> problem that might not be very likely to succeed
Yes. You might not be able to get rid of as many nops as you hoped
because of packing limitations.
--
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-07 23:46 ` Jim Wilson
@ 2004-04-08 0:10 ` Duraid Madina
2004-04-08 0:39 ` Jim Wilson
0 siblings, 1 reply; 10+ messages in thread
From: Duraid Madina @ 2004-04-08 0:10 UTC (permalink / raw)
To: Jim Wilson; +Cc: gcc
Jim Wilson wrote:
> There is some potential for confusion over what the pragma applies to.
> In a more complex example, the following line might contain more than
> one statement, or one statement might be spread over several lines, or
> the next line may contain a control-flow statement like an if. In some
> of these cases, it might not be clear what to do unless this is defined
> well.
Agreed, a simple pragma isn't really up to the task, but it seemed
easier than dredging up this information from profile feedback.
> I think there is no hope of implementing this in the current RTL
> optimizers. This would have to be done in the tree-ssa infrastructure
> which is not yet in the mainline compiler, but hopefully coming soon.
This will be the entry point for a new wave of GCC hackers, I hope!
> But if we are running both threads on a single IA-64 cpu, then they are
> now running in lock-step and things break down. If the second thread
> stalls because of a cache miss or page fault, then the primary thread
> stalls too, so now the secondary thread is slowly down the primary
> thread in some cases which is undesirable. This doesn't seem to work
> unless we use prefetch instructions in the secondary thread, and now we
> aren't doing speculative precomputation anymore. We are just doing
> simple prefetching.
On that note, does GCC *ever* emit ld8.a / chk.a pairs? This would be a
very nice addition for yucky pointer code, and isn't the can of worms
for the compiler that ld.s/chk.s is.
> Yes. You might not be able to get rid of as many nops as you hoped
> because of packing limitations.
If 1% of those nops turn into chk.a instructions, I'll be very pleased..
Bring on tree-ssa! :)
Duraid
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: user-guided speculative precomputation? (my wacky ia64 idea)
2004-04-08 0:10 ` Duraid Madina
@ 2004-04-08 0:39 ` Jim Wilson
0 siblings, 0 replies; 10+ messages in thread
From: Jim Wilson @ 2004-04-08 0:39 UTC (permalink / raw)
To: Duraid Madina; +Cc: gcc
On Wed, 2004-04-07 at 17:09, Duraid Madina wrote:
> On that note, does GCC *ever* emit ld8.a / chk.a pairs? This would be a
> very nice addition for yucky pointer code, and isn't the can of worms
> for the compiler that ld.s/chk.s is.
No.
Cygnus did try to implement some advanced/speculative load support about
4-5 years ago, but we couldn't get a reliable performance increase. I
think part of the problem was that the alias info in the RTL wasn't
quite good enough, and as a result we ended up generating too many
advanced loads which hurt performance. It would be worthwhile trying
again some day, as there have been infrastructure improvements since
then. Or maybe we didn't find the right set of heuristics to use.
Unfortunately, there isn't anyone doing serious IA-64 gcc work at this
time.
--
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2004-04-08 0:39 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-01 0:07 user-guided speculative precomputation? (my wacky ia64 idea) Duraid Madina
2004-04-06 6:02 ` Jim Wilson
2004-04-07 0:08 ` Duraid Madina
2004-04-07 1:54 ` James Morrison
2004-04-07 4:45 ` Daniel Jacobowitz
2004-04-07 5:19 ` James Morrison
2004-04-07 5:53 ` Duraid Madina
2004-04-07 23:46 ` Jim Wilson
2004-04-08 0:10 ` Duraid Madina
2004-04-08 0:39 ` Jim Wilson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).