From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-65804-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 22885 invoked by alias); 6 Jan 2003 19:47:59 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 22818 invoked from network); 6 Jan 2003 19:47:55 -0000
Received: from unknown (HELO emf.net) (205.149.0.20)
  by sources.redhat.com with SMTP; 6 Jan 2003 19:47:55 -0000
Received: (from lord@localhost) by emf.net (K/K) id EAA22286; Sun, 5 Jan 2003 04:24:22 -0800 (PST)
Date: Mon, 06 Jan 2003 19:50:00 -0000
From: Tom Lord <lord@emf.net>
Message-Id: <200301051224.EAA22286@emf.net>
To: dewar@gnat.com
CC: denisc@overta.ru, dewar@gnat.com, ja_walker@earthlink.net, gcc@gcc.gnu.org
In-reply-to: <20030105113840.BF53CF28C4@nile.gnat.com> (dewar@gnat.com)
Subject: Re: An unusual Performance approach using Synthetic registers
References:  <20030105113840.BF53CF28C4@nile.gnat.com>
X-SW-Source: 2003-01/txt/msg00285.txt.bz2


       dewar:

	This is a bit of an odd statement. In practice on a machine
	like the x86, the current stack frame will typically be
	resident in L1 cache, and that's where the register allocator
	spills to. What some of us still don't see is the difference
	in final resulting code between your "synthetic registers" and
	normal spill locations from the register allocator.


Register spills clearly don't equal synthetic registers.

Presumably, the number of locations dedicated to register spills never
exceeds (approximately) the maximum number of simultaneously live
_intermediate_ values minus the number of general purpose registers.
Any non-intermediate value (i.e., one that has a main memory
location), rather than being spilled, will be written to its location.
If that value is later re-used, it will be retrieved from memory.

The number of synthetic registers can be much larger than the number
of simultaneously live intermediate values.

So, with synthetic registers, some values that are not intermediates 
can be retained (in synthetic registers).  Without synthetic
registers, the next time those values are used, they have to be 
fetched from (non-special) memory.

In other words, with synthregs, the CPU can ship some value off to
memory and not care how long it takes to get there or to get back from
there -- because it also ships it off to the synthreg, which it
hypothetically has faster access to.

In practice, that means that synthregs will store some values in
memory twice: once in the location the program text says they go in;
again in the synthetic register.  If the synthetic register is indeed
cache-favored, maybe there's a performance win there -- and if so, a
register allocator is the right algorithm to decide which values to
keep duplicated in synthetic registers (so the proposed implementation
strategy is sensible).

(Another weird interaction is intermediate values that can be
recalculated -- I don't know if GCC ever makes that trade-off --
if it does, it needs to be tuned for synthregs.)

So, does that hypothesis (that synthreg access is faster than general
memory access) hold?  Quite possibly.  For example, a re-used synthreg
inherits cache-presence (at all levels, not just L1) from the previous
uses.  synthregs may win for some apps for more than just L1 reasons.

This brings in new alignment issues, too.  If you can, you might want
to make sure that your allocator locates its metadata where it will
cache-collide with the synthregs, to help push allocated memory out of
those locations (presuming here that allocator meta-data is relatively
infrequently accessed).  It's probably not all that hard to do this
"by accident".  Just in general: do things to protect the
cache-presence of the synthregs.

It might eventually lead to some hw advances: give synthregs with
absolute locations cache preference.  Or, if synthregs are on the
stack, give locations near the frame pointer cache preference (or is
that done already?).

I'd therefore guess it will be a very system-specific optimization --
but that it will win often enough to be useful.  And given what I
understand about trends in architecture, the cases in which it will
win will sharply increase over time.

No?

-t

p.s.: arch foo thinking about non-disruptive ways to improve gcc's
      rev ctl practices:

      http://lists.fifthvision.net/pipermail/arch-users/2003-January/001856.html

      and some of the follow-ups.   It's a pretty "noisy" list,
      though.