From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-65903-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 19152 invoked by alias); 7 Jan 2003 22:05:48 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 18106 invoked from network); 7 Jan 2003 22:04:30 -0000
Received: from unknown (HELO mail.kloo.net) (63.192.214.25)
  by 209.249.29.67 with SMTP; 7 Jan 2003 22:04:30 -0000
Received: by mail.kloo.net (Postfix, from userid 504)
	id 1FB833B0318; Tue,  7 Jan 2003 13:58:35 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
	by mail.kloo.net (Postfix) with ESMTP
	id 1DDC23B4161; Tue,  7 Jan 2003 13:58:35 -0800 (PST)
Date: Tue, 07 Jan 2003 22:53:00 -0000
From: <tm_gccmail@mail.kloo.net>
To: Marcel Cox <marcel_cox@hotmail.com>
Cc: gcc@gcc.gnu.org
Subject: Re: An unusual Performance approach using Synthetic registers
In-Reply-To: <DAV14qQkvUmmB61ZAe30000ca12@hotmail.com>
Message-ID: <Pine.LNX.4.21.0301071341070.18968-100000@mail.kloo.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-SW-Source: 2003-01/txt/msg00384.txt.bz2

On Tue, 7 Jan 2003, Marcel Cox wrote:
(stuff deleted for brevity)
...
> I think the speed gain is achieved for the following reasons:
> 
> 1) The most active variables are kept close together in memory. They only
> occupy one or a few cache lines. Normally,  one would expect the stack frame
> to always be in L1 cache. However especially when traversing data structures
> that are bigger than L1 cache, you can expect the cache lines holding the
> stack frame to be regularly be replaced by data, and having fewer cache
> lines to reload for local variables will certainly give a speed advantage.

The stack reordering pass posted by Sanjiv Gupta can do this.
This was posted about a few days ago in gcc-patches.

What you're describing is actually bad on the Pentium, and probably
subsequent implementations as well.

The Pentium can dual-issue loads as long as they reference separate cache
ways. So, manually sorting the stack so contiguous accesses are localized
increases the probability of the loads accessing the same cache way, thus
decreasing the probability of single-issuing.

You guys really should read the processor manual instead of making
incorrect assumptions about what features would improve the code quality.

> 2) Running the RA over the stack slots will cause the slots to be reused
> when the life range of variables does not overlap. This even increases the
> compactness that already gives the benefit of point 1. Also, overall
> reducing stack usage will always be a small gain.

The stack slots are already reused.

> 3) The "compact" memory access pattern and the reuse of stack slots might
> increase the opportunity for the processor to use "shortcut" features in
> memory access. For example successive writes to the same memory location
> might be optimised to a single write, or read access to a memory location
> may be fast if there is still a pending write on the same location

The first feature you're describing is called "dead write elimination" and
is already done in gcc.

The second feature you're describing is called "write FIFO snooping" and
is a hardware feature.

> 4) Finally,  the frequently used stack slots are probably placed next to the
> frame pointer so that 1 byte offsets can always be used for the most
> frequent stack accesses, thus reducing code size and reducing pressure on
> the
> instruction cache.

Yes, many instruction sets have this limitation, including:

o Hitachi SH
o MIPS16
o Thumb
o HP PA-RISC
etc.

> However I think that "synthetic registers" or not needed to get this
> gain. They are just a "trick" used to make the RA optimise the stack.
> However, it is probably possible to have a separate stack optimisation
> pass do the same thing, and this in a more flexible way as the number
> of important variables can dynamically adjust and does not have to be
> a fixed value like 32.

Yes, the stack reordering pass does this. Only the algorithm for deciding
the stack slot placement needs tweaking.

> Such a stack optimisation pass should do the
> following: - analyse the life range of variables and temporaries and
> use this information to reuse stack slots, rather than always
> assigning individual stack slots for each individual variable or
> temporary - analyse the usage of local variables and use this
> information to sort the local variables such that the most frequently
> used variables are nearest to the frame pointer (or to the stack point
> if you work without frame pointer). This will guarantee that 1 byte
> addressing can be used for the most frequent variables, and it will
> also put the most frequently variables together thus optimising the
> cache access pattern.

You guys need to look at Sanjiv's stack reordering patch.

You're introducing an unnecessary artificial construct to indirectly
manipulate the placment of items on the stack. The capability to
directly manipulate the placement of items on the stack already exists.

Toshi