From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19152 invoked by alias); 7 Jan 2003 22:05:48 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 18106 invoked from network); 7 Jan 2003 22:04:30 -0000 Received: from unknown (HELO mail.kloo.net) (63.192.214.25) by 209.249.29.67 with SMTP; 7 Jan 2003 22:04:30 -0000 Received: by mail.kloo.net (Postfix, from userid 504) id 1FB833B0318; Tue, 7 Jan 2003 13:58:35 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.kloo.net (Postfix) with ESMTP id 1DDC23B4161; Tue, 7 Jan 2003 13:58:35 -0800 (PST) Date: Tue, 07 Jan 2003 22:53:00 -0000 From: To: Marcel Cox Cc: gcc@gcc.gnu.org Subject: Re: An unusual Performance approach using Synthetic registers In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-SW-Source: 2003-01/txt/msg00384.txt.bz2 On Tue, 7 Jan 2003, Marcel Cox wrote: (stuff deleted for brevity) ... > I think the speed gain is achieved for the following reasons: > > 1) The most active variables are kept close together in memory. They only > occupy one or a few cache lines. Normally, one would expect the stack frame > to always be in L1 cache. However especially when traversing data structures > that are bigger than L1 cache, you can expect the cache lines holding the > stack frame to be regularly be replaced by data, and having fewer cache > lines to reload for local variables will certainly give a speed advantage. The stack reordering pass posted by Sanjiv Gupta can do this. This was posted about a few days ago in gcc-patches. What you're describing is actually bad on the Pentium, and probably subsequent implementations as well. The Pentium can dual-issue loads as long as they reference separate cache ways. So, manually sorting the stack so contiguous accesses are localized increases the probability of the loads accessing the same cache way, thus decreasing the probability of single-issuing. You guys really should read the processor manual instead of making incorrect assumptions about what features would improve the code quality. > 2) Running the RA over the stack slots will cause the slots to be reused > when the life range of variables does not overlap. This even increases the > compactness that already gives the benefit of point 1. Also, overall > reducing stack usage will always be a small gain. The stack slots are already reused. > 3) The "compact" memory access pattern and the reuse of stack slots might > increase the opportunity for the processor to use "shortcut" features in > memory access. For example successive writes to the same memory location > might be optimised to a single write, or read access to a memory location > may be fast if there is still a pending write on the same location The first feature you're describing is called "dead write elimination" and is already done in gcc. The second feature you're describing is called "write FIFO snooping" and is a hardware feature. > 4) Finally, the frequently used stack slots are probably placed next to the > frame pointer so that 1 byte offsets can always be used for the most > frequent stack accesses, thus reducing code size and reducing pressure on > the > instruction cache. Yes, many instruction sets have this limitation, including: o Hitachi SH o MIPS16 o Thumb o HP PA-RISC etc. > However I think that "synthetic registers" or not needed to get this > gain. They are just a "trick" used to make the RA optimise the stack. > However, it is probably possible to have a separate stack optimisation > pass do the same thing, and this in a more flexible way as the number > of important variables can dynamically adjust and does not have to be > a fixed value like 32. Yes, the stack reordering pass does this. Only the algorithm for deciding the stack slot placement needs tweaking. > Such a stack optimisation pass should do the > following: - analyse the life range of variables and temporaries and > use this information to reuse stack slots, rather than always > assigning individual stack slots for each individual variable or > temporary - analyse the usage of local variables and use this > information to sort the local variables such that the most frequently > used variables are nearest to the frame pointer (or to the stack point > if you work without frame pointer). This will guarantee that 1 byte > addressing can be used for the most frequent variables, and it will > also put the most frequently variables together thus optimising the > cache access pattern. You guys need to look at Sanjiv's stack reordering patch. You're introducing an unnecessary artificial construct to indirectly manipulate the placment of items on the stack. The capability to directly manipulate the placement of items on the stack already exists. Toshi