From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-64563-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 25104 invoked by alias); 13 Dec 2002 17:38:04 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 25097 invoked from network); 13 Dec 2002 17:38:03 -0000
Received: from unknown (HELO mail.cdt.org) (206.112.85.61)
  by sources.redhat.com with SMTP; 13 Dec 2002 17:38:03 -0000
Received: from dberlin.org (h-69-3-5-6.MCLNVA23.covad.net [69.3.5.6])
	by mail.cdt.org (Postfix) with ESMTP id 83E33490100
	for <gcc@gcc.gnu.org>; Fri, 13 Dec 2002 12:35:29 -0500 (EST)
Received: from [192.168.1.102] (account dberlin HELO dberlin.org)
  by dberlin.org (CommuniGate Pro SMTP 4.0.2)
  with ESMTP-TLS id 1811038 for gcc@gcc.gnu.org; Fri, 13 Dec 2002 12:37:59 -0500
Date: Fri, 13 Dec 2002 09:58:00 -0000
Mime-Version: 1.0 (Apple Message framework v551)
Content-Type: text/plain; charset=US-ASCII; format=flowed
Subject: Timings for copying collection vs non-copying collection
From: Daniel Berlin <dberlin@dberlin.org>
To: gcc@gcc.gnu.org
Content-Transfer-Encoding: 7bit
Message-Id: <4B49488A-0EC1-11D7-B1DF-000393575BCC@dberlin.org>
X-SW-Source: 2002-12/txt/msg00737.txt.bz2

Okay, after Geoff's suggestion to try the pch-branch, i rewrote the 
copying collector (much easier to do it on the pch-branch, *thanks* 
Geoff), and have some first timings.
A few notes:
1. Ignore GC times, this is a non-optimized copying collector.
2. These times are consistent to a few *tenths* (few = 2 max) of a 
second (for each pass) over multiple runs.  So pass times < 1 second 
are probably too noisy to be useful.
3. There is a bootstrap of another tree running in the background for 
this run, so ignore the wall clock time (the likely reason for 3, BTW).
4.  I'm just pasting one run as representative. The wall clock times 
obviously differed for each run.
5. The cc1's in question is not compiled with optimization.
6. Literally the only difference in cc1 between the two is that one is 
linked with ggc-page, one with ggc-copy (IE no other files are 
recompiled. They have the exact same object files being linked in).
7. The assembler output is the same for copying collection and 
non-copying collection.
8. GCC's memory usage actually shrinks after garbage collection with 
the copying collector, so it's definitely doing it's job.
9. Heap size for the copying collector is fixed at 64 meg.
10. This is a p4 1.7ghz computer with 768 meg of memory.
With ggc-page, compiling 20001221-1.c:


garbage collection    :   0.45 ( 0%) usr   0.01 ( 2%) sys   0.69 ( 0%) 
wall
cfg construction      :   0.31 ( 0%) usr   0.01 ( 2%) sys   0.84 ( 0%) 
wall
cfg cleanup           :   5.39 ( 5%) usr   0.01 ( 2%) sys  10.76 ( 6%) 
wall
trivially dead code   :   0.19 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) 
wall
life analysis         :   1.40 ( 1%) usr   0.01 ( 2%) sys   2.70 ( 1%) 
wall
life info update      :   0.61 ( 1%) usr   0.00 ( 0%) sys   1.21 ( 1%) 
wall
preprocessing         :   0.15 ( 0%) usr   0.11 (17%) sys   0.41 ( 0%) 
wall
lexical analysis      :   0.30 ( 0%) usr   0.23 (35%) sys   0.92 ( 0%) 
wall
parser                :   0.72 ( 1%) usr   0.13 (20%) sys   1.68 ( 1%) 
wall
expand                :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) 
wall
integration           :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) 
wall
jump                  :   0.86 ( 1%) usr   0.04 ( 6%) sys   1.95 ( 1%) 
wall
CSE                   :   2.77 ( 3%) usr   0.00 ( 0%) sys   5.62 ( 3%) 
wall
global CSE            :   0.69 ( 1%) usr   0.08 (12%) sys   1.52 ( 1%) 
wall
loop analysis         :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) 
wall
CSE 2                 :   0.27 ( 0%) usr   0.00 ( 0%) sys   0.42 ( 0%) 
wall
branch prediction     :  26.96 (27%) usr   0.01 ( 2%) sys  53.45 (28%) 
wall
flow analysis         :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.24 ( 0%) 
wall
combiner              :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.29 ( 0%) 
wall
if-conversion         :  11.55 (12%) usr   0.00 ( 0%) sys  22.98 (12%) 
wall
regmove               :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) 
wall
mode switching        :   0.16 ( 0%) usr   0.00 ( 0%) sys   0.31 ( 0%) 
wall
local alloc           :   0.22 ( 0%) usr   0.00 ( 0%) sys   0.52 ( 0%) 
wall
global alloc          :  19.84 (20%) usr   0.01 ( 2%) sys  37.17 (19%) 
wall
reload CSE regs       :   0.36 ( 0%) usr   0.00 ( 0%) sys   0.81 ( 0%) 
wall
flow 2                :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.27 ( 0%) 
wall
if-conversion 2       :   5.81 ( 6%) usr   0.00 ( 0%) sys  10.38 ( 5%) 
wall
peephole 2            :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) 
wall
rename registers      :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.29 ( 0%) 
wall
scheduling 2          :  18.43 (19%) usr   0.01 ( 2%) sys  34.19 (18%) 
wall
reorder blocks        :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) 
wall
shorten branches      :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) 
wall
final                 :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) 
wall
rest of compilation   :   0.26 ( 0%) usr   0.00 ( 0%) sys   0.71 ( 0%) 
wall
TOTAL                 :  98.78             0.66           191.35

Total time: ~99 seconds
GC time: ~.5 seconds
So ~98.5 seconds excluding GC time.

With ggc-copy:

garbage collection    :   1.47 ( 2%) usr   0.05 ( 7%) sys   2.50 ( 1%) 
wall
cfg construction      :   0.33 ( 0%) usr   0.01 ( 1%) sys   0.50 ( 0%) 
wall
cfg cleanup           :   5.44 ( 6%) usr   0.02 ( 3%) sys   9.06 ( 5%) 
wall
trivially dead code   :   0.16 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) 
wall
life analysis         :   1.51 ( 2%) usr   0.02 ( 3%) sys   3.12 ( 2%) 
wall
life info update      :   0.58 ( 1%) usr   0.00 ( 0%) sys   1.03 ( 1%) 
wall
preprocessing         :   0.11 ( 0%) usr   0.07 ( 9%) sys   0.18 ( 0%) 
wall
lexical analysis      :   0.42 ( 0%) usr   0.20 (26%) sys   1.34 ( 1%) 
wall
parser                :   0.65 ( 1%) usr   0.10 (13%) sys   1.14 ( 1%) 
wall
expand                :   0.13 ( 0%) usr   0.02 ( 3%) sys   0.24 ( 0%) 
wall
integration           :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.22 ( 0%) 
wall
jump                  :   0.96 ( 1%) usr   0.04 ( 5%) sys   1.71 ( 1%) 
wall
CSE                   :   2.40 ( 3%) usr   0.03 ( 4%) sys   4.64 ( 3%) 
wall
global CSE            :   0.68 ( 1%) usr   0.09 (12%) sys   1.59 ( 1%) 
wall
loop analysis         :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) 
wall
CSE 2                 :   0.23 ( 0%) usr   0.00 ( 0%) sys   0.53 ( 0%) 
wall
branch prediction     :  24.16 (26%) usr   0.05 ( 7%) sys  46.38 (27%) 
wall
flow analysis         :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) 
wall
combiner              :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.32 ( 0%) 
wall
if-conversion         :  11.68 (13%) usr   0.00 ( 0%) sys  22.72 (13%) 
wall
regmove               :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.21 ( 0%) 
wall
mode switching        :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) 
wall
local alloc           :   0.25 ( 0%) usr   0.00 ( 0%) sys   0.55 ( 0%) 
wall
global alloc          :  12.65 (14%) usr   0.03 ( 4%) sys  24.31 (14%) 
wall
reload CSE regs       :   0.33 ( 0%) usr   0.00 ( 0%) sys   0.52 ( 0%) 
wall
flow 2                :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) 
wall
if-conversion 2       :   5.85 ( 6%) usr   0.00 ( 0%) sys  10.73 ( 6%) 
wall
peephole 2            :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) 
wall
rename registers      :   0.10 ( 0%) usr   0.01 ( 1%) sys   0.26 ( 0%) 
wall
scheduling 2          :  20.56 (22%) usr   0.01 ( 1%) sys  37.06 (21%) 
wall
reorder blocks        :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) 
wall
shorten branches      :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) 
wall
final                 :   0.04 ( 0%) usr   0.01 ( 1%) sys   0.05 ( 0%) 
wall
rest of compilation   :   0.24 ( 0%) usr   0.00 ( 0%) sys   0.77 ( 0%) 
wall
TOTAL                 :  91.67             0.76           172.63
Total time: ~91.5 seconds
GC time: ~1.5 seconds
So 90 seconds excluding gc times.

Just about a 10% difference in overall speed.
Memory footprint when not doing collection is obviously smaller for the 
copying collector.

Some observations:

Global alloc takes half the time with a copying collector. This 
surprised me, but it's consistent over multiple runs.

Branch prediction is consistently 2 seconds faster (~10%).

Locality for long lived objects isn't as good as it could be, since we 
aren't generational.  This is likely to account for the scheduling 2 
time increase.

Things that touch a lot of RTL seem to be doing better with the copying 
collector.
Whatever the memory pattern is in global alloc is likely causing 
horrendous numbers of cache misses for ggc-page, due to fragmentation 
or locality (no idea which). This is a guess, i'll run the vtune beta 
for linux and see if i'm right.

I haven't yet done C++ timings to see if it speeds up the parser/expand 
passes.

All in all it looks, at the start, like it might be worth it to go to 
copying collection.
But these are just first timings, as i said.
The numbers look good enough that i'll keep implementing.

Would people like me to post the patch against the pch branch for 
copying collection so they can try it out themselves?
--Dan