From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 25104 invoked by alias); 13 Dec 2002 17:38:04 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 25097 invoked from network); 13 Dec 2002 17:38:03 -0000 Received: from unknown (HELO mail.cdt.org) (206.112.85.61) by sources.redhat.com with SMTP; 13 Dec 2002 17:38:03 -0000 Received: from dberlin.org (h-69-3-5-6.MCLNVA23.covad.net [69.3.5.6]) by mail.cdt.org (Postfix) with ESMTP id 83E33490100 for ; Fri, 13 Dec 2002 12:35:29 -0500 (EST) Received: from [192.168.1.102] (account dberlin HELO dberlin.org) by dberlin.org (CommuniGate Pro SMTP 4.0.2) with ESMTP-TLS id 1811038 for gcc@gcc.gnu.org; Fri, 13 Dec 2002 12:37:59 -0500 Date: Fri, 13 Dec 2002 09:58:00 -0000 Mime-Version: 1.0 (Apple Message framework v551) Content-Type: text/plain; charset=US-ASCII; format=flowed Subject: Timings for copying collection vs non-copying collection From: Daniel Berlin To: gcc@gcc.gnu.org Content-Transfer-Encoding: 7bit Message-Id: <4B49488A-0EC1-11D7-B1DF-000393575BCC@dberlin.org> X-SW-Source: 2002-12/txt/msg00737.txt.bz2 Okay, after Geoff's suggestion to try the pch-branch, i rewrote the copying collector (much easier to do it on the pch-branch, *thanks* Geoff), and have some first timings. A few notes: 1. Ignore GC times, this is a non-optimized copying collector. 2. These times are consistent to a few *tenths* (few = 2 max) of a second (for each pass) over multiple runs. So pass times < 1 second are probably too noisy to be useful. 3. There is a bootstrap of another tree running in the background for this run, so ignore the wall clock time (the likely reason for 3, BTW). 4. I'm just pasting one run as representative. The wall clock times obviously differed for each run. 5. The cc1's in question is not compiled with optimization. 6. Literally the only difference in cc1 between the two is that one is linked with ggc-page, one with ggc-copy (IE no other files are recompiled. They have the exact same object files being linked in). 7. The assembler output is the same for copying collection and non-copying collection. 8. GCC's memory usage actually shrinks after garbage collection with the copying collector, so it's definitely doing it's job. 9. Heap size for the copying collector is fixed at 64 meg. 10. This is a p4 1.7ghz computer with 768 meg of memory. With ggc-page, compiling 20001221-1.c: garbage collection : 0.45 ( 0%) usr 0.01 ( 2%) sys 0.69 ( 0%) wall cfg construction : 0.31 ( 0%) usr 0.01 ( 2%) sys 0.84 ( 0%) wall cfg cleanup : 5.39 ( 5%) usr 0.01 ( 2%) sys 10.76 ( 6%) wall trivially dead code : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.18 ( 0%) wall life analysis : 1.40 ( 1%) usr 0.01 ( 2%) sys 2.70 ( 1%) wall life info update : 0.61 ( 1%) usr 0.00 ( 0%) sys 1.21 ( 1%) wall preprocessing : 0.15 ( 0%) usr 0.11 (17%) sys 0.41 ( 0%) wall lexical analysis : 0.30 ( 0%) usr 0.23 (35%) sys 0.92 ( 0%) wall parser : 0.72 ( 1%) usr 0.13 (20%) sys 1.68 ( 1%) wall expand : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.33 ( 0%) wall integration : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall jump : 0.86 ( 1%) usr 0.04 ( 6%) sys 1.95 ( 1%) wall CSE : 2.77 ( 3%) usr 0.00 ( 0%) sys 5.62 ( 3%) wall global CSE : 0.69 ( 1%) usr 0.08 (12%) sys 1.52 ( 1%) wall loop analysis : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall CSE 2 : 0.27 ( 0%) usr 0.00 ( 0%) sys 0.42 ( 0%) wall branch prediction : 26.96 (27%) usr 0.01 ( 2%) sys 53.45 (28%) wall flow analysis : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.24 ( 0%) wall combiner : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.29 ( 0%) wall if-conversion : 11.55 (12%) usr 0.00 ( 0%) sys 22.98 (12%) wall regmove : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall mode switching : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.31 ( 0%) wall local alloc : 0.22 ( 0%) usr 0.00 ( 0%) sys 0.52 ( 0%) wall global alloc : 19.84 (20%) usr 0.01 ( 2%) sys 37.17 (19%) wall reload CSE regs : 0.36 ( 0%) usr 0.00 ( 0%) sys 0.81 ( 0%) wall flow 2 : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.27 ( 0%) wall if-conversion 2 : 5.81 ( 6%) usr 0.00 ( 0%) sys 10.38 ( 5%) wall peephole 2 : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall rename registers : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.29 ( 0%) wall scheduling 2 : 18.43 (19%) usr 0.01 ( 2%) sys 34.19 (18%) wall reorder blocks : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall shorten branches : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall final : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall rest of compilation : 0.26 ( 0%) usr 0.00 ( 0%) sys 0.71 ( 0%) wall TOTAL : 98.78 0.66 191.35 Total time: ~99 seconds GC time: ~.5 seconds So ~98.5 seconds excluding GC time. With ggc-copy: garbage collection : 1.47 ( 2%) usr 0.05 ( 7%) sys 2.50 ( 1%) wall cfg construction : 0.33 ( 0%) usr 0.01 ( 1%) sys 0.50 ( 0%) wall cfg cleanup : 5.44 ( 6%) usr 0.02 ( 3%) sys 9.06 ( 5%) wall trivially dead code : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.30 ( 0%) wall life analysis : 1.51 ( 2%) usr 0.02 ( 3%) sys 3.12 ( 2%) wall life info update : 0.58 ( 1%) usr 0.00 ( 0%) sys 1.03 ( 1%) wall preprocessing : 0.11 ( 0%) usr 0.07 ( 9%) sys 0.18 ( 0%) wall lexical analysis : 0.42 ( 0%) usr 0.20 (26%) sys 1.34 ( 1%) wall parser : 0.65 ( 1%) usr 0.10 (13%) sys 1.14 ( 1%) wall expand : 0.13 ( 0%) usr 0.02 ( 3%) sys 0.24 ( 0%) wall integration : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.22 ( 0%) wall jump : 0.96 ( 1%) usr 0.04 ( 5%) sys 1.71 ( 1%) wall CSE : 2.40 ( 3%) usr 0.03 ( 4%) sys 4.64 ( 3%) wall global CSE : 0.68 ( 1%) usr 0.09 (12%) sys 1.59 ( 1%) wall loop analysis : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall CSE 2 : 0.23 ( 0%) usr 0.00 ( 0%) sys 0.53 ( 0%) wall branch prediction : 24.16 (26%) usr 0.05 ( 7%) sys 46.38 (27%) wall flow analysis : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall combiner : 0.11 ( 0%) usr 0.00 ( 0%) sys 0.32 ( 0%) wall if-conversion : 11.68 (13%) usr 0.00 ( 0%) sys 22.72 (13%) wall regmove : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.21 ( 0%) wall mode switching : 0.15 ( 0%) usr 0.00 ( 0%) sys 0.30 ( 0%) wall local alloc : 0.25 ( 0%) usr 0.00 ( 0%) sys 0.55 ( 0%) wall global alloc : 12.65 (14%) usr 0.03 ( 4%) sys 24.31 (14%) wall reload CSE regs : 0.33 ( 0%) usr 0.00 ( 0%) sys 0.52 ( 0%) wall flow 2 : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall if-conversion 2 : 5.85 ( 6%) usr 0.00 ( 0%) sys 10.73 ( 6%) wall peephole 2 : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall rename registers : 0.10 ( 0%) usr 0.01 ( 1%) sys 0.26 ( 0%) wall scheduling 2 : 20.56 (22%) usr 0.01 ( 1%) sys 37.06 (21%) wall reorder blocks : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall shorten branches : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall final : 0.04 ( 0%) usr 0.01 ( 1%) sys 0.05 ( 0%) wall rest of compilation : 0.24 ( 0%) usr 0.00 ( 0%) sys 0.77 ( 0%) wall TOTAL : 91.67 0.76 172.63 Total time: ~91.5 seconds GC time: ~1.5 seconds So 90 seconds excluding gc times. Just about a 10% difference in overall speed. Memory footprint when not doing collection is obviously smaller for the copying collector. Some observations: Global alloc takes half the time with a copying collector. This surprised me, but it's consistent over multiple runs. Branch prediction is consistently 2 seconds faster (~10%). Locality for long lived objects isn't as good as it could be, since we aren't generational. This is likely to account for the scheduling 2 time increase. Things that touch a lot of RTL seem to be doing better with the copying collector. Whatever the memory pattern is in global alloc is likely causing horrendous numbers of cache misses for ggc-page, due to fragmentation or locality (no idea which). This is a guess, i'll run the vtune beta for linux and see if i'm right. I haven't yet done C++ timings to see if it speeds up the parser/expand passes. All in all it looks, at the start, like it might be worth it to go to copying collection. But these are just first timings, as i said. The numbers look good enough that i'll keep implementing. Would people like me to post the patch against the pch branch for copying collection so they can try it out themselves? --Dan