From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22246 invoked by alias); 19 Mar 2010 16:10:26 -0000 Received: (qmail 22236 invoked by uid 22791); 19 Mar 2010 16:10:25 -0000 X-SWARE-Spam-Status: No, hits=-0.8 required=5.0 tests=AWL,BAYES_50 X-Spam-Check-By: sourceware.org Received: from portal.icerasemi.com (HELO pOrtaL.icerasemi.com) (213.249.204.90) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 19 Mar 2010 16:10:20 +0000 X-ASG-Debug-ID: 1269015013-1860002b0002-ThFIni X-Barracuda-URL: http://192.168.1.243:80/cgi-bin/mark.cgi Received: from Exchangevs.Icerasemi.com (cluster1.icerasemi.local [192.168.1.203]) by pOrtaL.icerasemi.com (Spam & Virus Firewall) with ESMTP id 9CFE5F7DA2 for ; Fri, 19 Mar 2010 16:10:13 +0000 (GMT) Received: from Exchangevs.Icerasemi.com (cluster1.icerasemi.local [192.168.1.203]) by pOrtaL.icerasemi.com with ESMTP id oNEsg8RKImB4ksdH for ; Fri, 19 Mar 2010 16:10:13 +0000 (GMT) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-ASG-Orig-Subj: Understanding Scheduling Subject: Understanding Scheduling Date: Fri, 19 Mar 2010 16:12:00 -0000 Message-ID: <4D60B0700D1DB54A8C0C6E9BE69163700E08F38E@EXCHANGEVS.IceraSemi.local> From: "Ian Bolton" To: X-Barracuda-Connect: cluster1.icerasemi.local[192.168.1.203] X-Barracuda-Start-Time: 1269015013 X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.25234 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-IsSubscribed: yes Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2010-03/txt/msg00294.txt.bz2 Hi folks! I've moved on from register allocation (see Understanding IRA thread) and onto scheduling. In particular, I am investigating the effectiveness of the sched1 pass on our architecture and the associated interblock-scheduling optimisation. Let's start with sched1 ... For our architecture at least, it seems like Richard Earnshaw is right that sched1 is generally bad when you are using -Os, because it can increase register pressure and cause extra spill/fill code when you move independent instructions in between dependent instructions. For example: LOAD c2,c1[0] LOAD c3,c1[1] ADD c2,c2,c3 # depends on LOAD above it (might stall) LOAD c3,c1[2] ADD c2,c2,c3 # depends on LOAD above it (might stall) LOAD c3,c1[3] ADD c2,c2,c3 # depends on LOAD above it (might stall) LOAD c3,c1[4] ADD c2,c2,c3 # depends on LOAD above it (might stall) might become: LOAD c2,c1[0] LOAD c3,c1[1] LOAD c4,c1[2] # independent of first two LOADS LOAD c5,c1[3] # independent of first two LOADS ADD c2,c2,c3 # not dependent on preceding two insns (avoids stall) LOAD c3,c1[4] ADD c2,c2,c4 # not dependent on preceding three insns (avoids stall) ... This is a nice effect if your LOAD instructions have a latency of 3, so this should lead to performance increases, and indeed this is what I see for some low-reg-pressure Nullstone cases. Turning sched1 off therefore causes a regression on these cases. However, this pipeline-type effect may increase your register pressure such that caller-save regs are required and extra spill/fill code needs to be generated. This happens for some other Nullstone cases, and so it is good to have sched1 turned off for them! It's therefore looking like some kind of clever hybrid is required. I mention all this because I was wondering which other architectures have turned off sched1 for -Os? More importantly, I was wondering if anyone else had considered creating some kind of clever hybrid that only uses sched1 when it will increase performance without increasing register pressure? Or perhaps I could make a heuristic based on the balanced-ness of the tree? (I see sched1 does a lot better if the tree is balanced, since it has more options to play with.) Now onto interblock-scheduling ... As we all know, you can't have interblock-scheduling enabled unless you use the sched1 pass, so if sched1 is off then interblock is irrelevant. For now, let's assume we are going to make some clever hybrid that allows sched1 when we think it will increase performance for Os and we are going to keep sched1 on for O2 and O3. As I understand it, interblock-scheduling enlarges the scope of sched1, such that you can insert independent insns from a completely different block in between dependent insns in this block. As well as potentially amortizing stalls on high latency insns, we also get the chance to do "meatier" work in the destination block and leave less to do in the source block. I don't know if this is a deliberate effect of interblock-scheduling or if it is just a happy side-effect. Anyway, the reason I mention interblock-scheduling is that I see it doing seemingly intelligent moves, but then the later BB-reorder pass is juggling blocks around such that we end up with extra code inside hot loops! I assume this is because the scheduler and BB-reorderer are largely ignorant of each other, and so good intentions on the part of the former can be scuppered by the latter. I was wondering if anyone else has witnessed this madness on their architecture? Maybe it is a bug with BB-reorder? Or maybe it should only be enabled when function profiling information (e.g. gcov) is available? Or maybe it is not a high-priority thing for anyone to think about because no one uses interblock-scheduling? If anyone can shed some light on the above, I'd greatly appreciate it. For now, I will continue my experiments with selective enabling of sched1 for -Os. Best regards, Ian