From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22762 invoked by alias); 2 Jul 2008 11:59:48 -0000 Received: (qmail 22748 invoked by uid 22791); 2 Jul 2008 11:59:47 -0000 X-Spam-Check-By: sourceware.org Received: from wf-out-1314.google.com (HELO wf-out-1314.google.com) (209.85.200.172) by sourceware.org (qpsmtpd/0.31) with ESMTP; Wed, 02 Jul 2008 11:59:22 +0000 Received: by wf-out-1314.google.com with SMTP id 25so338404wfc.14 for ; Wed, 02 Jul 2008 04:59:20 -0700 (PDT) Received: by 10.142.177.7 with SMTP id z7mr3015587wfe.15.1214999960389; Wed, 02 Jul 2008 04:59:20 -0700 (PDT) Received: by 10.143.162.1 with HTTP; Wed, 2 Jul 2008 04:59:20 -0700 (PDT) Message-ID: <84fc9c000807020459w32ff0938p21d3397c8cc90805@mail.gmail.com> Date: Wed, 02 Jul 2008 12:08:00 -0000 From: "Richard Guenther" To: "Bingfeng Mei" Subject: Re: Inefficient loop unrolling. Cc: gcc@gcc.gnu.org In-Reply-To: <2E073B3ABB3F664DBA1D1C4D5FB47EF40E72D8B3@NT-IRVA-0752.brcm.ad.broadcom.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <2E073B3ABB3F664DBA1D1C4D5FB47EF40E72D8B3@NT-IRVA-0752.brcm.ad.broadcom.com> X-IsSubscribed: yes Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2008-07/txt/msg00039.txt.bz2 On Wed, Jul 2, 2008 at 1:13 PM, Bingfeng Mei wrote: > Hello, > I am looking at GCC's loop unrolling and find it quite inefficient > compared with manually unrolled loop even for very simple loop. The > followings are a simple loop and its manually unrolled version. I didn't > apply any trick on manually unrolled one as it is exact replications of > original loop body. I have expected by -funroll-loops the first version > should produce code of similar quality as the second one. However, > compiled with ARM target of mainline GCC, both functions produce very > different results. > > GCC-unrolled version mainly suffers from two issues. First, the > load/store offsets are registers. Extra ADD instructions are needed to > increase offset over iteration. In the contrast, manually unrolled code > makes use of immediate offset efficiently and only need one ADD to > adjust base register in the end. Second, the alias (dependence) analysis > is over conservative. The LOAD instruction of next unrolled iteration > cannot be moved beyond previous STORE instruction even they are clearly > not aliased. I suspect the failure of alias analysis is related to the > first issue of handling base and offset address. The .sched2 file shows > that the first loop body requires 57 cycles whereas the second one takes > 50 cycles for arm9 (56 cycles vs 34 cycles for Xscale). It become even > worse for our VLIW porting due to longer latency of MUL and Load > instructions and incapability of filling all slots (120 cycles vs. 20 > cycles) > > By analyzing compilation phases, I believe if the loop unrolling happens > at the tree-level, or if we have an optimizing pass like "ivopts" after > loop unrolling in RTL level, GCC can produce far more efficient > loop-unrolled code. "ivopts" pass really does a wonderful job in > optimizing induction variables. Strangely, I found some unrolling > functions at tree-level, but there is no independent tree-level loop > unrolling pass except "cunroll", which is complete unrolling. What > prevents such a tree-level unrolling pass? Or is there any suggestion to > improve existing RTL level unrolling? Thanks in advance. On the tree level only complete unrolling is done. The reason for this was the difficulty (or our unwillingness) to properly tune this for the target (loop unrolling is not a generally profitable optimization, unless complete unrolling for small loops). I would suggest to look into doing also partial unrolling on the tree level. Richard.