From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-300953-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 23504 invoked by alias); 30 Aug 2011 11:53:56 -0000
Received: (qmail 23495 invoked by uid 22791); 30 Aug 2011 11:53:55 -0000
X-SWARE-Spam-Status: No, hits=-2.5 required=5.0	tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW
X-Spam-Check-By: sourceware.org
Received: from mail-fx0-f47.google.com (HELO mail-fx0-f47.google.com) (209.85.161.47)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 30 Aug 2011 11:53:41 +0000
Received: by fxg11 with SMTP id 11so5676324fxg.20        for <gcc-patches@gcc.gnu.org>; Tue, 30 Aug 2011 04:53:40 -0700 (PDT)
Received: by 10.223.39.216 with SMTP id h24mr317704fae.31.1314705207477;        Tue, 30 Aug 2011 04:53:27 -0700 (PDT)
Received: from richards-thinkpad.stglab.manchester.uk.ibm.com (gbibp9ph1--blueice3n2.emea.ibm.com [195.212.29.84])        by mx.google.com with ESMTPS id b14sm4525820fak.29.2011.08.30.04.53.26        (version=TLSv1/SSLv3 cipher=OTHER);        Tue, 30 Aug 2011 04:53:27 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@linaro.org>
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org,zaks@il.ibm.com, richard.sandiford@linaro.org
Cc: zaks@il.ibm.com
Subject: [0/4] Make SMS schedule register moves
Date: Tue, 30 Aug 2011 12:45:00 -0000
Message-ID: <g4obz7cdze.fsf@richards-thinkpad.stglab.manchester.uk.ibm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2011-08/txt/msg02436.txt.bz2

I'm seeing several cases in which SMS's register move handling is
causing it to generate worse code than the normal schedulers on ARM
Cortex-A8.  The problem is that we first schedule the loop without
taking the moves into account, then emit the required moves immediately
before the initial definition.

A simple example is:

    void
    loop (unsigned char *__restrict q, unsigned char *__restrict p, int n)
    {
      while (n > 0)
        {
          q[0] = (p[0] + p[1]) >> 1;
          q++;
          p += 2;
          n--;
        }
    }

(taken from libav).  Compiled with:

  -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp
  -mvectorize-with-neon-quad -O2 -ftree-vectorize -fno-auto-inc-dec
  -fmodulo-sched -fmodulo-sched-allow-regmoves

on arm-linux-gnueabi-gcc (with current trunk), the scheduled loop has an
ii of 27, a stage count of 6, and requires 14 register moves, 12 of which
are vector moves.  Vector moves cannot be dual-issued with most other
vector instructions, and the scheduled loop only has one free "slot"
for a vector move, so even in the best case, this loop theoretically
needs 27 + 12 - 1 cycles per iteration, significantly higher than the ii.
(It actually ends up much worse than that, because with so many live
vector values, we end up with lots of spills.)

The aim of the patches is to schedule the moves, and to reject the
current ii if this scheduling fails.  Revital pointed out that Mustafa
Hagog had tried the same thing, so this is effectively a reimplementation
of that idea.  For those who've seen Mustafa's patch, the main functional
differences are that:

  - Mustafa's version scheduled from low rows to high rows, with the
    high row being the one associated with the previous move.  These
    patches instead schedule high-to-low, which should leave a larger
    window for later moves.

  - The patches use a cyclic scheduling window.  E.g., for a move
    related to the instruction in column 1 of row 0, the patches first
    try row 0 (column 0 only), then row ii-1, etc.

  - The patches take instruction latency into account.

On the loop above, we reject the ii of 27 and try again with an ii of 28.
This leads to a stage count of 3 and no register moves, which is
theoretically 10 cycles quicker than before.  The lack of spills means
that the real figures are much better though: on a BeagleBoard, the new
loop is 5.45 times faster than the old one.  I've seen similar improvements
in other "real" libav loops too, not all of them due to fewer spills.

(BTW, in the ii=27 version, most of the moves are for distance-1 true
dependencies between a stage N instruction in row R and a stage N+1
instruction in row R+1.)

As well as testing on a collection of libav-derived microbenchmarks like
the one above, I tested on a commercial embeeded testsuite.  It showed a
significant improvement in one test and no change for the rest.

Bootstrapped & regression-tested on powerp-ibm-aix5.3.0, using a
compiler that had -fmodulo-sched and -fmodulo-sched-allow-regmoves
turned on at -O and above.

Richard