From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23504 invoked by alias); 30 Aug 2011 11:53:56 -0000 Received: (qmail 23495 invoked by uid 22791); 30 Aug 2011 11:53:55 -0000 X-SWARE-Spam-Status: No, hits=-2.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW X-Spam-Check-By: sourceware.org Received: from mail-fx0-f47.google.com (HELO mail-fx0-f47.google.com) (209.85.161.47) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 30 Aug 2011 11:53:41 +0000 Received: by fxg11 with SMTP id 11so5676324fxg.20 for ; Tue, 30 Aug 2011 04:53:40 -0700 (PDT) Received: by 10.223.39.216 with SMTP id h24mr317704fae.31.1314705207477; Tue, 30 Aug 2011 04:53:27 -0700 (PDT) Received: from richards-thinkpad.stglab.manchester.uk.ibm.com (gbibp9ph1--blueice3n2.emea.ibm.com [195.212.29.84]) by mx.google.com with ESMTPS id b14sm4525820fak.29.2011.08.30.04.53.26 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 30 Aug 2011 04:53:27 -0700 (PDT) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org,zaks@il.ibm.com, richard.sandiford@linaro.org Cc: zaks@il.ibm.com Subject: [0/4] Make SMS schedule register moves Date: Tue, 30 Aug 2011 12:45:00 -0000 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2011-08/txt/msg02436.txt.bz2 I'm seeing several cases in which SMS's register move handling is causing it to generate worse code than the normal schedulers on ARM Cortex-A8. The problem is that we first schedule the loop without taking the moves into account, then emit the required moves immediately before the initial definition. A simple example is: void loop (unsigned char *__restrict q, unsigned char *__restrict p, int n) { while (n > 0) { q[0] = (p[0] + p[1]) >> 1; q++; p += 2; n--; } } (taken from libav). Compiled with: -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -O2 -ftree-vectorize -fno-auto-inc-dec -fmodulo-sched -fmodulo-sched-allow-regmoves on arm-linux-gnueabi-gcc (with current trunk), the scheduled loop has an ii of 27, a stage count of 6, and requires 14 register moves, 12 of which are vector moves. Vector moves cannot be dual-issued with most other vector instructions, and the scheduled loop only has one free "slot" for a vector move, so even in the best case, this loop theoretically needs 27 + 12 - 1 cycles per iteration, significantly higher than the ii. (It actually ends up much worse than that, because with so many live vector values, we end up with lots of spills.) The aim of the patches is to schedule the moves, and to reject the current ii if this scheduling fails. Revital pointed out that Mustafa Hagog had tried the same thing, so this is effectively a reimplementation of that idea. For those who've seen Mustafa's patch, the main functional differences are that: - Mustafa's version scheduled from low rows to high rows, with the high row being the one associated with the previous move. These patches instead schedule high-to-low, which should leave a larger window for later moves. - The patches use a cyclic scheduling window. E.g., for a move related to the instruction in column 1 of row 0, the patches first try row 0 (column 0 only), then row ii-1, etc. - The patches take instruction latency into account. On the loop above, we reject the ii of 27 and try again with an ii of 28. This leads to a stage count of 3 and no register moves, which is theoretically 10 cycles quicker than before. The lack of spills means that the real figures are much better though: on a BeagleBoard, the new loop is 5.45 times faster than the old one. I've seen similar improvements in other "real" libav loops too, not all of them due to fewer spills. (BTW, in the ii=27 version, most of the moves are for distance-1 true dependencies between a stage N instruction in row R and a stage N+1 instruction in row R+1.) As well as testing on a collection of libav-derived microbenchmarks like the one above, I tested on a commercial embeeded testsuite. It showed a significant improvement in one test and no change for the rest. Bootstrapped & regression-tested on powerp-ibm-aix5.3.0, using a compiler that had -fmodulo-sched and -fmodulo-sched-allow-regmoves turned on at -O and above. Richard