From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17134 invoked by alias); 11 Sep 2008 11:44:26 -0000 Received: (qmail 17124 invoked by uid 22791); 11 Sep 2008 11:44:25 -0000 X-Spam-Check-By: sourceware.org Received: from mtagate7.de.ibm.com (HELO mtagate7.de.ibm.com) (195.212.29.156) by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 11 Sep 2008 11:43:44 +0000 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate7.de.ibm.com (8.13.8/8.13.8) with ESMTP id m8BBgonO322478 for ; Thu, 11 Sep 2008 11:42:50 GMT Received: from d12av02.megacenter.de.ibm.com (d12av02.megacenter.de.ibm.com [9.149.165.228]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v9.1) with ESMTP id m8BBgoqM2379858 for ; Thu, 11 Sep 2008 13:42:50 +0200 Received: from d12av02.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av02.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m8BBgl4q022160 for ; Thu, 11 Sep 2008 13:42:47 +0200 Received: from tuxmaker.boeblingen.de.ibm.com (tuxmaker.boeblingen.de.ibm.com [9.152.85.9]) by d12av02.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with SMTP id m8BBgl7P022157; Thu, 11 Sep 2008 13:42:47 +0200 Message-Id: <200809111142.m8BBgl7P022157@d12av02.megacenter.de.ibm.com> Received: by tuxmaker.boeblingen.de.ibm.com (sSMTP sendmail emulation); Thu, 11 Sep 2008 13:42:46 +0200 Subject: Re: [PATCH, SPU] generated better code for loads and stores To: Trevor_Smigiel@playstation.sony.com Date: Thu, 11 Sep 2008 11:50:00 -0000 From: "Ulrich Weigand" Cc: gcc-patches@gcc.gnu.org (gcc-patches), andrew_pinski@playstation.sony.com, rguenther@suse.de In-Reply-To: <20080828021017.GO27746@playstation.sony.com> from "Trevor_Smigiel@playstation.sony.com" at Aug 27, 2008 07:10:17 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2008-09/txt/msg00893.txt.bz2 Trevor Smigiel wrote: > This patch generates better code for loads and stores on SPU. Unfortunately, in the current version this patch introduces a number of regressions for me: - wrong code generation due to exposed REGNO_POINTER_ALIGN bug - ICE with -fnon-call-exceptions - code size regression with -O0 - -march=celledp code quality regression I've seen those both on mainline and with a 4.3 backport of your patch. 1. Wrong code generation due to exposed REGNO_POINTER_ALIGN bug On mainline (only) the following test case now fails: FAIL: gcc.c-torture/execute/960521-1.c execution, -O1 This is because when compiling this code: int *b; foo () { int i; for (i = 0; i < BLOCK_SIZE - 1; i++) b[i] = -1; } GCC assumes (incorrectly) that the *value* of b must be 16-byte aligned, because it has (correctly) infered that the *address* of b is 16-byte aligned! This happens because of an independent bug in computing REGNO_POINTER_ALIGN which is present both in force_reg and set_reg_attrs_from_value: if (MEM_POINTER (x)) mark_reg_pointer (reg, MEM_ALIGN (x)); This is broken, because the *value* of the MEM x was just copied into reg. MEM_ALIGN is the alignment of the memory address, not the alignment of the pointer that is stored there. (Note that in GCC 4.3, REGNO_POINTER_ALIGN is incorrect just the same, but the problem still does not show because the broken register is placed as the second operand of the PLUS used for address generation -- therefore the optimization in spu_split_load does not trigger.) 2. ICE with -fnon-call-exceptions A couple of test cases now fail on both mainline and 4.3: FAIL: g++.dg/eh/subreg-1.C (internal compiler error) FAIL: g++.dg/opt/cfg5.C (internal compiler error) FAIL: g++.dg/opt/pr34036.C (internal compiler error) FAIL: g++.dg/opt/reg-stack2.C (internal compiler error) FAIL: g++.dg/other/profile1.C (internal compiler error) all with an ICE along the following lines: /home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: error: in basic block 5: /home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: error: flow control insn inside a basic block (insn 141 21 142 5 /home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:35 (set (reg:TI 185) (mem/s:TI (reg/v/f:SI 138 [ sp.19 ]) [0 S16 A128])) -1 (expr_list:REG_EH_REGION (const_int 1 [0x1]) (nil))) /home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: internal compiler error: in rtl_verify_flow_info_1, at cfgrtl.c:1923 The reason for this is that with -fnon-call-exceptions, the memory store gets tagged with a REG_EH_REGION note. The split0 pass generates a memory load and a memory store insn from this, and the splitter logic copies that note to *both* instructions. This is a problem as the load is now considered as a "flow control" insn as well (as it supposedly may throw an exception); and such insns are not allowed within a basic block. I'm not sure how to fix this. However, in actual fact, memory accesses on the SPU *never* trap anyway -- it's just that the may_trap_p logic is not aware of that fact. In the SDK compiler there is a target macro ADDRESSES_NEVER_TRAP that is set on the SPU to cause rtx_addr_can_trap_p_1 to always return 0. If we port that feature to mainline, this will fix this ICE as well. 3. Code size regression with -O0 I'm seeing one more test suite regression on 4.3 only: FAIL: g++.dg/opt/longbranch1.C (test for excess errors) This is caused by the resulting code just exceeding local store size, while it just fit into local store before this patch. The reason for this code size regression with -O0 is that the new code will always generate a rotate instruction to perform the following load: (insn 17 16 18 4 longbranch1.ii:6 (set (reg:SI 145) (const_int 29936 [0x74f0])) 6 {_movsi} (nil)) (insn 18 17 19 4 longbranch1.ii:6 (set (reg:SI 146) (mem/c/i:SI (plus:SI (reg/f:SI 128 $vfp) (reg:SI 145)) [0 i+0 S4 A128])) 6 {_movsi} (nil)) even though the memory access is clearly 16-byte aligned. The code in spu_split_load doesn't recognize this, however, because REGNO_POINTER_ALIGN of register 145 returns false -- this register is in fact not a pointer, but holds a plain integral value. However, even so, the memory access is itself marked as 16-byte aligned via the MEM_ALIGN flag. Unfortunately, spu_split_load never actually looks at that flag. In the old code, spu_valid_move used to check aligned_mem_p, which did look at MEM_ALIGN. It would appear that simply checking MEM_ALIGN and omitting the rotate in spu_split_load if possible should be sufficient to fix this regression. Unfortunately, the resulting code is still bigger, because of the additional _spu_convert insn that is being generated. With -O0, local-alloc does not attempt to use the same register for both operands of the _spu_convert, and thus it becomes an explicit copy in the final output. This still causes the test case to fail ... Now, we don't actually *need* to split the insn into a TImode load plus a _spu_convert. However, as your comment before address_needs_split says, you do it deliberately to enable more merging of accesses. Without optimization, that merging won't happen anyway ... so I think we should not perform the split in this case. With this change in place as well, the code size regression is fixed. 4. -march=celledp code quality regression The PowerXCell 8i processor has double-precision magnitude compare instructions. These used to be matched by combine against patterns like: (define_insn "cmeq__celledp" [(set (match_operand: 0 "spu_reg_operand" "=r") (eq: (abs:VDF (match_operand:VDF 1 "spu_reg_operand" "r")) (abs:VDF (match_operand:VDF 2 "spu_reg_operand" "r"))))] "spu_arch == PROCESSOR_CELLEDP" "dfcmeq\t%0,%1,%2" [(set_attr "type" "fpd")]) However, after your patch these instructions never match, because there is no ABS RTX any more. Those have been split into AND by: (define_insn_and_split "_abs2" [(set (match_operand:VSDF 0 "spu_reg_operand" "=r") (abs:VSDF (match_operand:VSDF 1 "spu_reg_operand" "r"))) (use (match_operand: 2 "spu_reg_operand" "r"))] "" "#" "" [(set (match_dup: 3) (and: (match_dup: 4) (match_dup: 2)))] which now runs already in the split0 pass (before combine). I think the proper fix is to simply add "split0_completed" to the condition of this splitter -- there doesn't seem to be any gain from running this splitter early. (I suspect there are some other splitters that probably should be treated likewise, but I don't have a specific regression except the ABS case.) The following patch contains my suggested fixes (as discussed above), except for the REGNO_POINTER_ALIGN breakage. Tested on mainline and 4.3 with no regressions, fixes all regressions (except the mentioned REGNO_POINTER_ALIGN breakage) of the original patch. What do you think? Bye, Ulrich ChangeLog: * config/spu/spu.h (ADDRESSES_NEVER_TRAP): Define. * rtlanal.c (rtx_addr_can_trap_p_1): Respect ADDRESSES_NEVER_TRAP macro. * doc/tm.texi (ADDRESSES_NEVER_TRAP): Document. * config/spu/spu.c (spu_split_load): Trust MEM_ALIGN. When not optimizing, do not split load unless necessary. * config/spu/spu.md ("_abs2"): Do not split in split0 pass. diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.c gcc-4_3/gcc/config/spu/spu.c *** gcc-4_3-orig/gcc/config/spu/spu.c 2008-09-10 22:09:24.000000000 +0200 --- gcc-4_3/gcc/config/spu/spu.c 2008-09-11 00:40:35.000000000 +0200 *************** spu_split_load (rtx * ops) *** 3596,3602 **** rot = 0; rot_amt = 0; ! if (GET_CODE (addr) == PLUS) { /* 8 cases: aligned reg + aligned reg => lqx --- 3596,3605 ---- rot = 0; rot_amt = 0; ! ! if (MEM_ALIGN (ops[1]) >= 128) ! /* Address is already aligned; simply perform a TImode load. */; ! else if (GET_CODE (addr) == PLUS) { /* 8 cases: aligned reg + aligned reg => lqx *************** spu_split_load (rtx * ops) *** 3707,3712 **** --- 3710,3723 ---- rot_amt = 0; } + /* If the source is properly aligned, we don't need to split this insn into + a TImode load plus a _spu_convert. However, we want to perform the split + anyway when optimizing to make the MEMs look the same as those used for + stores so they are more easily merged. When *not* optimizing, that will + not happen anyway, so we prefer to avoid generating the _spu_convert. */ + if (!rot && !rot_amt && !optimize) + return 0; + load = gen_reg_rtx (TImode); mem = change_address (ops[1], TImode, copy_rtx (addr)); diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.h gcc-4_3/gcc/config/spu/spu.h *** gcc-4_3-orig/gcc/config/spu/spu.h 2008-09-10 22:09:24.000000000 +0200 --- gcc-4_3/gcc/config/spu/spu.h 2008-09-10 21:19:30.000000000 +0200 *************** extern GTY(()) rtx spu_compare_op1; *** 640,642 **** --- 640,644 ---- #define SPLIT_BEFORE_CSE2 1 + #define ADDRESSES_NEVER_TRAP 1 + diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.md gcc-4_3/gcc/config/spu/spu.md *** gcc-4_3-orig/gcc/config/spu/spu.md 2008-09-10 22:09:32.000000000 +0200 --- gcc-4_3/gcc/config/spu/spu.md 2008-09-10 20:09:59.000000000 +0200 *************** *** 1246,1252 **** (use (match_operand: 2 "spu_reg_operand" "r"))] "" "#" ! "" [(set (match_dup: 3) (and: (match_dup: 4) (match_dup: 2)))] --- 1246,1252 ---- (use (match_operand: 2 "spu_reg_operand" "r"))] "" "#" ! "split0_completed" [(set (match_dup: 3) (and: (match_dup: 4) (match_dup: 2)))] diff -crNp -x .svn gcc-4_3-orig/gcc/doc/tm.texi gcc-4_3/gcc/doc/tm.texi *** gcc-4_3-orig/gcc/doc/tm.texi 2008-09-10 22:09:25.000000000 +0200 --- gcc-4_3/gcc/doc/tm.texi 2008-09-10 21:43:46.000000000 +0200 *************** optimizations before this pass work bett *** 10384,10386 **** --- 10384,10392 ---- instructions, and the optimizations right after this pass (e.g., CSE and combine) are be able to optimize the split instructions. @end defmac + + @defmac ADDRESSES_NEVER_TRAP + Define this macro if memory accesses will never cause a trap. + This is the case for example on the Cell SPU processor. + @end defmac + diff -crNp -x .svn gcc-4_3-orig/gcc/rtlanal.c gcc-4_3/gcc/rtlanal.c *** gcc-4_3-orig/gcc/rtlanal.c 2008-03-05 19:44:55.000000000 +0100 --- gcc-4_3/gcc/rtlanal.c 2008-09-10 21:18:53.000000000 +0200 *************** rtx_varies_p (const_rtx x, bool for_alia *** 265,270 **** --- 265,274 ---- static int rtx_addr_can_trap_p_1 (const_rtx x, enum machine_mode mode, bool unaligned_mems) { + #ifdef ADDRESSES_NEVER_TRAP + /* On some processors, like the SPU, memory accesses never trap. */ + return 0; + #else enum rtx_code code = GET_CODE (x); switch (code) *************** rtx_addr_can_trap_p_1 (const_rtx x, enum *** 344,349 **** --- 348,354 ---- /* If it isn't one of the case above, it can cause a trap. */ return 1; + #endif } /* Return nonzero if the use of X as an address in a MEM can cause a trap. */ -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE Ulrich.Weigand@de.ibm.com