From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 86276 invoked by alias); 22 Jul 2019 21:47:39 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 86268 invoked by uid 89); 22 Jul 2019 21:47:39 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-14.0 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 spammy=auditing, precise X-HELO: cvs.linux-mips.org Received: from eddie.linux-mips.org (HELO cvs.linux-mips.org) (148.251.95.138) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 22 Jul 2019 21:47:36 +0000 Received: (from localhost user: 'macro', uid#1010) by eddie.linux-mips.org with ESMTP id S23990431AbfGVVralOmNd (ORCPT ); Mon, 22 Jul 2019 23:47:30 +0200 Date: Mon, 22 Jul 2019 21:48:00 -0000 From: "Maciej W. Rozycki" To: Fredrik Noring cc: Paul Burton , Matthew Fortune , =?UTF-8?Q?J=C3=BCrgen_Urban?= , gcc-patches@gcc.gnu.org Subject: Re: [PATCH] MIPS: Fix GCC `noreorder' for undefined R5900 short loops In-Reply-To: <20190722155506.GA2726@sx9> Message-ID: References: <20190721171726.GA47580@sx9> <20190722155506.GA2726@sx9> User-Agent: Alpine 2.21 (LFD 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-IsSubscribed: yes X-SW-Source: 2019-07/txt/msg01454.txt.bz2 Hi Fredrik, > I'm glad to hear from you again! I'm not dead, just distracted. > > I think that should be a GAS warning really (similarly to macros that > > expand to multiple instructions in a delay slot) as people ought to be > > allowed to do what they wish, and then `-Werror' can be used for code > > quality enforcement (and possibly disabled on a case-by-case basis). > > How can one reasonably prevent the bug when compiling a whole Linux > distribution with thousands of packages if GAS merely warns and proceeds > in many cases? I think the tools must not set a policy. By using `.set noreorder' the user told the toolchain he knows better and asked to keep its hands away. As I say you can use `-Werror' for code auditing. And in most cases manually scheduled delay slots in handcoded assembly are a sign of a problem with the piece of code affected, regardless of the R5900 erratum. > > > [ In theory, GAS could actually insert NOPs prior to the noreorder directive, > > > to make the loop longer that six instructions, but GAS does not have that > > > kind of capability. Another option that GCC prevents is to place a NOP in > > > the delay slot. ] > > > > Well, GAS does have that capability, although of course it is not enabled > > for `noreorder' code. > > Hmm? I'm unable to make sense of that, since such NOPs are unaffected by > noreorder. This is what I meant: > > loop: addiu $5,$5,1 > addiu $4,$4,1 > lb $2,-1($5) > nop /* These two NOPs would prevent the */ > nop /* bug while preserving noreorder. */ > .set noreorder > .set nomacro > bne $2,$3,loop > sb $2,-1($4) > .set macro > .set reorder See `nops_for_insn'. However again, as I noted, `.set noreorder' tells GAS not to analyse the dependencies for code within. There is no need to schedule this delay slot manually, and if this is generated code, then the producer (GCC) should have taken care of the hazards, be it architectural or errata. If this is manually written code, then the author asked for trouble. > > For generated code I think however that usually it > > will be cheaper performance-wise if a non-trivial delay-slot instruction > > is never produced in the affected cases (i.e. a dummy NOP is always used). > > I suppose so too. One observation is that R5900 BogoMIPS went down from > 584 to 195 when switching from the generic kernel variant > > .set noreorder > 1: bnez a0,1b > addiu a0,a0,-1 > .set reorder > > that is undefined for R5900 in arch/mips/lib/delay.c, to a special variant > > beqz a0,2f > 1: addiu a0,a0,-1 > bnez a0,1b > 2: > > for the R5900 where GAS will place a NOP in the branch delay slot. With > loop unrolling BogoMIPS could be inflated again, of course. In general, > unrolling is a bit better for the R5900 than general MIPS. Perhaps GCC > could be informed about that, possibly via profile-guided optimisation. Well, BogoMIPS is just an arbitrary number. There is a data dependency here, so manual delay slot scheduling was unavoidable. I'd suggest using this as a functional equivalent for the R5900: addiu a0,a0,1 1: addiu a0,a0,-1 bnez a0,1b and avoiding the irregularity for a0==0. > > > A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not > > > make explicit use of the branch delay slot, as suggested by the patch below? > > > Then GCC will emit > > > > > > loop: addiu $5,$5,1 > > > addiu $4,$4,1 > > > lb $2,-1($5) > > > sb $2,-1($4) > > > bne $2,$3,loop > > > > > > that GAS will adjust in the ELF object to > > > > > > 4: 24a50001 addiu a1,a1,1 > > > 8: 24840001 addiu a0,a0,1 > > > c: 80a2ffff lb v0,-1(a1) > > > 10: a082ffff sb v0,-1(a0) > > > 14: 1443fffb bne v0,v1,4 > > > 18: 00000000 nop > > > > > > where a NOP is placed in the delay slot to avoid the bug. For longer loops, > > > this relies on GAS making appropriate use of the delay slot. I'm not sure if > > > GAS is good at that, though? > > > > I'm sort-of surprised that GCC has produced `reorder' code here, making > > it rely on GAS for delay slot scheduling. Have you used an unusual set of > > options that prevents GCC from making `noreorder' code, which as I recall > > is the usual default (not for the MIPS16 mode IIRC, as well as some > > obscure corner cases)? > > I used the suggested patch, and recompiled the kernel with it, and verified > with the machine code tool that there were no undefined loops. Wouldn't that > be enough? It would break if GAS was invoked without `-mfix-r5900' (e.g. by using an explicit `-Wa,-mno-fix-r5900' option). > > > diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md > > > index e17b1d522f0..acd31a8960c 100644 > > > --- a/gcc/config/mips/mips.md > > > +++ b/gcc/config/mips/mips.md > > > @@ -1091,6 +1091,7 @@ > > > > > > (define_delay (and (eq_attr "type" "branch") > > > (not (match_test "TARGET_MIPS16")) > > > + (not (match_test "TARGET_FIX_R5900")) > > > (eq_attr "branch_likely" "yes")) > > > [(eq_attr "can_delay" "yes") > > > (nil) > > > @@ -1100,6 +1101,7 @@ > > > ;; not annul on false. > > > (define_delay (and (eq_attr "type" "branch") > > > (not (match_test "TARGET_MIPS16")) > > > + (not (match_test "TARGET_FIX_R5900")) > > > (ior (match_test "TARGET_CB_NEVER") > > > (and (eq_attr "compact_form" "maybe") > > > (not (match_test "TARGET_CB_ALWAYS"))) > > > > > > > I think you need to modify the default `can_delay' attribute definition > > instead (in the same way). > > I tried to limit the case to branch delays only, which is a rough > approximation. Call and jump delay slots are acceptable. Are you referring > to this piece? > > ;; Can the instruction be put into a delay slot? > (define_attr "can_delay" "no,yes" > (if_then_else (and (eq_attr "type" "!branch,call,jump") > (eq_attr "hazard" "none") > (match_test "get_attr_insn_count (insn) == 1")) > (const_string "yes") > (const_string "no"))) Yes. My suggestion does not preclude limiting the workaround to branches only while being precise about what the situation is, i.e. branches still require a delay slot instruction (unlike MIPS16 or uMIPS compact branches) but no real instruction is suitable. I'd prefer if GCC actually scheduled the required dummy NOP instruction explicitly. > > An improved future version might determine the > > exact conditions as noted in your proposed commit description, however I'd > > suggest making this simple change first. > > Learning the exact conditions, in a hardware sense, would be interesting > indeed, as some short loops actually do appear to work despite being labeled > as undefined. A fairly deep understanding of the R5900 implementation is > essential for such an undertaking. :) Ack, although in this case I only meant what was stated in the commit description, i.e. avoiding forcing a dummy delay slot instruction where known for sure not to be needed, e.g. a long loop, a forward branch, etc. Maciej