From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id A62CB3858D1E for ; Tue, 2 May 2023 16:20:58 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A62CB3858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Transfer-Encoding:Content-Type: MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:Cc:To:From:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=0jmHM7G5jc3STYJblkimVHd7RIopOc0/f7znmvodFCA=; b=GwwvNWbY2tj21B5AwULZWhv4UP KZXZfMmYh4LHzbgJ2KY1h2BLRw1ylMD5dK79j4UamnxJ3USxe0ypyvydUjRet9B3h9mMRyFOJQLU3 F4DAGf401Q0Kd8QbhAUYgYaEDM1BpFnatNDAqtMWHsi+Em7eoCJ/ZeQBoSp8Ryfn++mZJQApT0hYa 0SIobnd6VJUd9n/i9zMcie+yZxhg6Nk/g9vDjKvcvp3AuYSDPgoc4177T+dEcLbAw0nBcQNYWThCo fSTqiJs6kVmiaq3SslxC+vVy7BKfjiFwLkFqhdHW0p0qBjoHJhysGWUPpl4wJ90lxNz7Jc//LGPXD OYwwv70A==; Received: from [185.62.158.67] (port=56139 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1ptskI-0001Eb-1M; Tue, 02 May 2023 12:20:54 -0400 From: "Roger Sayle" To: "'Segher Boessenkool'" Cc: "'Paul Koning'" , "'Jeff Law'" , "'GCC Patches'" References: <009601d97c85$de708170$9b518450$@nextmovesoftware.com> <0FD01909-1F42-4E50-8D0D-2EDB80C34C0A@comcast.net> <021801d97cf8$9fc91770$df5b4650$@nextmovesoftware.com> <20230502134928.GA19790@gate.crashing.org> In-Reply-To: <20230502134928.GA19790@gate.crashing.org> Subject: RE: [committed] Convert xstormy16 to LRA Date: Tue, 2 May 2023 17:20:49 +0100 Message-ID: <008101d97d12$12068200$36138600$@nextmovesoftware.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQG28Urg0PW1413oa1Hy1htPI2COKwI6TCJsATMouzMCUr310a9eMTeg Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 02 May 2023 14:49, Segher Boessenkool wrote: > On Tue, May 02, 2023 at 02:18:43PM +0100, Roger Sayle wrote: > > On 02 May 2023 13:40, Paul Koning wrote: > > > > On May 1, 2023, at 7:37 PM, Roger Sayle > > > > > > > wrote: > > > > The shiftsi.cc regression on xstormy16 is fixed by adding > > > > -fno-split-wide-types. > > > > In fact, if all the regression tests pass, I'd suggest that > > > > flag_split_wide-types = false should be the default on xstormy16 > > > > now that we've moved to LRA. And if this works for xstormy16, it > > > > might be useful to other targets for the LRA transition; it's a > > > > difference in behaviour between reload and LRA that could > > > > potentially affect multiple targets. > > > > > > Is there documentation for that flag? > > > > Yes, see the section -fsplit-wide-types in > > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html > > > > Interestingly, there's a recent-ish blog describing how > > -fno-split-wide-types reduces executable size on AVR: > > https://ufj.ddns.net/blog/marlin/2019/01/07/reducing-marlin-binary-siz > > e.html and its interaction with (AVR) register allocation is seen in > > PR middle-end/35860. > > But what causes the problem? Is something missing somewhere, or do we get too > high register pressure? > > There also is -fsplit-wide-types-early, which might help. > > In principle it always is good to describe a machine model as close as possible to > what the machine actually does. What gets in the way here? I can describe what GCC is doing, but I don't claim to understand how the register allocators make the choices they do (nor where the fix should be). The problematic test case is: unsigned long foo(unsigned long x) { return x << 1; } which with (the old) reload (and -fno-split-wide-types) is generated as: foo: add r2,r2 adc r3,r3 ret but with (the new) lra is generated as: mov r6,r2 mov r7,r3 mov r2,r6 mov r3,r7 add r2,r6 adc r3,r7 ret where the register allocator has failed to notice that (take advantage of) the SImode addition can be performed "in-place", from the insn's constraints: ;; Addition (define_insn_and_split "addsi3" [(set (match_operand:SI 0 "register_operand" "=r") (plus:SI (match_operand:SI 1 "register_operand" "%0") (match_operand:SI 2 "nonmemory_operand" "ri"))) (clobber (reg:BI CARRY_REG))] "" "#" "reload_completed" [(pc)] { xstormy16_expand_arith (SImode, PLUS, operands[0], operands[1], operands[2]); DONE; } [(set_attr "length" "4")]) Before combine, we have simplicity itself (a two instruction function): (insn 2 4 3 2 (set (reg/v:SI 26 [ x ]) (reg:SI 2 r2 [ x ])) "../../shiftsi.c":4:41 7 {*movsi_internal} (expr_list:REG_DEAD (reg:SI 2 r2 [ x ]) (nil))) (insn 6 3 11 2 (parallel [ (set (reg:SI 28) (plus:SI (reg/v:SI 26 [ x ]) (reg/v:SI 26 [ x ]))) (clobber (reg:BI 16 carry)) ]) "../../shiftsi.c":4:52 35 {addsi3} (expr_list:REG_DEAD (reg/v:SI 26 [ x ]) (expr_list:REG_UNUSED (reg:BI 16 carry) (nil)))) Then combine inserts an additional copy: (insn 14 4 2 2 (set (reg:SI 29) (reg:SI 2 r2 [ x ])) "../../shiftsi.c":4:41 -1 (expr_list:REG_DEAD (reg:SI 2 r2 [ x ]) (nil))) (insn 2 14 3 2 (set (reg/v:SI 26 [ x ]) (reg:SI 29)) "../../shiftsi.c":4:41 7 {*movsi_internal} (expr_list:REG_DEAD (reg:SI 29) (nil))) (insn 6 3 11 2 (parallel [ (set (reg:SI 28) (plus:SI (reg/v:SI 26 [ x ]) (reg/v:SI 26 [ x ]))) (clobber (reg:BI 16 carry)) ]) "../../shiftsi.c":4:52 35 {addsi3} (expr_list:REG_DEAD (reg/v:SI 26 [ x ]) (expr_list:REG_UNUSED (reg:BI 16 carry) (nil)))) The just before reload, subreg3 "splits the wide types", to produce: (insn 15 4 16 2 (set (reg:HI 30) (reg:HI 2 r2 [ x ])) "../../shiftsi.c":4:41 6 {movhi_internal} (nil)) (insn 16 15 17 2 (set (reg:HI 31 [+2 ]) (reg:HI 3 r3 [ x+2 ])) "../../shiftsi.c":4:41 6 {movhi_internal} (nil)) (insn 17 16 18 2 (clobber (reg/v:SI 26 [ x ])) "../../shiftsi.c":4:41 -1 (nil)) (insn 18 17 19 2 (set (subreg:HI (reg/v:SI 26 [ x ]) 0) (reg:HI 30)) "../../shiftsi.c":4:41 6 {movhi_internal} (nil)) (insn 19 18 3 2 (set (subreg:HI (reg/v:SI 26 [ x ]) 2) (reg:HI 31 [+2 ])) "../../shiftsi.c":4:41 6 {movhi_internal} (nil)) (insn 6 3 11 2 (parallel [ (set (reg:SI 28) (plus:SI (reg/v:SI 26 [ x ]) (reg/v:SI 26 [ x ]))) (clobber (reg:BI 16 carry)) ]) "../../shiftsi.c":4:52 35 {addsi3} (expr_list:REG_DEAD (reg/v:SI 26 [ x ]) (expr_list:REG_UNUSED (reg:BI 16 carry) (nil)))) Given this input, with the movhi_internals and the clobber, reload was able to do something sensible, but lra fails to place pseudos 26, 28 in the same place, in pseudos 30 and 31, ideally hard regs 2 and 3. With the -fno-split-wide-types workaround, lra/reload have the easier task of assigning registers to the three insn case above. I don't think it's a problem/fault with the machine description, but how the clobber above is being interpreted by the different register allocators. Back in August 2022, I submitted a x86_64 backend patch that used a peephole2 to eliminate this type of (subreg-lower) generated clobber: https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599419.html but Uros was hesitant to accept this change, without an RTL expert explaining why the clobbers were being generated in the first place. Like my x86_64 patch above, this issue could also probably be cleaned up by a peephole2 for xstormy16 that straddled/removed the clobber. My simplistic summary is that when a backend lowers a multi-word instruction with a splitter, it (typically) does so without introducing a clobber. If the clobbers generated by the middle-end's automatic lowering confuse the register allocator, then this is more efficient. Ideally, such clobbers should be avoided (if possible) and/or LRA improved to understand that they don't actually cause an allocno conflict. [but I'm in no way a register allocation expert]. Thoughts?