From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=42mZ=B4=gmail.com=jeffreyalaw@sourceware.org>
Received: from mail-pf1-x42f.google.com (mail-pf1-x42f.google.com [IPv6:2607:f8b0:4864:20::42f])
	by sourceware.org (Postfix) with ESMTPS id 7C5B13857B9B
	for <gcc-patches@gcc.gnu.org>; Thu,  8 Jun 2023 05:37:24 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7C5B13857B9B
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-pf1-x42f.google.com with SMTP id d2e1a72fcca58-662f0feafb2so178387b3a.1
        for <gcc-patches@gcc.gnu.org>; Wed, 07 Jun 2023 22:37:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1686202643; x=1688794643;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=z05QREQ7xSQAXYZBFReJv06Hw07ywaI3roBL8XNnaeY=;
        b=GCDs3OkumC61iQFLkEDYSxNNZRrhrb4Dxecgfca26x2XxGPSHst/Q7jiSizSd/vl2Z
         dICzWcFncrv805wndWDvDWLjSyrddz7CFqu8mR5LO5AF8SSZoxt+fvsbaF0Y84MX+hYi
         O149YUNGlYysAimevGD2bqtbB6IfyQpDcgCro8rMbiDtClImw2/YrDuVH6hEIxOzICek
         0FM7a2crPiac1W7MZJfTDUj8cxACMg6eJMcZOmyeggUy+LaDWedJt+c/8DSjl/08l74y
         4R5G7HkyZJmfjqqb/cYFCRtz3I4a9Pvc1KIfSFc9cNWejbzyXFAPf4XDTMYZ/o5RvZ/h
         f8cg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1686202643; x=1688794643;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=z05QREQ7xSQAXYZBFReJv06Hw07ywaI3roBL8XNnaeY=;
        b=MJeCku88Uc8krfFWkP/6ZggggRLgvanQeOayBT6aTSZoVdhbC28ESDZlESFMKNYC7o
         ipgQOI4DazkG55S0kE80dpCS0CkmybbHPlQASY7RZ+NLmIHHNYukyBF3BBv5wnZw/Qtd
         yTES5jjt5nFiVNRbBk0CPM7SvOSmmlAf8QTeBGihiW7zbGN0wzEvQkHAjqVTVMp4opUm
         eriqz1dbf8+0rkUiipVh+CCQC418jPjC/344z+fc9C9XVVpBc0ZCwMQVeiITgXYOIdvm
         m8E+B4iO9YNVH6YB9Qrj85Qj76o6vQEKyPSpVWBGFAujp8b3VCLsX1vaiREILFmCuIBL
         j1yw==
X-Gm-Message-State: AC+VfDyRSrMnt4H/TwZBDK63ATKoJtlSyT9R7h9aGI4Cn5pOVOA1Ewe8
	V7nMCEz14dNRQVz0yk652xQ=
X-Google-Smtp-Source: ACHHUZ69R+5keyt3H1RtEP2mvf2Tsw4eJKGM+ekKBJYcKNIrAlS1TRk+ksLX2ju5VDbqL/rDgz837A==
X-Received: by 2002:a05:6a20:430b:b0:100:c3fe:a653 with SMTP id h11-20020a056a20430b00b00100c3fea653mr1273915pzk.29.1686202642514;
        Wed, 07 Jun 2023 22:37:22 -0700 (PDT)
Received: from [172.31.0.109] ([136.36.130.248])
        by smtp.gmail.com with ESMTPSA id d16-20020aa78690000000b006505bae11bcsm289026pfo.23.2023.06.07.22.37.21
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 07 Jun 2023 22:37:22 -0700 (PDT)
Message-ID: <91d71dae-b235-fbd0-c8f0-001b7f1e444c@gmail.com>
Date: Wed, 7 Jun 2023 23:37:20 -0600
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.9.1
Subject: Re: [PATCH 1/2] Implementation of new RISCV optimizations pass:
 fold-mem-offsets.
Content-Language: en-US
To: Manolis Tsamis <manolis.tsamis@vrull.eu>, gcc-patches@gcc.gnu.org
Cc: Richard Biener <richard.guenther@gmail.com>,
 Palmer Dabbelt <palmer@rivosinc.com>,
 Philipp Tomsich <philipp.tomsich@vrull.eu>, Kito Cheng <kito.cheng@gmail.com>
References: <20230525123550.1072506-1-manolis.tsamis@vrull.eu>
 <20230525123550.1072506-2-manolis.tsamis@vrull.eu>
From: Jeff Law <jeffreyalaw@gmail.com>
In-Reply-To: <20230525123550.1072506-2-manolis.tsamis@vrull.eu>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


On 5/25/23 06:35, Manolis Tsamis wrote:
> Implementation of the new RISC-V optimization pass for memory offset
> calculations, documentation and testcases.
> 
> gcc/ChangeLog:
> 
> 	* config.gcc: Add riscv-fold-mem-offsets.o to extra_objs.
> 	* config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Schedule a new
> 	pass.
> 	* config/riscv/riscv-protos.h (make_pass_fold_mem_offsets): Declare.
> 	* config/riscv/riscv.opt: New options.
> 	* config/riscv/t-riscv: New build rule.
> 	* doc/invoke.texi: Document new option.
> 	* config/riscv/riscv-fold-mem-offsets.cc: New file.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/riscv/fold-mem-offsets-1.c: New test.
> 	* gcc.target/riscv/fold-mem-offsets-2.c: New test.
> 	* gcc.target/riscv/fold-mem-offsets-3.c: New test.
So not going into the guts of the patch yet.

 From a benchmark standpoint the only two that get out of the +-0.05% 
range are mcf and deepsjeng (from a dynamic instruction standpoint).  So 
from an evaluation standpoint we can probably focus our efforts there. 
And as we know, mcf is actually memory bound, so while improving its 
dynamic instruction count is good, the end performance improvement may 
be marginal.

As I mentioned to Philipp many months ago this reminds me a lot of a 
problem I've seen before.  Basically register elimination emits code 
that can be terrible in some circumstances.  So I went and poked at this 
again.

I think the key difference between now and what I was dealing with 
before is for the cases that really matter for rv64 we have a shNadd 
insn in the sequence.  That private port I was working on before did not 
have shNadd (don't ask, I probably can't tell).  Our target also had 
reg+reg addressing modes.  What I can't remember was if we were trying 
harder to fold the constant terms into the memory reference or if we 
were more focused on the reg+reg.  Ultimately it's probably not that 
important to remember -- the key is there are very significant 
differences in the target's capabilities which impact how we should be 
generating code in this case.  Those differences affect the code we 
generate *and* the places where we can potentially get control and do 
some address rewriting.

A  key sequence in mcf looks something like this in IRA, others have 
similar structure:

> (insn 237 234 239 26 (set (reg:DI 377)
>         (plus:DI (ashift:DI (reg:DI 200 [ _173 ])
>                 (const_int 3 [0x3]))
>             (reg/f:DI 65 frame))) "pbeampp.c":139:15 333 {*shNadd}
>      (nil))
> (insn 239 237 235 26 (set (reg/f:DI 380)
>         (plus:DI (reg:DI 513)
>             (reg:DI 377))) "pbeampp.c":139:15 5 {adddi3}
>      (expr_list:REG_DEAD (reg:DI 377)
>         (expr_list:REG_EQUAL (plus:DI (reg:DI 377)
>                 (const_int -32768 [0xffffffffffff8000]))
>             (nil))))
[ ... ]
> (insn 240 235 255 26 (set (reg/f:DI 204 [ _177 ])
>         (mem/f:DI (plus:DI (reg/f:DI 380)
>                 (const_int 280 [0x118])) [7 *_176+0 S8 A64])) "pbeampp.c":139:15 179 {*movdi_64bit}
>      (expr_list:REG_DEAD (reg/f:DI 380)
>         (nil)))


The key here is insn 237.  It's generally going to be bad to have FP 
show up in a shadd insn because its going to be eliminated into 
sp+offset.  That'll generate an input reload before insn 237 and we 
can't do any combination with the constant in insn 239.

After LRA it looks like this:

> (insn 1540 234 1541 26 (set (reg:DI 11 a1 [750])
>         (const_int 32768 [0x8000])) "pbeampp.c":139:15 179 {*movdi_64bit}
>      (nil))
> (insn 1541 1540 1611 26 (set (reg:DI 12 a2 [749])
>         (plus:DI (reg:DI 11 a1 [750])
>             (const_int -272 [0xfffffffffffffef0]))) "pbeampp.c":139:15 5 {adddi3}
>      (expr_list:REG_EQUAL (const_int 32496 [0x7ef0])
>         (nil))) 
> (insn 1611 1541 1542 26 (set (reg:DI 29 t4 [795])
>         (plus:DI (reg/f:DI 2 sp)
>             (const_int 64 [0x40]))) "pbeampp.c":139:15 5 {adddi3}
>      (nil))
> (insn 1542 1611 237 26 (set (reg:DI 12 a2 [749])
>         (plus:DI (reg:DI 12 a2 [749])
>             (reg:DI 29 t4 [795]))) "pbeampp.c":139:15 5 {adddi3}
>      (nil))
> (insn 237 1542 239 26 (set (reg:DI 12 a2 [377])
>         (plus:DI (ashift:DI (reg:DI 14 a4 [orig:200 _173 ] [200])
>                 (const_int 3 [0x3]))
>             (reg:DI 12 a2 [749]))) "pbeampp.c":139:15 333 {*shNadd}
>      (nil))
> (insn 239 237 235 26 (set (reg/f:DI 12 a2 [380])
>         (plus:DI (reg:DI 10 a0 [513])
>             (reg:DI 12 a2 [377]))) "pbeampp.c":139:15 5 {adddi3}
>      (expr_list:REG_EQUAL (plus:DI (reg:DI 12 a2 [377])
>             (const_int -32768 [0xffffffffffff8000]))
>         (nil))) 
[ ... ]
> (insn 240 235 255 26 (set (reg/f:DI 14 a4 [orig:204 _177 ] [204])
>         (mem/f:DI (plus:DI (reg/f:DI 12 a2 [380])
>                 (const_int 280 [0x118])) [7 *_176+0 S8 A64])) "pbeampp.c":139:15 179 {*movdi_64bit}
>      (nil))


Reload/LRA made an absolute mess of that code.

But before we add a new pass (target specific or generic), I think it 
may be in our best interest experiment a bit of creative rewriting to 
preserve the shadd, but without the frame pointer.  Perhaps also looking 
for a way to fold the constants, both the explicit ones and the implicit 
one from FP elimination.

This looks particularly promising:

> Trying 237, 239 -> 240:
>   237: r377:DI=r200:DI<<0x3+frame:DI
>       REG_DEAD r200:DI
>   239: r380:DI=r513:DI+r377:DI
>       REG_DEAD r377:DI
>       REG_EQUAL r377:DI-0x8000
>   240: r204:DI=[r380:DI+0x118]
>       REG_DEAD r380:DI
> Failed to match this instruction:
> (set (reg/f:DI 204 [ _177 ])
>     (mem/f:DI (plus:DI (plus:DI (plus:DI (mult:DI (reg:DI 200 [ _173 ])
>                         (const_int 8 [0x8]))
>                     (reg/f:DI 65 frame))
>                 (reg:DI 513))
>             (const_int 280 [0x118])) [7 *_176+0 S8 A64]))


We could reassociate this as

t1 = r200 * 8 + r513
t2 = frame + 280
t3 = t1 + t2
r204 = *t3

Which after elimination would be

t1 = r2000 * 8 + r513
t2 = sp + C + 280
t3 = t1 + t2
r204 = *t3

C + 280 will simplify.  And we'll probably end up in the addptr3 case 
which I think gives us a chance to write this a bit so that we end up wit
t1 = r200 * 8 + r513
t2 = sp + t1
r204 = *(t2 + 280 + C)


Or at least I *think* we might be able to get there.  Anyway, as I said, 
I think this deserves a bit of playing around before we jump straight 
into adding a new pass.

jeff