From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xG5S=LN=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 475A33858D39
	for <gcc-patches@gcc.gnu.org>; Mon,  8 Apr 2024 12:18:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 475A33858D39
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 475A33858D39
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1712578763; cv=none;
	b=xg5uYb+JC/xPjI2yR1hFCOzy4s6aiiRPflWo6iNchHcQJ3kW8Ps7DR7ff6Bb23soSwXW3OlwIU3RH/3Jt+vuUUVp2DAjDLFTWgJZgg+taGBq6gS1G/zB8aXGAFApaphLai+UZhaYgpQS3IFP1bzh0TdG4SyIJTxGv3xMD6Qg3VM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1712578763; c=relaxed/simple;
	bh=Sw2XUVHlSzAfeLYQxgToAUEKZy8Zq9nsrdyrCDQ7b30=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=GSX07ZFWXBh9G15+VHV/S+cQSVrD4xUIuMeiMK8ZLkLSnmlZFEvEqdfZkHXZEvuvDZgEnDcdvoUZTfhb4pNrmUyeIjArUL/4dK3K5tmR4DGoQqDbmvhyFRK4O6H+g6VwifKlJ0WipYqZudV0ixV4po06e6/SUo3lOzD9AzBQLCg=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 63B82DA7;
	Mon,  8 Apr 2024 05:19:26 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 219363F64C;
	Mon,  8 Apr 2024 05:18:55 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Mail-Followup-To: Segher Boessenkool <segher@kernel.crashing.org>,Richard Biener <rguenther@suse.de>,  gcc-patches@gcc.gnu.org,  Jakub Jelinek <jakub@redhat.com>,  jeffreyalaw@gmail.com, richard.sandiford@arm.com
Cc: Richard Biener <rguenther@suse.de>,  gcc-patches@gcc.gnu.org,  Jakub Jelinek <jakub@redhat.com>,  jeffreyalaw@gmail.com
Subject: Re: [PATCH] rtl-optimization/101523 - avoid re-combine after noop 2->2 combination
References: <202404031107.433B7hCA019240@gate.crashing.org>
	<20240405212754.GL19790@gate.crashing.org>
Date: Mon, 08 Apr 2024 13:18:53 +0100
In-Reply-To: <20240405212754.GL19790@gate.crashing.org> (Segher Boessenkool's
	message of "Fri, 5 Apr 2024 16:27:54 -0500")
Message-ID: <mpth6gcvyhe.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-14.6 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Wed, Apr 03, 2024 at 01:07:41PM +0200, Richard Biener wrote:
>> The following avoids re-walking and re-combining the instructions
>> between i2 and i3 when the pattern of i2 doesn't change.
>> 
>> Bootstrap and regtest running ontop of a reversal of 
>> r14-9692-g839bc42772ba7a.
>
> Please include that in the patch (or series, preferably).
>
>> It brings down memory use frmo 9GB to 400MB and compile-time from
>> 80s to 3.5s.  r14-9692-g839bc42772ba7a does better in both metrics
>> but has shown code generation regressions across acrchitectures.
>> 
>> OK to revert r14-9692-g839bc42772ba7a?
>
> No.
>
> The patch solved a very real problem.  How does your replacement handle
> that?  You don't say.  It looks like it only battles symptoms a bit,
> instead :-(
>
> We had this before: 3->2 combinations that leave an instruction
> identical to what was there before.  This was just a combination with
> context as well.  The only reason this wasn't a huge problem then
> already was because this is a 3->2 combination, even if it really is a
> 2->1 one it still is beneficial in all the same cases.  But in the new
> case it can iterate indefinitely -- well not quite, but some polynomial
> number of times, for a polynomial at least of degree three, possibly
> more :-(
>
> With this patch you need to show combine still is linear.  I don't think
> it is, but some deeper analysis might show it still is.
>
>   ~ - ~ - ~
>
> What should *really* be done is something that has been on the wish list
> for decades: an uncse pass.
>
> The things that combine no longer works on after my patch are actually
> 1->1 combinations (which we never do currently, although we probably
> should); or alternatively, an un-CSE followed by a 2->1 combination.
>
> We can do the latter of course, but we need to do an actual uncse first!
> Somewhere before combine, and then redo a CSE after it.  An actual CSE,
> not doing ten gazillion other things.

Can you give a specific example of a 2->2 combination that we still
want to apply after r14-9692-g839bc42772ba7a?

2->2 combinations as I understand them were added by
c4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306:

Author: Segher Boessenkool <segher@kernel.crashing.org>
Date:   Mon Jul 30 15:18:17 2018 +0200

    combine: Allow combining two insns to two insns

    This patch allows combine to combine two insns into two.  This helps
    in many cases, by reducing instruction path length, and also allowing
    further combinations to happen.  PR85160 is a typical example of code
    that it can improve.

    This patch does not allow such combinations if either of the original
    instructions was a simple move instruction.  In those cases combining
    the two instructions increases register pressure without improving the
    code.  With this move test register pressure does no longer increase
    noticably as far as I can tell.

    (At first I also didn't allow either of the resulting insns to be a
    move instruction.  But that is actually a very good thing to have, as
    should have been obvious).

            PR rtl-optimization/85160
            * combine.c (is_just_move): New function.
            (try_combine): Allow combining two instructions into two if neither of
            the original instructions was a move.

That patch didn't have a testcase, but one was added later in
81bdfc1e2940fc93bcd0bba4416daff47f04f3b3:

    testcase for 2-2 combine

    gcc/testsuite/
            PR rtl-optimization/85160
            * gcc.target/powerpc/combine-2-2.c: New testcase.

But this is the powerpc test that regresses with the recent patch (PR114518).

The patches reference aarch64 bug PR85160.  If I check out and build
c4c5ad1d6d1e above, I can see that it does indeed remove two mvns from
the PR85160 testcase.  The diff from c4c5ad1d6d1e~ is:

@@ -10,12 +10,10 @@
        .cfi_startproc
        ldr     w3, [x2, w3, sxtw 2]
        ldr     w2, [x2, w4, sxtw 2]
-       mvn     w3, w3
-       mvn     w2, w2
-       and     w4, w3, w1
-       and     w1, w2, w1
-       and     w3, w3, w0
-       and     w2, w2, w0
+       bic     w4, w1, w3
+       bic     w3, w0, w3
+       bic     w1, w1, w2
+       bic     w2, w0, w2
        asr     w4, w4, 9
        asr     w1, w1, 7
        orr     w3, w4, w3, asr 7

(which is great).  But if I apply 839bc42772ba on top of c4c5ad1d6d1e
then the optimisation is undone.

Is that the intention?  I.e. are we effectively removing the kind of
2->2 combinations added in c4c5ad1d6d1e1e1fe?  If so, why not simply
revert c4c5ad1d6d1e1e1fe itself?

Or is there a specific testcase that is still optimised with the
combination of c4c5ad1d6d1e1e1fe and 839bc42772ba7a that would not
be optimised without c4c5ad1d6d1e1e1fe?  If so, can you say what it is?

Thanks,
Richard