From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 728563858408; Fri, 16 Feb 2024 08:11:10 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 728563858408
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1708071070;
	bh=QrFPJcZ0f1HE4Dsn9EHo9YsDBY2iRUYaMI6y4Mer8xI=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=luZfOaigi2/yLLl5hYWOcruCPzk/xnhNB04NXR0M4J7ov9VqA6HXoWnXvLRySs5GF
	 8QTtjiKhHIL6BQb10aHr+TfjUkCrOCNeUBzVBWq0vd11OdKe/eBgUevY6cwtbxKiq8
	 wiT3xumH4U/kV+Vg5/8G7qX2AGjIRU0vbJg3YtxA=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/112508] [14 Regression] Size regression when
 using -Os starting with r14-4089-gd45ddc2c04e
Date: Fri, 16 Feb 2024 08:11:05 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-112508-4-Cf6JFRGz7H@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-112508-4@http.gcc.gnu.org/bugzilla/>
References: <bug-112508-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112508
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Loop store-motion is a difficult thing to cost - it's a critical enabler for
many of our loop optimizations, including scalar evolution analysis.

Now, this might not hold true so much for the cases where we end up
using an extra flag to avoid store data races and this example also shows
we're doing a bad job in trying to unify flags for variables stored in the
same blocks (we don't try to do this at all ...).

Value-numbering has difficulties getting from zero flags to "same flags",
it only manages to elide one flag (but maybe that's all we can do - I
didn't exactly analyze).

Conditionally set (conditionally within a loop, not so much conditionally
executed subloops) vars at least less likely will help SCEV, so cost
modeling (aka estimating register pressure in a simplistic way, like
counting the number of IVs) of store-motion of those might be a way to
combat this.

Or, for example, disable conditional store-motion for -Os entirely.

For targets where -Os matters likely -fallow-store-data-races would be
a way to rescue.  With that I get on x86_64

main1:
.LFB1:
        .cfi_startproc
        movb    h(%rip), %sil
        movl    d(%rip), %edx
        movl    g(%rip), %edi
        movl    e(%rip), %ecx
        movl    f(%rip), %eax
.L2:
        testb   %sil, %sil
        je      .L5
        movl    %eax, %ecx
.L6:
        movl    %ecx, %eax
        cmpl    $9, %ecx
        jg      .L9
        testl   %edx, %edx
        je      .L3
        xorl    %edi, %edi
.L3:
        incl    %ecx
        jmp     .L6
.L9:
        decl    %esi
        xorl    %ecx, %ecx
        xorl    %edx, %edx
        jmp     .L2
.L5:
        movb    $0, h(%rip)
        movl    %eax, f(%rip)
        movl    %ecx, e(%rip)
        movl    %edi, g(%rip)
        movl    %edx, d(%rip)
        ret

Actionable items:

 a) disable flag store motion for cold loops (or stores only happening in
    cold parts of the loop)
 b) optimize flag variable allocation (try to use the same flag for multiple
    vars)
 c) some kind of register pressure estimation, possibly only for non-innerm=
ost
    loops=