From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 32B72384AB5B; Fri,  3 May 2024 05:51:27 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 32B72384AB5B
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1714715487;
	bh=oX49vafKTh+IoitgXMqnSTuXsnIoBKKwssuhTVfbG1c=;
	h=From:To:Subject:Date:From;
	b=SpTm/u4gVFxuqAG+VX6LSXhHpBICo4FxAxftChJu+vaB0Hh61wqyYaQHP+6ihdqJ+
	 wiKWLcjqtPetG3tlGWTQdfY6Q/SznvfjBU3w4j1pEBSpV5ne58x17z/zXY4FvtsCgD
	 lVt5GPOOqYPOTse7iR07mbSBtPL4dC+DVJiBdyrY=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/114932] New: Improvement in CHREC can give
 large performance gains
Date: Fri, 03 May 2024 05:51:26 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 keywords bug_severity priority component assigned_to reporter cc
 target_milestone
Message-ID: <bug-114932-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114932

            Bug ID: 114932
           Summary: Improvement in CHREC can give large performance gains
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

With the original fix from PR114074 applied (e.g.
g:a0b1798042d033fd2cc2c806afbb77875dd2909b) we not only saw regressions but=
 saw
big improvements.

The following testcase:

---
  module brute_force
    integer, parameter :: r=3D9
     integer  block(r, r, r)
    contains
  subroutine brute
      k =3D 1
call digits_2(k)
  end
   recursive subroutine digits_2(row)
  integer, intent(in) :: row
  logical OK
     do i1 =3D 0, 1
      do i2 =3D 1, 1
          do i3 =3D 1, 1
           do i4 =3D 0, 1
                do i5 =3D 1, select
                     do i6 =3D 0, 1
                         do i7 =3D l0, u0
                       select case(1 )
                       case(1)
                           block(:2, 7:, i7) =3D block(:2, 7:, i7) - 1
                       end select
                            do i8 =3D 1, 1
                               do i9 =3D 1, 1
                            if(row =3D=3D 5) then
                          elseif(OK)then
                                    call digits_2(row + 1)
                                end if
                                end do
                          end do
                       block(:, 1, i7) =3D   select
                    end do
                    end do
              end do
              end do
           end do
        block =3D 1
     end do
     block =3D 1
     block =3D block0 + select
  end do
 end
  end
---

compiled with: -mcpu=3Dneoverse-v1 -Ofast -fomit-frame-pointer foo.f90

gets vectorized after sra and constprop.  But the final addressing modes ar=
e so
complicated that IVopts generates a register offset mode:

  4c:   2f00041d        mvni    v29.2s, #0x0
  50:   fc666842        ldr     d2, [x2, x6]
  54:   fc656841        ldr     d1, [x2, x5]
  58:   fc646840        ldr     d0, [x2, x4]
  5c:   0ebd8442        add     v2.2s, v2.2s, v29.2s
  60:   0ebd8421        add     v1.2s, v1.2s, v29.2s
  64:   0ebd8400        add     v0.2s, v0.2s, v29.2s

which is harder for prefetchers to follow.  When the patch was applied it w=
as
able to correctly lower these to the immediate offset loads that the scalar
code was using:

  38:   2f00041d        mvni    v29.2s, #0x0
  34:   fc594002        ldur    d2, [x0, #-108]
  40:   fc5b8001        ldur    d1, [x0, #-72]
  44:   fc5dc000        ldur    d0, [x0, #-36]
  48:   0ebd8442        add     v2.2s, v2.2s, v29.2s
  4c:   0ebd8421        add     v1.2s, v1.2s, v29.2s
  50:   0ebd8400        add     v0.2s, v0.2s, v29.2s

and also removes all the additional instructions to keep x6,x5 and x4 up to
date.

This gave 10%+ improvements on various workloads.

(ps I'm looking at the __brute_force_MOD_digits_2.constprop.3.isra.0
specialization).

I will try to reduce it more, but am filing this so we can keep track and
hopefully fix.=