From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 32B72384AB5B; Fri, 3 May 2024 05:51:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 32B72384AB5B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1714715487; bh=oX49vafKTh+IoitgXMqnSTuXsnIoBKKwssuhTVfbG1c=; h=From:To:Subject:Date:From; b=SpTm/u4gVFxuqAG+VX6LSXhHpBICo4FxAxftChJu+vaB0Hh61wqyYaQHP+6ihdqJ+ wiKWLcjqtPetG3tlGWTQdfY6Q/SznvfjBU3w4j1pEBSpV5ne58x17z/zXY4FvtsCgD lVt5GPOOqYPOTse7iR07mbSBtPL4dC+DVJiBdyrY= From: "tnfchris at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains Date: Fri, 03 May 2024 05:51:26 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: tnfchris at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status keywords bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114932 Bug ID: 114932 Summary: Improvement in CHREC can give large performance gains Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- With the original fix from PR114074 applied (e.g. g:a0b1798042d033fd2cc2c806afbb77875dd2909b) we not only saw regressions but= saw big improvements. The following testcase: --- module brute_force integer, parameter :: r=3D9 integer block(r, r, r) contains subroutine brute k =3D 1 call digits_2(k) end recursive subroutine digits_2(row) integer, intent(in) :: row logical OK do i1 =3D 0, 1 do i2 =3D 1, 1 do i3 =3D 1, 1 do i4 =3D 0, 1 do i5 =3D 1, select do i6 =3D 0, 1 do i7 =3D l0, u0 select case(1 ) case(1) block(:2, 7:, i7) =3D block(:2, 7:, i7) - 1 end select do i8 =3D 1, 1 do i9 =3D 1, 1 if(row =3D=3D 5) then elseif(OK)then call digits_2(row + 1) end if end do end do block(:, 1, i7) =3D select end do end do end do end do end do block =3D 1 end do block =3D 1 block =3D block0 + select end do end end --- compiled with: -mcpu=3Dneoverse-v1 -Ofast -fomit-frame-pointer foo.f90 gets vectorized after sra and constprop. But the final addressing modes ar= e so complicated that IVopts generates a register offset mode: 4c: 2f00041d mvni v29.2s, #0x0 50: fc666842 ldr d2, [x2, x6] 54: fc656841 ldr d1, [x2, x5] 58: fc646840 ldr d0, [x2, x4] 5c: 0ebd8442 add v2.2s, v2.2s, v29.2s 60: 0ebd8421 add v1.2s, v1.2s, v29.2s 64: 0ebd8400 add v0.2s, v0.2s, v29.2s which is harder for prefetchers to follow. When the patch was applied it w= as able to correctly lower these to the immediate offset loads that the scalar code was using: 38: 2f00041d mvni v29.2s, #0x0 34: fc594002 ldur d2, [x0, #-108] 40: fc5b8001 ldur d1, [x0, #-72] 44: fc5dc000 ldur d0, [x0, #-36] 48: 0ebd8442 add v2.2s, v2.2s, v29.2s 4c: 0ebd8421 add v1.2s, v1.2s, v29.2s 50: 0ebd8400 add v0.2s, v0.2s, v29.2s and also removes all the additional instructions to keep x6,x5 and x4 up to date. This gave 10%+ improvements on various workloads. (ps I'm looking at the __brute_force_MOD_digits_2.constprop.3.isra.0 specialization). I will try to reduce it more, but am filing this so we can keep track and hopefully fix.=