From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D56B5385B50E; Fri,  8 Mar 2024 10:25:37 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D56B5385B50E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709893537;
	bh=JHI+0FIcrBvhFBseSdAiHTpFw+AqGwbJivgbo1qXnXU=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=eSAkz/vbnMWeBzdPu1jhn+hNZ6j8/cSNq+c9pxp13POfFw0WJu5gMCwJo1OgDQ69j
	 EJ0ARn4uZsuhRyHFafFfZwDmXbkimdiZdcJLzcM62oPJwbIQQtjxOXWyx8TnRZiV5K
	 OzRRG2OKTwXgJLYfx4wZGPUyH0CiRyyGZmncu1kA=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec
 time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
Date: Fri, 08 Mar 2024 10:25:36 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-114269-4-PGD3Tgqi22@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-114269-4@http.gcc.gnu.org/bugzilla/>
References: <bug-114269-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114269
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
good (base) vs. bad (peak) on Zen2 with -Ofast -march=3Dnative shows

Samples: 654K of event 'cycles', Event count (approx.): 743149709374=20=20=
=20=20=20=20=20=20=20=20=20=20
Overhead       Samples  Command          Shared Object               Symbol=
=20=20=20=20=20
  16.71%        109793  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.] hs=
moc_
  14.37%         94016  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.] hs=
moc_
   8.82%         57979  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.]
lorentz_
   8.48%         55451  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.]
lorentz_
   4.84%         31575  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.] mo=
mx3_
   4.68%         30456  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.] mo=
mx3_
   4.08%         26675  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.]
tranx3_
   3.56%         23145  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.]
tranx3_

for hsmoc_ it looks like a difference in transformations done:

-hsmoc.f:826:19: optimized: loop vectorized using 32 byte vectors

(there are a lot more missed vectorizations).

       subroutine hsmoc ( emf1, emf2, emf3 )

       integer is, ie, js, je, ks, ke
       common /gridcomi/
     &   is, ie, js, je, ks, ke
       integer in, jn, kn, ijkn
       integer      i       , j       , k
       parameter(in =3D           128+5
     &        , jn =3D           128+5
     &        , kn =3D           128+5)
       parameter(ijkn =3D   128+5)
       real*8         emf1    (  in,  jn,  kn), emf2    (  in,  jn,  kn)
       real*8         vint    (ijkn), bint    (ijkn)

       do 199 j=3Djs,je+1
         do 59 i=3Dis,ie
          do 858 k=3Dks,ke+1
             vint(k)=3D k
             bint(k)=3D k
 858      continue
          do 58 k=3Dks,ke+1
             emf1(i,j,k) =3D vint(k)
             emf2(i,j,k) =3D bint(k)
 58       continue
 59      continue
 199   continue

       return
       end

doesn't reproduce it though.  The actual difference for the whole testcase
is of course failed data-ref analysis:

 Creating dr for (*emf2_1966(D))[_402]
-analyze_innermost: success.
-       base_address: emf2_1966(D)
-       offset from base address: (ssizetype) ((((sizetype) _1928 * 17689 +
(sizetype) j_2705 * 133) + (sizetype) i_2672) * 8)
-       constant offset from base address: -142584
-       step: 141512
-       base alignment: 8
+analyze_innermost: hsmoc.f:828:72: missed:  failed: evolution of offset is=
 not
affine.
+       base_address:=20
+       offset from base address:=20
+       constant offset from base address:=20
+       step:=20
+       base alignment: 0

and then

 hsmoc.f:826:19: note:   =3D=3D=3D vect_analyze_data_ref_accesses =3D=3D=3D
-hsmoc.f:826:19: missed:   not consecutive access (*emf1_1964(D))[_402] =3D=
 _403;
-hsmoc.f:826:19: note:   using strided accesses
-hsmoc.f:826:19: missed:   not consecutive access (*emf2_1966(D))[_402] =3D=
 _404;
-hsmoc.f:826:19: note:   using strided accesses

and we use gather and fail because of costs.

I suspect that relying on global ranges (that could save us here) is quite
fragile when there's a lot of other code around and thus opportunity for
random transforms "trashing" them.

Using the patch from PR114151 and enabling ranger during vectorization oddly
enough doesn't help (even when wiping the SCEV cache).

The odd thing is with the testcase above we get

        Access function 0: (integer(kind=3D8)) {(((unsigned long) _30 * 176=
89 +
(unsigned long) _10) + (unsigned long) _66) + 18446744073709533793, +,
17689}_4;

where you can see some of the unsigned promotion being done, but we
still succeed.

As I'm lacking a smaller testcase right now it's difficult to understand why
we fail in one case but not the other.=