From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D56B5385B50E; Fri, 8 Mar 2024 10:25:37 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D56B5385B50E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1709893537; bh=JHI+0FIcrBvhFBseSdAiHTpFw+AqGwbJivgbo1qXnXU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=eSAkz/vbnMWeBzdPu1jhn+hNZ6j8/cSNq+c9pxp13POfFw0WJu5gMCwJo1OgDQ69j EJ0ARn4uZsuhRyHFafFfZwDmXbkimdiZdcJLzcM62oPJwbIQQtjxOXWyx8TnRZiV5K OzRRG2OKTwXgJLYfx4wZGPUyH0CiRyyGZmncu1kA= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 Date: Fri, 08 Mar 2024 10:25:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 14.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114269 --- Comment #3 from Richard Biener --- good (base) vs. bad (peak) on Zen2 with -Ofast -march=3Dnative shows Samples: 654K of event 'cycles', Event count (approx.): 743149709374=20=20= =20=20=20=20=20=20=20=20=20=20 Overhead Samples Command Shared Object Symbol= =20=20=20=20=20 16.71% 109793 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] hs= moc_ 14.37% 94016 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] hs= moc_ 8.82% 57979 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] lorentz_ 8.48% 55451 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] lorentz_ 4.84% 31575 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] mo= mx3_ 4.68% 30456 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] mo= mx3_ 4.08% 26675 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] tranx3_ 3.56% 23145 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] tranx3_ for hsmoc_ it looks like a difference in transformations done: -hsmoc.f:826:19: optimized: loop vectorized using 32 byte vectors (there are a lot more missed vectorizations). subroutine hsmoc ( emf1, emf2, emf3 ) integer is, ie, js, je, ks, ke common /gridcomi/ & is, ie, js, je, ks, ke integer in, jn, kn, ijkn integer i , j , k parameter(in =3D 128+5 & , jn =3D 128+5 & , kn =3D 128+5) parameter(ijkn =3D 128+5) real*8 emf1 ( in, jn, kn), emf2 ( in, jn, kn) real*8 vint (ijkn), bint (ijkn) do 199 j=3Djs,je+1 do 59 i=3Dis,ie do 858 k=3Dks,ke+1 vint(k)=3D k bint(k)=3D k 858 continue do 58 k=3Dks,ke+1 emf1(i,j,k) =3D vint(k) emf2(i,j,k) =3D bint(k) 58 continue 59 continue 199 continue return end doesn't reproduce it though. The actual difference for the whole testcase is of course failed data-ref analysis: Creating dr for (*emf2_1966(D))[_402] -analyze_innermost: success. - base_address: emf2_1966(D) - offset from base address: (ssizetype) ((((sizetype) _1928 * 17689 + (sizetype) j_2705 * 133) + (sizetype) i_2672) * 8) - constant offset from base address: -142584 - step: 141512 - base alignment: 8 +analyze_innermost: hsmoc.f:828:72: missed: failed: evolution of offset is= not affine. + base_address:=20 + offset from base address:=20 + constant offset from base address:=20 + step:=20 + base alignment: 0 and then hsmoc.f:826:19: note: =3D=3D=3D vect_analyze_data_ref_accesses =3D=3D=3D -hsmoc.f:826:19: missed: not consecutive access (*emf1_1964(D))[_402] =3D= _403; -hsmoc.f:826:19: note: using strided accesses -hsmoc.f:826:19: missed: not consecutive access (*emf2_1966(D))[_402] =3D= _404; -hsmoc.f:826:19: note: using strided accesses and we use gather and fail because of costs. I suspect that relying on global ranges (that could save us here) is quite fragile when there's a lot of other code around and thus opportunity for random transforms "trashing" them. Using the patch from PR114151 and enabling ranger during vectorization oddly enough doesn't help (even when wiping the SCEV cache). The odd thing is with the testcase above we get Access function 0: (integer(kind=3D8)) {(((unsigned long) _30 * 176= 89 + (unsigned long) _10) + (unsigned long) _66) + 18446744073709533793, +, 17689}_4; where you can see some of the unsigned promotion being done, but we still succeed. As I'm lacking a smaller testcase right now it's difficult to understand why we fail in one case but not the other.=