From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 2806B385DC32; Wed, 26 Jun 2024 11:05:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2806B385DC32 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1719399957; bh=ZFsiezzsSZsPFD89r1JTzN3nuol7vM5diueF9gUQVMo=; h=From:To:Subject:Date:In-Reply-To:References:From; b=hryt12ZKcnSn+171CnxWy8sTLvN3Ye05oiQemNnT8i7Z15Z2M9a4DKb+XnmE+thfn jmHL+HOzZzegXJZjP1hh/GtX99xg1+3rW9sHqY5zgdZcyWIF0b2Ks1m6tLzMtKq9ex qYnbgZCKYj/R9GRw04vyg03QWMIunVl6Nor7Vbb0= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test Date: Wed, 26 Jun 2024 11:05:56 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 15.0 X-Bugzilla-Keywords: testsuite-fail X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115640 --- Comment #8 from Richard Biener --- (In reply to Richard Biener from comment #7) > I will have a look (and for run validation try to reproduce with gfx1036). OK, so with gfx1036 we end up using 16 byte vectors and the testcase passes. The difference with gfx908 is /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: note: =3D=3D> examining statement: _14 =3D aa[_13]; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: note: vect_model_load_cost: aligned. /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: note: vect_model_load_cost: inside_cost =3D 2, prologue_cost =3D 0 . vs. /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: note: =3D=3D> examining statement: _14 =3D aa[_13]; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: missed: unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 = 9 10 10 11 11 12 12 13 13 14 14 15 15 } /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: missed: unsupported load permutation /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:19:72: missed: not vectorized: relevant stmt not supported: _14 =3D aa[_13]; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: note: removing SLP instance operations starting from: REALPART_EXPR <(*hadcur_24(D))[_2]> =3D _86; /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: missed: unsupported SLP instances /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115= 528.f:16:12: note: re-trying with SLP disabled so gfx1036 cannot do such permutes but gfx908 can? On aarch64 with SVE we are using non-SLP and we're doing load-lanes in the outer loop. The reason seems to be also the unsupported load permutation, but that's possibly because of VLA vectors - GCN uses fixed size but loop masking. So the better equivalent would have been x86-64 with loop masking. So looking again I think the loop mask in the inner loop is wrong. We have do i =3D 1,4 do j =3D 1,4 HADCUR(I)=3D $ HADCUR(I)+CMPLX(COEF1)*FORM1*AA(I,J) end do end do and the vectorizer sees [local count: 214748368]: # i_35 =3D PHI # ivtmp_82 =3D PHI _1 =3D (integer(kind=3D8)) i_35; _2 =3D _1 + -1; hadcur__I_RE_lsm.15_8 =3D REALPART_EXPR <(*hadcur_24(D))[_2]>; hadcur__I_IM_lsm.16_9 =3D IMAGPART_EXPR <(*hadcur_24(D))[_2]>; [local count: 858993456]: # j_36 =3D PHI ... _10 =3D (integer(kind=3D8)) j_36; _11 =3D _10 * 4; _12 =3D _1 + _11; _13 =3D _12 + -5; _14 =3D aa[_13]; ... j_26 =3D j_36 + 1; [local count: 214748368]: # _86 =3D PHI <_49(4)> # _85 =3D PHI <_50(4)> REALPART_EXPR <(*hadcur_24(D))[_2]> =3D _86; IMAGPART_EXPR <(*hadcur_24(D))[_2]> =3D _85; i_27 =3D i_35 + 1; the loop mask { -1, -1, -1, -1, -1, -1, -1, -1, 0, .... } is OK for the outer loop grouped load vect_hadcur__I_RE_lsm.20_76 =3D .MASK_LOAD (vectp_hadcur.18_79, 64B, loop_mask_77); but for the inner loop we do vect__14.23_71 =3D .MASK_LOAD (vectp_aa.21_73, 64B, loop_mask_77); with the same mask. This fails to be pruned for the GAP which means that my improving of gap handling relies for this case to not end up in the masked load handling. In fact get_group_load_store_type doesn't seem to be prepared for outer loop vectorization. OTOH the inner loop isn't "unrolled" (it has a VF of 1), and this might be a mistake of loop mask handling and bad re-use. As was said elsewhere outer loop vectorization with inner loop datarefs is compensating for a missed interchange.=