From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 69471385840E; Mon, 4 Mar 2024 14:48:38 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 69471385840E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1709563718; bh=9lFtlbAdRMcTZz1tOFyIHkEY7xDyan8OQhI6275avWY=; h=From:To:Subject:Date:In-Reply-To:References:From; b=GFeI+BHeYOkCkvgS0ElY/9Vqskv9+7GKLeVvHljzsOsa9oXy4bEGKV0YTuzyzJGFi 45wkvVQA3B4q0EkHh5mq7fSQUWO/kl36egeeraHiaoup00xmGgX9Kc5d4U8DzUNVqT Tdc7b2hntR6QN9Wn82huLRslrlzUmqoNlQHSyRFE= From: "rsandifo at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7 Date: Mon, 04 Mar 2024 14:48:37 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rsandifo at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 14.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113441 --- Comment #38 from Richard Sandiford --- (In reply to Richard Biener from comment #37) > Even more iteration looks bad. I do wonder why when gather can avoid > peeling for GAPs using load-lanes cannot? Like you say, we don't realise that all the loads from array3[i] form a sin= gle group. Note that we're not using load-lanes in either case, since the group size (= 8) is too big for that. But load-lanes and load-and-permute have the same restriction about when peeling for gaps is required. In contrast, gather loads only ever load data that they actually need. > Also for the stores we seem to use elementwise stores rather than store-l= anes. What configuration are you trying? The original report was about SVE, so I= was trying that. There we use a scatter store. > To me the most obvious thing to try optimizing in this testcase is DR > analysis. With -march=3Darmv8.3-a I still see >=20 > t.c:26:22: note: =3D=3D=3D vect_analyze_data_ref_accesses =3D=3D=3D > t.c:26:22: note: Detected single element interleaving array1[0][_8] ste= p 4 > t.c:26:22: note: Detected single element interleaving array1[1][_8] ste= p 4 > t.c:26:22: note: Detected single element interleaving array1[2][_8] ste= p 4 > t.c:26:22: note: Detected single element interleaving array1[3][_8] ste= p 4 > t.c:26:22: note: Detected single element interleaving array1[0][_1] ste= p 4 > t.c:26:22: note: Detected single element interleaving array1[1][_1] ste= p 4 > t.c:26:22: note: Detected single element interleaving array1[2][_1] ste= p 4 > t.c:26:22: note: Detected single element interleaving array1[3][_1] ste= p 4 > t.c:26:22: missed: not consecutive access array2[_4][_8] =3D _69; > t.c:26:22: note: using strided accesses > t.c:26:22: missed: not consecutive access array2[_4][_1] =3D _67; > t.c:26:22: note: using strided accesses >=20 > so we don't figure >=20 > Creating dr for array1[0][_1] > base_address: &array1 > offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2) > constant offset from base address: 0 > step: 4 > base alignment: 16 > base misalignment: 0 > offset alignment: 4 > step alignment: 4 > base_object: array1 > Access function 0: {m_111 * 2, +, 2}_4 > Access function 1: 0 > Creating dr for array1[0][_8] > analyze_innermost: success. > base_address: &array1 > offset from base address: (ssizetype) ((sizetype) (m_111 * 2 + 1)= * > 2) > constant offset from base address: 0 > step: 4 > base alignment: 16 > base misalignment: 0 > offset alignment: 2 > step alignment: 4 > base_object: array1 > Access function 0: {m_111 * 2 + 1, +, 2}_4 > Access function 1: 0 >=20 > belong to the same group (but the access functions tell us it worked out). > Above we fail to split the + 1 to the constant offset. OK, but this is moving the question on to how we should optimise the testca= se for Advanced SIMD rather than SVE, and how we should optimise the testcase = in general, rather than simply recover what we could do before. (SVE is only enabled for -march=3Darvm9-a and above, in case armv8.3-a was intended to e= nable SVE too.)=