From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 5FA6B3858C50; Mon, 4 Mar 2024 12:07:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5FA6B3858C50 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1709554059; bh=1PzJn99ClgivuhpNJmaRtFBdZkN4Dxf6k8Cxrl1cgc4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=yHGz35HnmwpI8kyO/2c/SclUQ7DRcFdSTxcy9vM+ae068Ehbljm34KRoNq7uFgg6x LNLTLNYNqMs+E6w4CCTm1pOV6BUZMZRN5qmoEQe/jLQfcoRhxTn5xOxwgJ76lLoTxw uBUzjE4IFVku5m1GCyFYNU1FS71ho43yclPiCU7U= From: "rsandifo at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7 Date: Mon, 04 Mar 2024 12:07:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rsandifo at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 14.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113441 --- Comment #35 from Richard Sandiford --- Maybe I've misunderstood the flow of the ticket, but it looks to me like we= do still correctly recognise the truncating scatter stores. And, on their own= , we would be able to convert them into masked scatters. The reason for the epilogue is instead on the load side. There we have a non-strided grouped load, and currently we hard-code the assumption that it= is better to use contiguous loads and permutes rather than gather loads where possible. So we have: /* As a last resort, trying using a gather load or scatter store. ??? Although the code can handle all group sizes correctly, it probably isn't a win to use separate strided accesses based on nearby locations. Or, even if it's a win over scalar code, it might not be a win over vectorizing at a lower VF, if that allows us to use contiguous accesses. */ if (*memory_access_type =3D=3D VMAT_ELEMENTWISE && single_element_p && loop_vinfo && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo, masked_p, gs_info)) *memory_access_type =3D VMAT_GATHER_SCATTER; only after we've tried and failed to use load lanes or load+permute. If instead I change the order so that the code above is tried first, then we do use extending gather loads and truncating scatter stores as before, with no epilogue loop. So I suppose the question is: if we do prefer to use gathers over load+perm= ute for some cases, how do we decide which to use? And can it be done a per-lo= ad basis, or should it instead be a per-loop decision? E.g., if we end up wit= h a loop that needs peeling for gaps, perhaps we should try again and forbid peeling for gaps. Then, if that succeeds, see which loop gives the better overall cost. Of course, trying more things means more compile time=E2=80=A6=