From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 55ED33858431; Fri, 20 Sep 2024 09:33:13 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 55ED33858431 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1726824793; bh=ReNqnbojWZFg1u/FMR7QvwfAcBv77FyrX3Wm2Sphm6M=; h=From:To:Subject:Date:In-Reply-To:References:From; b=k35x60f27/pwOwmv5RmVLfsgTv6ME0tjbVIw2cTGl+/2LVQ2qXem0dQvuess434e8 kFNaTQZihd4Y/qOZi/9Q5Fn6XfF9KYRWtn8QNNpwLMg6+64QjHurB0Cnf1ZzZZnN/v k7nfB5bSCn3L0Bzm3jy0i4cu8uqgOmQggDBfUKaQ= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector Date: Fri, 20 Sep 2024 09:33:12 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 15.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D116583 --- Comment #4 from Richard Biener --- So the key to notice here is the regular interleaving knows there's enough vectors to perform two-vector to one permutes within the same group and we only have a single child for the VEC_PERM_EXPR which for the permute in question effectively means we have to take "two" VLA vectors. The non-SLP interleaving scheme for this performs multiple VLA loads while we'd have a contiguous load node that we'd permute later on but we're usual= ly not emitting multiple loads(?). For gcc.dg/vect/slp-42.c we do end up (after re-analyzing with single-lane SLP) with store-lanes for the 4 element store but SVE doesn't support 8 element load-lanes (we could use 4 element load lanes with u64 elements - missing feature). I do think the VLA interleaving scheme we produce is quite inefficient (and the cost modeling agrees and would choose V4SI fixed-size regs).=