From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 55ED33858431; Fri, 20 Sep 2024 09:33:13 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 55ED33858431
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1726824793;
	bh=ReNqnbojWZFg1u/FMR7QvwfAcBv77FyrX3Wm2Sphm6M=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=k35x60f27/pwOwmv5RmVLfsgTv6ME0tjbVIw2cTGl+/2LVQ2qXem0dQvuess434e8
	 kFNaTQZihd4Y/qOZi/9Q5Fn6XfF9KYRWtn8QNNpwLMg6+64QjHurB0Cnf1ZzZZnN/v
	 k7nfB5bSCn3L0Bzm3jy0i4cu8uqgOmQggDBfUKaQ=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/116583] vectorizable_slp_permutation cannot
 handle even/odd extract from VLA vector
Date: Fri, 20 Sep 2024 09:33:12 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 15.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-116583-4-7Um2KgIgZ8@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-116583-4@http.gcc.gnu.org/bugzilla/>
References: <bug-116583-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D116583
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the key to notice here is the regular interleaving knows there's enough
vectors to perform two-vector to one permutes within the same group and
we only have a single child for the VEC_PERM_EXPR which for the permute
in question effectively means we have to take "two" VLA vectors.

The non-SLP interleaving scheme for this performs multiple VLA loads while
we'd have a contiguous load node that we'd permute later on but we're usual=
ly
not emitting multiple loads(?).  For gcc.dg/vect/slp-42.c we do end up
(after re-analyzing with single-lane SLP) with store-lanes for the 4 element
store but SVE doesn't support 8 element load-lanes (we could use 4 element
load lanes with u64 elements - missing feature).

I do think the VLA interleaving scheme we produce is quite inefficient
(and the cost modeling agrees and would choose V4SI fixed-size regs).=