From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 7EBE83858431; Mon,  3 Jun 2024 14:11:47 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7EBE83858431
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1717423907;
	bh=EoEhcAGLl0cSggVGeLOf1csUE0CWB+eMLLpEi4hkTUg=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=f2KSp86WWnzYzeUo7t4sHh5QpEhMkP8k2JfnB7CWEmxO9quMObkp5yYNewHx9EA/J
	 oI7z4pBKPUb3NrNZutkdr9SzxFuFfdbdGtS/0nmiqSUJDSSvbsW75ZBQrM62TJngWg
	 f0XGz6wtqbsIiUmkujDch97yK3vyi69UPcEPWyZA=
From: "ams at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs
Date: Mon, 03 Jun 2024 14:11:47 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 15.0
X-Bugzilla-Keywords: testsuite-fail
X-Bugzilla-Severity: normal
X-Bugzilla-Who: ams at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 15.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-115304-4-lkePJs81Ea@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-115304-4@http.gcc.gnu.org/bugzilla/>
References: <bug-115304-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115304
--- Comment #11 from Andrew Stubbs <ams at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #10)
> On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote:
>=20
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115304
> >=20
> > --- Comment #9 from Andrew Stubbs <ams at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #6)
> > > The best strathegy for GCN would be to gather V4QImode aka SImode int=
o the
> > > V64QImode (or V16SImode) vector.  For pix2 we have a gap of 28 elemen=
ts,
> > > doing consecutive loads isn't a good strategy here.
> >=20
> > I don't fully understand what you're trying to say here, so apologies i=
f you
> > knew all this already and I missed the point.....
> >=20
> > In general, on GCN V4QImode is not in any way equivalent to SImode (whe=
n the
> > values are in registers). The vector registers are not one single strin=
g of
> > re-interpretable bits.
> >=20
> > For the same reason, you can't load a value as V64QImode and then try to
> > interpret it as V16SImode. GCN vector registers just don't work like
> > SSE/Neon/etc.
> >=20
> > When you load a V64QImode vector, each lane is extended to 32 bits, so =
what you
> > actually get in hardware is a V64SImode vector.
> >=20
> > Likewise, when you load a V4QImode vector the hardware representation is
> > actually V4SImode (which in itself is just V64SImode with undefined val=
ues in
> > the unused lanes).
>=20
> I see.  I wonder if there's not one or two latent wrong-code because of
> this and the vectorizers assumptions ;)  I suppose modes_tieable_p
> will tell us whether a VIEW_CONVERT_EXPR will do the right thing?
> Is GET_MODE_SIZE (V64QImode) =3D=3D GET_MODE_SIZE (V64SImode) btw?
> And V64QImode really V64PSImode?

The mode size says how big it will be when written to memory, so no they're=
 not
the same. I believe this matches the scalar QImode behaviour.

We don't use any PSI modes. There are (some) machine instructions for V64QI=
mode
(and V64HImode) so we don't want to lose that information.

There may well be some bugs, but we have handling for conversions in a numb=
er
of places. There are truncate and extend patterns that operate lane-wise, a=
nd
vec_extract can take a subset of a vector, IIRC.

> Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33],=20
> c[34], c[35], ... } it's probably best to use a single V64QImode gather=20
> with GCN then rather than four "consecutive" V64QImode loads and then
> element swizzling.

Fewer loads are always better, and permutations are expensive operations (a=
nd
don't work with 64-lane vectors on RDNA devices because they're actually two
32-lane vectors stuck together) so it can certainly make sense to use gather
with a vector of permuted offsets (although it can be expensive to generate
that vector in the first place).=