From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 7EBE83858431; Mon, 3 Jun 2024 14:11:47 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7EBE83858431 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1717423907; bh=EoEhcAGLl0cSggVGeLOf1csUE0CWB+eMLLpEi4hkTUg=; h=From:To:Subject:Date:In-Reply-To:References:From; b=f2KSp86WWnzYzeUo7t4sHh5QpEhMkP8k2JfnB7CWEmxO9quMObkp5yYNewHx9EA/J oI7z4pBKPUb3NrNZutkdr9SzxFuFfdbdGtS/0nmiqSUJDSSvbsW75ZBQrM62TJngWg f0XGz6wtqbsIiUmkujDch97yK3vyi69UPcEPWyZA= From: "ams at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs Date: Mon, 03 Jun 2024 14:11:47 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 15.0 X-Bugzilla-Keywords: testsuite-fail X-Bugzilla-Severity: normal X-Bugzilla-Who: ams at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 15.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115304 --- Comment #11 from Andrew Stubbs --- (In reply to rguenther@suse.de from comment #10) > On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote: >=20 > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115304 > >=20 > > --- Comment #9 from Andrew Stubbs --- > > (In reply to Richard Biener from comment #6) > > > The best strathegy for GCN would be to gather V4QImode aka SImode int= o the > > > V64QImode (or V16SImode) vector. For pix2 we have a gap of 28 elemen= ts, > > > doing consecutive loads isn't a good strategy here. > >=20 > > I don't fully understand what you're trying to say here, so apologies i= f you > > knew all this already and I missed the point..... > >=20 > > In general, on GCN V4QImode is not in any way equivalent to SImode (whe= n the > > values are in registers). The vector registers are not one single strin= g of > > re-interpretable bits. > >=20 > > For the same reason, you can't load a value as V64QImode and then try to > > interpret it as V16SImode. GCN vector registers just don't work like > > SSE/Neon/etc. > >=20 > > When you load a V64QImode vector, each lane is extended to 32 bits, so = what you > > actually get in hardware is a V64SImode vector. > >=20 > > Likewise, when you load a V4QImode vector the hardware representation is > > actually V4SImode (which in itself is just V64SImode with undefined val= ues in > > the unused lanes). >=20 > I see. I wonder if there's not one or two latent wrong-code because of > this and the vectorizers assumptions ;) I suppose modes_tieable_p > will tell us whether a VIEW_CONVERT_EXPR will do the right thing? > Is GET_MODE_SIZE (V64QImode) =3D=3D GET_MODE_SIZE (V64SImode) btw? > And V64QImode really V64PSImode? The mode size says how big it will be when written to memory, so no they're= not the same. I believe this matches the scalar QImode behaviour. We don't use any PSI modes. There are (some) machine instructions for V64QI= mode (and V64HImode) so we don't want to lose that information. There may well be some bugs, but we have handling for conversions in a numb= er of places. There are truncate and extend patterns that operate lane-wise, a= nd vec_extract can take a subset of a vector, IIRC. > Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33],=20 > c[34], c[35], ... } it's probably best to use a single V64QImode gather=20 > with GCN then rather than four "consecutive" V64QImode loads and then > element swizzling. Fewer loads are always better, and permutations are expensive operations (a= nd don't work with 64-lane vectors on RDNA devices because they're actually two 32-lane vectors stuck together) so it can certainly make sense to use gather with a vector of permuted offsets (although it can be expensive to generate that vector in the first place).=