From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 2600C3858D33; Wed,  1 Feb 2023 16:22:28 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2600C3858D33
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1675268548;
	bh=32CebL3MBY40CBZiOhJSmf2+C+68R/UettBZbmRMx1Y=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=VQIH1jUiXEY8ubAzei2lErum1g2HHq2SzLTHc/qpQn3QjVjmeDxTtG/BpoeoXUiEA
	 2Ri9mp4jEcryQbLbLNsUan9JmZnqBWPz/ZRSo3CvLpwEghscftBXe3ujTsaQK7nZCe
	 9c/C/IdnXWE69BmhNn2254yFhJZo10/qBdD4ls9k=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/108583] [13 Regression] wrong code with vector division
 by uint16 at -O2
Date: Wed, 01 Feb 2023 16:22:27 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: tnfchris at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 13.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108583-4-fWlDJzRDiV@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108583-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108583-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108583
--- Comment #20 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > I don't think so for addhn, because it wouldn't truncate the top bits, =
it
> > truncates the bottom bits.
> >=20
> > The instruction does
> >     element1 =3D Elem[operand1, e, 2*esize];
> >     element2 =3D Elem[operand2, e, 2*esize];
> >=20
> > So it widens on input.=20
>=20
> OK, so that's an ADD_HIGHPART_EXPR then?  Though the highpart of an
> add is only a single bit, isn't it?  For scalar you'd use the
> carry bit here and instructions like adc to consume it.  Is addhn
> to do such thing on vectors?

Sorry, it's not but that's cause I read the docs wrong. This instruction
doesn't widen on addition but narrows on result.  The esize for the
instruction is determined output not the input.


To unconfuse everyone lets just explain what this is doing:

The optimization is to try to optimize unsigned division by a mask found by
setting all bit in half the size of the input.

i.e. for a divide of a short, the division needs to be by 0xff.

This works by decomposing the unsigned division of x / 0xff into
(x + ((x + 257) >> 8)) >> 8.  This is correct for all ranges as long
as the operations are done in at least double the precision of x.

If the operations are done in the same precision as x, then the only
valid ranged for this are those where x + 257 does not cause an overflow.

This is why the maximum range of a widened operation is fine, because
0xFE01 + 0x101 =3D 0xFF02 and so does not overflow.  In this case we can av=
oid
the widening and do the operation in a smaller precision.

The two vector instructions we're using does this, i.e.
addhn =3D=3D (x + 257) >> 8) and uaddw =3D=3D addhn_res + x.  Finally we em=
it a >> 8
in the end.  This shift later gets removed because a widening operation
will do the same sequence for the _lo and _hi pairs.  since we are doing two
logical right shifts we can just permute the values out.

Hope this clear it up to you both.

>=20
> When writing generic vector code is combine able to synthesize addhn
> from widen, plus, pack-high?

No, it's currently unspecs. We could recognize add, shift and narrow as
it though.

>=20
> But sure, using the actual ISA is preferable for costing and also to
> avoid "breaking" the combination by later "optimization".  OTOH at least
> some basic constant folding for all such ISA IFNs is required to
> avoid regressing cases where complete unrolling later allows constant
> evaluation but first vectorizing breaking that.

So I suppose one way to open code this in the vectorizer, would be to use
the hook to tell the vectorizer to decompose the operation that's within
the valid range to (x + ((x + 257) >> 8)) >> 8 (without any intermediate
promotions), and fudge the costing for it and use combine to recognize the
sequence? it's 4 instruction sequence so should be able to.

I also spent the day wondering if I could come up with a sequence that's
correct in general and that I can use combine to optimize further in case
a widening operation is found. Unfortunately while I can create a sequence:

uint32x4_t
foo (uint32x4_t v)
{
  uint32x4_t cst =3D vreinterpretq_u32_u16 (vdupq_n_u16 (0x1));
  uint64x2_t l1 =3D vaddl_u32 (vget_low_u32 (cst), vget_low_u32 (v));
  uint64x2_t l2 =3D vaddl_high_u32 (cst, v);
  uint32x2_t t1 =3D vshrn_n_u64 (l1, 16);
  uint32x4_t t2 =3D vshrn_high_n_u64 (t1, l2, 16);
  uint64x2_t l3 =3D vaddl_u32 (vget_low_u32 (t2), vget_low_u32 (v));
  uint64x2_t l4 =3D vaddl_high_u32 (t2, v);
  uint32x2_t r1 =3D vshrn_n_u64 (l3, 16);
  uint32x4_t r2 =3D vshrn_high_n_u64 (r1, l4, 16);
  return r2;
}

But this isn't more efficient than what's there before, though it is easier=
 to
match in combine.  Part of the problem is that if the addition produces 33 =
bits
of significance then you do need a shift and can't remove them with permute=
s.

So the core of the optimization relies on the range of the inputs, so it do=
es
need
to be done pre-vectorization.=