[Bug target/108583] [13 Regression] wrong code with vector division by uint16 at -O2

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/108583] [13 Regression] wrong code with vector division by uint16 at -O2
Date: Wed, 01 Feb 2023 16:22:27 +0000	[thread overview]
Message-ID: <bug-108583-4-fWlDJzRDiV@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-108583-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583

--- Comment #20 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > I don't think so for addhn, because it wouldn't truncate the top bits, it
> > truncates the bottom bits.
> > 
> > The instruction does
> >     element1 = Elem[operand1, e, 2*esize];
> >     element2 = Elem[operand2, e, 2*esize];
> > 
> > So it widens on input. 
> 
> OK, so that's an ADD_HIGHPART_EXPR then?  Though the highpart of an
> add is only a single bit, isn't it?  For scalar you'd use the
> carry bit here and instructions like adc to consume it.  Is addhn
> to do such thing on vectors?

Sorry, it's not but that's cause I read the docs wrong. This instruction
doesn't widen on addition but narrows on result.  The esize for the
instruction is determined output not the input.

To unconfuse everyone lets just explain what this is doing:

The optimization is to try to optimize unsigned division by a mask found by
setting all bit in half the size of the input.

i.e. for a divide of a short, the division needs to be by 0xff.

This works by decomposing the unsigned division of x / 0xff into
(x + ((x + 257) >> 8)) >> 8.  This is correct for all ranges as long
as the operations are done in at least double the precision of x.

If the operations are done in the same precision as x, then the only
valid ranged for this are those where x + 257 does not cause an overflow.

This is why the maximum range of a widened operation is fine, because
0xFE01 + 0x101 = 0xFF02 and so does not overflow.  In this case we can avoid
the widening and do the operation in a smaller precision.

The two vector instructions we're using does this, i.e.
addhn == (x + 257) >> 8) and uaddw == addhn_res + x.  Finally we emit a >> 8
in the end.  This shift later gets removed because a widening operation
will do the same sequence for the _lo and _hi pairs.  since we are doing two
logical right shifts we can just permute the values out.

Hope this clear it up to you both.

> 
> When writing generic vector code is combine able to synthesize addhn
> from widen, plus, pack-high?

No, it's currently unspecs. We could recognize add, shift and narrow as
it though.

> 
> But sure, using the actual ISA is preferable for costing and also to
> avoid "breaking" the combination by later "optimization".  OTOH at least
> some basic constant folding for all such ISA IFNs is required to
> avoid regressing cases where complete unrolling later allows constant
> evaluation but first vectorizing breaking that.

So I suppose one way to open code this in the vectorizer, would be to use
the hook to tell the vectorizer to decompose the operation that's within
the valid range to (x + ((x + 257) >> 8)) >> 8 (without any intermediate
promotions), and fudge the costing for it and use combine to recognize the
sequence? it's 4 instruction sequence so should be able to.

I also spent the day wondering if I could come up with a sequence that's
correct in general and that I can use combine to optimize further in case
a widening operation is found. Unfortunately while I can create a sequence:

uint32x4_t
foo (uint32x4_t v)
{
  uint32x4_t cst = vreinterpretq_u32_u16 (vdupq_n_u16 (0x1));
  uint64x2_t l1 = vaddl_u32 (vget_low_u32 (cst), vget_low_u32 (v));
  uint64x2_t l2 = vaddl_high_u32 (cst, v);
  uint32x2_t t1 = vshrn_n_u64 (l1, 16);
  uint32x4_t t2 = vshrn_high_n_u64 (t1, l2, 16);
  uint64x2_t l3 = vaddl_u32 (vget_low_u32 (t2), vget_low_u32 (v));
  uint64x2_t l4 = vaddl_high_u32 (t2, v);
  uint32x2_t r1 = vshrn_n_u64 (l3, 16);
  uint32x4_t r2 = vshrn_high_n_u64 (r1, l4, 16);
  return r2;
}

But this isn't more efficient than what's there before, though it is easier to
match in combine.  Part of the problem is that if the addition produces 33 bits
of significance then you do need a shift and can't remove them with permutes.

So the core of the optimization relies on the range of the inputs, so it does
need
to be done pre-vectorization.

next prev parent reply	other threads:[~2023-02-01 16:22 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-28 14:00 [Bug rtl-optimization/108583] New: " zsojka at seznam dot cz
2023-01-30  4:19 ` [Bug target/108583] " pinskia at gcc dot gnu.org
2023-01-30  4:22 ` pinskia at gcc dot gnu.org
2023-01-30  8:31 ` rguenth at gcc dot gnu.org
2023-01-30 14:20 ` tnfchris at gcc dot gnu.org
2023-01-30 14:52 ` rguenther at suse dot de
2023-01-30 15:01 ` tnfchris at gcc dot gnu.org
2023-01-30 16:52 ` rsandifo at gcc dot gnu.org
2023-01-30 17:04 ` tnfchris at gcc dot gnu.org
2023-01-31 10:31 ` rguenth at gcc dot gnu.org
2023-01-31 11:01 ` rsandifo at gcc dot gnu.org
2023-01-31 11:39 ` tnfchris at gcc dot gnu.org
2023-01-31 11:44 ` rguenther at suse dot de
2023-01-31 11:58 ` tnfchris at gcc dot gnu.org
2023-01-31 12:03 ` rsandifo at gcc dot gnu.org
2023-01-31 12:19 ` rguenther at suse dot de
2023-01-31 13:35 ` tnfchris at gcc dot gnu.org
2023-01-31 14:33 ` rguenther at suse dot de
2023-01-31 14:45 ` rguenther at suse dot de
2023-01-31 15:01 ` tnfchris at gcc dot gnu.org
2023-02-01  7:29 ` rguenther at suse dot de
2023-02-01 16:22 ` tnfchris at gcc dot gnu.org [this message]
2023-02-02  8:03 ` tnfchris at gcc dot gnu.org
2023-02-02  8:50 ` rguenther at suse dot de
2023-02-02  8:55 ` tnfchris at gcc dot gnu.org
2023-02-08 13:57 ` tnfchris at gcc dot gnu.org
2023-02-09  7:41 ` rguenther at suse dot de
2023-03-12 18:43 ` cvs-commit at gcc dot gnu.org
2023-03-12 18:43 ` cvs-commit at gcc dot gnu.org
2023-03-12 18:43 ` cvs-commit at gcc dot gnu.org
2023-03-12 18:44 ` cvs-commit at gcc dot gnu.org
2023-03-12 18:44 ` cvs-commit at gcc dot gnu.org
2023-03-12 18:45 ` tnfchris at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-108583-4-fWlDJzRDiV@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).