From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=zAqF=6G=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id A99D03858C5F
	for <gcc-patches@gcc.gnu.org>; Fri, 10 Feb 2023 16:25:18 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A99D03858C5F
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D040F4B3;
	Fri, 10 Feb 2023 08:26:00 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8B2683F703;
	Fri, 10 Feb 2023 08:25:17 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Mail-Followup-To: Tamar Christina <Tamar.Christina@arm.com>,Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  "rguenther\@suse.de" <rguenther@suse.de>,  "jlaw\@ventanamicro.com" <jlaw@ventanamicro.com>, richard.sandiford@arm.com
Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  "rguenther\@suse.de" <rguenther@suse.de>,  "jlaw\@ventanamicro.com" <jlaw@ventanamicro.com>
Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
References: <patch-16909-tamar@arm.com> <mptzg9l4p7n.fsf@arm.com>
	<VI1PR08MB53257CBBF08FD81509AED8ACFFDE9@VI1PR08MB5325.eurprd08.prod.outlook.com>
	<mpth6vt345g.fsf@arm.com>
	<VI1PR08MB53258F0AB4D9FB5EF8B0D066FFDE9@VI1PR08MB5325.eurprd08.prod.outlook.com>
Date: Fri, 10 Feb 2023 16:25:16 +0000
In-Reply-To: <VI1PR08MB53258F0AB4D9FB5EF8B0D066FFDE9@VI1PR08MB5325.eurprd08.prod.outlook.com>
	(Tamar Christina's message of "Fri, 10 Feb 2023 16:09:58 +0000")
Message-ID: <mpt5yc932tv.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-27.9 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 3:57 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>>=20
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> >> 69
>> >> > de2afea139d6 100644
>> >> > --- a/gcc/tree-vect-patterns.cc
>> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
>> *vinfo,
>> >> >        return pattern_stmt;
>> >> >      }
>> >> >    else if ((cst =3D uniform_integer_cst_p (oprnd1))
>> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> vectype,
>> >> > -							  wi::to_wide (cst),
>> >> > -							  NULL, NULL_RTX,
>> >> > -							  NULL_RTX))
>> >> > +	   && TYPE_UNSIGNED (itype)
>> >> > +	   && rhs_code =3D=3D TRUNC_DIV_EXPR
>> >> > +	   && vectype
>> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >      {
>> >> > -      return NULL;
>> >> > +      /* div optimizations using narrowings
>> >> > +       we can do the division e.g. shorts by 255 faster by calcula=
ting it as
>> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done =
in
>> >> > +       double the precision of x.
>> >> > +
>> >> > +       If we imagine a short as being composed of two blocks of by=
tes
>> then
>> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equiva=
lent
>> to
>> >> > +       adding 1 to each sub component:
>> >> > +
>> >> > +	    short value of 16-bits
>> >> > +       =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=90
>> >> > +       =E2=94=82              =E2=94=82                =E2=94=82
>> >> > +       =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=98
>> >> > +	 8-bit part1 =E2=96=B2  8-bit part2   =E2=96=B2
>> >> > +		     =E2=94=82                =E2=94=82
>> >> > +		     =E2=94=82                =E2=94=82
>> >> > +		    +1               +1
>> >> > +
>> >> > +       after the first addition, we have to shift right by 8, and =
narrow the
>> >> > +       results back to a byte.  Remember that the addition must be=
 done
>> in
>> >> > +       double the precision of the input.  However if we know that
>> >> > + the
>> >> addition
>> >> > +       `x + 257` does not overflow then we can do the operation in
>> >> > + the
>> >> current
>> >> > +       precision.  In which case we don't need the pack and unpack=
s.  */
>> >> > +      auto wcst =3D wi::to_wide (cst);
>> >> > +      int pow =3D wi::exact_log2 (wcst + 1);
>> >> > +      if (pow =3D=3D (int) (element_precision (vectype) / 2))
>> >> > +	{
>> >> > +	  wide_int min,max;
>> >> > +	  /* If we're in a pattern we need to find the orginal definition=
.  */
>> >> > +	  tree op0 =3D oprnd0;
>> >> > +	  gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0);
>> >> > +	  stmt_vec_info stmt_info =3D vinfo->lookup_stmt (stmt);
>> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> > +	    {
>> >> > +	      auto orig_stmt =3D STMT_VINFO_RELATED_STMT (stmt_info);
>> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> > +		op0 =3D gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> >> > +	    }
>> >>
>> >> If this is generally safe (I'm skipping thinking about it in the
>> >> interests of a quick review :-)), then I think it should be done in
>> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
>> >> general than handling just assignments.
>> >>
>> >> > +
>> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> > +	     information we can't perform the optimization.  */
>> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> > +	    {
>> >> > +	      wide_int one =3D wi::to_wide (build_one_cst (itype));
>> >> > +	      wide_int adder =3D wi::add (one, wi::lshift (one, pow));
>> >> > +	      wi::overflow_type ovf;
>> >> > +	      /* We need adder and max in the same precision.  */
>> >> > +	      wide_int zadder
>> >> > +		=3D wide_int_storage::from (adder, wi::get_precision (max),
>> >> > +					  UNSIGNED);
>> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >>
>> >> Could you explain this a bit more?  When do we have mismatched
>> >> precisions?
>> >
>> > C promotion rules will promote e.g.
>> >
>> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >   for (int i =3D 0; i < n; i+=3D1)
>> >     pixel[i] =3D (pixel[i] + level) / 0xff; }
>> >
>> > And have the addition be done as a 32 bit integer.  The vectorizer
>> > will demote this down to a short, but range information is not stored
>> > for patterns.  So In the above the range will correctly be 0x1fe but
>> > the precision will be that of the original expression, so 32.  This
>> > will be a mismatch with itype which is derived from the size the vecto=
rizer
>> will perform the operation in.
>>=20
>> Gah, missed this first time round, sorry.
>>=20
>> Richi would know better than me, but I think it's dangerous to rely on t=
he
>> orig/pattern link for range information.  The end result of a pattern
>> (vect_stmt_to_vectorize) has to have the same type as the lhs of the ori=
ginal
>> statement.  But the other statements in the pattern sequence can do
>> arbitrary things.  Their range isn't predictable from the range of the o=
riginal
>> statement result.
>>=20
>> IIRC, the addition above is converted to:
>>=20
>>   a' =3D (uint16_t) pixel[i]
>>   b' =3D (uint16_t) level
>>   sum' =3D a' + b'
>>   sum =3D (int) sum'
>>=20
>> where sum is the direct replacement of "pixel[i] + level", with the same=
 type
>> and range.  The division then uses sum' instead of sum.
>>=20
>> But the fact that sum' is part of the same pattern as sum doesn't guaran=
tee
>> that sum' has the same range as sum.  E.g. the pattern statements added =
by
>> the division optimisation wouldn't have this property.
>
> So my assumption is that no pattern would replace a statement with someth=
ing
> That has higher precision than the C statement. The pattern above is demo=
ted
> By the vectorizer based on range information already. My assumption was t=
hat
> the precision can only ever be smaller, because otherwise the pattern has=
 violated
> the semantics of the C code, which would be dangerous if e.g. the express=
ion escapes?

IMO the difference in precisions was a symptom of the problem rather
than the direct cause.

The point is more that "B =3D vect_orig_stmt(A)" just says "A is used
somehow in a new calculation of B".  A might equal B (if A replaces B),
or A might be an arbitrary temporary result.  The code above is instead
using it to mean "A equals B, expressed in a different type".  That
happens to be true for sum' in the sequence above, but it isn't true of
non-final pattern statements in general.

In other words, the code hasn't proved that the path from A to
vect_stmt_to_vectorize(B) just involves conversions.

Applying the range of a pattern result to all temporary results in
the pattern could lead to wrong results even when the precisions
are all the same.

>> Is it possible to tell ranger to compute the range of expressions that h=
aven't
>> been added to the IL?  (Genuine question, haven't looked.
>> It seems pretty powerful though.)
>
> I don't know either, I guess for things it has explicit knowledge about i=
t's ok, so
> +w or *w would be fine, but with a random IFN_ it'll likely have to punt =
as varying.

Yeah.  But sum' above involves simple arithmetic and conversions,
so IFNs shouldn't be a problem.

Thanks,
Richard