From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=zAqF=6G=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 369223858D28
	for <gcc-patches@gcc.gnu.org>; Fri, 10 Feb 2023 15:56:46 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 369223858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 29F874B3;
	Fri, 10 Feb 2023 07:57:28 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DB1A13F703;
	Fri, 10 Feb 2023 07:56:44 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Mail-Followup-To: Tamar Christina <Tamar.Christina@arm.com>,Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  "rguenther\@suse.de" <rguenther@suse.de>,  "jlaw\@ventanamicro.com" <jlaw@ventanamicro.com>, richard.sandiford@arm.com
Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  "rguenther\@suse.de" <rguenther@suse.de>,  "jlaw\@ventanamicro.com" <jlaw@ventanamicro.com>
Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
References: <patch-16909-tamar@arm.com> <mptzg9l4p7n.fsf@arm.com>
	<VI1PR08MB53257CBBF08FD81509AED8ACFFDE9@VI1PR08MB5325.eurprd08.prod.outlook.com>
Date: Fri, 10 Feb 2023 15:56:43 +0000
In-Reply-To: <VI1PR08MB53257CBBF08FD81509AED8ACFFDE9@VI1PR08MB5325.eurprd08.prod.outlook.com>
	(Tamar Christina's message of "Fri, 10 Feb 2023 14:13:30 +0000")
Message-ID: <mpth6vt345g.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-28.0 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Tamar Christina <Tamar.Christina@arm.com> writes:
>> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> 69
>> > de2afea139d6 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>> >        return pattern_stmt;
>> >      }
>> >    else if ((cst =3D uniform_integer_cst_p (oprnd1))
>> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> vectype,
>> > -							  wi::to_wide (cst),
>> > -							  NULL, NULL_RTX,
>> > -							  NULL_RTX))
>> > +	   && TYPE_UNSIGNED (itype)
>> > +	   && rhs_code =3D=3D TRUNC_DIV_EXPR
>> > +	   && vectype
>> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> > +					      OPTIMIZE_FOR_SPEED))
>> >      {
>> > -      return NULL;
>> > +      /* div optimizations using narrowings
>> > +       we can do the division e.g. shorts by 255 faster by calculatin=
g it as
>> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> > +       double the precision of x.
>> > +
>> > +       If we imagine a short as being composed of two blocks of bytes=
 then
>> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalen=
t to
>> > +       adding 1 to each sub component:
>> > +
>> > +	    short value of 16-bits
>> > +       =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=90
>> > +       =E2=94=82              =E2=94=82                =E2=94=82
>> > +       =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=98
>> > +	 8-bit part1 =E2=96=B2  8-bit part2   =E2=96=B2
>> > +		     =E2=94=82                =E2=94=82
>> > +		     =E2=94=82                =E2=94=82
>> > +		    +1               +1
>> > +
>> > +       after the first addition, we have to shift right by 8, and nar=
row the
>> > +       results back to a byte.  Remember that the addition must be do=
ne in
>> > +       double the precision of the input.  However if we know that the
>> addition
>> > +       `x + 257` does not overflow then we can do the operation in the
>> current
>> > +       precision.  In which case we don't need the pack and unpacks. =
 */
>> > +      auto wcst =3D wi::to_wide (cst);
>> > +      int pow =3D wi::exact_log2 (wcst + 1);
>> > +      if (pow =3D=3D (int) (element_precision (vectype) / 2))
>> > +	{
>> > +	  wide_int min,max;
>> > +	  /* If we're in a pattern we need to find the orginal definition.  =
*/
>> > +	  tree op0 =3D oprnd0;
>> > +	  gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0);
>> > +	  stmt_vec_info stmt_info =3D vinfo->lookup_stmt (stmt);
>> > +	  if (is_pattern_stmt_p (stmt_info))
>> > +	    {
>> > +	      auto orig_stmt =3D STMT_VINFO_RELATED_STMT (stmt_info);
>> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> > +		op0 =3D gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> > +	    }
>>=20
>> If this is generally safe (I'm skipping thinking about it in the interes=
ts of a
>> quick review :-)), then I think it should be done in vect_get_range_info
>> instead.  Using gimple_get_lhs would be more general than handling just
>> assignments.
>>=20
>> > +
>> > +	  /* Check that no overflow will occur.  If we don't have range
>> > +	     information we can't perform the optimization.  */
>> > +	  if (vect_get_range_info (op0, &min, &max))
>> > +	    {
>> > +	      wide_int one =3D wi::to_wide (build_one_cst (itype));
>> > +	      wide_int adder =3D wi::add (one, wi::lshift (one, pow));
>> > +	      wi::overflow_type ovf;
>> > +	      /* We need adder and max in the same precision.  */
>> > +	      wide_int zadder
>> > +		=3D wide_int_storage::from (adder, wi::get_precision (max),
>> > +					  UNSIGNED);
>> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>>=20
>> Could you explain this a bit more?  When do we have mismatched
>> precisions?
>
> C promotion rules will promote e.g.
>
> void fun2(uint8_t* restrict pixel, uint8_t level, int n)
> {
>   for (int i =3D 0; i < n; i+=3D1)
>     pixel[i] =3D (pixel[i] + level) / 0xff;
> }
>
> And have the addition be done as a 32 bit integer.  The vectorizer will d=
emote this down
> to a short, but range information is not stored for patterns.  So In the =
above the range will
> correctly be 0x1fe but the precision will be that of the original express=
ion, so 32.  This will
> be a mismatch with itype which is derived from the size the vectorizer wi=
ll perform the
> operation in.

Gah, missed this first time round, sorry.

Richi would know better than me, but I think it's dangerous to rely on
the orig/pattern link for range information.  The end result of a pattern
(vect_stmt_to_vectorize) has to have the same type as the lhs of the
original statement.  But the other statements in the pattern sequence
can do arbitrary things.  Their range isn't predictable from the range
of the original statement result.

IIRC, the addition above is converted to:

  a' =3D (uint16_t) pixel[i]
  b' =3D (uint16_t) level
  sum' =3D a' + b'
  sum =3D (int) sum'

where sum is the direct replacement of "pixel[i] + level", with the
same type and range.  The division then uses sum' instead of sum.

But the fact that sum' is part of the same pattern as sum doesn't
guarantee that sum' has the same range as sum.  E.g. the pattern
statements added by the division optimisation wouldn't have this
property.

Is it possible to tell ranger to compute the range of expressions that
haven't been added to the IL?  (Genuine question, haven't looked.
It seems pretty powerful though.)

Thanks,
Richard