From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=glZS=6G=suse.de=rguenther@sourceware.org>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d])
	by sourceware.org (Postfix) with ESMTPS id 1B2713851402
	for <gcc-patches@gcc.gnu.org>; Fri, 10 Feb 2023 18:34:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1B2713851402
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id AADCC5D32E;
	Fri, 10 Feb 2023 18:34:47 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1676054087; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=xx1rUkv47KSehPCIe6pDHgS46jRIeacstrHxvz1MTzk=;
	b=EIBAYkoOWUr4+8FjR+/9e3dppPAN52inSPBJRVEScyQDXTaXj/ADJLiyHmADC+ZCYkdRJt
	IkbySMoM5rAht3OC8pLs3DeygFHzv+YqCdlRXca3ROwoJKiJmT6JdgUZ4Lr6m56oOjc0kY
	rTrM8kLkCZ697BYUzVfG9hHrrphqo44=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1676054087;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=xx1rUkv47KSehPCIe6pDHgS46jRIeacstrHxvz1MTzk=;
	b=2G+e2qUa+ZdQgvleNee1ZSzNk2XIyiouo+azc/GQNroJdz8O+pRU+IjAWcdLS+jZFMTX8P
	JWrmCgBJUxbvgiAg==
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 9807213206;
	Fri, 10 Feb 2023 18:34:47 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
	by imap2.suse-dmz.suse.de with ESMTPSA
	id uRoSJUeO5mNrPgAAMHmgww
	(envelope-from <rguenther@suse.de>); Fri, 10 Feb 2023 18:34:47 +0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
From: Richard Biener <rguenther@suse.de>
Mime-Version: 1.0 (1.0)
Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] 
Date: Fri, 10 Feb 2023 19:34:37 +0100
Message-Id: <F6AE5834-ACA9-4FE2-9697-3E20F8A53D47@suse.de>
References: <mpt1qmx1jbc.fsf@arm.com>
Cc: Tamar Christina <tamar.christina@arm.com>,
 Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 jlaw@ventanamicro.com, Andrew MacLeod <amacleod@redhat.com>
In-Reply-To: <mpt1qmx1jbc.fsf@arm.com>
To: Richard Sandiford <richard.sandiford@arm.com>
X-Mailer: iPhone Mail (20D47)
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MIME_QP_LONG_LINE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


> Am 10.02.2023 um 19:12 schrieb Richard Sandiford via Gcc-patches <gcc-patc=
hes@gcc.gnu.org>:
>=20
> =EF=BB=BFTamar Christina <Tamar.Christina@arm.com> writes:
>>> -----Original Message-----
>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>> Sent: Friday, February 10, 2023 4:57 PM
>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmas=
k
>>> by using new optabs [PR108583]
>>>=20
>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>> -----Original Message-----
>>>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>>>> Sent: Friday, February 10, 2023 4:25 PM
>>>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>>>>> div-bitmask by using new optabs [PR108583]
>>>>>=20
>>>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>>>> -----Original Message-----
>>>>>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>>>>>> Sent: Friday, February 10, 2023 3:57 PM
>>>>>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>>>>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>>>>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>>>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>>>>>>> div-bitmask by using new optabs [PR108583]
>>>>>>>=20
>>>>>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>>>>>>> a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>=20
>>>>>=20
>>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>>>>>>>>> 69
>>>>>>>>>> de2afea139d6 100644
>>>>>>>>>> --- a/gcc/tree-vect-patterns.cc
>>>>>>>>>> +++ b/gcc/tree-vect-patterns.cc
>>>>>>>>>> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
>>> (vec_info
>>>>>>> *vinfo,
>>>>>>>>>>       return pattern_stmt;
>>>>>>>>>>     }
>>>>>>>>>>   else if ((cst =3D uniform_integer_cst_p (oprnd1))
>>>>>>>>>> -       && targetm.vectorize.can_special_div_by_const (rhs_code,
>>>>>>>>> vectype,
>>>>>>>>>> -                              wi::to_wide
>>>>> (cst),
>>>>>>>>>> -                              NULL,
>>>>> NULL_RTX,
>>>>>>>>>> -                              NULL_RTX))
>>>>>>>>>> +       && TYPE_UNSIGNED (itype)
>>>>>>>>>> +       && rhs_code =3D=3D TRUNC_DIV_EXPR
>>>>>>>>>> +       && vectype
>>>>>>>>>> +       && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>>>>>>>>>> +                          OPTIMIZE_FOR_SPEED))
>>>>>>>>>>     {
>>>>>>>>>> -      return NULL;
>>>>>>>>>> +      /* div optimizations using narrowings
>>>>>>>>>> +       we can do the division e.g. shorts by 255 faster by
>>>>>>>>>> + calculating it
>>>>> as
>>>>>>>>>> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is don=
e in
>>>>>>>>>> +       double the precision of x.
>>>>>>>>>> +
>>>>>>>>>> +       If we imagine a short as being composed of two blocks
>>>>>>>>>> + of bytes
>>>>>>> then
>>>>>>>>>> +       adding 257 or 0b0000_0001_0000_0001 to the number is
>>>>>>>>>> + equivalent
>>>>>>> to
>>>>>>>>>> +       adding 1 to each sub component:
>>>>>>>>>> +
>>>>>>>>>> +        short value of 16-bits
>>>>>>>>>> +       =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=90
>>>>>>>>>> +       =E2=94=82              =E2=94=82                =E2=94=82=

>>>>>>>>>> +       =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=98
>>>>>>>>>> +     8-bit part1 =E2=96=B2  8-bit part2   =E2=96=B2
>>>>>>>>>> +             =E2=94=82                =E2=94=82
>>>>>>>>>> +             =E2=94=82                =E2=94=82
>>>>>>>>>> +            +1               +1
>>>>>>>>>> +
>>>>>>>>>> +       after the first addition, we have to shift right by
>>>>>>>>>> + 8, and narrow
>>>>> the
>>>>>>>>>> +       results back to a byte.  Remember that the addition
>>>>>>>>>> + must be done
>>>>>>> in
>>>>>>>>>> +       double the precision of the input.  However if we
>>>>>>>>>> + know that the
>>>>>>>>> addition
>>>>>>>>>> +       `x + 257` does not overflow then we can do the
>>>>>>>>>> + operation in the
>>>>>>>>> current
>>>>>>>>>> +       precision.  In which case we don't need the pack and
>>> unpacks.
>>>>> */
>>>>>>>>>> +      auto wcst =3D wi::to_wide (cst);
>>>>>>>>>> +      int pow =3D wi::exact_log2 (wcst + 1);
>>>>>>>>>> +      if (pow =3D=3D (int) (element_precision (vectype) / 2))
>>>>>>>>>> +    {
>>>>>>>>>> +      wide_int min,max;
>>>>>>>>>> +      /* If we're in a pattern we need to find the orginal
>>>>> definition.  */
>>>>>>>>>> +      tree op0 =3D oprnd0;
>>>>>>>>>> +      gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0);
>>>>>>>>>> +      stmt_vec_info stmt_info =3D vinfo->lookup_stmt (stmt);
>>>>>>>>>> +      if (is_pattern_stmt_p (stmt_info))
>>>>>>>>>> +        {
>>>>>>>>>> +          auto orig_stmt =3D STMT_VINFO_RELATED_STMT
>>>>> (stmt_info);
>>>>>>>>>> +          if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>>>>>>>>>> +        op0 =3D gimple_assign_lhs (STMT_VINFO_STMT
>>>>> (orig_stmt));
>>>>>>>>>> +        }
>>>>>>>>>=20
>>>>>>>>> If this is generally safe (I'm skipping thinking about it in
>>>>>>>>> the interests of a quick review :-)), then I think it should be
>>>>>>>>> done in vect_get_range_info instead.  Using gimple_get_lhs
>>>>>>>>> would be more general than handling just assignments.
>>>>>>>>>=20
>>>>>>>>>> +
>>>>>>>>>> +      /* Check that no overflow will occur.  If we don't have ra=
nge
>>>>>>>>>> +         information we can't perform the optimization.  */
>>>>>>>>>> +      if (vect_get_range_info (op0, &min, &max))
>>>>>>>>>> +        {
>>>>>>>>>> +          wide_int one =3D wi::to_wide (build_one_cst (itype));
>>>>>>>>>> +          wide_int adder =3D wi::add (one, wi::lshift (one, pow)=
);
>>>>>>>>>> +          wi::overflow_type ovf;
>>>>>>>>>> +          /* We need adder and max in the same precision.  */
>>>>>>>>>> +          wide_int zadder
>>>>>>>>>> +        =3D wide_int_storage::from (adder, wi::get_precision
>>>>> (max),
>>>>>>>>>> +                      UNSIGNED);
>>>>>>>>>> +          wi::add (max, zadder, UNSIGNED, &ovf);
>>>>>>>>>=20
>>>>>>>>> Could you explain this a bit more?  When do we have mismatched
>>>>>>>>> precisions?
>>>>>>>>=20
>>>>>>>> C promotion rules will promote e.g.
>>>>>>>>=20
>>>>>>>> void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>>>>>>>>  for (int i =3D 0; i < n; i+=3D1)
>>>>>>>>    pixel[i] =3D (pixel[i] + level) / 0xff; }
>>>>>>>>=20
>>>>>>>> And have the addition be done as a 32 bit integer.  The
>>>>>>>> vectorizer will demote this down to a short, but range
>>>>>>>> information is not stored for patterns.  So In the above the
>>>>>>>> range will correctly be 0x1fe but the precision will be that of
>>>>>>>> the original expression, so 32.  This will be a mismatch with
>>>>>>>> itype which is derived from the size the vectorizer
>>>>>>> will perform the operation in.
>>>>>>>=20
>>>>>>> Gah, missed this first time round, sorry.
>>>>>>>=20
>>>>>>> Richi would know better than me, but I think it's dangerous to
>>>>>>> rely on the orig/pattern link for range information.  The end
>>>>>>> result of a pattern
>>>>>>> (vect_stmt_to_vectorize) has to have the same type as the lhs of
>>>>>>> the original statement.  But the other statements in the pattern
>>>>>>> sequence can do arbitrary things.  Their range isn't predictable
>>>>>>> from the range of the original statement result.
>>>>>>>=20
>>>>>>> IIRC, the addition above is converted to:
>>>>>>>=20
>>>>>>>  a' =3D (uint16_t) pixel[i]
>>>>>>>  b' =3D (uint16_t) level
>>>>>>>  sum' =3D a' + b'
>>>>>>>  sum =3D (int) sum'
>>>>>>>=20
>>>>>>> where sum is the direct replacement of "pixel[i] + level", with
>>>>>>> the same type and range.  The division then uses sum' instead of sum=
.
>>>>>>>=20
>>>>>>> But the fact that sum' is part of the same pattern as sum doesn't
>>>>>>> guarantee that sum' has the same range as sum.  E.g. the pattern
>>>>>>> statements added by the division optimisation wouldn't have this
>>>>> property.
>>>>>>=20
>>>>>> So my assumption is that no pattern would replace a statement with
>>>>>> something That has higher precision than the C statement. The
>>>>>> pattern above is demoted By the vectorizer based on range information=

>>> already.
>>>>>> My assumption was that the precision can only ever be smaller,
>>>>>> because otherwise the pattern has violated the semantics of the C
>>>>>> code, which
>>>>> would be dangerous if e.g. the expression escapes?
>>>>>=20
>>>>> IMO the difference in precisions was a symptom of the problem rather
>>>>> than the direct cause.
>>>>>=20
>>>>> The point is more that "B =3D vect_orig_stmt(A)" just says "A is used
>>>>> somehow in a new calculation of B".  A might equal B (if A replaces
>>>>> B), or A might be an arbitrary temporary result.  The code above is
>>>>> instead using it to mean "A equals B, expressed in a different type".
>>>>> That happens to be true for sum' in the sequence above, but it isn't
>>>>> true of non-final pattern statements in general.
>>>>>=20
>>>>=20
>>>> Sorry for being dense, but I though that's exactly what the code does
>>>> and what I tried explain before. If B isn't a final statement than it w=
on't
>>> have an original statement.
>>>> AFAIK, the only places we set original statement is the root of the pat=
tern
>>> expression.
>>>=20
>>> Final pattern statements (those not in DEF_SEQ) always have the same typ=
e
>>> and value as the original statements.  We wouldn't see mismatched
>>> precisions if we were only looking at final pattern statements.
>>=20
>> We would because the entire problem is that pattern statement have no ran=
ges.
>> Ranger does not track them after they have been created.  This could of c=
ourse
>> Trivially be solved if we tell ranger about the demotion we did, but we d=
on't do so
>> at the moment. It will just return varying here.  This is the root cause o=
f the issue.
>>=20
>>>=20
>>> Like you say, the 16-bit addition didn't exist before vectorisation (it w=
as a 32-
>>> bit addition instead).  So to make things type-correct, the 32-bit addit=
ion:
>>>=20
>>>   A: sum =3D a + b           (STMT_VINFO_RELATED_STMT =3D=3D A2)
>>>=20
>>> is replaced with:
>>>=20
>>>   DEF_SEQ:
>>>     A1: tmp =3D a' + b'      (STMT_VINFO_RELATED_STMT =3D=3D A)
>>>   A2: sum' =3D (int) tmp     (STMT_VINFO_RELATED_STMT =3D=3D A)
>>>=20
>>> (using different notation from before, just to confuse things).
>>> Here, A2 is the final pattern statement for A and A1 is just a temporary=
 result.
>>> sum =3D=3D sum'.
>>>=20
>>> Later, we do a similar thing for the division itself.  We have:
>>>=20
>>>   B: quotient =3D sum / 0xff            (STMT_VINFO_RELATED_STMT =3D=3D B=
2)
>>>=20
>>> We realise that this can be a 16-bit division, so (IIRC) we use
>>> vect_look_through_possible_promotion on sum to find the best starting
>>> point.  This should give:
>>>=20
>>>   DEF_SEQ:
>>>     B1: tmp2 =3D tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT =3D=3D B=
)
>>>   B2: quotient' =3D (int) tmp2          (STMT_VINFO_RELATED_STMT =3D=3D B=
)
>>>=20
>>> Both changes are done by vect_widened_op_tree.
>>>=20
>>> We then apply the division pattern to B1.  B1 is a nonfinal pattern stat=
ement
>>> that uses the result (tmp) of another nonfinal pattern statement (A1).
>>>=20
>>> The code does:
>>>=20
>>>      if (is_pattern_stmt_p (stmt_info))
>>>        {
>>>          auto orig_stmt =3D STMT_VINFO_RELATED_STMT (stmt_info);
>>>          if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>>>        op0 =3D gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>>>        }
>>>=20
>>> is_pattern_stmt_p is true for both A1 and A2, and
>>> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
>>>=20
>>>  gcc_assert (stmt_info =3D=3D vect_stmt_to_vectorize (orig_stmt));
>>>=20
>>> (testing for a final pattern) to fail for the motivating example.
>>>=20
>>=20
>> I think we're actually saying the same thing. I believe all I'm saying is=
 that looking
>> at the original statement is a safe alternative as it conservatively will=
 overestimate
>> to VARYING or give a wider range than the pattern would have.
>=20
> Hmm, but you said "If B isn't a final statement than it won't have an
> original statement. AFAIK, the only places we set original statement
> is the root of the pattern expression."  My point was that that isn't true=
.
> All statements in the pattern have an original statement, not just the roo=
t.
> And we're specifically relying on that for the motivating example to work.=

>=20
>> I'm saying it's conservatively safe, while not overly accurate.  The alte=
rnative would be
>> to tell ranger about the demotions in vect_recog_over_widening_pattern us=
ing range::set.
>>=20
>> But for this to work the general widening pattern also have to update the=
 range information.
>>=20
>> I think where we're disagreeing is that I think looking at the original s=
calar statement is a safe
>> conservative estimate.  It will fail in some cases, but that's a missed o=
ptimization, not a miss-optimization.
>=20
> Yeah, like you say, I disagree that it's conservatively correct.
> It means that we're hoping (without proving) that the only things
> between stmt_info and the final pattern statement are conversions.
> I don't think there's any reason in principle why that must hold.
>=20
> What would be conservatively correct would be to start from the
> final pattern statement and work our way down to the value that
> is actually being used.  That seems a bit convoluted though,
> so I'd prefer not to do that...
>=20
>> In any case, if you disagree I don=E2=80=99t' really see a way forward as=
ide from making this its own pattern
>> running it before the overwidening pattern.
>=20
> I think we should look to see if ranger can be persuaded to provide the
> range of the 16-bit addition, even though the statement that produces it
> isn't part of a BB.  It shouldn't matter that the addition originally
> came from a 32-bit one: the range follows directly from the ranges of
> the operands (i.e. the fact that the operands are the results of
> widening conversions).

I think you can ask ranger on operations on names defined in the IL, so you c=
an work yourself through the sequence of operations in the pattern sequence t=
o compute ranges on their defs (and possibly even store them in the SSA info=
).  You just need to pick the correct ranger API for this=E2=80=A6. Andrew C=
Ced

Richard=20

> Thanks,
> Richard