From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by sourceware.org (Postfix) with ESMTPS id 1B2713851402 for ; Fri, 10 Feb 2023 18:34:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1B2713851402 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id AADCC5D32E; Fri, 10 Feb 2023 18:34:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1676054087; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xx1rUkv47KSehPCIe6pDHgS46jRIeacstrHxvz1MTzk=; b=EIBAYkoOWUr4+8FjR+/9e3dppPAN52inSPBJRVEScyQDXTaXj/ADJLiyHmADC+ZCYkdRJt IkbySMoM5rAht3OC8pLs3DeygFHzv+YqCdlRXca3ROwoJKiJmT6JdgUZ4Lr6m56oOjc0kY rTrM8kLkCZ697BYUzVfG9hHrrphqo44= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1676054087; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xx1rUkv47KSehPCIe6pDHgS46jRIeacstrHxvz1MTzk=; b=2G+e2qUa+ZdQgvleNee1ZSzNk2XIyiouo+azc/GQNroJdz8O+pRU+IjAWcdLS+jZFMTX8P JWrmCgBJUxbvgiAg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 9807213206; Fri, 10 Feb 2023 18:34:47 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id uRoSJUeO5mNrPgAAMHmgww (envelope-from ); Fri, 10 Feb 2023 18:34:47 +0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Biener Mime-Version: 1.0 (1.0) Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Date: Fri, 10 Feb 2023 19:34:37 +0100 Message-Id: References: Cc: Tamar Christina , Tamar Christina via Gcc-patches , nd , jlaw@ventanamicro.com, Andrew MacLeod In-Reply-To: To: Richard Sandiford X-Mailer: iPhone Mail (20D47) X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MIME_QP_LONG_LINE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > Am 10.02.2023 um 19:12 schrieb Richard Sandiford via Gcc-patches : >=20 > =EF=BB=BFTamar Christina writes: >>> -----Original Message----- >>> From: Richard Sandiford >>> Sent: Friday, February 10, 2023 4:57 PM >>> To: Tamar Christina >>> Cc: Tamar Christina via Gcc-patches ; nd >>> ; rguenther@suse.de; jlaw@ventanamicro.com >>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmas= k >>> by using new optabs [PR108583] >>>=20 >>> Tamar Christina writes: >>>>> -----Original Message----- >>>>> From: Richard Sandiford >>>>> Sent: Friday, February 10, 2023 4:25 PM >>>>> To: Tamar Christina >>>>> Cc: Tamar Christina via Gcc-patches ; nd >>>>> ; rguenther@suse.de; jlaw@ventanamicro.com >>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of >>>>> div-bitmask by using new optabs [PR108583] >>>>>=20 >>>>> Tamar Christina writes: >>>>>>> -----Original Message----- >>>>>>> From: Richard Sandiford >>>>>>> Sent: Friday, February 10, 2023 3:57 PM >>>>>>> To: Tamar Christina >>>>>>> Cc: Tamar Christina via Gcc-patches ; nd >>>>>>> ; rguenther@suse.de; jlaw@ventanamicro.com >>>>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of >>>>>>> div-bitmask by using new optabs [PR108583] >>>>>>>=20 >>>>>>> Tamar Christina writes: >>>>>>>>>> a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index >>>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>=20 >>>>>=20 >>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33 >>>>>>>>> 69 >>>>>>>>>> de2afea139d6 100644 >>>>>>>>>> --- a/gcc/tree-vect-patterns.cc >>>>>>>>>> +++ b/gcc/tree-vect-patterns.cc >>>>>>>>>> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern >>> (vec_info >>>>>>> *vinfo, >>>>>>>>>> return pattern_stmt; >>>>>>>>>> } >>>>>>>>>> else if ((cst =3D uniform_integer_cst_p (oprnd1)) >>>>>>>>>> - && targetm.vectorize.can_special_div_by_const (rhs_code, >>>>>>>>> vectype, >>>>>>>>>> - wi::to_wide >>>>> (cst), >>>>>>>>>> - NULL, >>>>> NULL_RTX, >>>>>>>>>> - NULL_RTX)) >>>>>>>>>> + && TYPE_UNSIGNED (itype) >>>>>>>>>> + && rhs_code =3D=3D TRUNC_DIV_EXPR >>>>>>>>>> + && vectype >>>>>>>>>> + && direct_internal_fn_supported_p (IFN_ADDH, vectype, >>>>>>>>>> + OPTIMIZE_FOR_SPEED)) >>>>>>>>>> { >>>>>>>>>> - return NULL; >>>>>>>>>> + /* div optimizations using narrowings >>>>>>>>>> + we can do the division e.g. shorts by 255 faster by >>>>>>>>>> + calculating it >>>>> as >>>>>>>>>> + (x + ((x + 257) >> 8)) >> 8 assuming the operation is don= e in >>>>>>>>>> + double the precision of x. >>>>>>>>>> + >>>>>>>>>> + If we imagine a short as being composed of two blocks >>>>>>>>>> + of bytes >>>>>>> then >>>>>>>>>> + adding 257 or 0b0000_0001_0000_0001 to the number is >>>>>>>>>> + equivalent >>>>>>> to >>>>>>>>>> + adding 1 to each sub component: >>>>>>>>>> + >>>>>>>>>> + short value of 16-bits >>>>>>>>>> + =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=90 >>>>>>>>>> + =E2=94=82 =E2=94=82 =E2=94=82= >>>>>>>>>> + =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=98 >>>>>>>>>> + 8-bit part1 =E2=96=B2 8-bit part2 =E2=96=B2 >>>>>>>>>> + =E2=94=82 =E2=94=82 >>>>>>>>>> + =E2=94=82 =E2=94=82 >>>>>>>>>> + +1 +1 >>>>>>>>>> + >>>>>>>>>> + after the first addition, we have to shift right by >>>>>>>>>> + 8, and narrow >>>>> the >>>>>>>>>> + results back to a byte. Remember that the addition >>>>>>>>>> + must be done >>>>>>> in >>>>>>>>>> + double the precision of the input. However if we >>>>>>>>>> + know that the >>>>>>>>> addition >>>>>>>>>> + `x + 257` does not overflow then we can do the >>>>>>>>>> + operation in the >>>>>>>>> current >>>>>>>>>> + precision. In which case we don't need the pack and >>> unpacks. >>>>> */ >>>>>>>>>> + auto wcst =3D wi::to_wide (cst); >>>>>>>>>> + int pow =3D wi::exact_log2 (wcst + 1); >>>>>>>>>> + if (pow =3D=3D (int) (element_precision (vectype) / 2)) >>>>>>>>>> + { >>>>>>>>>> + wide_int min,max; >>>>>>>>>> + /* If we're in a pattern we need to find the orginal >>>>> definition. */ >>>>>>>>>> + tree op0 =3D oprnd0; >>>>>>>>>> + gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0); >>>>>>>>>> + stmt_vec_info stmt_info =3D vinfo->lookup_stmt (stmt); >>>>>>>>>> + if (is_pattern_stmt_p (stmt_info)) >>>>>>>>>> + { >>>>>>>>>> + auto orig_stmt =3D STMT_VINFO_RELATED_STMT >>>>> (stmt_info); >>>>>>>>>> + if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) >>>>>>>>>> + op0 =3D gimple_assign_lhs (STMT_VINFO_STMT >>>>> (orig_stmt)); >>>>>>>>>> + } >>>>>>>>>=20 >>>>>>>>> If this is generally safe (I'm skipping thinking about it in >>>>>>>>> the interests of a quick review :-)), then I think it should be >>>>>>>>> done in vect_get_range_info instead. Using gimple_get_lhs >>>>>>>>> would be more general than handling just assignments. >>>>>>>>>=20 >>>>>>>>>> + >>>>>>>>>> + /* Check that no overflow will occur. If we don't have ra= nge >>>>>>>>>> + information we can't perform the optimization. */ >>>>>>>>>> + if (vect_get_range_info (op0, &min, &max)) >>>>>>>>>> + { >>>>>>>>>> + wide_int one =3D wi::to_wide (build_one_cst (itype)); >>>>>>>>>> + wide_int adder =3D wi::add (one, wi::lshift (one, pow)= ); >>>>>>>>>> + wi::overflow_type ovf; >>>>>>>>>> + /* We need adder and max in the same precision. */ >>>>>>>>>> + wide_int zadder >>>>>>>>>> + =3D wide_int_storage::from (adder, wi::get_precision >>>>> (max), >>>>>>>>>> + UNSIGNED); >>>>>>>>>> + wi::add (max, zadder, UNSIGNED, &ovf); >>>>>>>>>=20 >>>>>>>>> Could you explain this a bit more? When do we have mismatched >>>>>>>>> precisions? >>>>>>>>=20 >>>>>>>> C promotion rules will promote e.g. >>>>>>>>=20 >>>>>>>> void fun2(uint8_t* restrict pixel, uint8_t level, int n) { >>>>>>>> for (int i =3D 0; i < n; i+=3D1) >>>>>>>> pixel[i] =3D (pixel[i] + level) / 0xff; } >>>>>>>>=20 >>>>>>>> And have the addition be done as a 32 bit integer. The >>>>>>>> vectorizer will demote this down to a short, but range >>>>>>>> information is not stored for patterns. So In the above the >>>>>>>> range will correctly be 0x1fe but the precision will be that of >>>>>>>> the original expression, so 32. This will be a mismatch with >>>>>>>> itype which is derived from the size the vectorizer >>>>>>> will perform the operation in. >>>>>>>=20 >>>>>>> Gah, missed this first time round, sorry. >>>>>>>=20 >>>>>>> Richi would know better than me, but I think it's dangerous to >>>>>>> rely on the orig/pattern link for range information. The end >>>>>>> result of a pattern >>>>>>> (vect_stmt_to_vectorize) has to have the same type as the lhs of >>>>>>> the original statement. But the other statements in the pattern >>>>>>> sequence can do arbitrary things. Their range isn't predictable >>>>>>> from the range of the original statement result. >>>>>>>=20 >>>>>>> IIRC, the addition above is converted to: >>>>>>>=20 >>>>>>> a' =3D (uint16_t) pixel[i] >>>>>>> b' =3D (uint16_t) level >>>>>>> sum' =3D a' + b' >>>>>>> sum =3D (int) sum' >>>>>>>=20 >>>>>>> where sum is the direct replacement of "pixel[i] + level", with >>>>>>> the same type and range. The division then uses sum' instead of sum= . >>>>>>>=20 >>>>>>> But the fact that sum' is part of the same pattern as sum doesn't >>>>>>> guarantee that sum' has the same range as sum. E.g. the pattern >>>>>>> statements added by the division optimisation wouldn't have this >>>>> property. >>>>>>=20 >>>>>> So my assumption is that no pattern would replace a statement with >>>>>> something That has higher precision than the C statement. The >>>>>> pattern above is demoted By the vectorizer based on range information= >>> already. >>>>>> My assumption was that the precision can only ever be smaller, >>>>>> because otherwise the pattern has violated the semantics of the C >>>>>> code, which >>>>> would be dangerous if e.g. the expression escapes? >>>>>=20 >>>>> IMO the difference in precisions was a symptom of the problem rather >>>>> than the direct cause. >>>>>=20 >>>>> The point is more that "B =3D vect_orig_stmt(A)" just says "A is used >>>>> somehow in a new calculation of B". A might equal B (if A replaces >>>>> B), or A might be an arbitrary temporary result. The code above is >>>>> instead using it to mean "A equals B, expressed in a different type". >>>>> That happens to be true for sum' in the sequence above, but it isn't >>>>> true of non-final pattern statements in general. >>>>>=20 >>>>=20 >>>> Sorry for being dense, but I though that's exactly what the code does >>>> and what I tried explain before. If B isn't a final statement than it w= on't >>> have an original statement. >>>> AFAIK, the only places we set original statement is the root of the pat= tern >>> expression. >>>=20 >>> Final pattern statements (those not in DEF_SEQ) always have the same typ= e >>> and value as the original statements. We wouldn't see mismatched >>> precisions if we were only looking at final pattern statements. >>=20 >> We would because the entire problem is that pattern statement have no ran= ges. >> Ranger does not track them after they have been created. This could of c= ourse >> Trivially be solved if we tell ranger about the demotion we did, but we d= on't do so >> at the moment. It will just return varying here. This is the root cause o= f the issue. >>=20 >>>=20 >>> Like you say, the 16-bit addition didn't exist before vectorisation (it w= as a 32- >>> bit addition instead). So to make things type-correct, the 32-bit addit= ion: >>>=20 >>> A: sum =3D a + b (STMT_VINFO_RELATED_STMT =3D=3D A2) >>>=20 >>> is replaced with: >>>=20 >>> DEF_SEQ: >>> A1: tmp =3D a' + b' (STMT_VINFO_RELATED_STMT =3D=3D A) >>> A2: sum' =3D (int) tmp (STMT_VINFO_RELATED_STMT =3D=3D A) >>>=20 >>> (using different notation from before, just to confuse things). >>> Here, A2 is the final pattern statement for A and A1 is just a temporary= result. >>> sum =3D=3D sum'. >>>=20 >>> Later, we do a similar thing for the division itself. We have: >>>=20 >>> B: quotient =3D sum / 0xff (STMT_VINFO_RELATED_STMT =3D=3D B= 2) >>>=20 >>> We realise that this can be a 16-bit division, so (IIRC) we use >>> vect_look_through_possible_promotion on sum to find the best starting >>> point. This should give: >>>=20 >>> DEF_SEQ: >>> B1: tmp2 =3D tmp / (uint16_t) 0xff (STMT_VINFO_RELATED_STMT =3D=3D B= ) >>> B2: quotient' =3D (int) tmp2 (STMT_VINFO_RELATED_STMT =3D=3D B= ) >>>=20 >>> Both changes are done by vect_widened_op_tree. >>>=20 >>> We then apply the division pattern to B1. B1 is a nonfinal pattern stat= ement >>> that uses the result (tmp) of another nonfinal pattern statement (A1). >>>=20 >>> The code does: >>>=20 >>> if (is_pattern_stmt_p (stmt_info)) >>> { >>> auto orig_stmt =3D STMT_VINFO_RELATED_STMT (stmt_info); >>> if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) >>> op0 =3D gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt)); >>> } >>>=20 >>> is_pattern_stmt_p is true for both A1 and A2, and >>> STMT_VINFO_RELATED_STMT is A for both A1 and A2. I would expect: >>>=20 >>> gcc_assert (stmt_info =3D=3D vect_stmt_to_vectorize (orig_stmt)); >>>=20 >>> (testing for a final pattern) to fail for the motivating example. >>>=20 >>=20 >> I think we're actually saying the same thing. I believe all I'm saying is= that looking >> at the original statement is a safe alternative as it conservatively will= overestimate >> to VARYING or give a wider range than the pattern would have. >=20 > Hmm, but you said "If B isn't a final statement than it won't have an > original statement. AFAIK, the only places we set original statement > is the root of the pattern expression." My point was that that isn't true= . > All statements in the pattern have an original statement, not just the roo= t. > And we're specifically relying on that for the motivating example to work.= >=20 >> I'm saying it's conservatively safe, while not overly accurate. The alte= rnative would be >> to tell ranger about the demotions in vect_recog_over_widening_pattern us= ing range::set. >>=20 >> But for this to work the general widening pattern also have to update the= range information. >>=20 >> I think where we're disagreeing is that I think looking at the original s= calar statement is a safe >> conservative estimate. It will fail in some cases, but that's a missed o= ptimization, not a miss-optimization. >=20 > Yeah, like you say, I disagree that it's conservatively correct. > It means that we're hoping (without proving) that the only things > between stmt_info and the final pattern statement are conversions. > I don't think there's any reason in principle why that must hold. >=20 > What would be conservatively correct would be to start from the > final pattern statement and work our way down to the value that > is actually being used. That seems a bit convoluted though, > so I'd prefer not to do that... >=20 >> In any case, if you disagree I don=E2=80=99t' really see a way forward as= ide from making this its own pattern >> running it before the overwidening pattern. >=20 > I think we should look to see if ranger can be persuaded to provide the > range of the 16-bit addition, even though the statement that produces it > isn't part of a BB. It shouldn't matter that the addition originally > came from a 32-bit one: the range follows directly from the ranges of > the operands (i.e. the fact that the operands are the results of > widening conversions). I think you can ask ranger on operations on names defined in the IL, so you c= an work yourself through the sequence of operations in the pattern sequence t= o compute ranges on their defs (and possibly even store them in the SSA info= ). You just need to pick the correct ranger API for this=E2=80=A6. Andrew C= Ced Richard=20 > Thanks, > Richard