From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id A99D03858C5F for ; Fri, 10 Feb 2023 16:25:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A99D03858C5F Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D040F4B3; Fri, 10 Feb 2023 08:26:00 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8B2683F703; Fri, 10 Feb 2023 08:25:17 -0800 (PST) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina ,Tamar Christina via Gcc-patches , nd , "rguenther\@suse.de" , "jlaw\@ventanamicro.com" , richard.sandiford@arm.com Cc: Tamar Christina via Gcc-patches , nd , "rguenther\@suse.de" , "jlaw\@ventanamicro.com" Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] References: Date: Fri, 10 Feb 2023 16:25:16 +0000 In-Reply-To: (Tamar Christina's message of "Fri, 10 Feb 2023 16:09:58 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-27.9 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Tamar Christina writes: >> -----Original Message----- >> From: Richard Sandiford >> Sent: Friday, February 10, 2023 3:57 PM >> To: Tamar Christina >> Cc: Tamar Christina via Gcc-patches ; nd >> ; rguenther@suse.de; jlaw@ventanamicro.com >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask >> by using new optabs [PR108583] >>=20 >> Tamar Christina writes: >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index >> >> > >> >> >> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33 >> >> 69 >> >> > de2afea139d6 100644 >> >> > --- a/gcc/tree-vect-patterns.cc >> >> > +++ b/gcc/tree-vect-patterns.cc >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info >> *vinfo, >> >> > return pattern_stmt; >> >> > } >> >> > else if ((cst =3D uniform_integer_cst_p (oprnd1)) >> >> > - && targetm.vectorize.can_special_div_by_const (rhs_code, >> >> vectype, >> >> > - wi::to_wide (cst), >> >> > - NULL, NULL_RTX, >> >> > - NULL_RTX)) >> >> > + && TYPE_UNSIGNED (itype) >> >> > + && rhs_code =3D=3D TRUNC_DIV_EXPR >> >> > + && vectype >> >> > + && direct_internal_fn_supported_p (IFN_ADDH, vectype, >> >> > + OPTIMIZE_FOR_SPEED)) >> >> > { >> >> > - return NULL; >> >> > + /* div optimizations using narrowings >> >> > + we can do the division e.g. shorts by 255 faster by calcula= ting it as >> >> > + (x + ((x + 257) >> 8)) >> 8 assuming the operation is done = in >> >> > + double the precision of x. >> >> > + >> >> > + If we imagine a short as being composed of two blocks of by= tes >> then >> >> > + adding 257 or 0b0000_0001_0000_0001 to the number is equiva= lent >> to >> >> > + adding 1 to each sub component: >> >> > + >> >> > + short value of 16-bits >> >> > + =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=90 >> >> > + =E2=94=82 =E2=94=82 =E2=94=82 >> >> > + =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=98 >> >> > + 8-bit part1 =E2=96=B2 8-bit part2 =E2=96=B2 >> >> > + =E2=94=82 =E2=94=82 >> >> > + =E2=94=82 =E2=94=82 >> >> > + +1 +1 >> >> > + >> >> > + after the first addition, we have to shift right by 8, and = narrow the >> >> > + results back to a byte. Remember that the addition must be= done >> in >> >> > + double the precision of the input. However if we know that >> >> > + the >> >> addition >> >> > + `x + 257` does not overflow then we can do the operation in >> >> > + the >> >> current >> >> > + precision. In which case we don't need the pack and unpack= s. */ >> >> > + auto wcst =3D wi::to_wide (cst); >> >> > + int pow =3D wi::exact_log2 (wcst + 1); >> >> > + if (pow =3D=3D (int) (element_precision (vectype) / 2)) >> >> > + { >> >> > + wide_int min,max; >> >> > + /* If we're in a pattern we need to find the orginal definition= . */ >> >> > + tree op0 =3D oprnd0; >> >> > + gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0); >> >> > + stmt_vec_info stmt_info =3D vinfo->lookup_stmt (stmt); >> >> > + if (is_pattern_stmt_p (stmt_info)) >> >> > + { >> >> > + auto orig_stmt =3D STMT_VINFO_RELATED_STMT (stmt_info); >> >> > + if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) >> >> > + op0 =3D gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt)); >> >> > + } >> >> >> >> If this is generally safe (I'm skipping thinking about it in the >> >> interests of a quick review :-)), then I think it should be done in >> >> vect_get_range_info instead. Using gimple_get_lhs would be more >> >> general than handling just assignments. >> >> >> >> > + >> >> > + /* Check that no overflow will occur. If we don't have range >> >> > + information we can't perform the optimization. */ >> >> > + if (vect_get_range_info (op0, &min, &max)) >> >> > + { >> >> > + wide_int one =3D wi::to_wide (build_one_cst (itype)); >> >> > + wide_int adder =3D wi::add (one, wi::lshift (one, pow)); >> >> > + wi::overflow_type ovf; >> >> > + /* We need adder and max in the same precision. */ >> >> > + wide_int zadder >> >> > + =3D wide_int_storage::from (adder, wi::get_precision (max), >> >> > + UNSIGNED); >> >> > + wi::add (max, zadder, UNSIGNED, &ovf); >> >> >> >> Could you explain this a bit more? When do we have mismatched >> >> precisions? >> > >> > C promotion rules will promote e.g. >> > >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) { >> > for (int i =3D 0; i < n; i+=3D1) >> > pixel[i] =3D (pixel[i] + level) / 0xff; } >> > >> > And have the addition be done as a 32 bit integer. The vectorizer >> > will demote this down to a short, but range information is not stored >> > for patterns. So In the above the range will correctly be 0x1fe but >> > the precision will be that of the original expression, so 32. This >> > will be a mismatch with itype which is derived from the size the vecto= rizer >> will perform the operation in. >>=20 >> Gah, missed this first time round, sorry. >>=20 >> Richi would know better than me, but I think it's dangerous to rely on t= he >> orig/pattern link for range information. The end result of a pattern >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the ori= ginal >> statement. But the other statements in the pattern sequence can do >> arbitrary things. Their range isn't predictable from the range of the o= riginal >> statement result. >>=20 >> IIRC, the addition above is converted to: >>=20 >> a' =3D (uint16_t) pixel[i] >> b' =3D (uint16_t) level >> sum' =3D a' + b' >> sum =3D (int) sum' >>=20 >> where sum is the direct replacement of "pixel[i] + level", with the same= type >> and range. The division then uses sum' instead of sum. >>=20 >> But the fact that sum' is part of the same pattern as sum doesn't guaran= tee >> that sum' has the same range as sum. E.g. the pattern statements added = by >> the division optimisation wouldn't have this property. > > So my assumption is that no pattern would replace a statement with someth= ing > That has higher precision than the C statement. The pattern above is demo= ted > By the vectorizer based on range information already. My assumption was t= hat > the precision can only ever be smaller, because otherwise the pattern has= violated > the semantics of the C code, which would be dangerous if e.g. the express= ion escapes? IMO the difference in precisions was a symptom of the problem rather than the direct cause. The point is more that "B =3D vect_orig_stmt(A)" just says "A is used somehow in a new calculation of B". A might equal B (if A replaces B), or A might be an arbitrary temporary result. The code above is instead using it to mean "A equals B, expressed in a different type". That happens to be true for sum' in the sequence above, but it isn't true of non-final pattern statements in general. In other words, the code hasn't proved that the path from A to vect_stmt_to_vectorize(B) just involves conversions. Applying the range of a pattern result to all temporary results in the pattern could lead to wrong results even when the precisions are all the same. >> Is it possible to tell ranger to compute the range of expressions that h= aven't >> been added to the IL? (Genuine question, haven't looked. >> It seems pretty powerful though.) > > I don't know either, I guess for things it has explicit knowledge about i= t's ok, so > +w or *w would be fine, but with a random IFN_ it'll likely have to punt = as varying. Yeah. But sum' above involves simple arithmetic and conversions, so IFNs shouldn't be a problem. Thanks, Richard