From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 4994A3858C5F for ; Fri, 10 Feb 2023 16:57:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4994A3858C5F Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 60D8F4B3; Fri, 10 Feb 2023 08:58:13 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 1A6C83F703; Fri, 10 Feb 2023 08:57:29 -0800 (PST) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina ,Tamar Christina via Gcc-patches , nd , "rguenther\@suse.de" , "jlaw\@ventanamicro.com" , richard.sandiford@arm.com Cc: Tamar Christina via Gcc-patches , nd , "rguenther\@suse.de" , "jlaw\@ventanamicro.com" Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] References: Date: Fri, 10 Feb 2023 16:57:28 +0000 In-Reply-To: (Tamar Christina's message of "Fri, 10 Feb 2023 16:33:20 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-27.9 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Tamar Christina writes: >> -----Original Message----- >> From: Richard Sandiford >> Sent: Friday, February 10, 2023 4:25 PM >> To: Tamar Christina >> Cc: Tamar Christina via Gcc-patches ; nd >> ; rguenther@suse.de; jlaw@ventanamicro.com >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask >> by using new optabs [PR108583] >>=20 >> Tamar Christina writes: >> >> -----Original Message----- >> >> From: Richard Sandiford >> >> Sent: Friday, February 10, 2023 3:57 PM >> >> To: Tamar Christina >> >> Cc: Tamar Christina via Gcc-patches ; nd >> >> ; rguenther@suse.de; jlaw@ventanamicro.com >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of >> >> div-bitmask by using new optabs [PR108583] >> >> >> >> Tamar Christina writes: >> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index >> >> >> > >> >> >> >> >> >> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33 >> >> >> 69 >> >> >> > de2afea139d6 100644 >> >> >> > --- a/gcc/tree-vect-patterns.cc >> >> >> > +++ b/gcc/tree-vect-patterns.cc >> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info >> >> *vinfo, >> >> >> > return pattern_stmt; >> >> >> > } >> >> >> > else if ((cst =3D uniform_integer_cst_p (oprnd1)) >> >> >> > - && targetm.vectorize.can_special_div_by_const (rhs_code, >> >> >> vectype, >> >> >> > - wi::to_wide >> (cst), >> >> >> > - NULL, >> NULL_RTX, >> >> >> > - NULL_RTX)) >> >> >> > + && TYPE_UNSIGNED (itype) >> >> >> > + && rhs_code =3D=3D TRUNC_DIV_EXPR >> >> >> > + && vectype >> >> >> > + && direct_internal_fn_supported_p (IFN_ADDH, vectype, >> >> >> > + OPTIMIZE_FOR_SPEED)) >> >> >> > { >> >> >> > - return NULL; >> >> >> > + /* div optimizations using narrowings >> >> >> > + we can do the division e.g. shorts by 255 faster by calc= ulating it >> as >> >> >> > + (x + ((x + 257) >> 8)) >> 8 assuming the operation is do= ne in >> >> >> > + double the precision of x. >> >> >> > + >> >> >> > + If we imagine a short as being composed of two blocks of >> >> >> > + bytes >> >> then >> >> >> > + adding 257 or 0b0000_0001_0000_0001 to the number is >> >> >> > + equivalent >> >> to >> >> >> > + adding 1 to each sub component: >> >> >> > + >> >> >> > + short value of 16-bits >> >> >> > + =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=90 >> >> >> > + =E2=94=82 =E2=94=82 =E2=94=82 >> >> >> > + =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=98 >> >> >> > + 8-bit part1 =E2=96=B2 8-bit part2 =E2=96=B2 >> >> >> > + =E2=94=82 =E2=94=82 >> >> >> > + =E2=94=82 =E2=94=82 >> >> >> > + +1 +1 >> >> >> > + >> >> >> > + after the first addition, we have to shift right by 8, a= nd narrow >> the >> >> >> > + results back to a byte. Remember that the addition must >> >> >> > + be done >> >> in >> >> >> > + double the precision of the input. However if we know >> >> >> > + that the >> >> >> addition >> >> >> > + `x + 257` does not overflow then we can do the operation >> >> >> > + in the >> >> >> current >> >> >> > + precision. In which case we don't need the pack and unp= acks. >> */ >> >> >> > + auto wcst =3D wi::to_wide (cst); >> >> >> > + int pow =3D wi::exact_log2 (wcst + 1); >> >> >> > + if (pow =3D=3D (int) (element_precision (vectype) / 2)) >> >> >> > + { >> >> >> > + wide_int min,max; >> >> >> > + /* If we're in a pattern we need to find the orginal >> definition. */ >> >> >> > + tree op0 =3D oprnd0; >> >> >> > + gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0); >> >> >> > + stmt_vec_info stmt_info =3D vinfo->lookup_stmt (stmt); >> >> >> > + if (is_pattern_stmt_p (stmt_info)) >> >> >> > + { >> >> >> > + auto orig_stmt =3D STMT_VINFO_RELATED_STMT >> (stmt_info); >> >> >> > + if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) >> >> >> > + op0 =3D gimple_assign_lhs (STMT_VINFO_STMT >> (orig_stmt)); >> >> >> > + } >> >> >> >> >> >> If this is generally safe (I'm skipping thinking about it in the >> >> >> interests of a quick review :-)), then I think it should be done >> >> >> in vect_get_range_info instead. Using gimple_get_lhs would be >> >> >> more general than handling just assignments. >> >> >> >> >> >> > + >> >> >> > + /* Check that no overflow will occur. If we don't have range >> >> >> > + information we can't perform the optimization. */ >> >> >> > + if (vect_get_range_info (op0, &min, &max)) >> >> >> > + { >> >> >> > + wide_int one =3D wi::to_wide (build_one_cst (itype)); >> >> >> > + wide_int adder =3D wi::add (one, wi::lshift (one, pow)); >> >> >> > + wi::overflow_type ovf; >> >> >> > + /* We need adder and max in the same precision. */ >> >> >> > + wide_int zadder >> >> >> > + =3D wide_int_storage::from (adder, wi::get_precision >> (max), >> >> >> > + UNSIGNED); >> >> >> > + wi::add (max, zadder, UNSIGNED, &ovf); >> >> >> >> >> >> Could you explain this a bit more? When do we have mismatched >> >> >> precisions? >> >> > >> >> > C promotion rules will promote e.g. >> >> > >> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) { >> >> > for (int i =3D 0; i < n; i+=3D1) >> >> > pixel[i] =3D (pixel[i] + level) / 0xff; } >> >> > >> >> > And have the addition be done as a 32 bit integer. The vectorizer >> >> > will demote this down to a short, but range information is not >> >> > stored for patterns. So In the above the range will correctly be >> >> > 0x1fe but the precision will be that of the original expression, so >> >> > 32. This will be a mismatch with itype which is derived from the >> >> > size the vectorizer >> >> will perform the operation in. >> >> >> >> Gah, missed this first time round, sorry. >> >> >> >> Richi would know better than me, but I think it's dangerous to rely >> >> on the orig/pattern link for range information. The end result of a >> >> pattern >> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the >> >> original statement. But the other statements in the pattern sequence >> >> can do arbitrary things. Their range isn't predictable from the >> >> range of the original statement result. >> >> >> >> IIRC, the addition above is converted to: >> >> >> >> a' =3D (uint16_t) pixel[i] >> >> b' =3D (uint16_t) level >> >> sum' =3D a' + b' >> >> sum =3D (int) sum' >> >> >> >> where sum is the direct replacement of "pixel[i] + level", with the >> >> same type and range. The division then uses sum' instead of sum. >> >> >> >> But the fact that sum' is part of the same pattern as sum doesn't >> >> guarantee that sum' has the same range as sum. E.g. the pattern >> >> statements added by the division optimisation wouldn't have this >> property. >> > >> > So my assumption is that no pattern would replace a statement with >> > something That has higher precision than the C statement. The pattern >> > above is demoted By the vectorizer based on range information already. >> > My assumption was that the precision can only ever be smaller, because >> > otherwise the pattern has violated the semantics of the C code, which >> would be dangerous if e.g. the expression escapes? >>=20 >> IMO the difference in precisions was a symptom of the problem rather than >> the direct cause. >>=20 >> The point is more that "B =3D vect_orig_stmt(A)" just says "A is used so= mehow >> in a new calculation of B". A might equal B (if A replaces B), or A mig= ht be an >> arbitrary temporary result. The code above is instead using it to mean = "A >> equals B, expressed in a different type". That happens to be true for s= um' in >> the sequence above, but it isn't true of non-final pattern statements in >> general. >>=20 > > Sorry for being dense, but I though that's exactly what the code does and= what I > tried explain before. If B isn't a final statement than it won't have an = original statement. > AFAIK, the only places we set original statement is the root of the patte= rn expression. Final pattern statements (those not in DEF_SEQ) always have the same type and value as the original statements. We wouldn't see mismatched precisions if we were only looking at final pattern statements. Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-bit addition instead). So to make things type-correct, the 32-bit addition: A: sum =3D a + b (STMT_VINFO_RELATED_STMT =3D=3D A2) is replaced with: DEF_SEQ: A1: tmp =3D a' + b' (STMT_VINFO_RELATED_STMT =3D=3D A) A2: sum' =3D (int) tmp (STMT_VINFO_RELATED_STMT =3D=3D A) (using different notation from before, just to confuse things). Here, A2 is the final pattern statement for A and A1 is just a temporary result. sum =3D=3D sum'. Later, we do a similar thing for the division itself. We have: B: quotient =3D sum / 0xff (STMT_VINFO_RELATED_STMT =3D=3D B2) We realise that this can be a 16-bit division, so (IIRC) we use vect_look_through_possible_promotion on sum to find the best starting point. This should give: DEF_SEQ: B1: tmp2 =3D tmp / (uint16_t) 0xff (STMT_VINFO_RELATED_STMT =3D=3D B) B2: quotient' =3D (int) tmp2 (STMT_VINFO_RELATED_STMT =3D=3D B) Both changes are done by vect_widened_op_tree. We then apply the division pattern to B1. B1 is a nonfinal pattern statement that uses the result (tmp) of another nonfinal pattern statement (A1). The code does: if (is_pattern_stmt_p (stmt_info)) { auto orig_stmt =3D STMT_VINFO_RELATED_STMT (stmt_info); if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) op0 =3D gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt)); } is_pattern_stmt_p is true for both A1 and A2, and STMT_VINFO_RELATED_STMT is A for both A1 and A2. I would expect: gcc_assert (stmt_info =3D=3D vect_stmt_to_vectorize (orig_stmt)); (testing for a final pattern) to fail for the motivating example. Thanks, Richard