From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 51707 invoked by alias); 18 Aug 2017 10:39:31 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 91502 invoked by uid 89); 18 Aug 2017 10:35:38 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-6.3 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_1,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: mail-wr0-f182.google.com Received: from mail-wr0-f182.google.com (HELO mail-wr0-f182.google.com) (209.85.128.182) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 18 Aug 2017 10:35:33 +0000 Received: by mail-wr0-f182.google.com with SMTP id y96so63966668wrc.1 for ; Fri, 18 Aug 2017 03:35:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=FK50Q7LSLL1ElgK9moTdXFNTRWTLciht5/wynj8fw/A=; b=W3FcVV0ParOz0Sn3Sk9ZidH7eRJwWhTWAKVtVNwWNh+WNWwWAIdVWxLWYHosyoQbGD jeKHF/W9M6MrqcORpRZBdw4BlTlC5gT6mrMTwgo3oakZpJhghQaPi40UC4WHoIxuOwEN YNld4G2d/yFi5JOynIXxPvBNpcpLp7oec7fJs4C6ZPrGspf8cTPvVoK04gZ8QEAyid+y /pMIJKKIAeKFCe0AFQ4wuUlYMLqn93PZtfuoYwsD4+bPlyz5lJJoItu1ri2kFK1+4wwg s9vvApqm6lSViMWKbLBEEJpDx2VW0NqW1xk7w4/SH5LXUt1IZKyWNvCDzR79YkoSnGWi 3r9g== X-Gm-Message-State: AHYfb5hhVTYE4E30yil9P7hlAqzqpWUIIp8QAn1/3ysvRLOL5RNjOx/k s1xu63ASHAS27jcj7nC3TJjMZtbSgQ== X-Received: by 10.80.191.4 with SMTP id f4mr4420596edk.217.1503052531109; Fri, 18 Aug 2017 03:35:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.80.180.249 with HTTP; Fri, 18 Aug 2017 03:35:30 -0700 (PDT) In-Reply-To: References: <87pobvr5t8.fsf@linaro.org> <877ey3qz90.fsf@linaro.org> <87y3qjpfl2.fsf@linaro.org> <874lt6fmvq.fsf@linaro.org> From: Richard Biener Date: Fri, 18 Aug 2017 12:18:00 -0000 Message-ID: Subject: Re: PR81635: Use chrecs to help find related data refs To: "Bin.Cheng" Cc: gcc-patches List , Richard Sandiford Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2017-08/txt/msg01123.txt.bz2 On Fri, Aug 18, 2017 at 12:30 PM, Richard Biener wrote: > On Thu, Aug 17, 2017 at 2:24 PM, Bin.Cheng wrote: >> On Thu, Aug 17, 2017 at 12:35 PM, Richard Sandiford >> wrote: >>> "Bin.Cheng" writes: >>>> On Wed, Aug 16, 2017 at 6:50 PM, Richard Sandiford >>>> wrote: >>>>> "Bin.Cheng" writes: >>>>>> On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford >>>>>> wrote: >>>>>>> "Bin.Cheng" writes: >>>>>>>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford >>>>>>>> wrote: >>>>>>>>> The first loop in the testcase regressed after my recent changes to >>>>>>>>> dr_analyze_innermost. Previously we would treat "i" as an iv even >>>>>>>>> for bb analysis and end up with: >>>>>>>>> >>>>>>>>> DR_BASE_ADDRESS: p or q >>>>>>>>> DR_OFFSET: 0 >>>>>>>>> DR_INIT: 0 or 4 >>>>>>>>> DR_STEP: 16 >>>>>>>>> >>>>>>>>> We now always keep the step as 0 instead, so for an int "i" we'd have: >>>>>>>>> >>>>>>>>> DR_BASE_ADDRESS: p or q >>>>>>>>> DR_OFFSET: (intptr_t) i >>>>>>>>> DR_INIT: 0 or 4 >>>>>>>>> DR_STEP: 0 >>>>>>>>> >>>>>>>>> This is also what we'd like to have for the unsigned "i", but the >>>>>>>>> problem is that strip_constant_offset thinks that the "i + 1" in >>>>>>>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1". >>>>>>>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA >>>>>>>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no >>>>>>>>> longer seem to be related to the [i] ones. >>>>>>>> >>>>>>>> Didn't read the change in detail, so sorry if I mis-understood the issue. >>>>>>>> I made changes in scev to better fold type conversion by various sources >>>>>>>> of information, for example, vrp, niters, undefined overflow behavior etc. >>>>>>>> In theory these information should be available for other >>>>>>>> optimizers without >>>>>>>> querying scev. For the mentioned test, vrp should compute accurate range >>>>>>>> information for "i" so that we can fold (intptr_t) (i + 1) it without >>>>>>>> worrying >>>>>>>> overflow. Note we don't do it in generic folding because >>>>>>>> (intptr_t) (i) + 1 >>>>>>>> could be more expensive (especially in case of (T)(i + j)), or because the >>>>>>>> CST part is in bigger precision after conversion. >>>>>>>> But such folding is wanted in several places, e.g, IVOPTs. To provide such >>>>>>>> an interface, we changed tree-affine and made it do aggressive fold. I am >>>>>>>> curious if it's possible to use aff_tree to implement strip_constant_offset >>>>>>>> here since aggressive folding is wanted. After all, using additional chrec >>>>>>>> looks like a little heavy wrto the simple test. >>>>>>> >>>>>>> Yeah, using aff_tree does work here when the bounds are constant. >>>>>>> It doesn't look like it works for things like: >>>>>>> >>>>>>> double p[1000]; >>>>>>> double q[1000]; >>>>>>> >>>>>>> void >>>>>>> f4 (unsigned int n) >>>>>>> { >>>>>>> for (unsigned int i = 0; i < n; i += 4) >>>>>>> { >>>>>>> double a = q[i] + p[i]; >>>>>>> double b = q[i + 1] + p[i + 1]; >>>>>>> q[i] = a; >>>>>>> q[i + 1] = b; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> though, where the bounds on the global arrays guarantee that [i + 1] can't >>>>>>> overflow, even though "n" is unconstrained. The patch as posted handles >>>>>>> this case too. >>>>>> BTW is this a missed optimization in value range analysis? The range >>>>>> information for i should flow in a way like: array boundary -> niters >>>>>> -> scev/vrp. >>>>>> I think that's what niters/scev do in analysis. >>>>> >>>>> Yeah, maybe :-) It looks like the problem is that when SLP runs, >>>>> the previous VRP pass came before loop header copying, so the (single) >>>>> header has to cope with n == 0 case. Thus we get: >>>> Ah, there are several passes in between vrp and pass_ch, not sure if >>>> any such pass depends on vrp intensively. I would suggestion reorder >>>> the two passes, or standalone VRP interface updating information for >>>> loop region after header copied? This is a non-trivial issue that >>>> needs to be fixed. Niters analyzer rely on >>>> simplify_using_initial_conditions heavily to get the same information, >>>> which in my opinion should be provided by VRP. Though this won't be >>>> able to obsolete simplify_using_initial_conditions because VRP is weak >>>> in symbolic range... >>>> >>>>> >>>>> Visiting statement: >>>>> i_15 = ASSERT_EXPR ; >>>>> Intersecting >>>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>>> and >>>>> [0, 0] >>>>> to >>>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements) >>>>> Intersecting >>>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements) >>>>> and >>>>> [0, 1000] >>>>> to >>>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements) >>>>> Found new range for i_15: [0, 0] >>>>> >>>>> Visiting statement: >>>>> _3 = i_15 + 1; >>>>> Match-and-simplified i_15 + 1 to 1 >>>>> Intersecting >>>>> [1, 1] >>>>> and >>>>> [0, +INF] >>>>> to >>>>> [1, 1] >>>>> Found new range for _3: [1, 1] >>>>> >>>>> (where _3 is the index we care about), followed by: >>>>> >>>>> Visiting statement: >>>>> i_15 = ASSERT_EXPR ; >>>>> Intersectings/aarch64-linux/trunk-orig/debug/gcc' >>>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>>> and >>>>> [0, 4] >>>>> to >>>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>>> Intersecting >>>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>>> and >>>>> [0, 1000] >>>>> to >>>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>>> Found new range for i_15: [0, n_9(D) + 4294967295] >>>>> >>>>> Visiting statement: >>>>> _3 = i_15 + 1; >>>>> Intersecting >>>>> [0, +INF] >>>>> and >>>>> [0, +INF] >>>>> to >>>>> [0, +INF] >>>>> Found new range for _3: [0, +INF] >>>>> >>>>> I guess in this case it would be better to intersect the i_15 ranges >>>>> to [0, 1000] rather than [0, n_9(D) + 4294967295]. >>>>> >>>>> It does work if another VRP pass runs after CH. But even then, >>>>> is it a good idea to rely on the range info being kept up-to-date >>>>> all the way through to SLP? A lot happens inbetween. >>>> To some extend yes. Now I understand that SCEV uses niters >>>> information to prove no_overflow. Niters analysis does infer better >>>> information from array boundary, while VRP fails to do that. I don't >>>> worry much about gap between vrp pass and slp, it's basically the same >>>> as niters. Both information are analyzed at one point and meant to be >>>> used by following passes. It's also not common for vrp information >>>> become invalid given we are on SSA? >>> >>> I'm not worried so much about vrp information becoming invalid on >>> an SSA name that existed when VRP was run. It's more a question >>> of what happens about SSA names that get introduced after VRP, >>> e.g. by things like dom, reassoc, PRE, etc. >> For induction variables in concern, these passes shouldn't >> aggressively introduces new variables I think. >>> >>>> Now that data address is not analyzed against loop, VRP would be the >>>> only information we can use unless adding back scev analysis. IMHO, >>>> the patch is doing so in another way than before. It requires >>>> additional chrec data structure. I remember the previous patch >>>> enables more slp vectorization, is it because of "step" information >>>> from scev? >>> >>> Do you mean that: >>> >>> 2017-07-03 Richard Sandiford >>> >>> * tree-data-ref.c (dr_analyze_innermost): Replace the "nest" >>> parameter with a "loop" parameter and use it instead of the >>> loop containing DR_STMT. Don't check simple_iv when doing >>> BB analysis. Describe the two analysis modes in the comment. >>> >>> enabled more SLP vectorisation in bb-slp-pr65935.c? That was due >>> to us not doing IV analysis for BB vectorisation, and ensuring that >>> the step was always zero. >> Which means vectorizer code can handle not IV-analyzed offset, but >> can't for analyzed form? >>> >>>> In this patch, step information is simply discarded. I am >>>> wondering if possible to always analyze scev within innermost loop for >>>> slp while discards step information. >>> >>> Well, we don't calculate a step for bb analysis (i.e. it's not the case >>> the patch calculates step information and then simply discards it). >>> I don't see how that would work. For bb analysis, the DR_OFFSET + DR_INIT >>> has to give the offset for every execution of the block, not just the >>> first iteration of the containing loop. So if we get back a nonzero >>> step, we have to do something with it. >> Yeah. >>> >>> But: >>> >>> (a) the old simple_iv analysis is more expensive than simply calling >>> analyze_scev, so I don't think this is a win in terms of complexity. >>> >>> (b) for bb analysis, there's nothing particularly special about the >>> innermost loop. It makes more sense to analyse it in the innermost >>> loop for which the offset is invariant, as shown by the second >>> testcase in the patch. >>> >>> (c) The patch helps with loop vectorisation too, since analysing the >>> starting DR_OFFSET in the context of the containing loop can help >>> in a similar way as analysing the full offset does for SLP. >> >> I have to admit I am not very much into this method. It complicates >> structure as well as code. >> Mostly because now dr_init are split into two different fields and one >> of it is lazily computed. >> >> For example: >>> @@ -2974,12 +2974,12 @@ vect_vfa_segment_size (struct data_refer >>> vect_no_alias_p (struct data_reference *a, struct data_reference *b, >>> tree segment_length_a, tree segment_length_b) >>> { >>> - gcc_assert (TREE_CODE (DR_INIT (a)) == INTEGER_CST >>> - && TREE_CODE (DR_INIT (b)) == INTEGER_CST); >>> - if (tree_int_cst_equal (DR_INIT (a), DR_INIT (b))) >>> + gcc_assert (TREE_CODE (DR_CHREC_INIT (a)) == INTEGER_CST >>> + && TREE_CODE (DR_CHREC_INIT (b)) == INTEGER_CST); >>> + if (tree_int_cst_equal (DR_CHREC_INIT (a), DR_CHREC_INIT (b))) >>> return false; >>> >>> - tree seg_a_min = DR_INIT (a); >>> + tree seg_a_min = DR_CHREC_INIT (a); >>> tree seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_min), >>> seg_a_min, segment_length_a); >>> /* For negative step, we need to adjust address range by TYPE_SIZE_UNIT >>> @@ -2990,10 +2990,10 @@ vect_no_alias_p (struct data_reference * >>> tree unit_size = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (a))); >>> seg_a_min = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_max), >>> seg_a_max, unit_size); >>> - seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_INIT (a)), >>> - DR_INIT (a), unit_size); >>> + seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_CHREC_INIT (a)), >>> + DR_CHREC_INIT (a), unit_size); >>> } >>> - tree seg_b_min = DR_INIT (b); >>> + tree seg_b_min = DR_CHREC_INIT (b); >>> tree seg_b_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_b_min), >>> seg_b_min, segment_length_b); >>> if (tree_int_cst_compare (DR_STEP (b), size_zero_node) < 0) >> >> Use of DR_INIT is simply replaced by DR_CHREC_INIT. Is it safe to do >> so in case of non-ZERO >> DR_INIT? It worries me that I may need to think twice before >> referring to DR_INIT because it's >> not clear when DR_OFFSET is split and DR_CHREC_INIT becomes non-ZERO. >> It may simply >> because I am too dumb to handle this. I will leave this to richi. > > I'm trying to understand this a bit (not liking it very much in its > current form). > > Can code currently using DR_OFFSET and DR_INIT simply use > DR_CHREC_INIT and CHREC_LEFT (DR_CHREC_OFFSET) (or DR_CHREC_OFFSET > if DR_CHREC_OFFSET is not a CHREC)? If yes, would there be any downside > in doing that? If not, then why? > > That said, I'm all for making DR info more powerful. But I detest > having extra fields > and adding confusion as to when to use which. Thus if we can make DR_CHREC_INIT > be DR_INIT and DR_OFFSET an inline function accessing CHREC_LEFT if > suitable plus exposing DR_OFFSET_CHREC that would make me more happy. And if we want to make it opt-in do a dr_analyze_me_harder () which will re-write DR_OFFSET/INIT into the new form. But DR_OFFSET and DR_INIT (read) accessors would maintain their semantics while DR_OFFSET_CHREC would expose the CHREC if available. Richard. > One downside might be that the scalar evolution of the offset might pull in > SSA vars into "DR_OFFSET" that are "far away" and thus less optimal for > code-generation when one re-constructs a ref by adding the components? > > Thanks, > Richard. > > >> Thanks, >> bin >>> >>> Thanks, >>> Richard >>> >>>> >>>> Thanks, >>>> bin >>>>> >>>>> FWIW, the old simple_iv check that I removed for bb data-ref analysis >>>>> relies on SCEV analysis too, so I don't think this is more expensive >>>>> than what we had before. >>>>> >>>>> Thanks, >>>>> Richard