From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by sourceware.org (Postfix) with ESMTPS id 05D763848019 for ; Thu, 20 May 2021 09:25:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 05D763848019 Received: by mail-ej1-x629.google.com with SMTP id n2so24170929ejy.7 for ; Thu, 20 May 2021 02:25:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gP7I1o3IZrT3myrj69eF4x5j4D1C7cu/2jtwDnDSZ0g=; b=eSDHarAKc8F1rJVG+HxM+n7JzIb92BtAdjAl5v1FKpvqkueXzKOhmeVT20LhaYfz6o gD9tOTB4EgCVEY2c8si+UFTxosYRkJX/ptev60kgRlETb8N6+WI5yfF2XBuM07aYjjFi B2yqKTR66t9u+Cc0JUUPJt0Kr3YUo9HD+pVsZlkqaKPKOaOpnkfe5ALJs8oIyFmVTXEe pNRx+rve+Avxx6Ro/T1Yy3YkAO+QubU1cRsjyKAAp2TC469aNVhRifrzdvGkmes3ZI4w iSgkoolB1PONvCNj5+7uq5HlWQiknUooSaxHH2DE2+Et4rrFcZf+0tUKJxiFrl1DYD7D dScg== X-Gm-Message-State: AOAM531Fl78jtF6fPF0DxYvyjGqH3yU7+f3LnNTrgjOQWKgbDF+unBsE Aar4CeoDKGhJ3lgBpwn0EgRzPZ5RRWN8kOsBnsM= X-Google-Smtp-Source: ABdhPJw9ZgxGnJRO0xi3W7jtbKe8OKnxpTJPNo3DFHCRLQlWrbwBFSkeyR4oMXUh1pu63w2eIs7RX8Oc2Vtn/cUz1xQ= X-Received: by 2002:a17:906:680d:: with SMTP id k13mr3761625ejr.371.1621502707943; Thu, 20 May 2021 02:25:07 -0700 (PDT) MIME-Version: 1.0 References: <20210208124716.487709-1-stefansf@linux.ibm.com> In-Reply-To: From: Richard Biener Date: Thu, 20 May 2021 11:24:57 +0200 Message-ID: Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns To: Stefan Schulze Frielinghaus Cc: GCC Patches Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 May 2021 09:25:11 -0000 On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus wrote: > > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote: > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > > wrote: > > > > > > [snip] > > > > > > Please find attached a new version of the patch. A major change compared to > > > the previous patch is that I created a separate pass which hopefully makes > > > reviewing also easier since it is almost self-contained. After realizing that > > > detecting loops which mimic the behavior of rawmemchr/strlen functions does not > > > really fit into the topic of loop distribution, I created a separate pass. > > > > It's true that these reduction-like patterns are more difficult than > > the existing > > memcpy/memset cases. > > > > > Due > > > to this I was also able to play around a bit and schedule the pass at different > > > times. Currently it is scheduled right before loop distribution where loop > > > header copying already took place which leads to the following effect. > > > > In fact I'd schedule it after loop distribution so there's the chance that loop > > distribution can expose a loop that fits the new pattern. > > > > > Running > > > this setup over > > > > > > char *t (char *p) > > > { > > > for (; *p; ++p); > > > return p; > > > } > > > > > > the new pass transforms > > > > > > char * t (char * p) > > > { > > > char _1; > > > char _7; > > > > > > [local count: 118111600]: > > > _7 = *p_3(D); > > > if (_7 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > > [local count: 105119324]: > > > > > > [local count: 955630225]: > > > # p_8 = PHI > > > p_6 = p_8 + 1; > > > _1 = *p_6; > > > if (_1 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > > [local count: 105119324]: > > > # p_2 = PHI > > > goto ; [100.00%] > > > > > > [local count: 850510901]: > > > goto ; [100.00%] > > > > > > [local count: 12992276]: > > > > > > [local count: 118111600]: > > > # p_9 = PHI > > > return p_9; > > > > > > } > > > > > > into > > > > > > char * t (char * p) > > > { > > > char * _5; > > > char _7; > > > > > > [local count: 118111600]: > > > _7 = *p_3(D); > > > if (_7 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > > [local count: 105119324]: > > > _5 = p_3(D) + 1; > > > p_10 = .RAWMEMCHR (_5, 0); > > > > > > [local count: 118111600]: > > > # p_9 = PHI > > > return p_9; > > > > > > } > > > > > > which is fine so far. However, I haven't made up my mind so far whether it is > > > worthwhile to spend more time in order to also eliminate the "first unrolling" > > > of the loop. > > > > Might be a phiopt transform ;) Might apply to quite some set of > > builtins. I wonder how the strlen case looks like though. > > > > > I gave it a shot by scheduling the pass prior pass copy header > > > and ended up with: > > > > > > char * t (char * p) > > > { > > > [local count: 118111600]: > > > p_5 = .RAWMEMCHR (p_3(D), 0); > > > return p_5; > > > > > > } > > > > > > which seems optimal to me. The downside of this is that I have to initialize > > > scalar evolution analysis which might be undesired that early. > > > > > > All this brings me to the question where do you see this peace of code running? > > > If in a separate pass when would you schedule it? If in an existing pass, > > > which one would you choose? > > > > I think it still fits loop distribution. If you manage to detect it > > with your pass > > standalone then you should be able to detect it in loop distribution. > > If a loop is distributed only because one of the partitions matches a > rawmemchr/strlen-like loop pattern, then we have at least two partitions > which walk over the same memory region. Since a rawmemchr/strlen-like > loop has no body (neglecting expression-3 of a for-loop where just an > increment happens) it is governed by the memory accesses in the loop > condition. Therefore, in such a case loop distribution would result in > performance degradation. This is why I think that it does not fit > conceptually into ldist pass. However, since I make use of a couple of > helper functions from ldist pass, it may still fit technically. > > Since currently all ldist optimizations operate over loops where niters > is known and for rawmemchr/strlen-like loops this is not the case, it is > not possible that those optimizations expose a loop which is suitable > for rawmemchr/strlen optimization. True - though that seems to be an unnecessary restriction. > Therefore, what do you think about > scheduling rawmemchr/strlen optimization right between those > if-statements of function loop_distribution::execute? > > if (nb_generated_loops + nb_generated_calls > 0) > { > changed = true; > if (dump_enabled_p ()) > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > loc, "Loop%s %d distributed: split to %d loops " > "and %d library calls.\n", str, loop->num, > nb_generated_loops, nb_generated_calls); > > break; > } > > // rawmemchr/strlen like loops > > if (dump_file && (dump_flags & TDF_DETAILS)) > fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num); but we won't ever arrive here because of the niters condition. But yes, doing the pattern matching in the innermost loop processing code looks good to me - for the specific case it would be /* Don't distribute loop if niters is unknown. */ tree niters = number_of_latch_executions (loop); if (niters == NULL_TREE || niters == chrec_dont_know) ---> here? continue; > > Can you > > explain what part is "easier" as standalone pass? > > Yea that term is rather misleading. It was probably easier for me to > understand the underlying problem and to play around with my code. > There are no technical reasons for a standalone pass. And sorry for the late response... Richard. > Cheers, > Stefan > > > > > > Another topic which came up is whether there exists a more elegant solution to > > > my current implementation in order to deal with stores (I'm speaking of the `if > > > (store_dr)` statement inside of function transform_loop_1). For example, > > > > > > extern char *p; > > > char *t () > > > { > > > for (; *p; ++p); > > > return p; > > > } > > > > > > ends up as > > > > > > char * t () > > > { > > > char * _1; > > > char * _2; > > > char _3; > > > char * p.1_8; > > > char _9; > > > char * p.1_10; > > > char * p.1_11; > > > > > > [local count: 118111600]: > > > p.1_8 = p; > > > _9 = *p.1_8; > > > if (_9 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > > [local count: 105119324]: > > > > > > [local count: 955630225]: > > > # p.1_10 = PHI <_1(6), p.1_8(5)> > > > _1 = p.1_10 + 1; > > > p = _1; > > > _3 = *_1; > > > if (_3 != 0) > > > goto ; [89.00%] > > > else > > > goto ; [11.00%] > > > > > > [local count: 105119324]: > > > # _2 = PHI <_1(3)> > > > goto ; [100.00%] > > > > > > [local count: 850510901]: > > > goto ; [100.00%] > > > > > > [local count: 12992276]: > > > > > > [local count: 118111600]: > > > # p.1_11 = PHI <_2(8), p.1_8(7)> > > > return p.1_11; > > > > > > } > > > > > > where inside the loop a load and store occurs. For a rawmemchr like loop I > > > have to show that we never load from a memory location to which we write. > > > Currently I solve this by hard coding those facts which are not generic at all. > > > I gave compute_data_dependences_for_loop a try which failed to determine the > > > fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe > > > there are more generic solutions to express this in contrast to my current one? > > > > So the example loop is not valid to be trasformed to rawmemchr since it's > > valid to call it with p = &p; - but sure, once you pass the first *p != 0 check > > things become fishy but GCC isn't able to turn that into a non-dependence. > > > > Why is the case of stores inside the loop important? In fact that you > > think it is > > makes a case for integrating this with loop distribution since loop distribution > > would be able to prove (if possible) that the stores can be separated into > > a different loop. > > > > And sorry for the delay in answering ... > > > > Thanks, > > Richard. > > > > > > > > Thanks again for your input so far. Really appreciated. > > > > > > Cheers, > > > Stefan