From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com
 [IPv6:2a00:1450:4864:20::629])
 by sourceware.org (Postfix) with ESMTPS id 05D763848019
 for <gcc-patches@gcc.gnu.org>; Thu, 20 May 2021 09:25:09 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 05D763848019
Received: by mail-ej1-x629.google.com with SMTP id n2so24170929ejy.7
 for <gcc-patches@gcc.gnu.org>; Thu, 20 May 2021 02:25:08 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=gP7I1o3IZrT3myrj69eF4x5j4D1C7cu/2jtwDnDSZ0g=;
 b=eSDHarAKc8F1rJVG+HxM+n7JzIb92BtAdjAl5v1FKpvqkueXzKOhmeVT20LhaYfz6o
 gD9tOTB4EgCVEY2c8si+UFTxosYRkJX/ptev60kgRlETb8N6+WI5yfF2XBuM07aYjjFi
 B2yqKTR66t9u+Cc0JUUPJt0Kr3YUo9HD+pVsZlkqaKPKOaOpnkfe5ALJs8oIyFmVTXEe
 pNRx+rve+Avxx6Ro/T1Yy3YkAO+QubU1cRsjyKAAp2TC469aNVhRifrzdvGkmes3ZI4w
 iSgkoolB1PONvCNj5+7uq5HlWQiknUooSaxHH2DE2+Et4rrFcZf+0tUKJxiFrl1DYD7D
 dScg==
X-Gm-Message-State: AOAM531Fl78jtF6fPF0DxYvyjGqH3yU7+f3LnNTrgjOQWKgbDF+unBsE
 Aar4CeoDKGhJ3lgBpwn0EgRzPZ5RRWN8kOsBnsM=
X-Google-Smtp-Source: ABdhPJw9ZgxGnJRO0xi3W7jtbKe8OKnxpTJPNo3DFHCRLQlWrbwBFSkeyR4oMXUh1pu63w2eIs7RX8Oc2Vtn/cUz1xQ=
X-Received: by 2002:a17:906:680d:: with SMTP id
 k13mr3761625ejr.371.1621502707943; 
 Thu, 20 May 2021 02:25:07 -0700 (PDT)
MIME-Version: 1.0
References: <20210208124716.487709-1-stefansf@linux.ibm.com>
 <CAFiYyc12nnPUJRcPAig3nmWvNzxVD8Q9zEWQKPAMDvpwg4g_tA@mail.gmail.com>
 <YCj7GniMRQ5DkS37@localhost.localdomain>
 <CAFiYyc0udwxVePaDsWkpVNh6kbiqke0PVv-xarbsXq3VC-Bz-w@mail.gmail.com>
 <YD/EkX2TRRk1qRh5@localhost.localdomain>
 <YFDnMF3i6OsQa6L9@localhost.localdomain>
 <CAFiYyc0Q3OZqbzGBukt=rvd6VM0EcdLKbv1d2uORbGCGg-Lbzg@mail.gmail.com>
 <YJUzcnv1J2HaNro6@localhost.localdomain>
In-Reply-To: <YJUzcnv1J2HaNro6@localhost.localdomain>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 20 May 2021 11:24:57 +0200
Message-ID: <CAFiYyc3RR5P4GZXEUiwN2=5d96_pGaHPwPTYCOY=ax-6W3=raw@mail.gmail.com>
Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns
To: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 20 May 2021 09:25:11 -0000

On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> > <stefansf@linux.ibm.com> wrote:
> > >
> > > [snip]
> > >
> > > Please find attached a new version of the patch.  A major change compared to
> > > the previous patch is that I created a separate pass which hopefully makes
> > > reviewing also easier since it is almost self-contained.  After realizing that
> > > detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> > > really fit into the topic of loop distribution, I created a separate pass.
> >
> > It's true that these reduction-like patterns are more difficult than
> > the existing
> > memcpy/memset cases.
> >
> > >  Due
> > > to this I was also able to play around a bit and schedule the pass at different
> > > times.  Currently it is scheduled right before loop distribution where loop
> > > header copying already took place which leads to the following effect.
> >
> > In fact I'd schedule it after loop distribution so there's the chance that loop
> > distribution can expose a loop that fits the new pattern.
> >
> > >  Running
> > > this setup over
> > >
> > > char *t (char *p)
> > > {
> > >   for (; *p; ++p);
> > >   return p;
> > > }
> > >
> > > the new pass transforms
> > >
> > > char * t (char * p)
> > > {
> > >   char _1;
> > >   char _7;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 7>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >
> > >   <bb 3> [local count: 955630225]:
> > >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> > >   p_6 = p_8 + 1;
> > >   _1 = *p_6;
> > >   if (_1 != 0)
> > >     goto <bb 6>; [89.00%]
> > >   else
> > >     goto <bb 8>; [11.00%]
> > >
> > >   <bb 8> [local count: 105119324]:
> > >   # p_2 = PHI <p_6(3)>
> > >   goto <bb 4>; [100.00%]
> > >
> > >   <bb 6> [local count: 850510901]:
> > >   goto <bb 3>; [100.00%]
> > >
> > >   <bb 7> [local count: 12992276]:
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> > >   return p_9;
> > >
> > > }
> > >
> > > into
> > >
> > > char * t (char * p)
> > > {
> > >   char * _5;
> > >   char _7;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 4>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >   _5 = p_3(D) + 1;
> > >   p_10 = .RAWMEMCHR (_5, 0);
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> > >   return p_9;
> > >
> > > }
> > >
> > > which is fine so far.  However, I haven't made up my mind so far whether it is
> > > worthwhile to spend more time in order to also eliminate the "first unrolling"
> > > of the loop.
> >
> > Might be a phiopt transform ;)  Might apply to quite some set of
> > builtins.  I wonder how the strlen case looks like though.
> >
> > > I gave it a shot by scheduling the pass prior pass copy header
> > > and ended up with:
> > >
> > > char * t (char * p)
> > > {
> > >   <bb 2> [local count: 118111600]:
> > >   p_5 = .RAWMEMCHR (p_3(D), 0);
> > >   return p_5;
> > >
> > > }
> > >
> > > which seems optimal to me.  The downside of this is that I have to initialize
> > > scalar evolution analysis which might be undesired that early.
> > >
> > > All this brings me to the question where do you see this peace of code running?
> > > If in a separate pass when would you schedule it?  If in an existing pass,
> > > which one would you choose?
> >
> > I think it still fits loop distribution.  If you manage to detect it
> > with your pass
> > standalone then you should be able to detect it in loop distribution.
>
> If a loop is distributed only because one of the partitions matches a
> rawmemchr/strlen-like loop pattern, then we have at least two partitions
> which walk over the same memory region.  Since a rawmemchr/strlen-like
> loop has no body (neglecting expression-3 of a for-loop where just an
> increment happens) it is governed by the memory accesses in the loop
> condition.  Therefore, in such a case loop distribution would result in
> performance degradation.  This is why I think that it does not fit
> conceptually into ldist pass.  However, since I make use of a couple of
> helper functions from ldist pass, it may still fit technically.
>
> Since currently all ldist optimizations operate over loops where niters
> is known and for rawmemchr/strlen-like loops this is not the case, it is
> not possible that those optimizations expose a loop which is suitable
> for rawmemchr/strlen optimization.

True - though that seems to be an unnecessary restriction.

>  Therefore, what do you think about
> scheduling rawmemchr/strlen optimization right between those
> if-statements of function loop_distribution::execute?
>
>    if (nb_generated_loops + nb_generated_calls > 0)
>      {
>        changed = true;
>        if (dump_enabled_p ())
>          dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>                           loc, "Loop%s %d distributed: split to %d loops "
>                           "and %d library calls.\n", str, loop->num,
>                           nb_generated_loops, nb_generated_calls);
>
>        break;
>      }
>
>    // rawmemchr/strlen like loops
>
>    if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);

but we won't ever arrive here because of the niters condition.  But
yes, doing the pattern matching in the innermost loop processing code
looks good to me - for the specific case it would be

      /* Don't distribute loop if niters is unknown.  */
      tree niters = number_of_latch_executions (loop);
      if (niters == NULL_TREE || niters == chrec_dont_know)
---> here?
        continue;

> > Can you
> > explain what part is "easier" as standalone pass?
>
> Yea that term is rather misleading.  It was probably easier for me to
> understand the underlying problem and to play around with my code.
> There are no technical reasons for a standalone pass.

And sorry for the late response...

Richard.

> Cheers,
> Stefan
>
> >
> > > Another topic which came up is whether there exists a more elegant solution to
> > > my current implementation in order to deal with stores (I'm speaking of the `if
> > > (store_dr)` statement inside of function transform_loop_1).  For example,
> > >
> > > extern char *p;
> > > char *t ()
> > > {
> > >   for (; *p; ++p);
> > >   return p;
> > > }
> > >
> > > ends up as
> > >
> > > char * t ()
> > > {
> > >   char * _1;
> > >   char * _2;
> > >   char _3;
> > >   char * p.1_8;
> > >   char _9;
> > >   char * p.1_10;
> > >   char * p.1_11;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   p.1_8 = p;
> > >   _9 = *p.1_8;
> > >   if (_9 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 7>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >
> > >   <bb 3> [local count: 955630225]:
> > >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> > >   _1 = p.1_10 + 1;
> > >   p = _1;
> > >   _3 = *_1;
> > >   if (_3 != 0)
> > >     goto <bb 6>; [89.00%]
> > >   else
> > >     goto <bb 8>; [11.00%]
> > >
> > >   <bb 8> [local count: 105119324]:
> > >   # _2 = PHI <_1(3)>
> > >   goto <bb 4>; [100.00%]
> > >
> > >   <bb 6> [local count: 850510901]:
> > >   goto <bb 3>; [100.00%]
> > >
> > >   <bb 7> [local count: 12992276]:
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> > >   return p.1_11;
> > >
> > > }
> > >
> > > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > > have to show that we never load from a memory location to which we write.
> > > Currently I solve this by hard coding those facts which are not generic at all.
> > > I gave compute_data_dependences_for_loop a try which failed to determine the
> > > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > > there are more generic solutions to express this in contrast to my current one?
> >
> > So the example loop is not valid to be trasformed to rawmemchr since it's
> > valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
> > things become fishy but GCC isn't able to turn that into a non-dependence.
> >
> > Why is the case of stores inside the loop important?  In fact that you
> > think it is
> > makes a case for integrating this with loop distribution since loop distribution
> > would be able to prove (if possible) that the stores can be separated into
> > a different loop.
> >
> > And sorry for the delay in answering ...
> >
> > Thanks,
> > Richard.
> >
> > >
> > > Thanks again for your input so far.  Really appreciated.
> > >
> > > Cheers,
> > > Stefan