From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stefansf@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 2CC28385801A
 for <gcc-patches@gcc.gnu.org>; Fri,  7 May 2021 12:32:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 2CC28385801A
Received: from pps.filterd (m0098419.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 147C3hg6057432
 for <gcc-patches@gcc.gnu.org>; Fri, 7 May 2021 08:32:56 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0b-001b2d01.pphosted.com with ESMTP id 38d46n2aun-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <gcc-patches@gcc.gnu.org>; Fri, 07 May 2021 08:32:56 -0400
Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 147C5pY0067938
 for <gcc-patches@gcc.gnu.org>; Fri, 7 May 2021 08:32:56 -0400
Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.102])
 by mx0b-001b2d01.pphosted.com with ESMTP id 38d46n2atr-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 07 May 2021 08:32:55 -0400
Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1])
 by ppma06ams.nl.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 147CBuTB030733;
 Fri, 7 May 2021 12:32:53 GMT
Received: from b06cxnps3074.portsmouth.uk.ibm.com
 (d06relay09.portsmouth.uk.ibm.com [9.149.109.194])
 by ppma06ams.nl.ibm.com with ESMTP id 38csqa09q3-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 07 May 2021 12:32:53 +0000
Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com
 [9.149.105.59])
 by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 147CWplU31850992
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 7 May 2021 12:32:51 GMT
Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 061A0A405B;
 Fri,  7 May 2021 12:32:51 +0000 (GMT)
Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id A9F4AA4055;
 Fri,  7 May 2021 12:32:50 +0000 (GMT)
Received: from localhost.localdomain (unknown [9.145.11.82])
 by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTPS;
 Fri,  7 May 2021 12:32:50 +0000 (GMT)
Date: Fri, 7 May 2021 14:32:50 +0200
From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns
Message-ID: <YJUzcnv1J2HaNro6@localhost.localdomain>
References: <20210208124716.487709-1-stefansf@linux.ibm.com>
 <CAFiYyc12nnPUJRcPAig3nmWvNzxVD8Q9zEWQKPAMDvpwg4g_tA@mail.gmail.com>
 <YCj7GniMRQ5DkS37@localhost.localdomain>
 <CAFiYyc0udwxVePaDsWkpVNh6kbiqke0PVv-xarbsXq3VC-Bz-w@mail.gmail.com>
 <YD/EkX2TRRk1qRh5@localhost.localdomain>
 <YFDnMF3i6OsQa6L9@localhost.localdomain>
 <CAFiYyc0Q3OZqbzGBukt=rvd6VM0EcdLKbv1d2uORbGCGg-Lbzg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFiYyc0Q3OZqbzGBukt=rvd6VM0EcdLKbv1d2uORbGCGg-Lbzg@mail.gmail.com>
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: lPCxWBCfdle2joO8e55QbzEpd5n4UV02
X-Proofpoint-ORIG-GUID: bfYR5QNoaoKSzrZxVdS06IEAH7yHx_PU
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761
 definitions=2021-05-07_04:2021-05-06,
 2021-05-07 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 adultscore=0 impostorscore=0
 malwarescore=0 lowpriorityscore=0 phishscore=0 bulkscore=0 clxscore=1015
 suspectscore=0 priorityscore=1501 spamscore=0 mlxlogscore=999 mlxscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000
 definitions=main-2105070083
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 07 May 2021 12:32:58 -0000

On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > [snip]
> >
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> > really fit into the topic of loop distribution, I created a separate pass.
> 
> It's true that these reduction-like patterns are more difficult than
> the existing
> memcpy/memset cases.
> 
> >  Due
> > to this I was also able to play around a bit and schedule the pass at different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.
> 
> In fact I'd schedule it after loop distribution so there's the chance that loop
> distribution can expose a loop that fits the new pattern.
> 
> >  Running
> > this setup over
> >
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > the new pass transforms
> >
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # p_2 = PHI <p_6(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> >   return p_9;
> >
> > }
> >
> > into
> >
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 4>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> >   return p_9;
> >
> > }
> >
> > which is fine so far.  However, I haven't made up my mind so far whether it is
> > worthwhile to spend more time in order to also eliminate the "first unrolling"
> > of the loop.
> 
> Might be a phiopt transform ;)  Might apply to quite some set of
> builtins.  I wonder how the strlen case looks like though.
> 
> > I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> >
> > char * t (char * p)
> > {
> >   <bb 2> [local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> >
> > }
> >
> > which seems optimal to me.  The downside of this is that I have to initialize
> > scalar evolution analysis which might be undesired that early.
> >
> > All this brings me to the question where do you see this peace of code running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
> 
> I think it still fits loop distribution.  If you manage to detect it
> with your pass
> standalone then you should be able to detect it in loop distribution.

If a loop is distributed only because one of the partitions matches a
rawmemchr/strlen-like loop pattern, then we have at least two partitions
which walk over the same memory region.  Since a rawmemchr/strlen-like
loop has no body (neglecting expression-3 of a for-loop where just an
increment happens) it is governed by the memory accesses in the loop
condition.  Therefore, in such a case loop distribution would result in
performance degradation.  This is why I think that it does not fit
conceptually into ldist pass.  However, since I make use of a couple of
helper functions from ldist pass, it may still fit technically.

Since currently all ldist optimizations operate over loops where niters
is known and for rawmemchr/strlen-like loops this is not the case, it is
not possible that those optimizations expose a loop which is suitable
for rawmemchr/strlen optimization.  Therefore, what do you think about
scheduling rawmemchr/strlen optimization right between those
if-statements of function loop_distribution::execute?

   if (nb_generated_loops + nb_generated_calls > 0)
     {
       changed = true;
       if (dump_enabled_p ())
         dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
                          loc, "Loop%s %d distributed: split to %d loops "
                          "and %d library calls.\n", str, loop->num,
                          nb_generated_loops, nb_generated_calls);

       break;
     }

   // rawmemchr/strlen like loops

   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);

> Can you
> explain what part is "easier" as standalone pass?

Yea that term is rather misleading.  It was probably easier for me to
understand the underlying problem and to play around with my code.
There are no technical reasons for a standalone pass.

Cheers,
Stefan

> 
> > Another topic which came up is whether there exists a more elegant solution to
> > my current implementation in order to deal with stores (I'm speaking of the `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> >
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > ends up as
> >
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> >
> >   <bb 2> [local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> >   return p.1_11;
> >
> > }
> >
> > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > have to show that we never load from a memory location to which we write.
> > Currently I solve this by hard coding those facts which are not generic at all.
> > I gave compute_data_dependences_for_loop a try which failed to determine the
> > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > there are more generic solutions to express this in contrast to my current one?
> 
> So the example loop is not valid to be trasformed to rawmemchr since it's
> valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
> things become fishy but GCC isn't able to turn that into a non-dependence.
> 
> Why is the case of stores inside the loop important?  In fact that you
> think it is
> makes a case for integrating this with loop distribution since loop distribution
> would be able to prove (if possible) that the stores can be separated into
> a different loop.
> 
> And sorry for the delay in answering ...
> 
> Thanks,
> Richard.
> 
> >
> > Thanks again for your input so far.  Really appreciated.
> >
> > Cheers,
> > Stefan