From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stefansf@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 864A139A1834
 for <gcc-patches@gcc.gnu.org>; Fri,  6 Aug 2021 14:02:18 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 864A139A1834
Received: from pps.filterd (m0098416.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 176DjrMZ135222
 for <gcc-patches@gcc.gnu.org>; Fri, 6 Aug 2021 10:02:18 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3a859e659r-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <gcc-patches@gcc.gnu.org>; Fri, 06 Aug 2021 10:02:16 -0400
Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 176Dl31U140396
 for <gcc-patches@gcc.gnu.org>; Fri, 6 Aug 2021 10:02:16 -0400
Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.99])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3a859e657f-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 06 Aug 2021 10:02:15 -0400
Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1])
 by ppma04ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 176Dvkmt000762;
 Fri, 6 Aug 2021 14:02:13 GMT
Received: from b06cxnps4076.portsmouth.uk.ibm.com
 (d06relay13.portsmouth.uk.ibm.com [9.149.109.198])
 by ppma04ams.nl.ibm.com with ESMTP id 3a4x5959da-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 06 Aug 2021 14:02:13 +0000
Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com
 [9.149.105.58])
 by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 176E2ARA54460686
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 6 Aug 2021 14:02:10 GMT
Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 942B64C05E;
 Fri,  6 Aug 2021 14:02:10 +0000 (GMT)
Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 9D5954C05C;
 Fri,  6 Aug 2021 14:02:09 +0000 (GMT)
Received: from localhost.localdomain (unknown [9.145.75.161])
 by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTPS;
 Fri,  6 Aug 2021 14:02:09 +0000 (GMT)
Date: Fri, 6 Aug 2021 16:02:09 +0200
From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns
Message-ID: <YQ1A4cSv9tBM0qV7@localhost.localdomain>
References: <CAFiYyc0udwxVePaDsWkpVNh6kbiqke0PVv-xarbsXq3VC-Bz-w@mail.gmail.com>
 <YD/EkX2TRRk1qRh5@localhost.localdomain>
 <YFDnMF3i6OsQa6L9@localhost.localdomain>
 <CAFiYyc0Q3OZqbzGBukt=rvd6VM0EcdLKbv1d2uORbGCGg-Lbzg@mail.gmail.com>
 <YJUzcnv1J2HaNro6@localhost.localdomain>
 <CAFiYyc3RR5P4GZXEUiwN2=5d96_pGaHPwPTYCOY=ax-6W3=raw@mail.gmail.com>
 <YKasY3VQ5GDBRomk@localhost.localdomain>
 <YMeRMgR17PE0KTFx@localhost.localdomain>
 <CAFiYyc2JW0H-+Z7X7vwD39T3i2ndF5No6gQ55Qs=0VBtCF5=WA@mail.gmail.com>
 <YNWuowVwITc48yeL@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YNWuowVwITc48yeL@localhost.localdomain>
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: BL9v289NIuEEmq30ibGG6wBcvAkZP8D1
X-Proofpoint-ORIG-GUID: f-kNoMIv4DV3QJHj1lzhimGqcU-g0ih0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-08-06_04:2021-08-05,
 2021-08-06 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 phishscore=0 mlxlogscore=999
 clxscore=1015 suspectscore=0 malwarescore=0 bulkscore=0 impostorscore=0
 mlxscore=0 adultscore=0 priorityscore=1501 lowpriorityscore=0 spamscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2107140000
 definitions=main-2108060094
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Aug 2021 14:02:23 -0000

ping

On Fri, Jun 25, 2021 at 12:23:32PM +0200, Stefan Schulze Frielinghaus wrote:
> On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
> > <stefansf@linux.ibm.com> wrote:
> > >
> > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> > > [...]
> > > > > but we won't ever arrive here because of the niters condition.  But
> > > > > yes, doing the pattern matching in the innermost loop processing code
> > > > > looks good to me - for the specific case it would be
> > > > >
> > > > >       /* Don't distribute loop if niters is unknown.  */
> > > > >       tree niters = number_of_latch_executions (loop);
> > > > >       if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > > ---> here?
> > > > >         continue;
> > > >
> > > > Right, please find attached a new version of the patch where everything
> > > > is included in the loop distribution pass.  I will do a bootstrap and
> > > > regtest on IBM Z over night.  If you give me green light I will also do
> > > > the same on x86_64.
> > >
> > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > > would rebase and run the testsuites again and post a patch series
> > > including the rawmemchr implementation for IBM Z.
> > 
> > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> > *loop, vec<gimple *> *work_list)
> >    return work_list->length () > 0;
> >  }
> > 
> > +static void
> > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> > +                           data_reference_p store_dr, tree base, tree pattern,
> > +                           location_t loc)
> > +{
> > 
> > this new function needs a comment.  Applies to all of the new ones, btw.
> 
> Done.
> 
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> > +                      && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
> > 
> > this looks fragile and is probably unnecessary as well.
> > 
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> > 
> > in general you want types_compatible_p () checks which for pointers means
> > all pointers are compatible ...
> 
> True, I removed both asserts.
> 
> > (skipping stuff)
> > 
> > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
> >               && !optimize_loop_for_speed_p (loop)))
> >         continue;
> > 
> > -      /* Don't distribute loop if niters is unknown.  */
> > +      /* If niters is unknown don't distribute loop but rather try to transform
> > +        it to a call to a builtin.  */
> >        tree niters = number_of_latch_executions (loop);
> >        if (niters == NULL_TREE || niters == chrec_dont_know)
> > -       continue;
> > +       {
> > +         if (transform_reduction_loop (loop))
> > +           {
> > +             changed = true;
> > +             loops_to_be_destroyed.safe_push (loop);
> > +             if (dump_file)
> > +               fprintf (dump_file, "Loop %d transformed into a
> > builtin.\n", loop->num);
> > +           }
> > +         continue;
> > +       }
> > 
> > please look at
> > 
> >           if (nb_generated_loops + nb_generated_calls > 0)
> >             {
> >               changed = true;
> >               if (dump_enabled_p ())
> >                 dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >                                  loc, "Loop%s %d distributed: split to
> > %d loops "
> >                                  "and %d library calls.\n", str, loop->num,
> >                                  nb_generated_loops, nb_generated_calls);
> > 
> > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> > transforms are reported with -fopt-info-loop
> 
> Done.
> 
> > +
> > +  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
> > +}
> > 
> > what's the point in tail-calling here and visually splitting the
> > function in half?
> 
> In the first place I thought that this is more pleasant since in
> transform_reduction_loop_1 it is settled that we have a single load,
> store, and reduction variable.  After refactoring this isn't true
> anymore and I inlined the function and made this clear via a comment.
> 
> > 
> > (sorry for picking random pieces now ;))
> > 
> > +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> > +          gsi_next (&bsi), ++ninsns)
> > +       {
> > 
> > this counts debug insns, I guess you want gsi_next_nondebug at least.
> > not sure why you are counting PHIs at all btw - for the loops you match
> > you are expecting at most two, one IV and eventually one for the virtual
> > operand of the store?
> 
> Yes, I removed the counting for the phi loop and changed to
> gsi_next_nondebug for both loops.
> 
> > 
> > +         if (gimple_has_volatile_ops (phi))
> > +           return false;
> > 
> > PHIs never have volatile ops.
> > 
> > +         if (gimple_clobber_p (phi))
> > +           continue;
> > 
> > or are clobbers.
> 
> Removed both.
> 
> > Btw, can you factor out a helper from find_single_drs working on a
> > stmt to reduce code duplication?
> 
> Ahh sorry for that.  I've already done this in one of my first patches
> but didn't copy that over.  Although my changes do not require a RDG the
> whole pass is based upon this data structure.  Therefore, in order to
> share more code I decided to temporarily build the RDG so that I can
> call into find_single_drs.  Since the graph is rather small I guess the
> overhead is acceptable w.r.t. code sharing.
> 
> struct graph *rdg = build_rdg (loop, NULL);
> if (rdg == NULL)
>   {
>     if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file,
>      	 "Loop %d not transformed: failed to build the RDG.\n",
>      	 loop->num);
> 
>     return false;
>   }
> auto_bitmap partition_stmts;
> bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
> find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
> free_rdg (rdg);
> 
> As a side-effect of this, now, I also have to (de)allocate the class
> member datarefs_vec prior/after calling into transform_reduction_loop:
> 
> /* If niters is unknown don't distribute loop but rather try to transform
>    it to a call to a builtin.  */
> tree niters = number_of_latch_executions (loop);
> if (niters == NULL_TREE || niters == chrec_dont_know)
>   {
>     datarefs_vec.create (20);
>     if (transform_reduction_loop (loop))
>       {
>         changed = true;
>         loops_to_be_destroyed.safe_push (loop);
>         if (dump_enabled_p ())
>           {
>             dump_user_location_t loc = find_loop_location (loop);
>             dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>                              loc, "Loop %d transformed into a builtin.\n",
>                              loop->num);
>           }
>       }
>     free_data_refs (datarefs_vec);
>     continue;
>   }
> 
> > 
> > +  tree reduction_var;
> > +  switch (gimple_code (reduction_stmt))
> > +    {
> > +    case GIMPLE_PHI:
> > +      reduction_var = gimple_phi_result (reduction_stmt);
> > +      break;
> > +    case GIMPLE_ASSIGN:
> > +      reduction_var = gimple_assign_lhs (reduction_stmt);
> > +      break;
> > +    default:
> > +      /* Bail out e.g. for GIMPLE_CALL.  */
> > +      return false;
> > 
> > gimple_get_lhs (reduction_stmt); would work for both PHIs
> > and assigns.
> 
> Done.
> 
> > 
> > +  if (reduction_var == NULL)
> > +    return false;
> > 
> > it can never be NULL here.
> 
> True, otherwise the reduction statement wouldn't have a dependence outside
> the loop. => Removed.
> 
> > 
> > +  /* Bail out if this is a bitfield memory reference.  */
> > +  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
> > +      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
> > +    return false;
> > ...
> > 
> > I see this is again quite some code copied from find_single_drs, please
> > see how to avoid this much duplication by splitting out helpers.
> 
> Sorry again.  Hope the solution above is more appropriate.
> 
> > 
> > +static bool
> > +transform_reduction_loop_1 (loop_p loop,
> > +                           data_reference_p load_dr,
> > +                           data_reference_p store_dr,
> > +                           tree reduction_var)
> > +{
> > +  tree load_ref = DR_REF (load_dr);
> > +  tree load_type = TREE_TYPE (load_ref);
> > +  tree load_access_base = build_fold_addr_expr (load_ref);
> > +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> > +  affine_iv load_iv, reduction_iv;
> > +  tree pattern;
> > +
> > +  /* A limitation of the current implementation is that we only support
> > +     constant patterns.  */
> > +  edge e = single_exit (loop);
> > +  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
> > +  if (!cond_stmt)
> > +    return false;
> > 
> > that looks like checks to be done at the start of
> > transform_reduction_loop, not this late.
> 
> Pulled this to the very beginning of transform_reduction_loop.
> 
> > 
> > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
> > +      || TREE_CODE (pattern) != INTEGER_CST)
> > +    return false;
> > 
> > half of this as well.  Btw, there's no canonicalization for
> > the tests so you have to verify the false edge actually exits
> > the loop and allow for EQ_EXPR in case the false edge does.
> 
> Uh good point.  I added checks for that and pulled most of it to the
> beginning of transform_reduction_loop.
> 
> > 
> > +  /* Handle strlen like loops.  */
> > +  if (store_dr == NULL
> > +      && integer_zerop (pattern)
> > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > +      && integer_onep (reduction_iv.step)
> > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > +    {
> > 
> > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > The iteration
> > only stops when you load a NUL and the increments just wrap along (you're
> > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> 
> I think truncation is enough as long as no overflow occurs in strlen or
> strlen_using_rawmemchr.
> 
> > For larger than size_type_node (actually larger than ptr_type_node would matter
> > I guess), the argument is that since pointer wrapping would be undefined anyway
> > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > 
> >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > (ptr_type_node)
> >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > 
> > ?
> 
> Regarding the implementation which makes use of rawmemchr:
> 
> We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> the maximal length we can determine of a string where each character has
> size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> ptrdiff type is undefined we have to make sure that if an overflow
> occurs, then an overflow occurs for reduction variable, too, and that
> this is undefined, too.  However, I'm not sure anymore whether we want
> to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> this would mean that a single string consumes more than half of the
> virtual addressable memory.  At least for architectures where
> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> to neglect the case where computing pointer difference may overflow.
> Otherwise we are talking about strings with lenghts of multiple
> pebibytes.  For other architectures we might have to be more precise
> and make sure that reduction variable overflows first and that this is
> undefined.
> 
> Thus a conservative condition would be (I assumed that the size of any
> integral type is a power of two which I'm not sure if this really holds;
> IIRC the C standard requires only that the alignment is a power of two
> but not necessarily the size so I might need to change this):
> 
> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
>    or in other words return true if reduction variable overflows first
>    and false otherwise.  */
> 
> static bool
> reduction_var_overflows_first (tree reduction_var, tree load_type)
> {
>   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
>   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
>   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
>   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> }
> 
> TYPE_PRECISION (ptrdiff_type_node) == 64
> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>     && reduction_var_overflows_first (reduction_var, load_type)
> 
> Regarding the implementation which makes use of strlen:
> 
> I'm not sure what it means if strlen is called for a string with a
> length greater than SIZE_MAX.  Therefore, similar to the implementation
> using rawmemchr where we neglect the case of an overflow for 64bit
> architectures, a conservative condition would be:
> 
> TYPE_PRECISION (size_type_node) == 64
> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> 
> I still included the overflow undefined check for reduction variable in
> order to rule out situations where the reduction variable is unsigned
> and overflows as many times until strlen(,_using_rawmemchr) overflows,
> too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> architectures.  Anyhow, while writing this down it becomes clear that
> this deserves a comment which I will add once it becomes clear which way
> to go.
> 
> > 
> > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > +       {
> > +         const char *msg = G_("assuming signed overflow does not occur "
> > +                              "when optimizing strlen like loop");
> > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > +       }
> > 
> > no, please don't add any new strict-overflow warnings ;)
> 
> I just stumbled over code which produces such a warning and thought this
> is a hard requirement :D The new patch doesn't contain it anymore.
> 
> > 
> > The generate_*_builtin routines need some factoring - if you code-generate
> > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > (not sure why you do that - you should see to fold the call, not necessarily
> > the rest).  The replacement of reduction_var and the dumping could be shared.
> > There's also GET_MODE_NAME for the printing.
> 
> I wasn't really sure which way to go.  Use a gsi, as it is done by
> existing generate_* functions, or make use of gimple_seq.  Since the
> latter uses internally also gsi I thought it is better to stick to gsi
> in the first place.  Now, after changing to gimple_seq I see the beauty
> of it :)
> 
> I created two helper functions generate_strlen_builtin_1 and
> generate_reduction_builtin_1 in order to reduce code duplication.
> 
> In function generate_strlen_builtin I changed from using
> builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> sure whether my intuition about the difference between implicit and
> explicit builtins is correct.  In builtins.def there is a small example
> given which I would paraphrase as "use builtin_decl_explicit if the
> semantics of the builtin is defined by the C standard; otherwise use
> builtin_decl_implicit" but probably my intuition is wrong?
> 
> Beside that I'm not sure whether I really have to call
> build_fold_addr_expr which looks superfluous to me since
> gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> 
> tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> gimple *fn_call = gimple_build_call (fn, 1, mem);
> 
> However, since it is also used that way in the context of
> generate_memset_builtin I didn't remove it so far.
> 
> > I think overall the approach is sound now but the details still need work.
> 
> Once again thank you very much for your review.  Really appreciated!
> 
> Cheers,
> Stefan

> commit d24639c895ce3c0f9539570bab7b6510e98b1ffa
> Author: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
> Date:   Wed Mar 17 09:00:06 2021 +0100
> 
>     foo
> 
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index d209a52f823..633f41d9e6d 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2929,6 +2929,35 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>    gcc_unreachable ();
>  }
>  
> +/* Expand IFN_RAWMEMCHAR internal function.  */
> +
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> +  if (!lhs)
> +    return;
> +  tree lhs_type = TREE_TYPE (lhs);
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
> +
> +  for (unsigned int i = 0; i < 2; ++i)
> +    {
> +      tree rhs = gimple_call_arg (stmt, i);
> +      tree rhs_type = TREE_TYPE (rhs);
> +      rtx rhs_rtx = expand_normal (rhs);
> +      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
> +    }
> +
> +  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
> +
> +  expand_insn (icode, 3, ops);
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
>  /* Expand the IFN_UNIQUE function according to its first argument.  */
>  
>  static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index daeace7a34e..95c76795648 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
>  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
>  
>  /* An unduplicable, uncombinable function.  Generally used to preserve
>     a CFG property in the face of jump threading, tail merging or
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b..f7c69f914ce 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
>  OPTAB_D (movmem_optab, "movmem$a")
>  OPTAB_D (setmem_optab, "setmem$a")
>  OPTAB_D (strlen_optab, "strlen$a")
> +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
>  
>  OPTAB_DC(fma_optab, "fma$a4", FMA)
>  OPTAB_D (fms_optab, "fms$a4")
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
> new file mode 100644
> index 00000000000..6db62d7644d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
> @@ -0,0 +1,72 @@
> +/* { dg-do run { target s390x-*-* } } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
> +
> +/* Rawmemchr pattern: reduction stmt but no store */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +#define test(T, pattern)   \
> +__attribute__((noinline))  \
> +T *test_##T (T *p)         \
> +{                          \
> +  while (*p != (T)pattern) \
> +    ++p;                   \
> +  return p;                \
> +}
> +
> +test (uint8_t,  0xab)
> +test (uint16_t, 0xabcd)
> +test (uint32_t, 0xabcdef15)
> +
> +test (int8_t,  0xab)
> +test (int16_t, 0xabcd)
> +test (int32_t, 0xabcdef15)
> +
> +#define run(T, pattern, i)      \
> +{                               \
> +T *q = p;                       \
> +q[i] = (T)pattern;              \
> +assert (test_##T (p) == &q[i]); \
> +q[i] = 0;                       \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0, 1024);
> +
> +  run (uint8_t, 0xab, 0);
> +  run (uint8_t, 0xab, 1);
> +  run (uint8_t, 0xab, 13);
> +
> +  run (uint16_t, 0xabcd, 0);
> +  run (uint16_t, 0xabcd, 1);
> +  run (uint16_t, 0xabcd, 13);
> +
> +  run (uint32_t, 0xabcdef15, 0);
> +  run (uint32_t, 0xabcdef15, 1);
> +  run (uint32_t, 0xabcdef15, 13);
> +
> +  run (int8_t, 0xab, 0);
> +  run (int8_t, 0xab, 1);
> +  run (int8_t, 0xab, 13);
> +
> +  run (int16_t, 0xabcd, 0);
> +  run (int16_t, 0xabcd, 1);
> +  run (int16_t, 0xabcd, 13);
> +
> +  run (int32_t, 0xabcdef15, 0);
> +  run (int32_t, 0xabcdef15, 1);
> +  run (int32_t, 0xabcdef15, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
> new file mode 100644
> index 00000000000..00d6ea0f8e9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
> @@ -0,0 +1,83 @@
> +/* { dg-do run { target s390x-*-* } } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
> +
> +/* Rawmemchr pattern: reduction stmt and store */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +uint8_t *p_uint8_t;
> +uint16_t *p_uint16_t;
> +uint32_t *p_uint32_t;
> +
> +int8_t *p_int8_t;
> +int16_t *p_int16_t;
> +int32_t *p_int32_t;
> +
> +#define test(T, pattern)    \
> +__attribute__((noinline))   \
> +T *test_##T (void)          \
> +{                           \
> +  while (*p_##T != pattern) \
> +    ++p_##T;                \
> +  return p_##T;             \
> +}
> +
> +test (uint8_t,  0xab)
> +test (uint16_t, 0xabcd)
> +test (uint32_t, 0xabcdef15)
> +
> +test (int8_t,  (int8_t)0xab)
> +test (int16_t, (int16_t)0xabcd)
> +test (int32_t, (int32_t)0xabcdef15)
> +
> +#define run(T, pattern, i) \
> +{                          \
> +T *q = p;                  \
> +q[i] = pattern;            \
> +p_##T = p;                 \
> +T *r = test_##T ();        \
> +assert (r == p_##T);       \
> +assert (r == &q[i]);       \
> +q[i] = 0;                  \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, '\0', 1024);
> +
> +  run (uint8_t, 0xab, 0);
> +  run (uint8_t, 0xab, 1);
> +  run (uint8_t, 0xab, 13);
> +
> +  run (uint16_t, 0xabcd, 0);
> +  run (uint16_t, 0xabcd, 1);
> +  run (uint16_t, 0xabcd, 13);
> +
> +  run (uint32_t, 0xabcdef15, 0);
> +  run (uint32_t, 0xabcdef15, 1);
> +  run (uint32_t, 0xabcdef15, 13);
> +
> +  run (int8_t, (int8_t)0xab, 0);
> +  run (int8_t, (int8_t)0xab, 1);
> +  run (int8_t, (int8_t)0xab, 13);
> +
> +  run (int16_t, (int16_t)0xabcd, 0);
> +  run (int16_t, (int16_t)0xabcd, 1);
> +  run (int16_t, (int16_t)0xabcd, 13);
> +
> +  run (int32_t, (int32_t)0xabcdef15, 0);
> +  run (int32_t, (int32_t)0xabcdef15, 1);
> +  run (int32_t, (int32_t)0xabcdef15, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
> new file mode 100644
> index 00000000000..918b60099e4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
> @@ -0,0 +1,100 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */
> +/* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target s390x-*-* } } } */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +#define test(T, U)        \
> +__attribute__((noinline)) \
> +U test_##T##U (T *s)      \
> +{                         \
> +  U i;                    \
> +  for (i=0; s[i]; ++i);   \
> +  return i;               \
> +}
> +
> +test (uint8_t,  size_t)
> +test (uint16_t, size_t)
> +test (uint32_t, size_t)
> +test (uint8_t,  int)
> +test (uint16_t, int)
> +test (uint32_t, int)
> +
> +test (int8_t,  size_t)
> +test (int16_t, size_t)
> +test (int32_t, size_t)
> +test (int8_t,  int)
> +test (int16_t, int)
> +test (int32_t, int)
> +
> +#define run(T, U, i)             \
> +{                                \
> +T *q = p;                        \
> +q[i] = 0;                        \
> +assert (test_##T##U (p) == i);   \
> +memset (&q[i], 0xf, sizeof (T)); \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0xf, 1024);
> +
> +  run (uint8_t, size_t, 0);
> +  run (uint8_t, size_t, 1);
> +  run (uint8_t, size_t, 13);
> +
> +  run (int8_t, size_t, 0);
> +  run (int8_t, size_t, 1);
> +  run (int8_t, size_t, 13);
> +
> +  run (uint8_t, int, 0);
> +  run (uint8_t, int, 1);
> +  run (uint8_t, int, 13);
> +
> +  run (int8_t, int, 0);
> +  run (int8_t, int, 1);
> +  run (int8_t, int, 13);
> +
> +  run (uint16_t, size_t, 0);
> +  run (uint16_t, size_t, 1);
> +  run (uint16_t, size_t, 13);
> +
> +  run (int16_t, size_t, 0);
> +  run (int16_t, size_t, 1);
> +  run (int16_t, size_t, 13);
> +
> +  run (uint16_t, int, 0);
> +  run (uint16_t, int, 1);
> +  run (uint16_t, int, 13);
> +
> +  run (int16_t, int, 0);
> +  run (int16_t, int, 1);
> +  run (int16_t, int, 13);
> +
> +  run (uint32_t, size_t, 0);
> +  run (uint32_t, size_t, 1);
> +  run (uint32_t, size_t, 13);
> +
> +  run (int32_t, size_t, 0);
> +  run (int32_t, size_t, 1);
> +  run (int32_t, size_t, 13);
> +
> +  run (uint32_t, int, 0);
> +  run (uint32_t, int, 1);
> +  run (uint32_t, int, 13);
> +
> +  run (int32_t, int, 0);
> +  run (int32_t, int, 1);
> +  run (int32_t, int, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
> new file mode 100644
> index 00000000000..e25d6ea5b56
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
> @@ -0,0 +1,58 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 3 "ldist" } } */
> +
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +__attribute__((noinline))
> +int test_pos (char *s)
> +{
> +  int i;
> +  for (i=42; s[i]; ++i);
> +  return i;
> +}
> +
> +__attribute__((noinline))
> +int test_neg (char *s)
> +{
> +  int i;
> +  for (i=-42; s[i]; ++i);
> +  return i;
> +}
> +
> +__attribute__((noinline))
> +int test_including_null_char (char *s)
> +{
> +  int i;
> +  for (i=1; s[i-1]; ++i);
> +  return i;
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0xf, 1024);
> +  char *s = (char *)p + 100;
> +
> +  s[42+13] = 0;
> +  assert (test_pos (s) == 42+13);
> +  s[42+13] = 0xf;
> +
> +  s[13] = 0;
> +  assert (test_neg (s) == 13);
> +  s[13] = 0xf;
> +
> +  s[-13] = 0;
> +  assert (test_neg (s) == -13);
> +  s[-13] = 0xf;
> +
> +  s[13] = 0;
> +  assert (test_including_null_char (s) == 13+1);
> +
> +  return 0;
> +}
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 65aa1df4aba..6bd4bc8588a 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -116,6 +116,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
>  #include "tree-affine.h"
> +#include "intl.h"
> +#include "rtl.h"
> +#include "memmodel.h"
> +#include "optabs.h"
>  
>  
>  #define MAX_DATAREFS_NUM \
> @@ -650,6 +654,10 @@ class loop_distribution
>  		       control_dependences *cd, int *nb_calls, bool *destroy_p,
>  		       bool only_patterns_p);
>  
> +  /* Transform loops which mimic the effects of builtins rawmemchr or strlen and
> +     replace them accordingly.  */
> +  bool transform_reduction_loop (loop_p loop);
> +
>    /* Compute topological order for basic blocks.  Topological order is
>       needed because data dependence is computed for data references in
>       lexicographical order.  */
> @@ -1490,14 +1498,14 @@ loop_distribution::build_rdg_partition_for_vertex (struct graph *rdg, int v)
>     data references.  */
>  
>  static bool
> -find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> +find_single_drs (class loop *loop, struct graph *rdg, const bitmap &partition_stmts,
>  		 data_reference_p *dst_dr, data_reference_p *src_dr)
>  {
>    unsigned i;
>    data_reference_p single_ld = NULL, single_st = NULL;
>    bitmap_iterator bi;
>  
> -  EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> +  EXECUTE_IF_SET_IN_BITMAP (partition_stmts, 0, i, bi)
>      {
>        gimple *stmt = RDG_STMT (rdg, i);
>        data_reference_p dr;
> @@ -1538,44 +1546,47 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
>  	}
>      }
>  
> -  if (!single_st)
> -    return false;
> -
> -  /* Bail out if this is a bitfield memory reference.  */
> -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> +  if (!single_ld && !single_st)
>      return false;
>  
> -  /* Data reference must be executed exactly once per iteration of each
> -     loop in the loop nest.  We only need to check dominance information
> -     against the outermost one in a perfect loop nest because a bb can't
> -     dominate outermost loop's latch without dominating inner loop's.  */
> -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> -    return false;
> +  basic_block bb_ld = NULL;
> +  basic_block bb_st = NULL;
>  
>    if (single_ld)
>      {
> -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> -      /* Direct aggregate copy or via an SSA name temporary.  */
> -      if (load != store
> -	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> -	return false;
> -
>        /* Bail out if this is a bitfield memory reference.  */
>        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
>  	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
>  	return false;
>  
> -      /* Load and store must be in the same loop nest.  */
> -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> -      if (bb_st->loop_father != bb_ld->loop_father)
> +      /* Data reference must be executed exactly once per iteration of each
> +	 loop in the loop nest.  We only need to check dominance information
> +	 against the outermost one in a perfect loop nest because a bb can't
> +	 dominate outermost loop's latch without dominating inner loop's.  */
> +      bb_ld = gimple_bb (DR_STMT (single_ld));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +	return false;
> +    }
> +
> +  if (single_st)
> +    {
> +      /* Bail out if this is a bitfield memory reference.  */
> +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> +	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
>  	return false;
>  
>        /* Data reference must be executed exactly once per iteration.
> -	 Same as single_st, we only need to check against the outermost
> +	 Same as single_ld, we only need to check against the outermost
>  	 loop.  */
> -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +      bb_st = gimple_bb (DR_STMT (single_st));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> +	return false;
> +    }
> +
> +  if (single_ld && single_st)
> +    {
> +      /* Load and store must be in the same loop nest.  */
> +      if (bb_st->loop_father != bb_ld->loop_father)
>  	return false;
>  
>        edge e = single_exit (bb_st->loop_father);
> @@ -1850,9 +1861,19 @@ loop_distribution::classify_partition (loop_p loop,
>      return has_reduction;
>  
>    /* Find single load/store data references for builtin partition.  */
> -  if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> +  if (!find_single_drs (loop, rdg, partition->stmts, &single_st, &single_ld)
> +      || !single_st)
>      return has_reduction;
>  
> +  if (single_ld && single_st)
> +    {
> +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> +      /* Direct aggregate copy or via an SSA name temporary.  */
> +      if (load != store
> +	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> +	return has_reduction;
> +    }
> +
>    partition->loc = gimple_location (DR_STMT (single_st));
>  
>    /* Classify the builtin kind.  */
> @@ -3257,6 +3278,379 @@ find_seed_stmts_for_distribution (class loop *loop, vec<gimple *> *work_list)
>    return work_list->length () > 0;
>  }
>  
> +/* A helper function for generate_{rawmemchr,strlen}_builtin functions in order
> +   to place new statements SEQ before LOOP and replace the old reduction
> +   variable with the new one.  */
> +
> +static void
> +generate_reduction_builtin_1 (loop_p loop, gimple_seq &seq,
> +			      tree reduction_var_old, tree reduction_var_new,
> +			      const char *info, machine_mode load_mode)
> +{
> +  /* Place new statements before LOOP.  */
> +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +  gsi_insert_seq_after (&gsi, seq, GSI_CONTINUE_LINKING);
> +
> +  /* Replace old reduction variable with new one.  */
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var_old)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +	SET_USE (use_p, reduction_var_new);
> +
> +      update_stmt (stmt);
> +    }
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, info, GET_MODE_NAME (load_mode));
> +}
> +
> +/* Generate a call to rawmemchr and place it before LOOP.  REDUCTION_VAR is
> +   replaced with a fresh SSA name representing the result of the call.  */
> +
> +static void
> +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> +			    data_reference_p store_dr, tree base, tree pattern,
> +			    location_t loc)
> +{
> +  gimple_seq seq = NULL;
> +
> +  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
> +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
> +  tree reduction_var_new = copy_ssa_name (reduction_var);
> +  gimple_call_set_lhs (fn_call, reduction_var_new);
> +  gimple_set_location (fn_call, loc);
> +  gimple_seq_add_stmt (&seq, fn_call);
> +
> +  if (store_dr)
> +    {
> +      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
> +      gimple_seq_add_stmt (&seq, g);
> +    }
> +
> +  generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new,
> +				"generated rawmemchr%s\n",
> +				TYPE_MODE (TREE_TYPE (TREE_TYPE (base))));
> +}
> +
> +/* Helper function for generate_strlen_builtin(,_using_rawmemchr)  */
> +
> +static void
> +generate_strlen_builtin_1 (loop_p loop, gimple_seq &seq,
> +			   tree reduction_var_old, tree reduction_var_new,
> +			   machine_mode mode, tree start_len)
> +{
> +  /* REDUCTION_VAR_NEW has either size type or ptrdiff type and must be
> +     converted if types of old and new reduction variable are not compatible. */
> +  reduction_var_new = gimple_convert (&seq, TREE_TYPE (reduction_var_old),
> +				      reduction_var_new);
> +
> +  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
> +     length.  */
> +  if (!integer_zerop (start_len))
> +    {
> +      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
> +      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
> +				       start_len);
> +      gimple_seq_add_stmt (&seq, g);
> +      reduction_var_new = lhs;
> +    }
> +
> +  generate_reduction_builtin_1 (loop, seq, reduction_var_old, reduction_var_new,
> +				"generated strlen%s\n", mode);
> +}
> +
> +/* Generate a call to strlen and place it before LOOP.  REDUCTION_VAR is
> +   replaced with a fresh SSA name representing the result of the call.  */
> +
> +static void
> +generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
> +			 tree start_len, location_t loc)
> +{
> +  gimple_seq seq = NULL;
> +
> +  tree reduction_var_new = make_ssa_name (size_type_node);
> +
> +  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
> +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> +  gimple_call_set_lhs (fn_call, reduction_var_new);
> +  gimple_set_location (fn_call, loc);
> +  gimple_seq_add_stmt (&seq, fn_call);
> +
> +  generate_strlen_builtin_1 (loop, seq, reduction_var, reduction_var_new,
> +			     QImode, start_len);
> +}
> +
> +/* Generate code in order to mimic the behaviour of strlen but this time over
> +   an array of elements with mode different than QI.  REDUCTION_VAR is replaced
> +   with a fresh SSA name representing the result, i.e., the length.  */
> +
> +static void
> +generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
> +					 tree base, tree start_len,
> +					 location_t loc)
> +{
> +  gimple_seq seq = NULL;
> +
> +  tree start = force_gimple_operand (base, &seq, true, NULL_TREE);
> +  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (start)));
> +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, start, zero);
> +  tree end = make_ssa_name (TREE_TYPE (base));
> +  gimple_call_set_lhs (fn_call, end);
> +  gimple_set_location (fn_call, loc);
> +  gimple_seq_add_stmt (&seq, fn_call);
> +
> +  /* Determine the number of elements between START and END by
> +     evaluating (END - START) / sizeof (*START).  */
> +  tree diff = make_ssa_name (ptrdiff_type_node);
> +  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, start);
> +  gimple_seq_add_stmt (&seq, diff_stmt);
> +  /* Let SIZE be the size of the the pointed-to type of START.  */
> +  tree size = gimple_convert (&seq, ptrdiff_type_node,
> +			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (start))));
> +  tree count = make_ssa_name (ptrdiff_type_node);
> +  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
> +  gimple_seq_add_stmt (&seq, count_stmt);
> +
> +  generate_strlen_builtin_1 (loop, seq, reduction_var, count,
> +			     TYPE_MODE (TREE_TYPE (TREE_TYPE (base))),
> +			     start_len);
> +}
> +
> +static bool
> +reduction_var_overflows_first (tree reduction_var, tree load_type)
> +{
> +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> +  unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> +  unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> +  return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> +}
> +
> +/* Transform loops which mimic the effects of builtins rawmemchr or strlen and
> +   replace them accordingly.  For example, a loop of the form
> +
> +     for (; *p != 42; ++p);
> +
> +   is replaced by
> +
> +     p = rawmemchr<MODE> (p, 42);
> +
> +   under the assumption that rawmemchr is available for a particular MODE.
> +   Another example is
> +
> +     int i;
> +     for (i = 42; s[i]; ++i);
> +
> +   which is replaced by
> +
> +     i = (int)strlen (&s[42]) + 42;
> +
> +   for some character array S.  In case array S is not of type character array
> +   we end up with
> +
> +     i = (int)(rawmemchr<MODE> (&s[42], 0) - &s[42]) + 42;
> +
> +   assuming that rawmemchr is available for a particular MODE.  */
> +
> +bool
> +loop_distribution::transform_reduction_loop (loop_p loop)
> +{
> +  gimple *reduction_stmt = NULL;
> +  data_reference_p load_dr = NULL, store_dr = NULL;
> +
> +  edge e = single_exit (loop);
> +  gcond *cond = safe_dyn_cast <gcond *> (last_stmt (e->src));
> +  if (!cond)
> +    return false;
> +  /* Ensure loop condition is an (in)equality test and loop is exited either if
> +     the inequality test fails or the equality test succeeds.  */
> +  if (!(e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == NE_EXPR)
> +      && !(e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == EQ_EXPR))
> +    return false;
> +  /* A limitation of the current implementation is that we only support
> +     constant patterns in (in)equality tests.  */
> +  tree pattern = gimple_cond_rhs (cond);
> +  if (TREE_CODE (pattern) != INTEGER_CST)
> +    return false;
> +
> +  basic_block *bbs = get_loop_body (loop);
> +
> +  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
> +    {
> +      basic_block bb = bbs[i];
> +
> +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> +	   gsi_next_nondebug (&bsi))
> +	{
> +	  gphi *phi = bsi.phi ();
> +	  if (virtual_operand_p (gimple_phi_result (phi)))
> +	    continue;
> +	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> +	    {
> +	      if (reduction_stmt)
> +		return false;
> +	      reduction_stmt = phi;
> +	    }
> +	}
> +
> +      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
> +	   gsi_next_nondebug (&bsi), ++ninsns)
> +	{
> +	  /* Bail out early for loops which are unlikely to match.  */
> +	  if (ninsns > 16)
> +	    return false;
> +	  gimple *stmt = gsi_stmt (bsi);
> +	  if (gimple_clobber_p (stmt))
> +	    continue;
> +	  if (gimple_code (stmt) == GIMPLE_LABEL)
> +	    continue;
> +	  if (gimple_has_volatile_ops (stmt))
> +	    return false;
> +	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
> +	    {
> +	      if (reduction_stmt)
> +		return false;
> +	      reduction_stmt = stmt;
> +	    }
> +	}
> +    }
> +
> +  /* A limitation of the current implementation is that we require a reduction
> +     statement.  Therefore, loops without a reduction statement as in the
> +     following are not recognized:
> +     int *p;
> +     void foo (void) { for (; *p; ++p); } */
> +  if (reduction_stmt == NULL)
> +    return false;
> +
> +  /* Reduction variables are guaranteed to be SSA names.  */
> +  tree reduction_var;
> +  switch (gimple_code (reduction_stmt))
> +    {
> +    case GIMPLE_ASSIGN:
> +    case GIMPLE_PHI:
> +      reduction_var = gimple_get_lhs (reduction_stmt);
> +      break;
> +    default:
> +      /* Bail out e.g. for GIMPLE_CALL.  */
> +      return false;
> +    }
> +
> +  struct graph *rdg = build_rdg (loop, NULL);
> +  if (rdg == NULL)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file,
> +		 "Loop %d not transformed: failed to build the RDG.\n",
> +		 loop->num);
> +
> +      return false;
> +    }
> +  auto_bitmap partition_stmts;
> +  bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
> +  find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
> +  free_rdg (rdg);
> +
> +  /* Bail out if there is no single load.  */
> +  if (load_dr == NULL)
> +    return false;
> +
> +  /* Reaching this point we have a loop with a single reduction variable,
> +     a single load, and an optional single store.  */
> +
> +  tree load_ref = DR_REF (load_dr);
> +  tree load_type = TREE_TYPE (load_ref);
> +  tree load_access_base = build_fold_addr_expr (load_ref);
> +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> +  affine_iv load_iv, reduction_iv;
> +
> +  if (!INTEGRAL_TYPE_P (load_type)
> +      || !type_has_mode_precision_p (load_type))
> +    return false;
> +
> +  /* We already ensured that the loop condition tests for (in)equality where the
> +     rhs is a constant pattern. Now ensure that the lhs is the result of the
> +     load.  */
> +  if (gimple_cond_lhs (cond) != gimple_assign_lhs (DR_STMT (load_dr)))
> +    return false;
> +
> +  /* Bail out if no affine induction variable with constant step can be
> +     determined.  */
> +  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
> +    return false;
> +
> +  /* Bail out if memory accesses are not consecutive or not growing.  */
> +  if (!operand_equal_p (load_iv.step, load_access_size, 0))
> +    return false;
> +
> +  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
> +    return false;
> +
> +  /* Handle rawmemchr like loops.  */
> +  if (operand_equal_p (load_iv.base, reduction_iv.base)
> +      && operand_equal_p (load_iv.step, reduction_iv.step))
> +    {
> +      if (store_dr)
> +	{
> +	  /* Ensure that we store to X and load from X+I where I>0.  */
> +	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
> +	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
> +	    return false;
> +	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
> +	  if (TREE_CODE (ptr_base) != SSA_NAME)
> +	    return false;
> +	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
> +	  if (!gimple_assign_single_p (def)
> +	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
> +	    return false;
> +	  /* Ensure that the reduction value is stored.  */
> +	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
> +	    return false;
> +	}
> +      /* Bail out if target does not provide rawmemchr for a certain mode.  */
> +      machine_mode mode = TYPE_MODE (load_type);
> +      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
> +	return false;
> +      location_t loc = gimple_location (DR_STMT (load_dr));
> +      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
> +				  pattern, loc);
> +      return true;
> +    }
> +
> +  /* Handle strlen like loops.  */
> +  if (store_dr == NULL
> +      && integer_zerop (pattern)
> +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> +      && integer_onep (reduction_iv.step))
> +    {
> +      location_t loc = gimple_location (DR_STMT (load_dr));
> +      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
> +	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)
> +	  && (TYPE_PRECISION (size_type_node) == 64
> +	      || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> +		  && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))))
> +	generate_strlen_builtin (loop, reduction_var, load_iv.base,
> +				 reduction_iv.base, loc);
> +      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
> +	       != CODE_FOR_nothing
> +	       && (TYPE_PRECISION (ptrdiff_type_node) == 64
> +		   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> +		       && reduction_var_overflows_first (reduction_var, load_type))))
> +	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
> +						 load_iv.base,
> +						 reduction_iv.base, loc);
> +      else
> +	return false;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
>  /* Given innermost LOOP, return the outermost enclosing loop that forms a
>     perfect loop nest.  */
>  
> @@ -3321,10 +3715,27 @@ loop_distribution::execute (function *fun)
>  	      && !optimize_loop_for_speed_p (loop)))
>  	continue;
>  
> -      /* Don't distribute loop if niters is unknown.  */
> +      /* If niters is unknown don't distribute loop but rather try to transform
> +	 it to a call to a builtin.  */
>        tree niters = number_of_latch_executions (loop);
>        if (niters == NULL_TREE || niters == chrec_dont_know)
> -	continue;
> +	{
> +	  datarefs_vec.create (20);
> +	  if (transform_reduction_loop (loop))
> +	    {
> +	      changed = true;
> +	      loops_to_be_destroyed.safe_push (loop);
> +	      if (dump_enabled_p ())
> +		{
> +		  dump_user_location_t loc = find_loop_location (loop);
> +		  dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> +				   loc, "Loop %d transformed into a builtin.\n",
> +				   loop->num);
> +		}
> +	    }
> +	  free_data_refs (datarefs_vec);
> +	  continue;
> +	}
>  
>        /* Get the perfect loop nest for distribution.  */
>        loop = prepare_perfect_loop_nest (loop);