Re: [RFC] ldist: Recognize rawmemchr loop patterns

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns
Date: Mon, 13 Sep 2021 16:53:11 +0200	[thread overview]
Message-ID: <YT9l12mswNfteDHb@localhost.localdomain> (raw)
In-Reply-To: <CAFiYyc3w=0QJsYhgWWEQXiN-3NErNCVm4L4fkDbfE0tLNXvtvw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 12766 bytes --]

On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > [...]
> > > > >
> > > > > +  /* Handle strlen like loops.  */
> > > > > +  if (store_dr == NULL
> > > > > +      && integer_zerop (pattern)
> > > > > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > +      && integer_onep (reduction_iv.step)
> > > > > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > > > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > > > > +    {
> > > > >
> > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > The iteration
> > > > > only stops when you load a NUL and the increments just wrap along (you're
> > > > > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> > > >
> > > > I think truncation is enough as long as no overflow occurs in strlen or
> > > > strlen_using_rawmemchr.
> > > >
> > > > > For larger than size_type_node (actually larger than ptr_type_node would matter
> > > > > I guess), the argument is that since pointer wrapping would be undefined anyway
> > > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > > >
> > > > >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > (ptr_type_node)
> > > > >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > >
> > > > > ?
> > > >
> > > > Regarding the implementation which makes use of rawmemchr:
> > > >
> > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > > the maximal length we can determine of a string where each character has
> > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > this would mean that a single string consumes more than half of the
> > > > virtual addressable memory.  At least for architectures where
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > > to neglect the case where computing pointer difference may overflow.
> > > > Otherwise we are talking about strings with lenghts of multiple
> > > > pebibytes.  For other architectures we might have to be more precise
> > > > and make sure that reduction variable overflows first and that this is
> > > > undefined.
> > > >
> > > > Thus a conservative condition would be (I assumed that the size of any
> > > > integral type is a power of two which I'm not sure if this really holds;
> > > > IIRC the C standard requires only that the alignment is a power of two
> > > > but not necessarily the size so I might need to change this):
> > > >
> > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> > > >    or in other words return true if reduction variable overflows first
> > > >    and false otherwise.  */
> > > >
> > > > static bool
> > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > {
> > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> > > > }
> > > >
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >     && reduction_var_overflows_first (reduction_var, load_type)
> > > >
> > > > Regarding the implementation which makes use of strlen:
> > > >
> > > > I'm not sure what it means if strlen is called for a string with a
> > > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > architectures, a conservative condition would be:
> > > >
> > > > TYPE_PRECISION (size_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> > > >
> > > > I still included the overflow undefined check for reduction variable in
> > > > order to rule out situations where the reduction variable is unsigned
> > > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > > > architectures.  Anyhow, while writing this down it becomes clear that
> > > > this deserves a comment which I will add once it becomes clear which way
> > > > to go.
> > >
> > > I think all the arguments about objects bigger than half of the address-space
> > > also are valid for 32bit targets and thus 32bit size_type_node (or
> > > 32bit pointer size).
> > > I'm not actually sure what's the canonical type to check against, whether
> > > it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> > > middle-end "offset" type used for all address computations).  For weird reasons
> > > I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> > > pointers but 16bit 'sizetype').
> >
> > Ok, for the strlen implementation I changed from size_type_node to
> > sizetype and assume that no overflow occurs for string objects bigger
> > than half of the address space for 32-bit targets and up:
> >
> >   (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
> >    && TYPE_PRECISION (ptr_type_node) >= 32)
> >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >       && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
> >
> > and similarly for the rawmemchr implementation:
> >
> >   (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
> >    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
> >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >       && reduction_var_overflows_first (reduction_var, load_type))
> >
> > >
> > > > >
> > > > > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > > > > +       {
> > > > > +         const char *msg = G_("assuming signed overflow does not occur "
> > > > > +                              "when optimizing strlen like loop");
> > > > > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > > > > +       }
> > > > >
> > > > > no, please don't add any new strict-overflow warnings ;)
> > > >
> > > > I just stumbled over code which produces such a warning and thought this
> > > > is a hard requirement :D The new patch doesn't contain it anymore.
> > > >
> > > > >
> > > > > The generate_*_builtin routines need some factoring - if you code-generate
> > > > > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > > > > (not sure why you do that - you should see to fold the call, not necessarily
> > > > > the rest).  The replacement of reduction_var and the dumping could be shared.
> > > > > There's also GET_MODE_NAME for the printing.
> > > >
> > > > I wasn't really sure which way to go.  Use a gsi, as it is done by
> > > > existing generate_* functions, or make use of gimple_seq.  Since the
> > > > latter uses internally also gsi I thought it is better to stick to gsi
> > > > in the first place.  Now, after changing to gimple_seq I see the beauty
> > > > of it :)
> > > >
> > > > I created two helper functions generate_strlen_builtin_1 and
> > > > generate_reduction_builtin_1 in order to reduce code duplication.
> > > >
> > > > In function generate_strlen_builtin I changed from using
> > > > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> > > > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> > > > sure whether my intuition about the difference between implicit and
> > > > explicit builtins is correct.  In builtins.def there is a small example
> > > > given which I would paraphrase as "use builtin_decl_explicit if the
> > > > semantics of the builtin is defined by the C standard; otherwise use
> > > > builtin_decl_implicit" but probably my intuition is wrong?
> > > >
> > > > Beside that I'm not sure whether I really have to call
> > > > build_fold_addr_expr which looks superfluous to me since
> > > > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> > > >
> > > > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > >
> > > > However, since it is also used that way in the context of
> > > > generate_memset_builtin I didn't remove it so far.
> > > >
> > > > > I think overall the approach is sound now but the details still need work.
> > > >
> > > > Once again thank you very much for your review.  Really appreciated!
> > >
> > > The patch lacks a changelog entry / description.  It's nice if patches sent
> > > out for review are basically the rev as git format-patch produces.
> > >
> > > The rawmemchr optab needs documenting in md.texi
> >
> > While writing the documentation in md.texi I realised that other
> > instructions expect an address to be a memory operand which is not the
> > case for rawmemchr currently. At the moment the address is either an
> > SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> > consequence in the backend define_expand rawmemchr<mode> expects a
> > register operand and not a memory operand. Would it make sense to build
> > a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> > MEM_REF is supposed to be the canonical form here.
> 
> I suppose the expander could use code similar to what
> expand_builtin_memset_args does,
> using get_memory_rtx.  I suppose that we're using MEM operands because those
> can convey things like alias info or alignment info, something which
> REG operands cannot
> (easily).  I wouldn't build a MEM_REF and try to expand that.

The new patch contains the following changes:

- In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
  change linkage of get_memory_rtx to extern.

- In function generate_strlen_builtin_using_rawmemchr I'm not
  reconstructing the load type anymore from the base pointer but rather
  pass it as a parameter from function transform_reduction_loop where we
  also ensured that it is of integral type.  Reconstructing the load
  type was error prone since e.g. I didn't distinct between
  pointer_plus_expr or addr_expr.  Thus passing the load type should be
  more solid.

Regtested on IBM Z and x86.  Ok for mainline?

Thanks,
Stefan

> 
> > >
> > > +}
> > > +
> > > +static bool
> > > +reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > +{
> > > +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > >
> > > this function needs a comment.
> >
> > Done.
> >
> > >
> > > +         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> > > +           {
> > > +             if (reduction_stmt)
> > > +               return false;
> > >
> > > you leak bbs here and elsewhere where you early exit the function.
> > > In fact you fail to free it at all.
> >
> > Whoopsy. I factored the whole loop out into static function
> > determine_reduction_stmt in order to deal with all early exits.
> >
> > >
> > > Otherwise the patch looks good - thanks for all the improvements.
> > >
> > > What I do wonder is
> > >
> > > +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> > >
> > > using builtin_decl_explicit means that in a TU where strlen is neither
> > > declared nor used we can end up emitting calls to it.  For memcpy/memmove
> > > that's usually OK since we require those to be present even in a
> > > freestanding environment.  But I'm not sure about strlen here so I'd
> > > lean towards using builtin_decl_implicit and checking that for NULL which
> > > IIRC should prevent emitting strlen when it's not declared and maybe even
> > > if it's declared but not used.  All other uses that generate STRLEN
> > > use that at least.
> >
> > Thanks for clarification.  I changed it back to builtin_decl_implicit
> > and check for null pointers.
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Stefan

[-- Attachment #2: 0001-ldist-Recognize-strlen-and-rawmemchr-like-loops.patch --]
[-- Type: text/plain, Size: 38262 bytes --]

From ad7cc80f57394178b11a4033f6a693a199732106 Mon Sep 17 00:00:00 2001
From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
Date: Wed, 17 Mar 2021 09:00:06 +0100
Subject: [PATCH 1/2] ldist: Recognize strlen and rawmemchr like loops

This patch adds support for recognizing loops which mimic the behaviour
of functions strlen and rawmemchr, and replaces those with internal
function calls in case a target provides them.  In contrast to the
standard strlen and rawmemchr functions, this patch also supports
different instances where the memory pointed to is interpreted as 8, 16,
and 32-bit sized, respectively.

gcc/ChangeLog:

	* builtins.c (get_memory_rtx): Change to external linkage.
	* builtins.h (get_memory_rtx): Add function prototype.
	* doc/md.texi (rawmemchr<mode>): Document.
	* internal-fn.c (expand_RAWMEMCHR): Define.
	* internal-fn.def (RAWMEMCHR): Add.
	* optabs.def (rawmemchr_optab): Add.
	* tree-loop-distribution.c (find_single_drs): Change return code
	behaviour by also returning true if no single store was found
	but a single load.
	(loop_distribution::classify_partition): Respect the new return
	code behaviour of function find_single_drs.
	(loop_distribution::execute): Call new function
	transform_reduction_loop in order to replace rawmemchr or strlen
	like loops by calls into builtins.
	(generate_reduction_builtin_1): New function.
	(generate_rawmemchr_builtin): New function.
	(generate_strlen_builtin_1): New function.
	(generate_strlen_builtin): New function.
	(generate_strlen_builtin_using_rawmemchr): New function.
	(reduction_var_overflows_first): New function.
	(determine_reduction_stmt_1): New function.
	(determine_reduction_stmt): New function.
	(loop_distribution::transform_reduction_loop): New function.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ldist-rawmemchr-1.c: New test.
	* gcc.dg/tree-ssa/ldist-rawmemchr-2.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-1.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-2.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-3.c: New test.
---
 gcc/builtins.c                                |   3 +-
 gcc/builtins.h                                |   1 +
 gcc/doc/md.texi                               |   7 +
 gcc/internal-fn.c                             |  30 +
 gcc/internal-fn.def                           |   1 +
 gcc/optabs.def                                |   1 +
 .../gcc.dg/tree-ssa/ldist-rawmemchr-1.c       |  72 +++
 .../gcc.dg/tree-ssa/ldist-rawmemchr-2.c       |  83 +++
 .../gcc.dg/tree-ssa/ldist-strlen-1.c          | 100 ++++
 .../gcc.dg/tree-ssa/ldist-strlen-2.c          |  58 ++
 .../gcc.dg/tree-ssa/ldist-strlen-3.c          |  12 +
 gcc/tree-loop-distribution.c                  | 518 +++++++++++++++++-
 12 files changed, 855 insertions(+), 31 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 3e57eb03af0..f989a869a9f 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -105,7 +105,6 @@ builtin_info_type builtin_info[(int)END_BUILTINS];
 bool force_folding_builtin_constant_p;
 
 static int target_char_cast (tree, char *);
-static rtx get_memory_rtx (tree, tree);
 static int apply_args_size (void);
 static int apply_result_size (void);
 static rtx result_vector (int, rtx);
@@ -1355,7 +1354,7 @@ expand_builtin_prefetch (tree exp)
    the maximum length of the block of memory that might be accessed or
    NULL if unknown.  */
 
-static rtx
+rtx
 get_memory_rtx (tree exp, tree len)
 {
   tree orig_exp = exp;
diff --git a/gcc/builtins.h b/gcc/builtins.h
index d330b78e591..5e4d86e9c37 100644
--- a/gcc/builtins.h
+++ b/gcc/builtins.h
@@ -146,6 +146,7 @@ extern char target_percent_s[3];
 extern char target_percent_c[3];
 extern char target_percent_s_newline[4];
 extern bool target_char_cst_p (tree t, char *p);
+extern rtx get_memory_rtx (tree exp, tree len);
 
 extern internal_fn associated_internal_fn (tree);
 extern internal_fn replacement_internal_fn (gcall *);
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2b41cb7fb7b..10faacdca6c 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -6695,6 +6695,13 @@ operand 2 is the character to search for (normally zero),
 and operand 3 is a constant describing the known alignment
 of the beginning of the string.
 
+@cindex @code{rawmemchr@var{m}} instruction pattern
+@item @samp{rawmemchr@var{m}}
+Scan memory referred to by operand 1 for the first occurrence of operand 2.
+Operand 1 is a @code{mem} and operand 2 a @code{const_int} of mode @var{m}.
+Operand 0 is the result, i.e., a pointer to the first occurrence of operand 2
+in the memory block given by operand 1.
+
 @cindex @code{float@var{m}@var{n}2} instruction pattern
 @item @samp{float@var{m}@var{n}2}
 Convert signed integer operand 1 (valid for fixed point mode @var{m}) to
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index ada2a820ff1..5bde0864a8f 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2934,6 +2934,36 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+/* Expand IFN_RAWMEMCHAR internal function.  */
+
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+  machine_mode lhs_mode = TYPE_MODE (TREE_TYPE (lhs));
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, lhs_mode);
+
+  tree mem = gimple_call_arg (stmt, 0);
+  rtx mem_rtx = get_memory_rtx (mem, NULL);
+  create_fixed_operand (&ops[1], mem_rtx);
+
+  tree pattern = gimple_call_arg (stmt, 1);
+  machine_mode mode = TYPE_MODE (TREE_TYPE (pattern));
+  rtx pattern_rtx = expand_normal (pattern);
+  create_input_operand (&ops[2], pattern_rtx, mode);
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, mode);
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 88169ef4656..01f60a6cf26 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -352,6 +352,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 201b8aae1c0..f02c7b729a5 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
 OPTAB_D (movmem_optab, "movmem$a")
 OPTAB_D (setmem_optab, "setmem$a")
 OPTAB_D (strlen_optab, "strlen$a")
+OPTAB_D (rawmemchr_optab, "rawmemchr$a")
 
 OPTAB_DC(fma_optab, "fma$a4", FMA)
 OPTAB_D (fms_optab, "fms$a4")
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
new file mode 100644
index 00000000000..6abfd278351
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
@@ -0,0 +1,72 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and no store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, pattern)   \
+__attribute__((noinline))  \
+T *test_##T (T *p)         \
+{                          \
+  while (*p != (T)pattern) \
+    ++p;                   \
+  return p;                \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  0xab)
+test (int16_t, 0xabcd)
+test (int32_t, 0xabcdef15)
+
+#define run(T, pattern, i)      \
+{                               \
+T *q = p;                       \
+q[i] = (T)pattern;              \
+assert (test_##T (p) == &q[i]); \
+q[i] = 0;                       \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0, 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, 0xab, 0);
+  run (int8_t, 0xab, 1);
+  run (int8_t, 0xab, 13);
+
+  run (int16_t, 0xabcd, 0);
+  run (int16_t, 0xabcd, 1);
+  run (int16_t, 0xabcd, 13);
+
+  run (int32_t, 0xabcdef15, 0);
+  run (int32_t, 0xabcdef15, 1);
+  run (int32_t, 0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
new file mode 100644
index 00000000000..00d6ea0f8e9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
@@ -0,0 +1,83 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+uint8_t *p_uint8_t;
+uint16_t *p_uint16_t;
+uint32_t *p_uint32_t;
+
+int8_t *p_int8_t;
+int16_t *p_int16_t;
+int32_t *p_int32_t;
+
+#define test(T, pattern)    \
+__attribute__((noinline))   \
+T *test_##T (void)          \
+{                           \
+  while (*p_##T != pattern) \
+    ++p_##T;                \
+  return p_##T;             \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  (int8_t)0xab)
+test (int16_t, (int16_t)0xabcd)
+test (int32_t, (int32_t)0xabcdef15)
+
+#define run(T, pattern, i) \
+{                          \
+T *q = p;                  \
+q[i] = pattern;            \
+p_##T = p;                 \
+T *r = test_##T ();        \
+assert (r == p_##T);       \
+assert (r == &q[i]);       \
+q[i] = 0;                  \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, '\0', 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, (int8_t)0xab, 0);
+  run (int8_t, (int8_t)0xab, 1);
+  run (int8_t, (int8_t)0xab, 13);
+
+  run (int16_t, (int16_t)0xabcd, 0);
+  run (int16_t, (int16_t)0xabcd, 1);
+  run (int16_t, (int16_t)0xabcd, 13);
+
+  run (int32_t, (int32_t)0xabcdef15, 0);
+  run (int32_t, (int32_t)0xabcdef15, 1);
+  run (int32_t, (int32_t)0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
new file mode 100644
index 00000000000..918b60099e4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
@@ -0,0 +1,100 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target s390x-*-* } } } */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, U)        \
+__attribute__((noinline)) \
+U test_##T##U (T *s)      \
+{                         \
+  U i;                    \
+  for (i=0; s[i]; ++i);   \
+  return i;               \
+}
+
+test (uint8_t,  size_t)
+test (uint16_t, size_t)
+test (uint32_t, size_t)
+test (uint8_t,  int)
+test (uint16_t, int)
+test (uint32_t, int)
+
+test (int8_t,  size_t)
+test (int16_t, size_t)
+test (int32_t, size_t)
+test (int8_t,  int)
+test (int16_t, int)
+test (int32_t, int)
+
+#define run(T, U, i)             \
+{                                \
+T *q = p;                        \
+q[i] = 0;                        \
+assert (test_##T##U (p) == i);   \
+memset (&q[i], 0xf, sizeof (T)); \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+
+  run (uint8_t, size_t, 0);
+  run (uint8_t, size_t, 1);
+  run (uint8_t, size_t, 13);
+
+  run (int8_t, size_t, 0);
+  run (int8_t, size_t, 1);
+  run (int8_t, size_t, 13);
+
+  run (uint8_t, int, 0);
+  run (uint8_t, int, 1);
+  run (uint8_t, int, 13);
+
+  run (int8_t, int, 0);
+  run (int8_t, int, 1);
+  run (int8_t, int, 13);
+
+  run (uint16_t, size_t, 0);
+  run (uint16_t, size_t, 1);
+  run (uint16_t, size_t, 13);
+
+  run (int16_t, size_t, 0);
+  run (int16_t, size_t, 1);
+  run (int16_t, size_t, 13);
+
+  run (uint16_t, int, 0);
+  run (uint16_t, int, 1);
+  run (uint16_t, int, 13);
+
+  run (int16_t, int, 0);
+  run (int16_t, int, 1);
+  run (int16_t, int, 13);
+
+  run (uint32_t, size_t, 0);
+  run (uint32_t, size_t, 1);
+  run (uint32_t, size_t, 13);
+
+  run (int32_t, size_t, 0);
+  run (int32_t, size_t, 1);
+  run (int32_t, size_t, 13);
+
+  run (uint32_t, int, 0);
+  run (uint32_t, int, 1);
+  run (uint32_t, int, 13);
+
+  run (int32_t, int, 0);
+  run (int32_t, int, 1);
+  run (int32_t, int, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
new file mode 100644
index 00000000000..e25d6ea5b56
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 3 "ldist" } } */
+
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+__attribute__((noinline))
+int test_pos (char *s)
+{
+  int i;
+  for (i=42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_neg (char *s)
+{
+  int i;
+  for (i=-42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_including_null_char (char *s)
+{
+  int i;
+  for (i=1; s[i-1]; ++i);
+  return i;
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+  char *s = (char *)p + 100;
+
+  s[42+13] = 0;
+  assert (test_pos (s) == 42+13);
+  s[42+13] = 0xf;
+
+  s[13] = 0;
+  assert (test_neg (s) == 13);
+  s[13] = 0xf;
+
+  s[-13] = 0;
+  assert (test_neg (s) == -13);
+  s[-13] = 0xf;
+
+  s[13] = 0;
+  assert (test_including_null_char (s) == 13+1);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
new file mode 100644
index 00000000000..370fd5eb088
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 1 "ldist" { target s390x-*-* } } } */
+
+extern int s[];
+
+int test ()
+{
+  int i = 0;
+  for (; s[i]; ++i);
+  return i;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 2df762c8aa8..fb9250031b5 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -116,6 +116,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "gimple-fold.h"
 #include "tree-affine.h"
+#include "intl.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "optabs.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -651,6 +655,10 @@ class loop_distribution
 		       control_dependences *cd, int *nb_calls, bool *destroy_p,
 		       bool only_patterns_p);
 
+  /* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+     replace them accordingly.  */
+  bool transform_reduction_loop (loop_p loop);
+
   /* Compute topological order for basic blocks.  Topological order is
      needed because data dependence is computed for data references in
      lexicographical order.  */
@@ -1492,14 +1500,14 @@ loop_distribution::build_rdg_partition_for_vertex (struct graph *rdg, int v)
    data references.  */
 
 static bool
-find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
+find_single_drs (class loop *loop, struct graph *rdg, const bitmap &partition_stmts,
 		 data_reference_p *dst_dr, data_reference_p *src_dr)
 {
   unsigned i;
   data_reference_p single_ld = NULL, single_st = NULL;
   bitmap_iterator bi;
 
-  EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
+  EXECUTE_IF_SET_IN_BITMAP (partition_stmts, 0, i, bi)
     {
       gimple *stmt = RDG_STMT (rdg, i);
       data_reference_p dr;
@@ -1540,44 +1548,47 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
 	}
     }
 
-  if (!single_st)
-    return false;
-
-  /* Bail out if this is a bitfield memory reference.  */
-  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
-      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
+  if (!single_ld && !single_st)
     return false;
 
-  /* Data reference must be executed exactly once per iteration of each
-     loop in the loop nest.  We only need to check dominance information
-     against the outermost one in a perfect loop nest because a bb can't
-     dominate outermost loop's latch without dominating inner loop's.  */
-  basic_block bb_st = gimple_bb (DR_STMT (single_st));
-  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
-    return false;
+  basic_block bb_ld = NULL;
+  basic_block bb_st = NULL;
 
   if (single_ld)
     {
-      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
-      /* Direct aggregate copy or via an SSA name temporary.  */
-      if (load != store
-	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
-	return false;
-
       /* Bail out if this is a bitfield memory reference.  */
       if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
 	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
 	return false;
 
-      /* Load and store must be in the same loop nest.  */
-      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
-      if (bb_st->loop_father != bb_ld->loop_father)
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      bb_ld = gimple_bb (DR_STMT (single_ld));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+	return false;
+    }
+
+  if (single_st)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
 	return false;
 
       /* Data reference must be executed exactly once per iteration.
-	 Same as single_st, we only need to check against the outermost
+	 Same as single_ld, we only need to check against the outermost
 	 loop.  */
-      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+      bb_st = gimple_bb (DR_STMT (single_st));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
+	return false;
+    }
+
+  if (single_ld && single_st)
+    {
+      /* Load and store must be in the same loop nest.  */
+      if (bb_st->loop_father != bb_ld->loop_father)
 	return false;
 
       edge e = single_exit (bb_st->loop_father);
@@ -1852,9 +1863,19 @@ loop_distribution::classify_partition (loop_p loop,
     return has_reduction;
 
   /* Find single load/store data references for builtin partition.  */
-  if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
+  if (!find_single_drs (loop, rdg, partition->stmts, &single_st, &single_ld)
+      || !single_st)
     return has_reduction;
 
+  if (single_ld && single_st)
+    {
+      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
+      /* Direct aggregate copy or via an SSA name temporary.  */
+      if (load != store
+	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
+	return has_reduction;
+    }
+
   partition->loc = gimple_location (DR_STMT (single_st));
 
   /* Classify the builtin kind.  */
@@ -3260,6 +3281,428 @@ find_seed_stmts_for_distribution (class loop *loop, vec<gimple *> *work_list)
   return work_list->length () > 0;
 }
 
+/* A helper function for generate_{rawmemchr,strlen}_builtin functions in order
+   to place new statements SEQ before LOOP and replace the old reduction
+   variable with the new one.  */
+
+static void
+generate_reduction_builtin_1 (loop_p loop, gimple_seq &seq,
+			      tree reduction_var_old, tree reduction_var_new,
+			      const char *info, machine_mode load_mode)
+{
+  /* Place new statements before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+  gsi_insert_seq_after (&gsi, seq, GSI_CONTINUE_LINKING);
+
+  /* Replace old reduction variable with new one.  */
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var_old)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, info, GET_MODE_NAME (load_mode));
+}
+
+/* Generate a call to rawmemchr and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+			    data_reference_p store_dr, tree base, tree pattern,
+			    location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
+  tree reduction_var_new = copy_ssa_name (reduction_var);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  if (store_dr)
+    {
+      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
+      gimple_seq_add_stmt (&seq, g);
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+				"generated rawmemchr%s\n",
+				TYPE_MODE (TREE_TYPE (TREE_TYPE (base))));
+}
+
+/* Helper function for generate_strlen_builtin(,_using_rawmemchr)  */
+
+static void
+generate_strlen_builtin_1 (loop_p loop, gimple_seq &seq,
+			   tree reduction_var_old, tree reduction_var_new,
+			   machine_mode mode, tree start_len)
+{
+  /* REDUCTION_VAR_NEW has either size type or ptrdiff type and must be
+     converted if types of old and new reduction variable are not compatible. */
+  reduction_var_new = gimple_convert (&seq, TREE_TYPE (reduction_var_old),
+				      reduction_var_new);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
+      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
+				       start_len);
+      gimple_seq_add_stmt (&seq, g);
+      reduction_var_new = lhs;
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var_old, reduction_var_new,
+				"generated strlen%s\n", mode);
+}
+
+/* Generate a call to strlen and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
+			 tree start_len, location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree reduction_var_new = make_ssa_name (size_type_node);
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN));
+  gimple *fn_call = gimple_build_call (fn, 1, mem);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+			     QImode, start_len);
+}
+
+/* Generate code in order to mimic the behaviour of strlen but this time over
+   an array of elements with mode different than QI.  REDUCTION_VAR is replaced
+   with a fresh SSA name representing the result, i.e., the length.  */
+
+static void
+generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
+					 tree base, tree load_type,
+					 tree start_len, location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree start = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree zero = build_zero_cst (load_type);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, start, zero);
+  tree end = make_ssa_name (TREE_TYPE (base));
+  gimple_call_set_lhs (fn_call, end);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  /* Determine the number of elements between START and END by
+     evaluating (END - START) / sizeof (*START).  */
+  tree diff = make_ssa_name (ptrdiff_type_node);
+  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, start);
+  gimple_seq_add_stmt (&seq, diff_stmt);
+  /* Let SIZE be the size of each character.  */
+  tree size = gimple_convert (&seq, ptrdiff_type_node,
+			      TYPE_SIZE_UNIT (load_type));
+  tree count = make_ssa_name (ptrdiff_type_node);
+  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
+  gimple_seq_add_stmt (&seq, count_stmt);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, count,
+			     TYPE_MODE (load_type),
+			     start_len);
+}
+
+/* Return true if we can count at least as many characters by taking pointer
+   difference as we can count via reduction_var without an overflow.  Thus
+   compute 2^n < (2^(m-1) / s) where n = TYPE_PRECISION (reduction_var),
+   m = TYPE_PRECISION (ptrdiff_type_node), and s = size of each character.  */
+static bool
+reduction_var_overflows_first (tree reduction_var, tree load_type)
+{
+  widest_int n2 = wi::lshift (1, TYPE_PRECISION (reduction_var));;
+  widest_int m2 = wi::lshift (1, TYPE_PRECISION (ptrdiff_type_node) - 1);
+  widest_int s = wi::to_widest (TYPE_SIZE_UNIT (load_type));
+  return wi::ltu_p (n2, wi::udiv_trunc (m2, s));
+}
+
+static gimple *
+determine_reduction_stmt_1 (const loop_p loop, const basic_block *bbs)
+{
+  gimple *reduction_stmt = NULL;
+
+  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
+    {
+      basic_block bb = bbs[i];
+
+      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi))
+	{
+	  gphi *phi = bsi.phi ();
+	  if (virtual_operand_p (gimple_phi_result (phi)))
+	    continue;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
+	    {
+	      if (reduction_stmt)
+		return NULL;
+	      reduction_stmt = phi;
+	    }
+	}
+
+      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return NULL;
+	  gimple *stmt = gsi_stmt (bsi);
+	  if (gimple_clobber_p (stmt))
+	    continue;
+	  if (gimple_code (stmt) == GIMPLE_LABEL)
+	    continue;
+	  if (gimple_has_volatile_ops (stmt))
+	    return NULL;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
+	    {
+	      if (reduction_stmt)
+		return NULL;
+	      reduction_stmt = stmt;
+	    }
+	}
+    }
+
+  return reduction_stmt;
+}
+
+/* If LOOP has a single non-volatile reduction statement, then return a pointer
+   to it.  Otherwise return NULL.  */
+static gimple *
+determine_reduction_stmt (const loop_p loop)
+{
+  basic_block *bbs = get_loop_body (loop);
+  gimple *reduction_stmt = determine_reduction_stmt_1 (loop, bbs);
+  XDELETEVEC (bbs);
+  return reduction_stmt;
+}
+
+/* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+   replace them accordingly.  For example, a loop of the form
+
+     for (; *p != 42; ++p);
+
+   is replaced by
+
+     p = rawmemchr<MODE> (p, 42);
+
+   under the assumption that rawmemchr is available for a particular MODE.
+   Another example is
+
+     int i;
+     for (i = 42; s[i]; ++i);
+
+   which is replaced by
+
+     i = (int)strlen (&s[42]) + 42;
+
+   for some character array S.  In case array S is not of type character array
+   we end up with
+
+     i = (int)(rawmemchr<MODE> (&s[42], 0) - &s[42]) + 42;
+
+   assuming that rawmemchr is available for a particular MODE.  */
+
+bool
+loop_distribution::transform_reduction_loop (loop_p loop)
+{
+  gimple *reduction_stmt;
+  data_reference_p load_dr = NULL, store_dr = NULL;
+
+  edge e = single_exit (loop);
+  gcond *cond = safe_dyn_cast <gcond *> (last_stmt (e->src));
+  if (!cond)
+    return false;
+  /* Ensure loop condition is an (in)equality test and loop is exited either if
+     the inequality test fails or the equality test succeeds.  */
+  if (!(e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == NE_EXPR)
+      && !(e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == EQ_EXPR))
+    return false;
+  /* A limitation of the current implementation is that we only support
+     constant patterns in (in)equality tests.  */
+  tree pattern = gimple_cond_rhs (cond);
+  if (TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  reduction_stmt = determine_reduction_stmt (loop);
+
+  /* A limitation of the current implementation is that we require a reduction
+     statement.  Therefore, loops without a reduction statement as in the
+     following are not recognized:
+     int *p;
+     void foo (void) { for (; *p; ++p); } */
+  if (reduction_stmt == NULL)
+    return false;
+
+  /* Reduction variables are guaranteed to be SSA names.  */
+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_ASSIGN:
+    case GIMPLE_PHI:
+      reduction_var = gimple_get_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;
+    }
+
+  struct graph *rdg = build_rdg (loop, NULL);
+  if (rdg == NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Loop %d not transformed: failed to build the RDG.\n",
+		 loop->num);
+
+      return false;
+    }
+  auto_bitmap partition_stmts;
+  bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
+  find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
+  free_rdg (rdg);
+
+  /* Bail out if there is no single load.  */
+  if (load_dr == NULL)
+    return false;
+
+  /* Reaching this point we have a loop with a single reduction variable,
+     a single load, and an optional single store.  */
+
+  tree load_ref = DR_REF (load_dr);
+  tree load_type = TREE_TYPE (load_ref);
+  tree load_access_base = build_fold_addr_expr (load_ref);
+  tree load_access_size = TYPE_SIZE_UNIT (load_type);
+  affine_iv load_iv, reduction_iv;
+
+  if (!INTEGRAL_TYPE_P (load_type)
+      || !type_has_mode_precision_p (load_type))
+    return false;
+
+  /* We already ensured that the loop condition tests for (in)equality where the
+     rhs is a constant pattern. Now ensure that the lhs is the result of the
+     load.  */
+  if (gimple_cond_lhs (cond) != gimple_assign_lhs (DR_STMT (load_dr)))
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive or not growing.  */
+  if (!operand_equal_p (load_iv.step, load_access_size, 0))
+    return false;
+
+  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
+    return false;
+
+  /* Handle rawmemchr like loops.  */
+  if (operand_equal_p (load_iv.base, reduction_iv.base)
+      && operand_equal_p (load_iv.step, reduction_iv.step))
+    {
+      if (store_dr)
+	{
+	  /* Ensure that we store to X and load from X+I where I>0.  */
+	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
+	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
+	    return false;
+	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
+	  if (TREE_CODE (ptr_base) != SSA_NAME)
+	    return false;
+	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
+	  if (!gimple_assign_single_p (def)
+	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
+	    return false;
+	  /* Ensure that the reduction value is stored.  */
+	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
+	    return false;
+	}
+      /* Bail out if target does not provide rawmemchr for a certain mode.  */
+      machine_mode mode = TYPE_MODE (load_type);
+      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
+	return false;
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
+				  pattern, loc);
+      return true;
+    }
+
+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step))
+    {
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      /* While determining the length of a string an overflow might occur.
+	 If an overflow only occurs in the loop implementation and not in the
+	 strlen implementation, then either the overflow is undefined or the
+	 truncated result of strlen equals the one of the loop.  Otherwise if
+	 an overflow may also occur in the strlen implementation, then
+	 replacing a loop by a call to strlen is sound whenever we ensure that
+	 if an overflow occurs in the strlen implementation, then also an
+	 overflow occurs in the loop implementation which is undefined.  It
+	 seems reasonable to relax this and assume that the strlen
+	 implementation cannot overflow in case sizetype is big enough in the
+	 sense that an overflow can only happen for string objects which are
+	 bigger than half of the address space; at least for 32-bit targets and
+	 up.
+
+	 For strlen which makes use of rawmemchr the maximal length of a string
+	 which can be determined without an overflow is PTRDIFF_MAX / S where
+	 each character has size S.  Since an overflow for ptrdiff type is
+	 undefined we have to make sure that if an overflow occurs, then an
+	 overflow occurs in the loop implementation, too, and this is
+	 undefined, too.  Similar as before we relax this and assume that no
+	 string object is larger than half of the address space; at least for
+	 32-bit targets and up.  */
+      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
+	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)
+	  && ((TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
+	       && TYPE_PRECISION (ptr_type_node) >= 32)
+	      || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		  && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype)))
+	  && builtin_decl_implicit (BUILT_IN_STRLEN))
+	generate_strlen_builtin (loop, reduction_var, load_iv.base,
+				 reduction_iv.base, loc);
+      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
+	       != CODE_FOR_nothing
+	       && ((TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
+		    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
+		   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		       && reduction_var_overflows_first (reduction_var, load_type))))
+	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
+						 load_iv.base,
+						 load_type,
+						 reduction_iv.base, loc);
+      else
+	return false;
+      return true;
+    }
+
+  return false;
+}
+
 /* Given innermost LOOP, return the outermost enclosing loop that forms a
    perfect loop nest.  */
 
@@ -3324,10 +3767,27 @@ loop_distribution::execute (function *fun)
 	      && !optimize_loop_for_speed_p (loop)))
 	continue;
 
-      /* Don't distribute loop if niters is unknown.  */
+      /* If niters is unknown don't distribute loop but rather try to transform
+	 it to a call to a builtin.  */
       tree niters = number_of_latch_executions (loop);
       if (niters == NULL_TREE || niters == chrec_dont_know)
-	continue;
+	{
+	  datarefs_vec.create (20);
+	  if (transform_reduction_loop (loop))
+	    {
+	      changed = true;
+	      loops_to_be_destroyed.safe_push (loop);
+	      if (dump_enabled_p ())
+		{
+		  dump_user_location_t loc = find_loop_location (loop);
+		  dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
+				   loc, "Loop %d transformed into a builtin.\n",
+				   loop->num);
+		}
+	    }
+	  free_data_refs (datarefs_vec);
+	  continue;
+	}
 
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);
-- 
2.31.1

next prev parent reply	other threads:[~2021-09-13 14:53 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-08 12:47 Stefan Schulze Frielinghaus
2021-02-09  8:57 ` Richard Biener
2021-02-14 10:27   ` Stefan Schulze Frielinghaus
2021-02-25 17:01     ` Stefan Schulze Frielinghaus
2021-02-25 23:49       ` Jeff Law
2021-03-02 12:29     ` Richard Biener
2021-03-03 17:17       ` Stefan Schulze Frielinghaus
2021-03-16 17:13         ` Stefan Schulze Frielinghaus
2021-04-08  8:23           ` Stefan Schulze Frielinghaus
2021-05-04 17:25             ` Stefan Schulze Frielinghaus
2021-05-05  9:36           ` Richard Biener
2021-05-05 10:03             ` Richard Biener
2021-05-07 12:32             ` Stefan Schulze Frielinghaus
2021-05-20  9:24               ` Richard Biener
2021-05-20 18:37                 ` Stefan Schulze Frielinghaus
2021-06-14 17:26                   ` Stefan Schulze Frielinghaus
2021-06-16 14:22                     ` Richard Biener
2021-06-25 10:23                       ` Stefan Schulze Frielinghaus
2021-08-06 14:02                         ` Stefan Schulze Frielinghaus
2021-08-20 10:35                         ` Richard Biener
2021-09-03  8:00                           ` Stefan Schulze Frielinghaus
2021-09-06  9:56                             ` Richard Biener
2021-09-13 14:53                               ` Stefan Schulze Frielinghaus [this message]
2021-09-17  8:08                                 ` Richard Biener
2021-10-11 16:02                                   ` Stefan Schulze Frielinghaus
2022-01-31 13:16                                   ` Tom de Vries
2022-01-31 15:00                                     ` Richard Biener
2022-01-31 16:26                                       ` [PATCH][ldist] Don't add lib calls with -fno-tree-loop-distribute-patterns Tom de Vries
2022-02-01  7:04                                         ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YT9l12mswNfteDHb@localhost.localdomain \
    --to=stefansf@linux.ibm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).