public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Thoughts on memcmp expansion (PR43052)
@ 2016-01-15 16:58 Bernd Schmidt
  2016-01-18  9:22 ` Richard Biener
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Bernd Schmidt @ 2016-01-15 16:58 UTC (permalink / raw)
  To: GCC Patches, Nick Clifton

[-- Attachment #1: Type: text/plain, Size: 1963 bytes --]

PR43052 is a PR complaining about how the rep cmpsb expansion that gcc 
uses for memcmp is slower than the library function. As is so often the 
case, if you investigate a bit, you can find a lot of issues with the 
current situation in the compiler.

This PR was accidentally fixed by a patch by Nick which disabled the use 
of cmpstrnsi for memcmp expansion, on the grounds that cmpstrnsi could 
stop looking after seeing a null byte, which would be invalid for 
memcmp, so only cmpmemsi should be used. This fix was for an out-of-tree 
target.

I believe the rep cmpsb sequence used by i386 would actually be valid, 
so we could duplicate the cmpstrn pattern to also match cmpmem and be 
done - but that would then again cause the performance problem described 
in the PR, so it's probably not a good idea.

One question Richard posed in the comments: why aren't we optimizing 
small constant size memcmps other than size 1 to *s == *q? The reason is 
the return value of memcmp, which implies byte-sized operation 
(incidentally, the use of SImode in the cmpmem/cmpstr patterns is really 
odd). It's possible to work around this, but expansion becomes a little 
more tricky (subtract after bswap, maybe). Still, the current code 
generation is lame.

So, for gcc-6, I think we shouldn't do anything. The PR is fixed, and 
there's no easy bug-fix that can be done to improve matters. Not sure 
whether to keep the PR open or create a new one for the remaining 
issues. For the next stage1, I'm attaching a proof-of-concept patch that 
does the following:
  * notice if memcmp results are only used for equality comparison
    against zero
  * if so, replace with a different builtin __memcmp_eq
  * Expand __memcmp_eq for small constant sizes with loads and
    comparison, fall back to a memcmp call.

The whole thing could be extended to work for sizes larger than an int, 
along the lines of memcpy expansion controlled by move ratio etc. Thoughts?


Bernd

[-- Attachment #2: memcmp-eq.diff --]
[-- Type: text/x-patch, Size: 7989 bytes --]

Index: gcc/builtins.c
===================================================================
--- gcc/builtins.c	(revision 232359)
+++ gcc/builtins.c	(working copy)
@@ -3699,25 +3699,22 @@ expand_cmpstrn_or_cmpmem (insn_code icod
 
 /* Expand expression EXP, which is a call to the memcmp built-in function.
    Return NULL_RTX if we failed and the caller should emit a normal call,
-   otherwise try to get the result in TARGET, if convenient.  */
+   otherwise try to get the result in TARGET, if convenient.
+   RESULT_EQ is true if we can relax the returned value to be either zero
+   or nonzero, without caring about the sign.  */
 
 static rtx
-expand_builtin_memcmp (tree exp, rtx target)
+expand_builtin_memcmp (tree exp, rtx target, bool result_eq)
 {
   if (!validate_arglist (exp,
  			 POINTER_TYPE, POINTER_TYPE, INTEGER_TYPE, VOID_TYPE))
     return NULL_RTX;
 
-  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
-     implementing memcmp because it will stop if it encounters two
-     zero bytes.  */
-  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
-  if (icode == CODE_FOR_nothing)
-    return NULL_RTX;
-
   tree arg1 = CALL_EXPR_ARG (exp, 0);
   tree arg2 = CALL_EXPR_ARG (exp, 1);
   tree len = CALL_EXPR_ARG (exp, 2);
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  location_t loc = EXPR_LOCATION (exp);
 
   unsigned int arg1_align = get_pointer_alignment (arg1) / BITS_PER_UNIT;
   unsigned int arg2_align = get_pointer_alignment (arg2) / BITS_PER_UNIT;
@@ -3725,12 +3722,27 @@ expand_builtin_memcmp (tree exp, rtx tar
   /* If we don't have POINTER_TYPE, call the function.  */
   if (arg1_align == 0 || arg2_align == 0)
     return NULL_RTX;
+  unsigned int min_align = MIN (arg1_align, arg2_align);
+
+  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
+     implementing memcmp because it will stop if it encounters two
+     zero bytes.  */
+  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
+
+  rtx arg3_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
+  machine_mode direct_mode = VOIDmode;
+  if (CONST_INT_P (arg3_rtx))
+    direct_mode = mode_for_size (INTVAL (arg3_rtx) * BITS_PER_UNIT,
+				 MODE_INT, 1);
+  if (icode == CODE_FOR_nothing
+      && (!result_eq
+	  || direct_mode == VOIDmode
+	  || direct_mode == BLKmode
+	  || GET_MODE_ALIGNMENT (direct_mode) / BITS_PER_UNIT > min_align))
+    return NULL_RTX;
 
-  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
-  location_t loc = EXPR_LOCATION (exp);
   rtx arg1_rtx = get_memory_rtx (arg1, len);
   rtx arg2_rtx = get_memory_rtx (arg2, len);
-  rtx arg3_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
 
   /* Set MEM_SIZE as appropriate.  */
   if (CONST_INT_P (arg3_rtx))
@@ -3739,6 +3751,27 @@ expand_builtin_memcmp (tree exp, rtx tar
       set_mem_size (arg2_rtx, INTVAL (arg3_rtx));
     }
 
+  if (icode == CODE_FOR_nothing)
+    {
+      arg1_rtx = change_address (arg1_rtx, direct_mode, XEXP (arg1_rtx, 0));
+      arg2_rtx = change_address (arg2_rtx, direct_mode, XEXP (arg2_rtx, 0));
+      start_sequence ();
+      rtx op0 = force_reg (direct_mode, arg1_rtx);
+      rtx op1 = force_reg (direct_mode, arg2_rtx);
+      rtx tem = emit_store_flag (target, NE, op0, op1,
+				 direct_mode, true, false);
+      rtx seq = get_insns ();
+      end_sequence ();
+      if (tem != 0)
+	{
+	  emit_insn (seq);
+	  return target;
+	}
+    }
+
+  if (icode == CODE_FOR_nothing)
+    return NULL_RTX;
+
   rtx result = expand_cmpstrn_or_cmpmem (icode, target, arg1_rtx, arg2_rtx,
 					 TREE_TYPE (len), arg3_rtx,
 					 MIN (arg1_align, arg2_align));
@@ -5997,9 +6030,16 @@ expand_builtin (tree exp, rtx target, rt
 
     case BUILT_IN_BCMP:
     case BUILT_IN_MEMCMP:
-      target = expand_builtin_memcmp (exp, target);
+    case BUILT_IN_MEMCMP_EQ:
+      target = expand_builtin_memcmp (exp, target, fcode == BUILT_IN_MEMCMP_EQ);
       if (target)
 	return target;
+      if (fcode == BUILT_IN_MEMCMP_EQ)
+	{
+	  exp = copy_node (exp);
+	  tree newdecl = builtin_decl_explicit (BUILT_IN_MEMCMP);
+	  TREE_OPERAND (exp, 1) = build_fold_addr_expr (newdecl);
+	}
       break;
 
     case BUILT_IN_SETJMP:
Index: gcc/builtins.def
===================================================================
--- gcc/builtins.def	(revision 232359)
+++ gcc/builtins.def	(working copy)
@@ -617,6 +617,7 @@ DEF_EXT_LIB_BUILTIN    (BUILT_IN_BZERO,
 DEF_EXT_LIB_BUILTIN    (BUILT_IN_INDEX, "index", BT_FN_STRING_CONST_STRING_INT, ATTR_PURE_NOTHROW_NONNULL_LEAF)
 DEF_LIB_BUILTIN        (BUILT_IN_MEMCHR, "memchr", BT_FN_PTR_CONST_PTR_INT_SIZE, ATTR_PURE_NOTHROW_NONNULL_LEAF)
 DEF_LIB_BUILTIN        (BUILT_IN_MEMCMP, "memcmp", BT_FN_INT_CONST_PTR_CONST_PTR_SIZE, ATTR_PURE_NOTHROW_NONNULL_LEAF)
+DEF_GCC_BUILTIN        (BUILT_IN_MEMCMP_EQ, "__memcmp_eq", BT_FN_INT_CONST_PTR_CONST_PTR_SIZE, ATTR_PURE_NOTHROW_NONNULL_LEAF)
 DEF_LIB_BUILTIN_CHKP   (BUILT_IN_MEMCPY, "memcpy", BT_FN_PTR_PTR_CONST_PTR_SIZE, ATTR_RET1_NOTHROW_NONNULL_LEAF)
 DEF_LIB_BUILTIN_CHKP   (BUILT_IN_MEMMOVE, "memmove", BT_FN_PTR_PTR_CONST_PTR_SIZE, ATTR_RET1_NOTHROW_NONNULL_LEAF)
 DEF_EXT_LIB_BUILTIN_CHKP (BUILT_IN_MEMPCPY, "mempcpy", BT_FN_PTR_PTR_CONST_PTR_SIZE, ATTR_NOTHROW_NONNULL_LEAF)
Index: gcc/gimple-fold.c
===================================================================
--- gcc/gimple-fold.c	(revision 232359)
+++ gcc/gimple-fold.c	(working copy)
@@ -2539,13 +2539,52 @@ gimple_fold_builtin_fprintf (gimple_stmt
   return false;
 }
 
+/* Fold a call to the memcmp builtin.  ARG0, ARG1 and ARG2 are the
+   arguments to the call.
+
+   Return false if no simplification was possible, otherwise return
+   true and replace the call with a simplified form.  */
+
+static bool
+gimple_fold_builtin_memcmp (gimple_stmt_iterator *gsi, tree arg0, tree arg1, tree arg2)
+{
+  gcall *stmt = as_a <gcall *> (gsi_stmt (*gsi));
+
+  /* If the return value is used, don't do the transformation.  */
+  tree res = gimple_call_lhs (stmt);
+  if (res == NULL_TREE || TREE_CODE (res) != SSA_NAME)
+    return false;
+
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_FAST (use_p, iter, res)
+    {
+      gimple *ustmt = USE_STMT (use_p);
+
+      if (gimple_code (ustmt) != GIMPLE_ASSIGN)
+	return false;
+      gassign *asgn = as_a <gassign *> (ustmt);
+      tree_code code = gimple_assign_rhs_code (asgn);
+      if ((code != EQ_EXPR && code != NE_EXPR)
+	  || !integer_zerop (gimple_assign_rhs2 (asgn)))
+	return false;
+    }
+
+  tree newfn = builtin_decl_explicit (BUILT_IN_MEMCMP_EQ);
+  gcall *repl = gimple_build_call (newfn, 3, arg0, arg1, arg2);
+  gimple_call_set_lhs (repl, res);
+  gimple_set_vuse (repl, gimple_vuse (stmt));
+  replace_call_with_call_and_fold (gsi, repl);
+  return true;
+}
+
 /* Fold a call to the {,v}printf{,_unlocked} and __{,v}printf_chk builtins.
    FMT and ARG are the arguments to the call; we don't fold cases with
    more than 2 arguments, and ARG may be null if this is a 1-argument case.
+   FCODE is the BUILT_IN_* code of the function to be simplified.
 
-   Return NULL_TREE if no simplification was possible, otherwise return the
-   simplified form of the call as a tree.  FCODE is the BUILT_IN_*
-   code of the function to be simplified.  */
+   Return false if no simplification was possible, otherwise return
+   true and replace the call with a simplified form.    */
 
 static bool
 gimple_fold_builtin_printf (gimple_stmt_iterator *gsi, tree fmt,
@@ -2796,6 +2835,10 @@ gimple_fold_builtin (gimple_stmt_iterato
     case BUILT_IN_MEMMOVE:
       return gimple_fold_builtin_memory_op (gsi, gimple_call_arg (stmt, 0),
 					    gimple_call_arg (stmt, 1), 3);
+    case BUILT_IN_MEMCMP:
+      return gimple_fold_builtin_memcmp (gsi, gimple_call_arg (stmt, 0),
+					 gimple_call_arg (stmt, 1),
+					 gimple_call_arg (stmt, 2));
     case BUILT_IN_SPRINTF_CHK:
     case BUILT_IN_VSPRINTF_CHK:
       return gimple_fold_builtin_sprintf_chk (gsi, fcode);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-01-15 16:58 Thoughts on memcmp expansion (PR43052) Bernd Schmidt
@ 2016-01-18  9:22 ` Richard Biener
  2016-04-28 18:36   ` Bernd Schmidt
  2016-01-18 12:22 ` Nick Clifton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2016-01-18  9:22 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, Nick Clifton

On Fri, Jan 15, 2016 at 5:58 PM, Bernd Schmidt <bschmidt@redhat.com> wrote:
> PR43052 is a PR complaining about how the rep cmpsb expansion that gcc uses
> for memcmp is slower than the library function. As is so often the case, if
> you investigate a bit, you can find a lot of issues with the current
> situation in the compiler.
>
> This PR was accidentally fixed by a patch by Nick which disabled the use of
> cmpstrnsi for memcmp expansion, on the grounds that cmpstrnsi could stop
> looking after seeing a null byte, which would be invalid for memcmp, so only
> cmpmemsi should be used. This fix was for an out-of-tree target.
>
> I believe the rep cmpsb sequence used by i386 would actually be valid, so we
> could duplicate the cmpstrn pattern to also match cmpmem and be done - but
> that would then again cause the performance problem described in the PR, so
> it's probably not a good idea.
>
> One question Richard posed in the comments: why aren't we optimizing small
> constant size memcmps other than size 1 to *s == *q? The reason is the
> return value of memcmp, which implies byte-sized operation (incidentally,
> the use of SImode in the cmpmem/cmpstr patterns is really odd). It's
> possible to work around this, but expansion becomes a little more tricky
> (subtract after bswap, maybe). Still, the current code generation is lame.
>
> So, for gcc-6, I think we shouldn't do anything. The PR is fixed, and
> there's no easy bug-fix that can be done to improve matters. Not sure
> whether to keep the PR open or create a new one for the remaining issues.
> For the next stage1, I'm attaching a proof-of-concept patch that does the
> following:
>  * notice if memcmp results are only used for equality comparison
>    against zero
>  * if so, replace with a different builtin __memcmp_eq
>  * Expand __memcmp_eq for small constant sizes with loads and
>    comparison, fall back to a memcmp call.
>
> The whole thing could be extended to work for sizes larger than an int,
> along the lines of memcpy expansion controlled by move ratio etc. Thoughts?

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 - the
inline expansion
for small sizes and equality compares should be done on GIMPLE.  Today the
strlen pass might be an appropriate place to do this given its
superior knowledge
about string lengths.

The idea of turning eq feeding memcmp into a special memcmp_eq is good but
you have to avoid doing that too early - otherwise you'd lose on

  res = memcmp (p, q, sz);
  if (memcmp (p, q, sz) == 0)
   ...

that is, you have to make sure CSE got the chance to common the two calls.
This is why I think this kind of transform needs to happen in specific places
(like during strlen opt) rather than in generic folding.

Richard.

>
> Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-01-15 16:58 Thoughts on memcmp expansion (PR43052) Bernd Schmidt
  2016-01-18  9:22 ` Richard Biener
@ 2016-01-18 12:22 ` Nick Clifton
  2016-01-19 21:36 ` Jeff Law
  2016-05-30 11:29 ` Florian Weimer
  3 siblings, 0 replies; 19+ messages in thread
From: Nick Clifton @ 2016-01-18 12:22 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches

Hi Bernd,

+      rtx op0 = force_reg (direct_mode, arg1_rtx);
+      rtx op1 = force_reg (direct_mode, arg2_rtx);
+      rtx tem = emit_store_flag (target, NE, op0, op1,
+				 direct_mode, true, false);

This is me being ignorant here... wouldn't it be easier to have a new 
cmpmem_eq pattern (and resulting optab) than to generate this code 
sequence directly ?  That way backends can choose to support this 
optimization, and if they do, they can also choose to support longer 
lengths of comparison.


  DEF_LIB_BUILTIN        (BUILT_IN_MEMCMP, "memcmp", 
BT_FN_INT_CONST_PTR_CONST_PTR_SIZE, ATTR_PURE_NOTHROW_NONNULL_LEAF)
+DEF_GCC_BUILTIN        (BUILT_IN_MEMCMP_EQ, "__memcmp_eq", 
BT_FN_INT_CONST_PTR_CONST_PTR_SIZE, ATTR_PURE_NOTHROW_NONNULL_LEAF)
  DEF_LIB_BUILTIN_CHKP   (BUILT_IN_MEMCPY, "memcpy", 
BT_FN_PTR_PTR_CONST_PTR_SIZE, ATTR_RET1_NOTHROW_NONNULL_LEAF)

Presumably you would also document this new builtin in doc/extend.texi ? 
  Plus maybe add a testcase for it as well ?


+  /* If the return value is used, don't do the transformation.  */

This comment struck me as wrong.  Surely if the return value is not used 
then the entire memcmp can be transformed into nothing.  Plus if the 
return value is used, but only for an equality comparison with zero then 
the transformation can take place.


Cheers
   Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-01-15 16:58 Thoughts on memcmp expansion (PR43052) Bernd Schmidt
  2016-01-18  9:22 ` Richard Biener
  2016-01-18 12:22 ` Nick Clifton
@ 2016-01-19 21:36 ` Jeff Law
  2016-05-30 11:29 ` Florian Weimer
  3 siblings, 0 replies; 19+ messages in thread
From: Jeff Law @ 2016-01-19 21:36 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches, Nick Clifton

On 01/15/2016 09:58 AM, Bernd Schmidt wrote:
> PR43052 is a PR complaining about how the rep cmpsb expansion that gcc
> uses for memcmp is slower than the library function. As is so often the
> case, if you investigate a bit, you can find a lot of issues with the
> current situation in the compiler.
>
> This PR was accidentally fixed by a patch by Nick which disabled the use
> of cmpstrnsi for memcmp expansion, on the grounds that cmpstrnsi could
> stop looking after seeing a null byte, which would be invalid for
> memcmp, so only cmpmemsi should be used. This fix was for an out-of-tree
> target.
>
> I believe the rep cmpsb sequence used by i386 would actually be valid,
> so we could duplicate the cmpstrn pattern to also match cmpmem and be
> done - but that would then again cause the performance problem described
> in the PR, so it's probably not a good idea.
>
> One question Richard posed in the comments: why aren't we optimizing
> small constant size memcmps other than size 1 to *s == *q? The reason is
> the return value of memcmp, which implies byte-sized operation
> (incidentally, the use of SImode in the cmpmem/cmpstr patterns is really
> odd). It's possible to work around this, but expansion becomes a little
> more tricky (subtract after bswap, maybe). Still, the current code
> generation is lame.
>
> So, for gcc-6, I think we shouldn't do anything. The PR is fixed, and
> there's no easy bug-fix that can be done to improve matters. Not sure
> whether to keep the PR open or create a new one for the remaining
> issues. For the next stage1, I'm attaching a proof-of-concept patch that
> does the following:
>   * notice if memcmp results are only used for equality comparison
>     against zero
>   * if so, replace with a different builtin __memcmp_eq
>   * Expand __memcmp_eq for small constant sizes with loads and
>     comparison, fall back to a memcmp call.
>
> The whole thing could be extended to work for sizes larger than an int,
> along the lines of memcpy expansion controlled by move ratio etc. Thoughts?
Seems generally reasonable to me.  I've looked at that BZ more than once 
and thought "ewww, I don't want to touch this mess", so I'm more than 
happy to have you taking the lead on it.

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-01-18  9:22 ` Richard Biener
@ 2016-04-28 18:36   ` Bernd Schmidt
  2016-05-02 12:52     ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schmidt @ 2016-04-28 18:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Nick Clifton

[-- Attachment #1: Type: text/plain, Size: 1912 bytes --]

On 01/18/2016 10:22 AM, Richard Biener wrote:
> See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 - the
> inline expansion
> for small sizes and equality compares should be done on GIMPLE.  Today the
> strlen pass might be an appropriate place to do this given its
> superior knowledge
> about string lengths.
>
> The idea of turning eq feeding memcmp into a special memcmp_eq is good but
> you have to avoid doing that too early - otherwise you'd lose on
>
>    res = memcmp (p, q, sz);
>    if (memcmp (p, q, sz) == 0)
>     ...
>
> that is, you have to make sure CSE got the chance to common the two calls.
> This is why I think this kind of transform needs to happen in specific places
> (like during strlen opt) rather than in generic folding.

Ok, here's an update. I kept pieces of your patch from that PR, but also 
translating memcmps larger than a single operation into memcmp_eq as in 
my previous patch.

Then, I added by_pieces infrastructure for memcmp expansion. To avoid 
any more code duplication in this area, I abstracted the existing code 
and converted it to C++ classes since that seemed to fit pretty well.

There are a few possible ways I could go with this, which is why I'm 
posting it more as a RFD at this point.
  - should store_by_pieces be eliminated in favour of doing just
    move_by_pieces with constfns?
  - the C++ification could be extended, making move_by_pieces_d and
    compare_by_pieces_d classes inheriting from a common base. This
    would get rid of the callbacks, replacing them with virtuals,
    and also make some of the current struct members private.
  - could move all of the by_pieces stuff out into a separate file?

Later, I think we'll also want to extend this to allow vector mode 
operations, but I think that's a separate patch.

So, opinions what I should be doing with this patch? FWIW it bootstraps 
and tests OK on x86_64-linux.


Bernd

[-- Attachment #2: memcmp-eq-0428.diff --]
[-- Type: text/x-patch, Size: 48099 bytes --]

Index: gcc/builtins.c
===================================================================
--- gcc/builtins.c	(revision 235474)
+++ gcc/builtins.c	(working copy)
@@ -3671,53 +3671,24 @@ expand_cmpstr (insn_code icode, rtx targ
   return NULL_RTX;
 }
 
-/* Try to expand cmpstrn or cmpmem operation ICODE with the given operands.
-   ARG3_TYPE is the type of ARG3_RTX.  Return the result rtx on success,
-   otherwise return null.  */
-
-static rtx
-expand_cmpstrn_or_cmpmem (insn_code icode, rtx target, rtx arg1_rtx,
-			  rtx arg2_rtx, tree arg3_type, rtx arg3_rtx,
-			  HOST_WIDE_INT align)
-{
-  machine_mode insn_mode = insn_data[icode].operand[0].mode;
-
-  if (target && (!REG_P (target) || HARD_REGISTER_P (target)))
-    target = NULL_RTX;
-
-  struct expand_operand ops[5];
-  create_output_operand (&ops[0], target, insn_mode);
-  create_fixed_operand (&ops[1], arg1_rtx);
-  create_fixed_operand (&ops[2], arg2_rtx);
-  create_convert_operand_from (&ops[3], arg3_rtx, TYPE_MODE (arg3_type),
-			       TYPE_UNSIGNED (arg3_type));
-  create_integer_operand (&ops[4], align);
-  if (maybe_expand_insn (icode, 5, ops))
-    return ops[0].value;
-  return NULL_RTX;
-}
-
 /* Expand expression EXP, which is a call to the memcmp built-in function.
    Return NULL_RTX if we failed and the caller should emit a normal call,
-   otherwise try to get the result in TARGET, if convenient.  */
+   otherwise try to get the result in TARGET, if convenient.
+   RESULT_EQ is true if we can relax the returned value to be either zero
+   or nonzero, without caring about the sign.  */
 
 static rtx
-expand_builtin_memcmp (tree exp, rtx target)
+expand_builtin_memcmp (tree exp, rtx target, bool result_eq)
 {
   if (!validate_arglist (exp,
  			 POINTER_TYPE, POINTER_TYPE, INTEGER_TYPE, VOID_TYPE))
     return NULL_RTX;
 
-  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
-     implementing memcmp because it will stop if it encounters two
-     zero bytes.  */
-  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
-  if (icode == CODE_FOR_nothing)
-    return NULL_RTX;
-
   tree arg1 = CALL_EXPR_ARG (exp, 0);
   tree arg2 = CALL_EXPR_ARG (exp, 1);
   tree len = CALL_EXPR_ARG (exp, 2);
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  location_t loc = EXPR_LOCATION (exp);
 
   unsigned int arg1_align = get_pointer_alignment (arg1) / BITS_PER_UNIT;
   unsigned int arg2_align = get_pointer_alignment (arg2) / BITS_PER_UNIT;
@@ -3726,22 +3697,39 @@ expand_builtin_memcmp (tree exp, rtx tar
   if (arg1_align == 0 || arg2_align == 0)
     return NULL_RTX;
 
-  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
-  location_t loc = EXPR_LOCATION (exp);
   rtx arg1_rtx = get_memory_rtx (arg1, len);
   rtx arg2_rtx = get_memory_rtx (arg2, len);
-  rtx arg3_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
+  rtx len_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
 
   /* Set MEM_SIZE as appropriate.  */
-  if (CONST_INT_P (arg3_rtx))
+  if (CONST_INT_P (len_rtx))
     {
-      set_mem_size (arg1_rtx, INTVAL (arg3_rtx));
-      set_mem_size (arg2_rtx, INTVAL (arg3_rtx));
+      set_mem_size (arg1_rtx, INTVAL (len_rtx));
+      set_mem_size (arg2_rtx, INTVAL (len_rtx));
     }
 
-  rtx result = expand_cmpstrn_or_cmpmem (icode, target, arg1_rtx, arg2_rtx,
-					 TREE_TYPE (len), arg3_rtx,
-					 MIN (arg1_align, arg2_align));
+  by_pieces_constfn constfn = NULL;
+
+  const char *src_str = c_getstr (arg1);
+  if (src_str == NULL)
+    src_str = c_getstr (arg2);
+  else
+    std::swap (arg1_rtx, arg2_rtx);
+
+  /* If SRC is a string constant and block move would be done
+     by pieces, we can avoid loading the string from memory
+     and only stored the computed constants.  */
+  if (src_str
+      && CONST_INT_P (len_rtx)
+      && (unsigned HOST_WIDE_INT) INTVAL (len_rtx) <= strlen (src_str) + 1)
+    constfn = builtin_memcpy_read_str;
+
+  rtx result = emit_block_cmp_hints (arg1_rtx, arg2_rtx, len_rtx,
+				     TREE_TYPE (len), target,
+				     result_eq, result_eq, constfn,
+				     CONST_CAST (char *, src_str));
+
+
   if (result)
     {
       /* Return the value in the proper mode for this function.  */
@@ -3757,19 +3745,6 @@ expand_builtin_memcmp (tree exp, rtx tar
       return convert_to_mode (mode, result, 0);
     }
 
-  result = target;
-  if (! (result != 0
-	 && REG_P (result) && GET_MODE (result) == mode
-	 && REGNO (result) >= FIRST_PSEUDO_REGISTER))
-    result = gen_reg_rtx (mode);
-
-  emit_library_call_value (memcmp_libfunc, result, LCT_PURE,
-			   TYPE_MODE (integer_type_node), 3,
-			   XEXP (arg1_rtx, 0), Pmode,
-			   XEXP (arg2_rtx, 0), Pmode,
-			   convert_to_mode (TYPE_MODE (sizetype), arg3_rtx,
-					    TYPE_UNSIGNED (sizetype)),
-			   TYPE_MODE (sizetype));
   return result;
 }
 
@@ -5997,9 +5972,16 @@ expand_builtin (tree exp, rtx target, rt
 
     case BUILT_IN_BCMP:
     case BUILT_IN_MEMCMP:
-      target = expand_builtin_memcmp (exp, target);
+    case BUILT_IN_MEMCMP_EQ:
+      target = expand_builtin_memcmp (exp, target, fcode == BUILT_IN_MEMCMP_EQ);
       if (target)
 	return target;
+      if (fcode == BUILT_IN_MEMCMP_EQ)
+	{
+	  exp = copy_node (exp);
+	  tree newdecl = builtin_decl_explicit (BUILT_IN_MEMCMP);
+	  TREE_OPERAND (exp, 1) = build_fold_addr_expr (newdecl);
+	}
       break;
 
     case BUILT_IN_SETJMP:
Index: gcc/builtins.def
===================================================================
--- gcc/builtins.def	(revision 235474)
+++ gcc/builtins.def	(working copy)
@@ -864,6 +864,10 @@ DEF_BUILTIN_STUB (BUILT_IN_STACK_SAVE, "
 DEF_BUILTIN_STUB (BUILT_IN_STACK_RESTORE, "__builtin_stack_restore")
 DEF_BUILTIN_STUB (BUILT_IN_ALLOCA_WITH_ALIGN, "__builtin_alloca_with_align")
 
+/* An internal version of memcmp, used when the result is only tested for
+   equality with zero.  */
+DEF_BUILTIN_STUB (BUILT_IN_MEMCMP_EQ, "__builtin_memcmp_eq")
+
 /* Object size checking builtins.  */
 DEF_GCC_BUILTIN	       (BUILT_IN_OBJECT_SIZE, "object_size", BT_FN_SIZE_CONST_PTR_INT, ATTR_PURE_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN_CHKP (BUILT_IN_MEMCPY_CHK, "__memcpy_chk", BT_FN_PTR_PTR_CONST_PTR_SIZE_SIZE, ATTR_RET1_NOTHROW_NONNULL_LEAF)
Index: gcc/defaults.h
===================================================================
--- gcc/defaults.h	(revision 235474)
+++ gcc/defaults.h	(working copy)
@@ -1039,6 +1039,11 @@ see the files COPYING3 and COPYING.RUNTI
 #define STORE_MAX_PIECES  MIN (MOVE_MAX_PIECES, 2 * sizeof (HOST_WIDE_INT))
 #endif
 
+/* Likewise for block comparisons.  */
+#ifndef COMPARE_MAX_PIECES
+#define COMPARE_MAX_PIECES  MOVE_MAX_PIECES
+#endif
+
 #ifndef MAX_MOVE_MAX
 #define MAX_MOVE_MAX MOVE_MAX
 #endif
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 235484)
+++ gcc/doc/tm.texi	(working copy)
@@ -6311,8 +6311,9 @@ Both @var{size} and @var{alignment} are
 units.
 
 The parameter @var{op} is one of: @code{CLEAR_BY_PIECES},
-@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES}.
-These describe the type of memory operation under consideration.
+@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES} or
+@code{COMPARE_BY_PIECES}.  These describe the type of memory operation
+under consideration.
 
 The parameter @var{speed_p} is true if the code is currently being
 optimized for speed rather than size.
@@ -6329,11 +6330,31 @@ in code size, for example where the numb
 move would be greater than that of a library call.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_COMPARE_BY_PIECES_BRANCH_RATIO (machine_mode @var{mode})
+When expanding a block comparison in MODE, gcc can try to reduce the
+number of branches at the expense of more memory operations.  This hook
+allows the target to override the default choice.  It should return the
+factor by which branches should be reduced over the plain expansion with
+one comparison per @var{mode}-sized piece.
+@end deftypefn
+
 @defmac MOVE_MAX_PIECES
 A C expression used by @code{move_by_pieces} to determine the largest unit
 a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
 @end defmac
 
+@defmac STORE_MAX_PIECES
+A C expression used by @code{store_by_pieces} to determine the largest unit
+a store used to memory is.  Defaults to @code{MOVE_MAX_PIECES}, or two times
+the size of @code{HOST_WIDE_INT}, whichever is smaller.
+@end defmac
+
+@defmac COMPARE_MAX_PIECES
+A C expression used by @code{compare_by_pieces} to determine the largest unit
+a load or store used to compare memory is.  Defaults to
+@code{MOVE_MAX_PIECES}.
+@end defmac
+
 @defmac CLEAR_RATIO (@var{speed})
 The threshold of number of scalar move insns, @emph{below} which a sequence
 of insns should be generated to clear memory instead of a string clear insn
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 235484)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -4653,11 +4653,25 @@ If you don't define this, a reasonable d
 
 @hook TARGET_USE_BY_PIECES_INFRASTRUCTURE_P
 
+@hook TARGET_COMPARE_BY_PIECES_BRANCH_RATIO
+
 @defmac MOVE_MAX_PIECES
 A C expression used by @code{move_by_pieces} to determine the largest unit
 a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
 @end defmac
 
+@defmac STORE_MAX_PIECES
+A C expression used by @code{store_by_pieces} to determine the largest unit
+a store used to memory is.  Defaults to @code{MOVE_MAX_PIECES}, or two times
+the size of @code{HOST_WIDE_INT}, whichever is smaller.
+@end defmac
+
+@defmac COMPARE_MAX_PIECES
+A C expression used by @code{compare_by_pieces} to determine the largest unit
+a load or store used to compare memory is.  Defaults to
+@code{MOVE_MAX_PIECES}.
+@end defmac
+
 @defmac CLEAR_RATIO (@var{speed})
 The threshold of number of scalar move insns, @emph{below} which a sequence
 of insns should be generated to clear memory instead of a string clear insn
Index: gcc/expr.c
===================================================================
--- gcc/expr.c	(revision 235474)
+++ gcc/expr.c	(working copy)
@@ -70,23 +70,6 @@ along with GCC; see the file COPYING3.
    the same indirect address eventually.  */
 int cse_not_expected;
 
-/* This structure is used by move_by_pieces to describe the move to
-   be performed.  */
-struct move_by_pieces_d
-{
-  rtx to;
-  rtx to_addr;
-  int autinc_to;
-  int explicit_inc_to;
-  rtx from;
-  rtx from_addr;
-  int autinc_from;
-  int explicit_inc_from;
-  unsigned HOST_WIDE_INT len;
-  HOST_WIDE_INT offset;
-  int reverse;
-};
-
 /* This structure is used by store_by_pieces to describe the clear to
    be performed.  */
 
@@ -103,8 +86,6 @@ struct store_by_pieces_d
   int reverse;
 };
 
-static void move_by_pieces_1 (insn_gen_fn, machine_mode,
-			      struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT,
 					unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
@@ -769,177 +750,342 @@ widest_int_mode_for_size (unsigned int s
   return mode;
 }
 
+/* Determine whether an operation OP on LEN bytes with alignment ALIGN can
+   and should be performed piecewise.  */
+
+static bool
+can_do_by_pieces (unsigned HOST_WIDE_INT len, unsigned int align,
+		  enum by_pieces_operation op)
+{
+  return targetm.use_by_pieces_infrastructure_p (len, align, op,
+						 optimize_insn_for_speed_p ());
+}
+
 /* Determine whether the LEN bytes can be moved by using several move
    instructions.  Return nonzero if a call to move_by_pieces should
    succeed.  */
 
-int
-can_move_by_pieces (unsigned HOST_WIDE_INT len,
-		    unsigned int align)
+bool
+can_move_by_pieces (unsigned HOST_WIDE_INT len, unsigned int align)
 {
-  return targetm.use_by_pieces_infrastructure_p (len, align, MOVE_BY_PIECES,
-						 optimize_insn_for_speed_p ());
+  return can_do_by_pieces (len, align, MOVE_BY_PIECES);
 }
 
-/* Generate several move instructions to copy LEN bytes from block FROM to
-   block TO.  (These are MEM rtx's with BLKmode).
+/* Used when performing piecewise block operations, holds information
+   about one of the memory objects involved.  The member functions
+   can be used to generate code for loading from the object and
+   updating the address when iterating.  */
 
-   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
-   used to push FROM to the stack.
+struct pieces_addr
+{
+  /* The object being referenced, a MEM.  Can be NULL_RTX to indicate
+     stack pushes.  */
+  rtx m_obj;
+  /* The address of the object.  Can differ from that seen in the
+     MEM rtx if we copied the address to a register.  */
+  rtx m_addr;
+  /* Nonzero if the address on the object has an autoincrement already,
+     signifies whether that was an increment or decrement.  */
+  signed char m_addr_inc;
+  /* Nonzero if we intend to use autoinc without the address already
+     having autoinc form.  We will insert add insns around each memory
+     reference, expecting later passes to form autoinc addressing modes.
+     The only supported options are predecrement and postincrement.  */
+  signed char m_explicit_inc;
+  /* True if we have either of the two possible cases of using
+     autoincrement.  */
+  bool m_auto;
+  /* True if this is an address to be used for load operations rather
+     than stores.  */
+  bool m_is_load;
 
-   ALIGN is maximum stack alignment we can assume.
+  /* Optionally, a function to obtain constants for any given offset into
+     the objects, and data associated with it.  */
+  by_pieces_constfn m_constfn;
+  void *m_cfndata;
+public:
+  pieces_addr (rtx, bool, by_pieces_constfn, void *);
+  rtx adjust (machine_mode, HOST_WIDE_INT);
+  void increment_address (HOST_WIDE_INT);
+  void maybe_predec (HOST_WIDE_INT);
+  void maybe_postinc (HOST_WIDE_INT);
+  void decide_autoinc (machine_mode, bool, HOST_WIDE_INT);
+  int get_addr_inc ()
+  {
+    return m_addr_inc;
+  }
+};
 
-   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
-   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
-   stpcpy.  */
+/* Initialize a pieces_addr structure from an object OBJ.  If this is
+   NULL, we assume that it is intended to describe a stack push.
+   IS_LOAD is true if the operation to be performed on this object is
+   a load rather than a store.  If it is, then the optional CONSTFN
+   and its associated CFNDATA can be used in place of the memory
+   load.  */
 
-rtx
-move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
-		unsigned int align, int endp)
+pieces_addr::pieces_addr (rtx obj, bool is_load, by_pieces_constfn constfn,
+			  void *cfndata)
+  : m_obj (obj), m_is_load (is_load), m_constfn (constfn), m_cfndata (cfndata)
 {
-  struct move_by_pieces_d data;
-  machine_mode to_addr_mode;
-  machine_mode from_addr_mode = get_address_mode (from);
-  rtx to_addr, from_addr = XEXP (from, 0);
-  unsigned int max_size = MOVE_MAX_PIECES + 1;
-  enum insn_code icode;
-
-  align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
-
-  data.offset = 0;
-  data.from_addr = from_addr;
-  if (to)
+  if (obj)
     {
-      to_addr_mode = get_address_mode (to);
-      to_addr = XEXP (to, 0);
-      data.to = to;
-      data.autinc_to
-	= (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
-	   || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
-      data.reverse
-	= (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+      rtx addr = XEXP (obj, 0);
+      rtx_code code = GET_CODE (addr);
+      m_addr = addr;
+      bool dec = code == PRE_DEC || code == POST_DEC;
+      bool inc = code == PRE_INC || code == POST_INC;
+      m_auto = inc || dec;
+      if (m_auto)
+	m_addr_inc = dec ? -1 : 1;
+      else
+	m_addr_inc = 0;
+      /* While we have always looked for these codes here, the code
+	 implementing the memory operation has never handled them.
+	 Support could be added later if necessary or beneficial.  */
+      gcc_assert (code != PRE_INC && code != POST_DEC);
     }
   else
     {
-      to_addr_mode = VOIDmode;
-      to_addr = NULL_RTX;
-      data.to = NULL_RTX;
-      data.autinc_to = 1;
+      m_addr = NULL_RTX;
+      m_auto = true;
       if (STACK_GROWS_DOWNWARD)
-	data.reverse = 1;
+	m_addr_inc = -1;
       else
-	data.reverse = 0;
+	m_addr_inc = 1;
     }
-  data.to_addr = to_addr;
-  data.from = from;
-  data.autinc_from
-    = (GET_CODE (from_addr) == PRE_INC || GET_CODE (from_addr) == PRE_DEC
-       || GET_CODE (from_addr) == POST_INC
-       || GET_CODE (from_addr) == POST_DEC);
+  m_explicit_inc = 0;
+  if (constfn)
+    gcc_assert (is_load);
+}
 
-  data.explicit_inc_from = 0;
-  data.explicit_inc_to = 0;
-  if (data.reverse) data.offset = len;
-  data.len = len;
+/* Decide whether to use autoinc for an address involved in a memory op.
+   MODE is the mode of the accesses, REVERSE is true if we've decided to
+   perform the operation starting from the end, and LEN is the length of
+   the operation.  Don't override an earlier decision to set m_auto.  */
+
+void
+pieces_addr::decide_autoinc (machine_mode ARG_UNUSED (mode), bool reverse,
+			     HOST_WIDE_INT len)
+{
+  if (m_auto)
+    return;
+
+  bool use_predec = (m_is_load
+		     ? USE_LOAD_PRE_DECREMENT (mode)
+		     : USE_STORE_PRE_DECREMENT (mode));
+  bool use_postinc = (m_is_load
+		      ? USE_LOAD_POST_INCREMENT (mode)
+		      : USE_STORE_POST_INCREMENT (mode));
+  machine_mode addr_mode = get_address_mode (m_obj);
+
+  if (use_predec && reverse)
+    {
+      m_addr = copy_to_mode_reg (addr_mode,
+				 plus_constant (addr_mode,
+						m_addr, len));
+      m_auto = true;
+      m_explicit_inc = -1;
+    }
+  else if (use_postinc && !reverse)
+    {
+      m_addr = copy_to_mode_reg (addr_mode, m_addr);
+      m_auto = true;
+      m_explicit_inc = 1;
+    }
+  else if (CONSTANT_P (m_addr))
+    m_addr = copy_to_mode_reg (addr_mode, m_addr);
+}
+
+/* Adjust the address to refer to the data at OFFSET in MODE.  If we
+   are using autoincrement for this address, we don't add the offset,
+   but we still modify the MEM's properties.  */
+
+rtx
+pieces_addr::adjust (machine_mode mode, HOST_WIDE_INT offset)
+{
+  if (m_obj == NULL_RTX)
+    return NULL_RTX;
+  if (m_constfn)
+    return m_constfn (m_cfndata, offset, mode);
+  if (m_auto)
+    return adjust_automodify_address (m_obj, mode, m_addr, offset);
+  else
+    return adjust_address (m_obj, mode, offset);
+}
+
+/* Emit an add instruction to increment the address by SIZE.  */
+
+void
+pieces_addr::increment_address (HOST_WIDE_INT size)
+{
+  rtx amount = gen_int_mode (size, GET_MODE (m_addr));
+  emit_insn (gen_add2_insn (m_addr, amount));
+}
+
+/* If we are supposed to decrement the address after each access, emit code
+   to do so now.  */
+
+void
+pieces_addr::maybe_predec (HOST_WIDE_INT size)
+{
+  if (m_explicit_inc >= 0)
+    return;
+  gcc_assert (HAVE_PRE_DECREMENT);
+  increment_address (size);
+}
+
+/* If we are supposed to increment the address after each access, emit code
+   to do so now.  */
+
+void
+pieces_addr::maybe_postinc (HOST_WIDE_INT size)
+{
+  if (m_explicit_inc <= 0)
+    return;
+  gcc_assert (HAVE_POST_INCREMENT);
+  increment_address (size);
+}
+
+/* This structure is used by do_op_by_pieces to describe the operation
+   to be performed.  */
+
+struct op_by_pieces_d
+{
+  pieces_addr m_to, m_from;
+  unsigned HOST_WIDE_INT m_len;
+  HOST_WIDE_INT m_offset;
+  unsigned int m_align;
+  unsigned int m_max_size;
+  bool m_reverse;
+
+  op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *,
+		  unsigned HOST_WIDE_INT, unsigned int);
+  void iterate (void (*)(rtx, rtx, machine_mode, void *),
+		     machine_mode, void *);
+};
+
+/* The constructor for an op_by_pieces_d structure.  We require two
+   objects named TO and FROM, which are identified as loads or stores
+   by TO_LOAD and FROM_LOAD.  If FROM is a load, the optional FROM_CFN
+   and its associated FROM_CFN_DATA can be used to replace loads with
+   constant values.  LEN describes the length of the operation.  */
+op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
+				rtx from, bool from_load,
+				by_pieces_constfn from_cfn,
+				void *from_cfn_data,
+				unsigned HOST_WIDE_INT len,
+				unsigned int align)
+  : m_to (to, to_load, NULL, NULL),
+    m_from (from, from_load, from_cfn, from_cfn_data),
+    m_len (len), m_max_size (MOVE_MAX_PIECES + 1)
+{
+  int toi = m_to.get_addr_inc ();
+  int fromi = m_from.get_addr_inc ();
+  if (toi >= 0 && fromi >= 0)
+    m_reverse = false;
+  else if (toi <= 0 && fromi <= 0)
+    m_reverse = true;
+  else
+    gcc_unreachable ();
+
+  m_offset = m_reverse ? len : 0;
+  align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
   /* If copying requires more than two move insns,
      copy addresses to registers (to make displacements shorter)
      and use post-increment if available.  */
-  if (!(data.autinc_from && data.autinc_to)
-      && move_by_pieces_ninsns (len, align, max_size) > 2)
+  if (move_by_pieces_ninsns (len, align, m_max_size) > 2)
     {
-      /* Find the mode of the largest move...
-	 MODE might not be used depending on the definitions of the
-	 USE_* macros below.  */
-      machine_mode mode ATTRIBUTE_UNUSED
-	= widest_int_mode_for_size (max_size);
+      /* Find the mode of the largest comparison.  */
+      machine_mode mode = widest_int_mode_for_size (m_max_size);
 
-      if (USE_LOAD_PRE_DECREMENT (mode) && data.reverse && ! data.autinc_from)
-	{
-	  data.from_addr = copy_to_mode_reg (from_addr_mode,
-					     plus_constant (from_addr_mode,
-							    from_addr, len));
-	  data.autinc_from = 1;
-	  data.explicit_inc_from = -1;
-	}
-      if (USE_LOAD_POST_INCREMENT (mode) && ! data.autinc_from)
-	{
-	  data.from_addr = copy_to_mode_reg (from_addr_mode, from_addr);
-	  data.autinc_from = 1;
-	  data.explicit_inc_from = 1;
-	}
-      if (!data.autinc_from && CONSTANT_P (from_addr))
-	data.from_addr = copy_to_mode_reg (from_addr_mode, from_addr);
-      if (USE_STORE_PRE_DECREMENT (mode) && data.reverse && ! data.autinc_to)
-	{
-	  data.to_addr = copy_to_mode_reg (to_addr_mode,
-					   plus_constant (to_addr_mode,
-							  to_addr, len));
-	  data.autinc_to = 1;
-	  data.explicit_inc_to = -1;
-	}
-      if (USE_STORE_POST_INCREMENT (mode) && ! data.reverse && ! data.autinc_to)
-	{
-	  data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
-	  data.autinc_to = 1;
-	  data.explicit_inc_to = 1;
-	}
-      if (!data.autinc_to && CONSTANT_P (to_addr))
-	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+      m_from.decide_autoinc (mode, m_reverse, len);
+      m_to.decide_autoinc (mode, m_reverse, len);
     }
 
   align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
+  m_align = align;
+}
+
+/* A callback used when iterating for a compare_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  If OP0 is NULL, this means we should generate a
+   push; otherwise EXTRA_DATA holds a pointer to a pointer to the insn
+   gen function that should be used to generate the mode.  */
+
+static void
+move_by_pieces_genfn (rtx op0, rtx op1, machine_mode mode, void *extra_data)
+{
+  insn_gen_fn genfun = *(insn_gen_fn *)extra_data;
+#ifdef PUSH_ROUNDING
+  if (op0 == NULL_RTX)
+    {
+      emit_single_push_insn (mode, op1, NULL);
+      return;
+    }
+#endif
+  emit_insn (genfun (op0, op1));
+}
+
+/* Generate several move instructions to copy LEN bytes from block FROM to
+   block TO.  (These are MEM rtx's with BLKmode).
+
+   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
+   used to push FROM to the stack.
+
+   ALIGN is maximum stack alignment we can assume.
+
+   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
+   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
+   stpcpy.  */
+
+rtx
+move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
+		unsigned int align, int endp)
+{
+  enum insn_code icode;
+
+#ifndef PUSH_ROUNDING
+  gcc_unreachable ();
+#endif
+
+  op_by_pieces_d data (to, false, from, true, NULL, NULL, len, align);
 
   /* First move what we can in the largest integer mode, then go to
      successively smaller modes.  */
 
-  while (max_size > 1 && data.len > 0)
+  while (data.m_max_size > 1 && data.m_len > 0)
     {
-      machine_mode mode = widest_int_mode_for_size (max_size);
+      machine_mode mode = widest_int_mode_for_size (data.m_max_size);
 
       if (mode == VOIDmode)
 	break;
 
       icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+      insn_gen_fn fn = GEN_FCN (icode);
+      if (icode != CODE_FOR_nothing
+	  && data.m_align >= GET_MODE_ALIGNMENT (mode))
+	data.iterate (move_by_pieces_genfn, mode, (void *)&fn);
 
-      max_size = GET_MODE_SIZE (mode);
+      data.m_max_size = GET_MODE_SIZE (mode);
     }
 
   /* The code above should have handled everything.  */
-  gcc_assert (!data.len);
+  gcc_assert (!data.m_len);
 
   if (endp)
     {
-      rtx to1;
-
-      gcc_assert (!data.reverse);
-      if (data.autinc_to)
-	{
-	  if (endp == 2)
-	    {
-	      if (HAVE_POST_INCREMENT && data.explicit_inc_to > 0)
-		emit_insn (gen_add2_insn (data.to_addr, constm1_rtx));
-	      else
-		data.to_addr = copy_to_mode_reg (to_addr_mode,
-						 plus_constant (to_addr_mode,
-								data.to_addr,
-								-1));
-	    }
-	  to1 = adjust_automodify_address (data.to, QImode, data.to_addr,
-					   data.offset);
-	}
-      else
+      gcc_assert (!data.m_reverse);
+      if (endp == 2)
 	{
-	  if (endp == 2)
-	    --data.offset;
-	  to1 = adjust_address (data.to, QImode, data.offset);
+	  if (data.m_to.m_auto)
+	    data.m_to.increment_address (-1);
+	  --data.m_offset;
 	}
-      return to1;
+      return data.m_to.adjust (QImode, data.m_offset);
     }
   else
-    return data.to;
+    return to;
 }
 
 /* Return number of insns required to move L bytes by pieces.
@@ -974,71 +1120,148 @@ move_by_pieces_ninsns (unsigned HOST_WID
   return n_insns;
 }
 
-/* Subroutine of move_by_pieces.  Move as many bytes as appropriate
-   with move instructions for mode MODE.  GENFUN is the gen_... function
-   to make a move insn for that mode.  DATA has all the other info.  */
+/* Perform one step of a piecewise memory operation, using accesses of MODE
+   until the remaining size is smaller than this mode.  For every iteration,
+   call GENFUN with the two operands and the EXTRA_DATA.  */
 
-static void
-move_by_pieces_1 (insn_gen_fn genfun, machine_mode mode,
-		  struct move_by_pieces_d *data)
+void
+op_by_pieces_d::iterate (void (*genfun)(rtx, rtx, machine_mode, void *),
+			 machine_mode mode, void *extra_data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
+  while (m_len >= size)
     {
-      if (data->reverse)
-	data->offset -= size;
+      if (m_reverse)
+	m_offset -= size;
 
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+      to1 = m_to.adjust (mode, m_offset);
+      from1 = m_from.adjust (mode, m_offset);
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
-      else
-	from1 = adjust_address (data->from, mode, data->offset);
+      m_to.maybe_predec (-(HOST_WIDE_INT)size);
+      m_from.maybe_predec (-(HOST_WIDE_INT)size);
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->to_addr))));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->from_addr))));
+      (*genfun) (to1, from1, mode, extra_data);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
-#ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
-#else
-	  gcc_unreachable ();
-#endif
-	}
+      m_to.maybe_postinc (size);
+      m_from.maybe_postinc (size);
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->to_addr))));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->from_addr))));
+      if (!m_reverse)
+	m_offset += size;
 
-      if (! data->reverse)
-	data->offset += size;
+      m_len -= size;
+    }
+}
 
-      data->len -= size;
+/* Context used by compare_by_pieces_genfn.  It stores the fail label
+   to jump to in case of miscomparison, and for branch ratios greater than 1,
+   it stores an accumulator and the current and maximum counts before
+   emitting another branch.  */
+
+struct compare_by_pieces_data
+{
+  rtx_code_label *fail_label;
+  rtx accumulator;
+  int count, batch;
+};
+
+/* A callback used when iterating for a compare_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  DATA holds a pointer to the compare_by_pieces_data
+   context structure.  */
+
+static void
+compare_by_pieces_genfn (rtx op0, rtx op1, machine_mode mode, void *data)
+{
+  compare_by_pieces_data *extra = (compare_by_pieces_data *)data;
+  if (extra->batch > 1)
+    {
+      rtx temp = expand_binop (mode, sub_optab, op0, op1, NULL_RTX,
+			       true, OPTAB_LIB_WIDEN);
+      if (extra->count != 0)
+	temp = expand_binop (mode, ior_optab, extra->accumulator, temp, temp,
+			     true, OPTAB_LIB_WIDEN);
+      extra->accumulator = temp;
+
+      if (++extra->count < extra->batch)
+	return;
+
+      extra->count = 0;
+      op0 = extra->accumulator;
+      op1 = const0_rtx;
+      extra->accumulator = NULL_RTX;
     }
+  do_compare_rtx_and_jump (op0, op1, NE, true, mode, NULL_RTX, NULL,
+			   extra->fail_label, -1);
+}
+
+/* Generate several move instructions to compare LEN bytes from blocks
+   ARG0 and ARG1.  (These are MEM rtx's with BLKmode).
+
+   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
+   used to push FROM to the stack.
+
+   ALIGN is maximum stack alignment we can assume.
+
+   Optionally, the caller can pass a constfn and associated data in A1_CFN
+   and A1_CFN_DATA. describing that the second operand being compared is a
+   known constant and how to obtain its data.  */
+
+static rtx
+compare_by_pieces (rtx arg0, rtx arg1, unsigned HOST_WIDE_INT len,
+		   rtx target, unsigned int align,
+		   by_pieces_constfn a1_cfn, void *a1_cfn_data)
+{
+  enum insn_code icode;
+  rtx_code_label *fail_label = gen_label_rtx ();
+  rtx_code_label *end_label = gen_label_rtx ();
+
+  compare_by_pieces_data extra;
+  extra.fail_label = fail_label;
+  if (target == NULL_RTX
+      || !REG_P (target) || REGNO (target) < FIRST_PSEUDO_REGISTER)
+    target = gen_reg_rtx (TYPE_MODE (integer_type_node));
+
+  op_by_pieces_d data (arg0, true, arg1, true, a1_cfn, a1_cfn_data, len,
+		       align);
+
+  /* First move what we can in the largest integer mode, then go to
+     successively smaller modes.  */
+
+  while (data.m_max_size > 1 && data.m_len > 0)
+    {
+      machine_mode mode = widest_int_mode_for_size (data.m_max_size);
+
+      if (mode == VOIDmode)
+	break;
+
+      icode = optab_handler (mov_optab, mode);
+      if (icode != CODE_FOR_nothing
+	  && data.m_align >= GET_MODE_ALIGNMENT (mode))
+	{
+	  extra.batch = targetm.compare_by_pieces_branch_ratio (mode);
+	  extra.accumulator = NULL_RTX;
+	  extra.count = 0;
+	  data.iterate (compare_by_pieces_genfn, mode, (void *)&extra);
+	  if (extra.accumulator != NULL_RTX)
+	    do_compare_rtx_and_jump (extra.accumulator, const0_rtx, NE, true,
+				     mode, NULL_RTX, NULL, extra.fail_label,
+				     -1);
+	}
+
+      data.m_max_size = GET_MODE_SIZE (mode);
+    }
+  emit_move_insn (target, const0_rtx);
+  emit_jump (end_label);
+  emit_barrier ();
+  emit_label (fail_label);
+  emit_move_insn (target, const1_rtx);
+  emit_label (end_label);
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data.m_len);
+  return target;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1068,8 +1291,7 @@ emit_block_move_hints (rtx x, rtx y, rtx
   unsigned int align;
 
   gcc_assert (size);
-  if (CONST_INT_P (size)
-      && INTVAL (size) == 0)
+  if (CONST_INT_P (size) && INTVAL (size) == 0)
     return 0;
 
   switch (method)
@@ -1463,6 +1685,116 @@ emit_block_move_via_loop (rtx x, rtx y,
   emit_cmp_and_jump_insns (iter, size, LT, NULL_RTX, iter_mode,
 			   true, top_label, REG_BR_PROB_BASE * 90 / 100);
 }
+
+/* Try to expand cmpstrn or cmpmem operation ICODE with the given operands.
+   ARG3_TYPE is the type of ARG3_RTX.  Return the result rtx on success,
+   otherwise return null.  */
+
+rtx
+expand_cmpstrn_or_cmpmem (insn_code icode, rtx target, rtx arg1_rtx,
+			  rtx arg2_rtx, tree arg3_type, rtx arg3_rtx,
+			  HOST_WIDE_INT align)
+{
+  machine_mode insn_mode = insn_data[icode].operand[0].mode;
+
+  if (target && (!REG_P (target) || HARD_REGISTER_P (target)))
+    target = NULL_RTX;
+
+  struct expand_operand ops[5];
+  create_output_operand (&ops[0], target, insn_mode);
+  create_fixed_operand (&ops[1], arg1_rtx);
+  create_fixed_operand (&ops[2], arg2_rtx);
+  create_convert_operand_from (&ops[3], arg3_rtx, TYPE_MODE (arg3_type),
+			       TYPE_UNSIGNED (arg3_type));
+  create_integer_operand (&ops[4], align);
+  if (maybe_expand_insn (icode, 5, ops))
+    return ops[0].value;
+  return NULL_RTX;
+}
+
+static rtx
+emit_block_cmp_via_cmpmem (rtx x, rtx y, rtx len, tree len_type, rtx target,
+			   unsigned align)
+{
+  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
+     implementing memcmp because it will stop if it encounters two
+     zero bytes.  */
+  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
+
+  if (icode == CODE_FOR_nothing)
+    return NULL_RTX;
+
+  return expand_cmpstrn_or_cmpmem (icode, target, x, y,
+				   len_type, len, align);
+}
+
+/* Emit code to compare a block Y to a block X.  This may be done with
+   string-compare instructions, with multiple scalar instructions,
+   or with a library call.
+
+   Both X and Y must be MEM rtx's.  LEN is an rtx that says how long
+   they are.  LEN_TYPE is the type of the expression that was used to
+   calculate it.
+
+   If EQUALITY_ONLY is true, it means we don't have to return the tri-state
+   value of a normal memcmp call, instead we can just compare for equality.
+   If FORCE_LIBCALL is true, we should emit a call to memcmp rather than
+   returning NULL_RTX.
+
+   Optionally, the caller can pass a constfn and associated data in Y_CFN
+   and Y_CFN_DATA. describing that the second operand being compared is a
+   known constant and how to obtain its data.
+   Return the result of the comparison, or NULL_RTX if we failed to
+   perform the operation.  */
+
+rtx
+emit_block_cmp_hints (rtx x, rtx y, rtx len, tree len_type, rtx target,
+		      bool equality_only, bool force_libcall,
+		      by_pieces_constfn y_cfn, void *y_cfndata)
+{
+  rtx result = 0;
+
+  if (CONST_INT_P (len) && INTVAL (len) == 0)
+    return const0_rtx;
+
+  gcc_assert (MEM_P (x) && MEM_P (y));
+  unsigned int align = MIN (MEM_ALIGN (x), MEM_ALIGN (y));
+  gcc_assert (align >= BITS_PER_UNIT);
+
+  x = adjust_address (x, BLKmode, 0);
+  y = adjust_address (y, BLKmode, 0);
+
+  if (equality_only
+      && CONST_INT_P (len)
+      && can_do_by_pieces (INTVAL (len), align, COMPARE_BY_PIECES))
+    result = compare_by_pieces (x, y, INTVAL (len), target, align,
+				y_cfn, y_cfndata);
+  else if ((result = emit_block_cmp_via_cmpmem (x, y, len, len_type,
+						target, align)) != NULL_RTX)
+    ;
+  else if (force_libcall)
+    {
+      gcc_assert (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (x)));
+      gcc_assert (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (y)));
+
+      result = target;
+      machine_mode result_mode = TYPE_MODE (integer_type_node);
+      if (result == NULL_RTX
+	  || !REG_P (result)
+	  || GET_MODE (result) != result_mode
+	  || REGNO (result) < FIRST_PSEUDO_REGISTER)
+	result = gen_reg_rtx (result_mode);
+
+      machine_mode size_mode = TYPE_MODE (sizetype);
+      len = convert_to_mode (size_mode, len, TYPE_UNSIGNED (sizetype));
+      emit_library_call_value (memcmp_libfunc, result, LCT_PURE,
+			       result_mode, 3,
+			       XEXP (x, 0), Pmode, XEXP (y, 0), Pmode,
+			       len, size_mode);
+    }
+
+  return result;
+}
 \f
 /* Copy all or part of a value X into registers starting at REGNO.
    The number of registers to be filled is NREGS.  */
Index: gcc/expr.h
===================================================================
--- gcc/expr.h	(revision 235474)
+++ gcc/expr.h	(working copy)
@@ -82,6 +82,8 @@ enum block_op_methods
   BLOCK_OP_TAILCALL
 };
 
+typedef rtx (*by_pieces_constfn) (void *, HOST_WIDE_INT, machine_mode);
+
 extern GTY(()) tree block_clear_fn;
 extern void init_block_move_fn (const char *);
 extern void init_block_clear_fn (const char *);
@@ -93,6 +95,8 @@ extern rtx emit_block_move_hints (rtx, r
 				  unsigned HOST_WIDE_INT,
 				  unsigned HOST_WIDE_INT,
 				  unsigned HOST_WIDE_INT);
+extern rtx emit_block_cmp_hints (rtx, rtx, rtx, tree, rtx, bool, bool,
+				 by_pieces_constfn, void *);
 extern bool emit_storent_insn (rtx to, rtx from);
 
 /* Copy all or part of a value X into registers starting at REGNO.
@@ -157,6 +161,11 @@ extern void use_regs (rtx *, int, int);
 /* Mark a PARALLEL as holding a parameter for the next CALL_INSN.  */
 extern void use_group_regs (rtx *, rtx);
 
+#ifdef GCC_INSN_CODES_H
+extern rtx expand_cmpstrn_or_cmpmem (insn_code, rtx, rtx, rtx, tree, rtx,
+				     HOST_WIDE_INT);
+#endif
+
 /* Write zeros through the storage of OBJECT.
    If OBJECT has BLKmode, SIZE is its length in bytes.  */
 extern rtx clear_storage (rtx, rtx, enum block_op_methods);
@@ -187,8 +196,7 @@ extern unsigned HOST_WIDE_INT move_by_pi
    MEMSETP is true if this is a real memset/bzero, not a copy
    of a const string.  */
 extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
-				rtx (*) (void *, HOST_WIDE_INT,
-					 machine_mode),
+				by_pieces_constfn,
 				void *, unsigned int, bool);
 
 /* Generate several move instructions to store LEN bytes generated by
@@ -197,8 +205,7 @@ extern int can_store_by_pieces (unsigned
    ALIGN is maximum alignment we can assume.
    MEMSETP is true if this is a real memset/bzero, not a copy.
    Returns TO + LEN.  */
-extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT,
-			    rtx (*) (void *, HOST_WIDE_INT, machine_mode),
+extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
 			    void *, unsigned int, bool, int);
 
 /* Emit insns to set X from Y.  */
@@ -279,7 +286,7 @@ rtx get_personality_function (tree);
 /* Determine whether the LEN bytes can be moved by using several move
    instructions.  Return nonzero if a call to move_by_pieces should
    succeed.  */
-extern int can_move_by_pieces (unsigned HOST_WIDE_INT, unsigned int);
+extern bool can_move_by_pieces (unsigned HOST_WIDE_INT, unsigned int);
 
 extern unsigned HOST_WIDE_INT highest_pow2_factor (const_tree);
 
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 235474)
+++ gcc/target.def	(working copy)
@@ -3397,8 +3397,9 @@ Both @var{size} and @var{alignment} are
 units.\n\
 \n\
 The parameter @var{op} is one of: @code{CLEAR_BY_PIECES},\n\
-@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES}.\n\
-These describe the type of memory operation under consideration.\n\
+@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES} or\n\
+@code{COMPARE_BY_PIECES}.  These describe the type of memory operation\n\
+under consideration.\n\
 \n\
 The parameter @var{speed_p} is true if the code is currently being\n\
 optimized for speed rather than size.\n\
@@ -3418,6 +3419,16 @@ move would be greater than that of a lib
  default_use_by_pieces_infrastructure_p)
 
 DEFHOOK
+(compare_by_pieces_branch_ratio,
+ "When expanding a block comparison in MODE, gcc can try to reduce the\n\
+number of branches at the expense of more memory operations.  This hook\n\
+allows the target to override the default choice.  It should return the\n\
+factor by which branches should be reduced over the plain expansion with\n\
+one comparison per @var{mode}-sized piece.",
+ int, (machine_mode mode),
+ default_compare_by_pieces_branch_ratio)
+
+DEFHOOK
 (optab_supported_p,
  "Return true if the optimizers should use optab @var{op} with\n\
 modes @var{mode1} and @var{mode2} for optimization type @var{opt_type}.\n\
Index: gcc/target.h
===================================================================
--- gcc/target.h	(revision 235474)
+++ gcc/target.h	(working copy)
@@ -79,14 +79,18 @@ enum print_switch_type
 };
 
 /* Types of memory operation understood by the "by_pieces" infrastructure.
-   Used by the TARGET_USE_BY_PIECES_INFRASTRUCTURE_P target hook.  */
+   Used by the TARGET_USE_BY_PIECES_INFRASTRUCTURE_P target hook and
+   internally by the functions in expr.c.  */
 
 enum by_pieces_operation
 {
   CLEAR_BY_PIECES,
   MOVE_BY_PIECES,
   SET_BY_PIECES,
-  STORE_BY_PIECES
+  STORE_BY_PIECES,
+  COMPARE_BY_PIECES,
+  /* Used internally only, never passed to the target hook.  */
+  PUSH_BY_PIECES
 };
 
 typedef int (* print_switch_fn_type) (print_switch_type, const char *);
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	(revision 235474)
+++ gcc/targhooks.c	(working copy)
@@ -1482,27 +1482,43 @@ default_use_by_pieces_infrastructure_p (
 
   switch (op)
     {
-      case CLEAR_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = CLEAR_RATIO (speed_p);
-	break;
-      case MOVE_BY_PIECES:
-	max_size = MOVE_MAX_PIECES;
-	ratio = get_move_ratio (speed_p);
-	break;
-      case SET_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = SET_RATIO (speed_p);
-	break;
-      case STORE_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = get_move_ratio (speed_p);
-	break;
+    case CLEAR_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = CLEAR_RATIO (speed_p);
+      break;
+    case MOVE_BY_PIECES:
+      max_size = MOVE_MAX_PIECES;
+      ratio = get_move_ratio (speed_p);
+      break;
+    case SET_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = SET_RATIO (speed_p);
+      break;
+    case STORE_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = get_move_ratio (speed_p);
+      break;
+    case COMPARE_BY_PIECES:
+      max_size = COMPARE_MAX_PIECES;
+      ratio = get_move_ratio (speed_p);
+      break;
+    case PUSH_BY_PIECES:
+      gcc_unreachable ();
     }
 
   return move_by_pieces_ninsns (size, alignment, max_size + 1) < ratio;
 }
 
+/* This hook controls code generation for expanding a memcmp operation by
+   pieces.  Return 1 for the normal pattern of compare/jump after each pair
+   of loads, or a higher number to reduce the number of branches.  */
+
+int
+default_compare_by_pieces_branch_ratio (machine_mode)
+{
+  return 2;
+}
+
 bool
 default_profile_before_prologue (void)
 {
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	(revision 235474)
+++ gcc/targhooks.h	(working copy)
@@ -199,6 +199,7 @@ extern bool default_use_by_pieces_infras
 						    unsigned int,
 						    enum by_pieces_operation,
 						    bool);
+extern int default_compare_by_pieces_branch_ratio (machine_mode);
 
 extern bool default_profile_before_prologue (void);
 extern reg_class_t default_preferred_reload_class (rtx, reg_class_t);
Index: gcc/testsuite/gcc.dg/pr43052-1.c
===================================================================
--- gcc/testsuite/gcc.dg/pr43052-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43052-1.c	(working copy)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+/* { dg-final { scan-assembler-not "memcmp" } }
+ */
+#include <string.h>
+struct A { int x; } a, b;
+
+extern char s[], t[];
+
+int foo ()
+{
+  return memcmp (&a, &b, sizeof (struct A)) == 0;
+}
+
+int bar ()
+{
+  return strcmp (s, t);
+}
+
+int baz ()
+{
+  return strncmp (s, t, 32);
+}
Index: gcc/tree-ssa-forwprop.c
===================================================================
--- gcc/tree-ssa-forwprop.c	(revision 235474)
+++ gcc/tree-ssa-forwprop.c	(working copy)
@@ -1435,6 +1435,76 @@ simplify_builtin_call (gimple_stmt_itera
 	    }
 	}
       break;
+
+    case BUILT_IN_MEMCMP:
+      {
+	tree res = gimple_call_lhs (stmt2);
+	tree arg1 = gimple_call_arg (stmt2, 0);
+	tree arg2 = gimple_call_arg (stmt2, 1);
+	tree len = gimple_call_arg (stmt2, 2);
+	unsigned HOST_WIDE_INT leni;
+	unsigned align1, align2;
+	use_operand_p use_p;
+	imm_use_iterator iter;
+
+	if (!res)
+	  return false;
+
+	FOR_EACH_IMM_USE_FAST (use_p, iter, res)
+	  {
+	    gimple *ustmt = USE_STMT (use_p);
+
+	    if (gimple_code (ustmt) == GIMPLE_ASSIGN)
+	      {
+		gassign *asgn = as_a <gassign *> (ustmt);
+		tree_code code = gimple_assign_rhs_code (asgn);
+		if ((code != EQ_EXPR && code != NE_EXPR)
+		    || !integer_zerop (gimple_assign_rhs2 (asgn)))
+		  return false;
+	      }
+	    else if (gimple_code (ustmt) == GIMPLE_COND)
+	      {
+		tree_code code = gimple_cond_code (ustmt);
+		if ((code != EQ_EXPR && code != NE_EXPR)
+		    || !integer_zerop (gimple_cond_rhs (ustmt)))
+		  return false;
+	      }
+	    else
+	      return false;
+	  }
+
+	if (tree_fits_uhwi_p (len)
+	    && (leni = tree_to_uhwi (len)) <= GET_MODE_SIZE (word_mode)
+	    && exact_log2 (leni) != -1
+	    && (align1 = get_pointer_alignment (arg1)) >= leni * BITS_PER_UNIT
+	    && (align2 = get_pointer_alignment (arg2)) >= leni * BITS_PER_UNIT)
+	  {
+	    location_t loc = gimple_location (stmt2);
+	    tree type, off;
+	    type = build_nonstandard_integer_type (leni * BITS_PER_UNIT, 1);
+	    gcc_assert (GET_MODE_SIZE (TYPE_MODE (type)) == leni);
+	    off = build_int_cst
+	     (build_pointer_type_for_mode (char_type_node, ptr_mode, true), 0);
+	    arg1 = build2_loc (loc, MEM_REF, type, arg1, off);
+	    arg2 = build2_loc (loc, MEM_REF, type, arg2, off);
+	    res = fold_convert_loc (loc, TREE_TYPE (res),
+				    fold_build2_loc (loc, NE_EXPR,
+						     boolean_type_node,
+						     arg1, arg2));
+	    gimplify_and_update_call_from_tree (gsi_p, res);
+	    return true;
+	  }
+
+	tree newfn = builtin_decl_explicit (BUILT_IN_MEMCMP_EQ);
+	gcall *repl = gimple_build_call (newfn, 3, arg1, arg2, len);
+	gimple_call_set_lhs (repl, res);
+	gimple_set_location (repl, gimple_location (stmt2));
+	gimple_set_vuse (repl, gimple_vuse (stmt2));
+	gcc_assert (!gimple_vdef (stmt2));
+	gsi_replace (gsi_p, repl, false);
+	return true;
+      }
+
     default:
       break;
     }
Index: gcc/tree.c
===================================================================
--- gcc/tree.c	(revision 235474)
+++ gcc/tree.c	(working copy)
@@ -10576,6 +10576,13 @@ build_common_builtin_nodes (void)
 			BUILT_IN_STACK_RESTORE,
 			"__builtin_stack_restore", ECF_NOTHROW | ECF_LEAF);
 
+  ftype = build_function_type_list (integer_type_node, const_ptr_type_node,
+				    const_ptr_type_node, size_type_node,
+				    NULL_TREE);
+  local_define_builtin ("__builtin_memcmp_eq", ftype, BUILT_IN_MEMCMP_EQ,
+			"__builtin_memcmp_eq",
+			ECF_PURE | ECF_NOTHROW | ECF_LEAF);
+
   /* If there's a possibility that we might use the ARM EABI, build the
     alternate __cxa_end_cleanup node used to resume from C++ and Java.  */
   if (targetm.arm_eabi_unwinder)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-04-28 18:36   ` Bernd Schmidt
@ 2016-05-02 12:52     ` Richard Biener
  2016-05-02 12:57       ` Bernd Schmidt
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2016-05-02 12:52 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, Nick Clifton

On Thu, Apr 28, 2016 at 8:36 PM, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 01/18/2016 10:22 AM, Richard Biener wrote:
>>
>> See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 - the
>> inline expansion
>> for small sizes and equality compares should be done on GIMPLE.  Today the
>> strlen pass might be an appropriate place to do this given its
>> superior knowledge
>> about string lengths.
>>
>> The idea of turning eq feeding memcmp into a special memcmp_eq is good but
>> you have to avoid doing that too early - otherwise you'd lose on
>>
>>    res = memcmp (p, q, sz);
>>    if (memcmp (p, q, sz) == 0)
>>     ...
>>
>> that is, you have to make sure CSE got the chance to common the two calls.
>> This is why I think this kind of transform needs to happen in specific
>> places
>> (like during strlen opt) rather than in generic folding.
>
>
> Ok, here's an update. I kept pieces of your patch from that PR, but also
> translating memcmps larger than a single operation into memcmp_eq as in my
> previous patch.
>
> Then, I added by_pieces infrastructure for memcmp expansion. To avoid any
> more code duplication in this area, I abstracted the existing code and
> converted it to C++ classes since that seemed to fit pretty well.
>
> There are a few possible ways I could go with this, which is why I'm posting
> it more as a RFD at this point.
>  - should store_by_pieces be eliminated in favour of doing just
>    move_by_pieces with constfns?
>  - the C++ification could be extended, making move_by_pieces_d and
>    compare_by_pieces_d classes inheriting from a common base. This
>    would get rid of the callbacks, replacing them with virtuals,
>    and also make some of the current struct members private.
>  - could move all of the by_pieces stuff out into a separate file?
>
> Later, I think we'll also want to extend this to allow vector mode
> operations, but I think that's a separate patch.
>
> So, opinions what I should be doing with this patch? FWIW it bootstraps and
> tests OK on x86_64-linux.

+struct pieces_addr
+{
...
+  void *m_cfndata;
+public:
+  pieces_addr (rtx, bool, by_pieces_constfn, void *);

unless you strick private: somewhere the public: is redundant

Index: gcc/tree-ssa-forwprop.c
===================================================================
--- gcc/tree-ssa-forwprop.c     (revision 235474)
+++ gcc/tree-ssa-forwprop.c     (working copy)
@@ -1435,6 +1435,76 @@ simplify_builtin_call (gimple_stmt_itera
            }
        }
       break;
+
+    case BUILT_IN_MEMCMP:
+      {

I think this doesn't belong in forwprop.  If we want to stick it into
a pass rather than
folding it should be in tree-ssa-strlen.c.

+       if (tree_fits_uhwi_p (len)
+           && (leni = tree_to_uhwi (len)) <= GET_MODE_SIZE (word_mode)
+           && exact_log2 (leni) != -1
+           && (align1 = get_pointer_alignment (arg1)) >= leni * BITS_PER_UNIT
+           && (align2 = get_pointer_alignment (arg2)) >= leni * BITS_PER_UNIT)
+         {
+           location_t loc = gimple_location (stmt2);
+           tree type, off;
+           type = build_nonstandard_integer_type (leni * BITS_PER_UNIT, 1);
+           gcc_assert (GET_MODE_SIZE (TYPE_MODE (type)) == leni);
+           off = build_int_cst
+            (build_pointer_type_for_mode (char_type_node, ptr_mode, true), 0);
+           arg1 = build2_loc (loc, MEM_REF, type, arg1, off);
+           arg2 = build2_loc (loc, MEM_REF, type, arg2, off);
+           res = fold_convert_loc (loc, TREE_TYPE (res),
+                                   fold_build2_loc (loc, NE_EXPR,
+                                                    boolean_type_node,
+                                                    arg1, arg2));
+           gimplify_and_update_call_from_tree (gsi_p, res);
+           return true;
+         }

for this part see gimple_fold_builtin_memory_op handling of

      /* If we can perform the copy efficiently with first doing all loads
         and then all stores inline it that way.  Currently efficiently
         means that we can load all the memory into a single integer
         register which is what MOVE_MAX gives us.  */

we might want to share a helper yielding the type of the load/store
or NULL if not possible.

Note that we can handle size-1 memcmp even for ordered compares.

Jakub, where do you think this fits best?  Note that gimple-fold.c may
not use immediate uses but would have to match this from the
comparison (I still have to find a way to handle this in match.pd where
the result expression contains virtual operands in the not toplevel stmt).

Richard.

>
> Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-02 12:52     ` Richard Biener
@ 2016-05-02 12:57       ` Bernd Schmidt
  2016-05-02 13:14         ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schmidt @ 2016-05-02 12:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Nick Clifton

On 05/02/2016 02:52 PM, Richard Biener wrote:
> +struct pieces_addr
> +{
> ...
> +  void *m_cfndata;
> +public:
> +  pieces_addr (rtx, bool, by_pieces_constfn, void *);
>
> unless you strick private: somewhere the public: is redundant

Yeah, ideally I want to turn these into a classes rather than structs. 
Maybe that particular one can already be done, but I'm kind of wondering 
how far to take the C++ification of the other one.

> Index: gcc/tree-ssa-forwprop.c
> ===================================================================
> --- gcc/tree-ssa-forwprop.c     (revision 235474)
> +++ gcc/tree-ssa-forwprop.c     (working copy)
> @@ -1435,6 +1435,76 @@ simplify_builtin_call (gimple_stmt_itera
>              }
>          }
>         break;
> +
> +    case BUILT_IN_MEMCMP:
> +      {
>
> I think this doesn't belong in forwprop.  If we want to stick it into
> a pass rather than
> folding it should be in tree-ssa-strlen.c.

This part (and the other one you quoted) was essentially your prototype 
patch from PR52171. I can put it whereever you like, really.

> Note that we can handle size-1 memcmp even for ordered compares.

One would hope this doesn't occur very often...

> Jakub, where do you think this fits best?  Note that gimple-fold.c may
> not use immediate uses but would have to match this from the
> comparison (I still have to find a way to handle this in match.pd where
> the result expression contains virtual operands in the not toplevel stmt).


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-02 12:57       ` Bernd Schmidt
@ 2016-05-02 13:14         ` Richard Biener
  2016-05-12 17:14           ` Bernd Schmidt
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2016-05-02 13:14 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, Nick Clifton

On Mon, May 2, 2016 at 2:57 PM, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 05/02/2016 02:52 PM, Richard Biener wrote:
>>
>> +struct pieces_addr
>> +{
>> ...
>> +  void *m_cfndata;
>> +public:
>> +  pieces_addr (rtx, bool, by_pieces_constfn, void *);
>>
>> unless you strick private: somewhere the public: is redundant
>
>
> Yeah, ideally I want to turn these into a classes rather than structs. Maybe
> that particular one can already be done, but I'm kind of wondering how far
> to take the C++ification of the other one.
>
>> Index: gcc/tree-ssa-forwprop.c
>> ===================================================================
>> --- gcc/tree-ssa-forwprop.c     (revision 235474)
>> +++ gcc/tree-ssa-forwprop.c     (working copy)
>> @@ -1435,6 +1435,76 @@ simplify_builtin_call (gimple_stmt_itera
>>              }
>>          }
>>         break;
>> +
>> +    case BUILT_IN_MEMCMP:
>> +      {
>>
>> I think this doesn't belong in forwprop.  If we want to stick it into
>> a pass rather than
>> folding it should be in tree-ssa-strlen.c.
>
>
> This part (and the other one you quoted) was essentially your prototype
> patch from PR52171. I can put it whereever you like, really.

I think it fits best in tree-ssa-strlen.c:strlen_optimize_stmt for the moment.

>> Note that we can handle size-1 memcmp even for ordered compares.
>
>
> One would hope this doesn't occur very often...

C++ templates .... but yes.

Richard.

>
>> Jakub, where do you think this fits best?  Note that gimple-fold.c may
>> not use immediate uses but would have to match this from the
>> comparison (I still have to find a way to handle this in match.pd where
>> the result expression contains virtual operands in the not toplevel stmt).
>
>
>
> Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-02 13:14         ` Richard Biener
@ 2016-05-12 17:14           ` Bernd Schmidt
  2016-05-13 10:20             ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schmidt @ 2016-05-12 17:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Nick Clifton

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

On 05/02/2016 03:14 PM, Richard Biener wrote:

> I think it fits best in tree-ssa-strlen.c:strlen_optimize_stmt for the moment.

I've done this in this version. Note that this apparently means it won't 
be run at -O unlike the previous version.

The code moved to tree-ssa-strlen.c is nearly identical to the previous 
version, with the exception of a new line that copies the contents of 
gimple_call_use_set to the newly constructed call. Without this, I found 
a testcase where we can miss a DSE opportunity since we think memcmp_eq 
can access anything.

In the by_pieces_code, I've gone ahead and progressed the C++ conversion 
a little further by making subclasses of op_by_pieces_d. Now the 
{move,store,compare}_by_pieces functions generally just make an object 
of the appropriate type and call its run method. I've also converted 
store_by_pieces to this infrastructure to eliminate more duplication.

I've verified that if you disable the strlenopt code, you get identical 
code generation before and after the patch on x86_64-linux and arm-eabi. 
With the strlenopt enabled, it finds memcmp optimization opportunities 
on both targets (more on x86 due to !SLOW_UNALIGNED_ACCESS). On arm, 
there is a question mark over whether the autoinc support in the 
by_pieces code is really a good idea, but that's unchanged from before 
and can be addressed later.
Microbenchmarks on x86 suggest that the new by-pieces expansion is a 
whole lot faster than calling memcmp (about a factor of 3 in my tests).

Bootstrapped and tested on x86_64-linux. The testcase failed at -O (now 
changed to -O2), and there was a failure in vrp47.c which also seems to 
appear in gcc-testresults, so I think it's unrelated. Still, I'll make 
another run against a new baseline. Ok for trunk if that all comes back 
clean?


Bernd

[-- Attachment #2: memcmp-eq-0512c.diff --]
[-- Type: text/x-patch, Size: 72889 bytes --]

	PR tree-optimization/52171
	* builtins.c (expand_cmpstrn_or_cmpmem): Delete, moved elsewhere.
	(expand_builtin_memcmp): New arg RESULT_EQ.  All callers changed.
	Look for constant strings.  Move some code to emit_block_cmp_hints
	and use it.
	* builtins.def (BUILT_IN_MEMCMP_EQ): New.
	* defaults.h (COMPARE_MAX_PIECES): New macro.
	* expr.c (move_by_pieces_d, store_by_pieces_d): Remove old structs.
	(move_by_pieces_1, store_by_pieces_1, store_by_pieces_2): Remvoe.
	(clear_by_pieces_1): Don't declare.  Move definition before use.
	(can_do_by_pieces): New static function.
	(can_move_by_pieces): Use it.  Return bool.
	(by_pieces_ninsns): Renamed from move_by_pieces_ninsns.  New arg
	OP.  All callers changed.  Handle COMPARE_BY_PIECES.
	(class pieces_addr); New.
	(pieces_addr::pieces_addr, pieces_addr::decide_autoinc,
	pieces_addr::adjust, pieces_addr::increment_address,
	pieces_addr::maybe_predec, pieces_addr::maybe_postinc): New member
	functions for it.
	(class op_by_pieces_d): New.
	(op_by_pieces_d::op_by_pieces_d, op_by_pieces_d::run): New member
	functions for it.
	(class move_by_pieces_d, class compare_by_pieces_d,
	class store_by_pieces_d): New subclasses of op_by_pieces_d.
	(move_by_pieces_d::prepare_mode, move_by_pieces_d::generate,
	move_by_pieces_d::finish_endp, store_by_pieces_d::prepare_mode,
	store_by_pieces_d::generate, store_by_pieces_d::finish_endp,
	compare_by_pieces_d::generate, compare_by_pieces_d::prepare_mode,
	compare_by_pieces_d::finish_mode): New member functions.
	(compare_by_pieces, emit_block_cmp_via_cmpmem): New static
	functions.
	(expand_cmpstrn_or_cmpmem): Moved here from builtins.c.
	(emit_block_cmp_hints): New function.
	(move_by_pieces, store_by_pieces, clear_by_pieces): Rewrite to just
	use the newly defined classes.
	* expr.h (by_pieces_constfn): New typedef.
	(can_store_by_pieces, store_by_pieces): Use it in arg declarations.
	(emit_block_cmp_hints, expand_cmpstrn_or_cmpmem): Declare.
	(move_by_pieces_ninsns): Don't declare.
	(can_move_by_pieces): Change return value to bool.
	* target.def (TARGET_USE_BY_PIECES_INFRASTRUCTURE_P): Update docs.
	(compare_by_pieces_branch_ratio): New hook.
	* target.h (enum by_pieces_operation): Add COMPARE_BY_PIECES.
	(by_pieces_ninsns): Declare.
	* targethooks.c (default_use_by_pieces_infrastructure_p): Handle
	COMPARE_BY_PIECES.
	(default_compare_by_pieces_branch_ratio): New function.
	* targhooks.h (default_compare_by_pieces_branch_ratio): Declare.
	* doc/tm.texi.in (STORE_MAX_PIECES, COMPARE_MAX_PIECES): Document.
	* doc/tm.texi: Regenerate.
	* tree-ssa-strlen.c: Include "builtins.h".
	(handle_builtin_memcmp): New static function.
	(strlen_optimize_stmt): Call it for BUILT_IN_MEMCMP.
	* tree.c (build_common_builtin_nodes): Create __builtin_memcmp_eq.

testsuite/
	PR tree-optimization/52171
	* gcc.dg/pr52171.c: New test.
	* gcc.target/i386/pr52171.c: New test.

Index: gcc/builtins.c
===================================================================
--- gcc/builtins.c	(revision 236113)
+++ gcc/builtins.c	(working copy)
@@ -3671,53 +3671,24 @@ expand_cmpstr (insn_code icode, rtx targ
   return NULL_RTX;
 }
 
-/* Try to expand cmpstrn or cmpmem operation ICODE with the given operands.
-   ARG3_TYPE is the type of ARG3_RTX.  Return the result rtx on success,
-   otherwise return null.  */
-
-static rtx
-expand_cmpstrn_or_cmpmem (insn_code icode, rtx target, rtx arg1_rtx,
-			  rtx arg2_rtx, tree arg3_type, rtx arg3_rtx,
-			  HOST_WIDE_INT align)
-{
-  machine_mode insn_mode = insn_data[icode].operand[0].mode;
-
-  if (target && (!REG_P (target) || HARD_REGISTER_P (target)))
-    target = NULL_RTX;
-
-  struct expand_operand ops[5];
-  create_output_operand (&ops[0], target, insn_mode);
-  create_fixed_operand (&ops[1], arg1_rtx);
-  create_fixed_operand (&ops[2], arg2_rtx);
-  create_convert_operand_from (&ops[3], arg3_rtx, TYPE_MODE (arg3_type),
-			       TYPE_UNSIGNED (arg3_type));
-  create_integer_operand (&ops[4], align);
-  if (maybe_expand_insn (icode, 5, ops))
-    return ops[0].value;
-  return NULL_RTX;
-}
-
 /* Expand expression EXP, which is a call to the memcmp built-in function.
    Return NULL_RTX if we failed and the caller should emit a normal call,
-   otherwise try to get the result in TARGET, if convenient.  */
+   otherwise try to get the result in TARGET, if convenient.
+   RESULT_EQ is true if we can relax the returned value to be either zero
+   or nonzero, without caring about the sign.  */
 
 static rtx
-expand_builtin_memcmp (tree exp, rtx target)
+expand_builtin_memcmp (tree exp, rtx target, bool result_eq)
 {
   if (!validate_arglist (exp,
  			 POINTER_TYPE, POINTER_TYPE, INTEGER_TYPE, VOID_TYPE))
     return NULL_RTX;
 
-  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
-     implementing memcmp because it will stop if it encounters two
-     zero bytes.  */
-  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
-  if (icode == CODE_FOR_nothing)
-    return NULL_RTX;
-
   tree arg1 = CALL_EXPR_ARG (exp, 0);
   tree arg2 = CALL_EXPR_ARG (exp, 1);
   tree len = CALL_EXPR_ARG (exp, 2);
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  location_t loc = EXPR_LOCATION (exp);
 
   unsigned int arg1_align = get_pointer_alignment (arg1) / BITS_PER_UNIT;
   unsigned int arg2_align = get_pointer_alignment (arg2) / BITS_PER_UNIT;
@@ -3726,22 +3697,39 @@ expand_builtin_memcmp (tree exp, rtx tar
   if (arg1_align == 0 || arg2_align == 0)
     return NULL_RTX;
 
-  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
-  location_t loc = EXPR_LOCATION (exp);
   rtx arg1_rtx = get_memory_rtx (arg1, len);
   rtx arg2_rtx = get_memory_rtx (arg2, len);
-  rtx arg3_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
+  rtx len_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
 
   /* Set MEM_SIZE as appropriate.  */
-  if (CONST_INT_P (arg3_rtx))
+  if (CONST_INT_P (len_rtx))
     {
-      set_mem_size (arg1_rtx, INTVAL (arg3_rtx));
-      set_mem_size (arg2_rtx, INTVAL (arg3_rtx));
+      set_mem_size (arg1_rtx, INTVAL (len_rtx));
+      set_mem_size (arg2_rtx, INTVAL (len_rtx));
     }
 
-  rtx result = expand_cmpstrn_or_cmpmem (icode, target, arg1_rtx, arg2_rtx,
-					 TREE_TYPE (len), arg3_rtx,
-					 MIN (arg1_align, arg2_align));
+  by_pieces_constfn constfn = NULL;
+
+  const char *src_str = c_getstr (arg1);
+  if (src_str == NULL)
+    src_str = c_getstr (arg2);
+  else
+    std::swap (arg1_rtx, arg2_rtx);
+
+  /* If SRC is a string constant and block move would be done
+     by pieces, we can avoid loading the string from memory
+     and only stored the computed constants.  */
+  if (src_str
+      && CONST_INT_P (len_rtx)
+      && (unsigned HOST_WIDE_INT) INTVAL (len_rtx) <= strlen (src_str) + 1)
+    constfn = builtin_memcpy_read_str;
+
+  rtx result = emit_block_cmp_hints (arg1_rtx, arg2_rtx, len_rtx,
+				     TREE_TYPE (len), target,
+				     result_eq, result_eq, constfn,
+				     CONST_CAST (char *, src_str));
+
+
   if (result)
     {
       /* Return the value in the proper mode for this function.  */
@@ -3757,19 +3745,6 @@ expand_builtin_memcmp (tree exp, rtx tar
       return convert_to_mode (mode, result, 0);
     }
 
-  result = target;
-  if (! (result != 0
-	 && REG_P (result) && GET_MODE (result) == mode
-	 && REGNO (result) >= FIRST_PSEUDO_REGISTER))
-    result = gen_reg_rtx (mode);
-
-  emit_library_call_value (memcmp_libfunc, result, LCT_PURE,
-			   TYPE_MODE (integer_type_node), 3,
-			   XEXP (arg1_rtx, 0), Pmode,
-			   XEXP (arg2_rtx, 0), Pmode,
-			   convert_to_mode (TYPE_MODE (sizetype), arg3_rtx,
-					    TYPE_UNSIGNED (sizetype)),
-			   TYPE_MODE (sizetype));
   return result;
 }
 
@@ -6081,9 +6056,16 @@ expand_builtin (tree exp, rtx target, rt
 
     case BUILT_IN_BCMP:
     case BUILT_IN_MEMCMP:
-      target = expand_builtin_memcmp (exp, target);
+    case BUILT_IN_MEMCMP_EQ:
+      target = expand_builtin_memcmp (exp, target, fcode == BUILT_IN_MEMCMP_EQ);
       if (target)
 	return target;
+      if (fcode == BUILT_IN_MEMCMP_EQ)
+	{
+	  exp = copy_node (exp);
+	  tree newdecl = builtin_decl_explicit (BUILT_IN_MEMCMP);
+	  TREE_OPERAND (exp, 1) = build_fold_addr_expr (newdecl);
+	}
       break;
 
     case BUILT_IN_SETJMP:
Index: gcc/builtins.def
===================================================================
--- gcc/builtins.def	(revision 236113)
+++ gcc/builtins.def	(working copy)
@@ -864,6 +864,10 @@ DEF_BUILTIN_STUB (BUILT_IN_STACK_SAVE, "
 DEF_BUILTIN_STUB (BUILT_IN_STACK_RESTORE, "__builtin_stack_restore")
 DEF_BUILTIN_STUB (BUILT_IN_ALLOCA_WITH_ALIGN, "__builtin_alloca_with_align")
 
+/* An internal version of memcmp, used when the result is only tested for
+   equality with zero.  */
+DEF_BUILTIN_STUB (BUILT_IN_MEMCMP_EQ, "__builtin_memcmp_eq")
+
 /* Object size checking builtins.  */
 DEF_GCC_BUILTIN	       (BUILT_IN_OBJECT_SIZE, "object_size", BT_FN_SIZE_CONST_PTR_INT, ATTR_PURE_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN_CHKP (BUILT_IN_MEMCPY_CHK, "__memcpy_chk", BT_FN_PTR_PTR_CONST_PTR_SIZE_SIZE, ATTR_RET1_NOTHROW_NONNULL_LEAF)
Index: gcc/defaults.h
===================================================================
--- gcc/defaults.h	(revision 236113)
+++ gcc/defaults.h	(working copy)
@@ -1039,6 +1039,11 @@ see the files COPYING3 and COPYING.RUNTI
 #define STORE_MAX_PIECES  MIN (MOVE_MAX_PIECES, 2 * sizeof (HOST_WIDE_INT))
 #endif
 
+/* Likewise for block comparisons.  */
+#ifndef COMPARE_MAX_PIECES
+#define COMPARE_MAX_PIECES  MOVE_MAX_PIECES
+#endif
+
 #ifndef MAX_MOVE_MAX
 #define MAX_MOVE_MAX MOVE_MAX
 #endif
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 236113)
+++ gcc/doc/tm.texi	(working copy)
@@ -6311,8 +6311,9 @@ Both @var{size} and @var{alignment} are
 units.
 
 The parameter @var{op} is one of: @code{CLEAR_BY_PIECES},
-@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES}.
-These describe the type of memory operation under consideration.
+@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES} or
+@code{COMPARE_BY_PIECES}.  These describe the type of memory operation
+under consideration.
 
 The parameter @var{speed_p} is true if the code is currently being
 optimized for speed rather than size.
@@ -6329,11 +6330,33 @@ in code size, for example where the numb
 move would be greater than that of a library call.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_COMPARE_BY_PIECES_BRANCH_RATIO (machine_mode @var{mode})
+When expanding a block comparison in MODE, gcc can try to reduce the
+number of branches at the expense of more memory operations.  This hook
+allows the target to override the default choice.  It should return the
+factor by which branches should be reduced over the plain expansion with
+one comparison per @var{mode}-sized piece.  A port can also prevent a
+particular mode from being used for block comparisons by returning a
+negative number from this hook.
+@end deftypefn
+
 @defmac MOVE_MAX_PIECES
 A C expression used by @code{move_by_pieces} to determine the largest unit
 a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
 @end defmac
 
+@defmac STORE_MAX_PIECES
+A C expression used by @code{store_by_pieces} to determine the largest unit
+a store used to memory is.  Defaults to @code{MOVE_MAX_PIECES}, or two times
+the size of @code{HOST_WIDE_INT}, whichever is smaller.
+@end defmac
+
+@defmac COMPARE_MAX_PIECES
+A C expression used by @code{compare_by_pieces} to determine the largest unit
+a load or store used to compare memory is.  Defaults to
+@code{MOVE_MAX_PIECES}.
+@end defmac
+
 @defmac CLEAR_RATIO (@var{speed})
 The threshold of number of scalar move insns, @emph{below} which a sequence
 of insns should be generated to clear memory instead of a string clear insn
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 236113)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -4653,11 +4653,25 @@ If you don't define this, a reasonable d
 
 @hook TARGET_USE_BY_PIECES_INFRASTRUCTURE_P
 
+@hook TARGET_COMPARE_BY_PIECES_BRANCH_RATIO
+
 @defmac MOVE_MAX_PIECES
 A C expression used by @code{move_by_pieces} to determine the largest unit
 a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
 @end defmac
 
+@defmac STORE_MAX_PIECES
+A C expression used by @code{store_by_pieces} to determine the largest unit
+a store used to memory is.  Defaults to @code{MOVE_MAX_PIECES}, or two times
+the size of @code{HOST_WIDE_INT}, whichever is smaller.
+@end defmac
+
+@defmac COMPARE_MAX_PIECES
+A C expression used by @code{compare_by_pieces} to determine the largest unit
+a load or store used to compare memory is.  Defaults to
+@code{MOVE_MAX_PIECES}.
+@end defmac
+
 @defmac CLEAR_RATIO (@var{speed})
 The threshold of number of scalar move insns, @emph{below} which a sequence
 of insns should be generated to clear memory instead of a string clear insn
Index: gcc/expr.c
===================================================================
--- gcc/expr.c	(revision 236113)
+++ gcc/expr.c	(working copy)
@@ -70,52 +70,13 @@ along with GCC; see the file COPYING3.
    the same indirect address eventually.  */
 int cse_not_expected;
 
-/* This structure is used by move_by_pieces to describe the move to
-   be performed.  */
-struct move_by_pieces_d
-{
-  rtx to;
-  rtx to_addr;
-  int autinc_to;
-  int explicit_inc_to;
-  rtx from;
-  rtx from_addr;
-  int autinc_from;
-  int explicit_inc_from;
-  unsigned HOST_WIDE_INT len;
-  HOST_WIDE_INT offset;
-  int reverse;
-};
-
-/* This structure is used by store_by_pieces to describe the clear to
-   be performed.  */
-
-struct store_by_pieces_d
-{
-  rtx to;
-  rtx to_addr;
-  int autinc_to;
-  int explicit_inc_to;
-  unsigned HOST_WIDE_INT len;
-  HOST_WIDE_INT offset;
-  rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode);
-  void *constfundata;
-  int reverse;
-};
-
-static void move_by_pieces_1 (insn_gen_fn, machine_mode,
-			      struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT,
 					unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
 					unsigned HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
-static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
-static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
-static void store_by_pieces_2 (insn_gen_fn, machine_mode,
-			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
 static rtx_insn *compress_float_constant (rtx, rtx);
 static rtx get_subtarget (rtx);
@@ -769,276 +730,799 @@ widest_int_mode_for_size (unsigned int s
   return mode;
 }
 
+/* Determine whether an operation OP on LEN bytes with alignment ALIGN can
+   and should be performed piecewise.  */
+
+static bool
+can_do_by_pieces (unsigned HOST_WIDE_INT len, unsigned int align,
+		  enum by_pieces_operation op)
+{
+  return targetm.use_by_pieces_infrastructure_p (len, align, op,
+						 optimize_insn_for_speed_p ());
+}
+
 /* Determine whether the LEN bytes can be moved by using several move
    instructions.  Return nonzero if a call to move_by_pieces should
    succeed.  */
 
-int
-can_move_by_pieces (unsigned HOST_WIDE_INT len,
-		    unsigned int align)
+bool
+can_move_by_pieces (unsigned HOST_WIDE_INT len, unsigned int align)
 {
-  return targetm.use_by_pieces_infrastructure_p (len, align, MOVE_BY_PIECES,
-						 optimize_insn_for_speed_p ());
+  return can_do_by_pieces (len, align, MOVE_BY_PIECES);
 }
 
-/* Generate several move instructions to copy LEN bytes from block FROM to
-   block TO.  (These are MEM rtx's with BLKmode).
+/* Return number of insns required to perform operation OP by pieces
+   for L bytes.  ALIGN (in bits) is maximum alignment we can assume.  */
 
-   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
-   used to push FROM to the stack.
+unsigned HOST_WIDE_INT
+by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
+		  unsigned int max_size, by_pieces_operation op)
+{
+  unsigned HOST_WIDE_INT n_insns = 0;
 
-   ALIGN is maximum stack alignment we can assume.
+  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
 
-   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
-   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
-   stpcpy.  */
+  while (max_size > 1 && l > 0)
+    {
+      machine_mode mode;
+      enum insn_code icode;
 
-rtx
-move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
-		unsigned int align, int endp)
+      mode = widest_int_mode_for_size (max_size);
+
+      if (mode == VOIDmode)
+	break;
+      unsigned int modesize = GET_MODE_SIZE (mode);
+
+      icode = optab_handler (mov_optab, mode);
+      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
+	{
+	  unsigned HOST_WIDE_INT n_pieces = l / modesize;
+	  l %= modesize;
+	  switch (op)
+	    {
+	    default:
+	      n_insns += n_pieces;
+	      break;
+
+	    case COMPARE_BY_PIECES:
+	      int batch = targetm.compare_by_pieces_branch_ratio (mode);
+	      int batch_ops = 4 * batch - 1;
+	      int full = n_pieces / batch;
+	      n_insns += full * batch_ops;
+	      if (n_pieces % batch != 0)
+		n_insns++;
+	      break;
+
+	    }
+	}
+      max_size = modesize;
+    }
+
+  gcc_assert (!l);
+  return n_insns;
+}
+
+/* Used when performing piecewise block operations, holds information
+   about one of the memory objects involved.  The member functions
+   can be used to generate code for loading from the object and
+   updating the address when iterating.  */
+
+class pieces_addr
 {
-  struct move_by_pieces_d data;
-  machine_mode to_addr_mode;
-  machine_mode from_addr_mode = get_address_mode (from);
-  rtx to_addr, from_addr = XEXP (from, 0);
-  unsigned int max_size = MOVE_MAX_PIECES + 1;
-  enum insn_code icode;
+  /* The object being referenced, a MEM.  Can be NULL_RTX to indicate
+     stack pushes.  */
+  rtx m_obj;
+  /* The address of the object.  Can differ from that seen in the
+     MEM rtx if we copied the address to a register.  */
+  rtx m_addr;
+  /* Nonzero if the address on the object has an autoincrement already,
+     signifies whether that was an increment or decrement.  */
+  signed char m_addr_inc;
+  /* Nonzero if we intend to use autoinc without the address already
+     having autoinc form.  We will insert add insns around each memory
+     reference, expecting later passes to form autoinc addressing modes.
+     The only supported options are predecrement and postincrement.  */
+  signed char m_explicit_inc;
+  /* True if we have either of the two possible cases of using
+     autoincrement.  */
+  bool m_auto;
+  /* True if this is an address to be used for load operations rather
+     than stores.  */
+  bool m_is_load;
 
-  align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
+  /* Optionally, a function to obtain constants for any given offset into
+     the objects, and data associated with it.  */
+  by_pieces_constfn m_constfn;
+  void *m_cfndata;
+public:
+  pieces_addr (rtx, bool, by_pieces_constfn, void *);
+  rtx adjust (machine_mode, HOST_WIDE_INT);
+  void increment_address (HOST_WIDE_INT);
+  void maybe_predec (HOST_WIDE_INT);
+  void maybe_postinc (HOST_WIDE_INT);
+  void decide_autoinc (machine_mode, bool, HOST_WIDE_INT);
+  int get_addr_inc ()
+  {
+    return m_addr_inc;
+  }
+};
 
-  data.offset = 0;
-  data.from_addr = from_addr;
-  if (to)
+/* Initialize a pieces_addr structure from an object OBJ.  IS_LOAD is
+   true if the operation to be performed on this object is a load
+   rather than a store.  For stores, OBJ can be NULL, in which case we
+   assume the operation is a stack push.  For loads, the optional
+   CONSTFN and its associated CFNDATA can be used in place of the
+   memory load.  */
+
+pieces_addr::pieces_addr (rtx obj, bool is_load, by_pieces_constfn constfn,
+			  void *cfndata)
+  : m_obj (obj), m_is_load (is_load), m_constfn (constfn), m_cfndata (cfndata)
+{
+  m_addr_inc = 0;
+  m_auto = false;
+  if (obj)
     {
-      to_addr_mode = get_address_mode (to);
-      to_addr = XEXP (to, 0);
-      data.to = to;
-      data.autinc_to
-	= (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
-	   || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
-      data.reverse
-	= (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+      rtx addr = XEXP (obj, 0);
+      rtx_code code = GET_CODE (addr);
+      m_addr = addr;
+      bool dec = code == PRE_DEC || code == POST_DEC;
+      bool inc = code == PRE_INC || code == POST_INC;
+      m_auto = inc || dec;
+      if (m_auto)
+	m_addr_inc = dec ? -1 : 1;
+
+      /* While we have always looked for these codes here, the code
+	 implementing the memory operation has never handled them.
+	 Support could be added later if necessary or beneficial.  */
+      gcc_assert (code != PRE_INC && code != POST_DEC);
     }
   else
     {
-      to_addr_mode = VOIDmode;
-      to_addr = NULL_RTX;
-      data.to = NULL_RTX;
-      data.autinc_to = 1;
-      if (STACK_GROWS_DOWNWARD)
-	data.reverse = 1;
+      m_addr = NULL_RTX;
+      if (!is_load)
+	{
+	  m_auto = true;
+	  if (STACK_GROWS_DOWNWARD)
+	    m_addr_inc = -1;
+	  else
+	    m_addr_inc = 1;
+	}
       else
-	data.reverse = 0;
+	gcc_assert (constfn != NULL);
     }
-  data.to_addr = to_addr;
-  data.from = from;
-  data.autinc_from
-    = (GET_CODE (from_addr) == PRE_INC || GET_CODE (from_addr) == PRE_DEC
-       || GET_CODE (from_addr) == POST_INC
-       || GET_CODE (from_addr) == POST_DEC);
+  m_explicit_inc = 0;
+  if (constfn)
+    gcc_assert (is_load);
+}
 
-  data.explicit_inc_from = 0;
-  data.explicit_inc_to = 0;
-  if (data.reverse) data.offset = len;
-  data.len = len;
+/* Decide whether to use autoinc for an address involved in a memory op.
+   MODE is the mode of the accesses, REVERSE is true if we've decided to
+   perform the operation starting from the end, and LEN is the length of
+   the operation.  Don't override an earlier decision to set m_auto.  */
+
+void
+pieces_addr::decide_autoinc (machine_mode ARG_UNUSED (mode), bool reverse,
+			     HOST_WIDE_INT len)
+{
+  if (m_auto || m_obj == NULL_RTX)
+    return;
+
+  bool use_predec = (m_is_load
+		     ? USE_LOAD_PRE_DECREMENT (mode)
+		     : USE_STORE_PRE_DECREMENT (mode));
+  bool use_postinc = (m_is_load
+		      ? USE_LOAD_POST_INCREMENT (mode)
+		      : USE_STORE_POST_INCREMENT (mode));
+  machine_mode addr_mode = get_address_mode (m_obj);
+
+  if (use_predec && reverse)
+    {
+      m_addr = copy_to_mode_reg (addr_mode,
+				 plus_constant (addr_mode,
+						m_addr, len));
+      m_auto = true;
+      m_explicit_inc = -1;
+    }
+  else if (use_postinc && !reverse)
+    {
+      m_addr = copy_to_mode_reg (addr_mode, m_addr);
+      m_auto = true;
+      m_explicit_inc = 1;
+    }
+  else if (CONSTANT_P (m_addr))
+    m_addr = copy_to_mode_reg (addr_mode, m_addr);
+}
+
+/* Adjust the address to refer to the data at OFFSET in MODE.  If we
+   are using autoincrement for this address, we don't add the offset,
+   but we still modify the MEM's properties.  */
+
+rtx
+pieces_addr::adjust (machine_mode mode, HOST_WIDE_INT offset)
+{
+  if (m_constfn)
+    return m_constfn (m_cfndata, offset, mode);
+  if (m_obj == NULL_RTX)
+    return NULL_RTX;
+  if (m_auto)
+    return adjust_automodify_address (m_obj, mode, m_addr, offset);
+  else
+    return adjust_address (m_obj, mode, offset);
+}
+
+/* Emit an add instruction to increment the address by SIZE.  */
+
+void
+pieces_addr::increment_address (HOST_WIDE_INT size)
+{
+  rtx amount = gen_int_mode (size, GET_MODE (m_addr));
+  emit_insn (gen_add2_insn (m_addr, amount));
+}
+
+/* If we are supposed to decrement the address after each access, emit code
+   to do so now.  Increment by SIZE (which has should have the correct sign
+   already).  */
+
+void
+pieces_addr::maybe_predec (HOST_WIDE_INT size)
+{
+  if (m_explicit_inc >= 0)
+    return;
+  gcc_assert (HAVE_PRE_DECREMENT);
+  increment_address (size);
+}
+
+/* If we are supposed to decrement the address after each access, emit code
+   to do so now.  Increment by SIZE.  */
+
+void
+pieces_addr::maybe_postinc (HOST_WIDE_INT size)
+{
+  if (m_explicit_inc <= 0)
+    return;
+  gcc_assert (HAVE_POST_INCREMENT);
+  increment_address (size);
+}
+
+/* This structure is used by do_op_by_pieces to describe the operation
+   to be performed.  */
+
+class op_by_pieces_d
+{
+ protected:
+  pieces_addr m_to, m_from;
+  unsigned HOST_WIDE_INT m_len;
+  HOST_WIDE_INT m_offset;
+  unsigned int m_align;
+  unsigned int m_max_size;
+  bool m_reverse;
+
+  /* Virtual functions, overriden by derived classes for the specific
+     operation.  */
+  virtual void generate (rtx, rtx, machine_mode) = 0;
+  virtual bool prepare_mode (machine_mode, unsigned int) = 0;
+  virtual void finish_mode (machine_mode)
+  {
+  }
+
+ public:
+  op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *,
+		  unsigned HOST_WIDE_INT, unsigned int);
+  void run ();
+};
+
+/* The constructor for an op_by_pieces_d structure.  We require two
+   objects named TO and FROM, which are identified as loads or stores
+   by TO_LOAD and FROM_LOAD.  If FROM is a load, the optional FROM_CFN
+   and its associated FROM_CFN_DATA can be used to replace loads with
+   constant values.  LEN describes the length of the operation.  */
+
+op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
+				rtx from, bool from_load,
+				by_pieces_constfn from_cfn,
+				void *from_cfn_data,
+				unsigned HOST_WIDE_INT len,
+				unsigned int align)
+  : m_to (to, to_load, NULL, NULL),
+    m_from (from, from_load, from_cfn, from_cfn_data),
+    m_len (len), m_max_size (MOVE_MAX_PIECES + 1)
+{
+  int toi = m_to.get_addr_inc ();
+  int fromi = m_from.get_addr_inc ();
+  if (toi >= 0 && fromi >= 0)
+    m_reverse = false;
+  else if (toi <= 0 && fromi <= 0)
+    m_reverse = true;
+  else
+    gcc_unreachable ();
+
+  m_offset = m_reverse ? len : 0;
+  align = MIN (to ? MEM_ALIGN (to) : align,
+	       from ? MEM_ALIGN (from) : align);
 
   /* If copying requires more than two move insns,
      copy addresses to registers (to make displacements shorter)
      and use post-increment if available.  */
-  if (!(data.autinc_from && data.autinc_to)
-      && move_by_pieces_ninsns (len, align, max_size) > 2)
+  if (by_pieces_ninsns (len, align, m_max_size, MOVE_BY_PIECES) > 2)
     {
-      /* Find the mode of the largest move...
-	 MODE might not be used depending on the definitions of the
-	 USE_* macros below.  */
-      machine_mode mode ATTRIBUTE_UNUSED
-	= widest_int_mode_for_size (max_size);
+      /* Find the mode of the largest comparison.  */
+      machine_mode mode = widest_int_mode_for_size (m_max_size);
 
-      if (USE_LOAD_PRE_DECREMENT (mode) && data.reverse && ! data.autinc_from)
-	{
-	  data.from_addr = copy_to_mode_reg (from_addr_mode,
-					     plus_constant (from_addr_mode,
-							    from_addr, len));
-	  data.autinc_from = 1;
-	  data.explicit_inc_from = -1;
-	}
-      if (USE_LOAD_POST_INCREMENT (mode) && ! data.autinc_from)
-	{
-	  data.from_addr = copy_to_mode_reg (from_addr_mode, from_addr);
-	  data.autinc_from = 1;
-	  data.explicit_inc_from = 1;
-	}
-      if (!data.autinc_from && CONSTANT_P (from_addr))
-	data.from_addr = copy_to_mode_reg (from_addr_mode, from_addr);
-      if (USE_STORE_PRE_DECREMENT (mode) && data.reverse && ! data.autinc_to)
-	{
-	  data.to_addr = copy_to_mode_reg (to_addr_mode,
-					   plus_constant (to_addr_mode,
-							  to_addr, len));
-	  data.autinc_to = 1;
-	  data.explicit_inc_to = -1;
-	}
-      if (USE_STORE_POST_INCREMENT (mode) && ! data.reverse && ! data.autinc_to)
-	{
-	  data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
-	  data.autinc_to = 1;
-	  data.explicit_inc_to = 1;
-	}
-      if (!data.autinc_to && CONSTANT_P (to_addr))
-	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+      m_from.decide_autoinc (mode, m_reverse, len);
+      m_to.decide_autoinc (mode, m_reverse, len);
     }
 
   align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
+  m_align = align;
+}
 
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
+/* This function contains the main loop used for expanding a block
+   operation.  First move what we can in the largest integer mode,
+   then go to successively smaller modes.  For every access, call
+   GENFUN with the two operands and the EXTRA_DATA.  */
 
-  while (max_size > 1 && data.len > 0)
+void
+op_by_pieces_d::run ()
+{
+  while (m_max_size > 1 && m_len > 0)
     {
-      machine_mode mode = widest_int_mode_for_size (max_size);
+      machine_mode mode = widest_int_mode_for_size (m_max_size);
 
       if (mode == VOIDmode)
 	break;
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+      if (prepare_mode (mode, m_align))
+	{
+	  unsigned int size = GET_MODE_SIZE (mode);
+	  rtx to1 = NULL_RTX, from1;
 
-      max_size = GET_MODE_SIZE (mode);
+	  while (m_len >= size)
+	    {
+	      if (m_reverse)
+		m_offset -= size;
+
+	      to1 = m_to.adjust (mode, m_offset);
+	      from1 = m_from.adjust (mode, m_offset);
+
+	      m_to.maybe_predec (-(HOST_WIDE_INT)size);
+	      m_from.maybe_predec (-(HOST_WIDE_INT)size);
+
+	      generate (to1, from1, mode);
+
+	      m_to.maybe_postinc (size);
+	      m_from.maybe_postinc (size);
+
+	      if (!m_reverse)
+		m_offset += size;
+
+	      m_len -= size;
+	    }
+
+	  finish_mode (mode);
+	}
+
+      m_max_size = GET_MODE_SIZE (mode);
     }
 
   /* The code above should have handled everything.  */
-  gcc_assert (!data.len);
+  gcc_assert (!m_len);
+}
+
+/* Derived class from op_by_pieces_d, providing support for block move
+   operations.  */
+
+class move_by_pieces_d : public op_by_pieces_d
+{
+  insn_gen_fn m_gen_fun;
+  void generate (rtx, rtx, machine_mode);
+  bool prepare_mode (machine_mode, unsigned int);
+
+ public:
+  move_by_pieces_d (rtx to, rtx from, unsigned HOST_WIDE_INT len,
+		    unsigned int align)
+    : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align)
+  {
+  }
+  rtx finish_endp (int);
+};
+
+/* Return true if MODE can be used for a set of copies, given an
+   alignment ALIGN.  Prepare whatever data is necessary for later
+   calls to generate.  */
+
+bool
+move_by_pieces_d::prepare_mode (machine_mode mode, unsigned int align)
+{
+  insn_code icode = optab_handler (mov_optab, mode);
+  m_gen_fun = GEN_FCN (icode);
+  return icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode);
+}
+
+/* A callback used when iterating for a compare_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  If OP0 is NULL, this means we should generate a
+   push; otherwise EXTRA_DATA holds a pointer to a pointer to the insn
+   gen function that should be used to generate the mode.  */
+
+void
+move_by_pieces_d::generate (rtx op0, rtx op1, machine_mode mode)
+{
+#ifdef PUSH_ROUNDING
+  if (op0 == NULL_RTX)
+    {
+      emit_single_push_insn (mode, op1, NULL);
+      return;
+    }
+#endif
+  emit_insn (m_gen_fun (op0, op1));
+}
+
+/* Perform the final adjustment at the end of a string to obtain the
+   correct return value for the block operation.  If ENDP is 1 return
+   memory at the end ala mempcpy, and if ENDP is 2 return memory the
+   end minus one byte ala stpcpy.  */
+
+rtx
+move_by_pieces_d::finish_endp (int endp)
+{
+  gcc_assert (!m_reverse);
+  if (endp == 2)
+    {
+      m_to.maybe_postinc (-1);
+      --m_offset;
+    }
+  return m_to.adjust (QImode, m_offset);
+}
+
+/* Generate several move instructions to copy LEN bytes from block FROM to
+   block TO.  (These are MEM rtx's with BLKmode).
+
+   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
+   used to push FROM to the stack.
+
+   ALIGN is maximum stack alignment we can assume.
+
+   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
+   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
+   stpcpy.  */
+
+rtx
+move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
+		unsigned int align, int endp)
+{
+#ifndef PUSH_ROUNDING
+  if (to == NULL)
+    gcc_unreachable ();
+#endif
+
+  move_by_pieces_d data (to, from, len, align);
+
+  data.run ();
 
   if (endp)
+    return data.finish_endp (endp);
+  else
+    return to;
+}
+
+/* Derived class from op_by_pieces_d, providing support for block move
+   operations.  */
+
+class store_by_pieces_d : public op_by_pieces_d
+{
+  insn_gen_fn m_gen_fun;
+  void generate (rtx, rtx, machine_mode);
+  bool prepare_mode (machine_mode, unsigned int);
+
+ public:
+  store_by_pieces_d (rtx to, by_pieces_constfn cfn, void *cfn_data,
+		     unsigned HOST_WIDE_INT len, unsigned int align)
+    : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len, align)
+  {
+  }
+  rtx finish_endp (int);
+};
+
+/* Return true if MODE can be used for a set of stores, given an
+   alignment ALIGN.  Prepare whatever data is necessary for later
+   calls to generate.  */
+
+bool
+store_by_pieces_d::prepare_mode (machine_mode mode, unsigned int align)
+{
+  insn_code icode = optab_handler (mov_optab, mode);
+  m_gen_fun = GEN_FCN (icode);
+  return icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode);
+}
+
+/* A callback used when iterating for a store_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  If OP0 is NULL, this means we should generate a
+   push; otherwise EXTRA_DATA holds a pointer to a pointer to the insn
+   gen function that should be used to generate the mode.  */
+
+void
+store_by_pieces_d::generate (rtx op0, rtx op1, machine_mode)
+{
+  emit_insn (m_gen_fun (op0, op1));
+}
+
+/* Perform the final adjustment at the end of a string to obtain the
+   correct return value for the block operation.  If ENDP is 1 return
+   memory at the end ala mempcpy, and if ENDP is 2 return memory the
+   end minus one byte ala stpcpy.  */
+
+rtx
+store_by_pieces_d::finish_endp (int endp)
+{
+  gcc_assert (!m_reverse);
+  if (endp == 2)
     {
-      rtx to1;
+      m_to.maybe_postinc (-1);
+      --m_offset;
+    }
+  return m_to.adjust (QImode, m_offset);
+}
 
-      gcc_assert (!data.reverse);
-      if (data.autinc_to)
+/* Determine whether the LEN bytes generated by CONSTFUN can be
+   stored to memory using several move instructions.  CONSTFUNDATA is
+   a pointer which will be passed as argument in every CONSTFUN call.
+   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
+   a memset operation and false if it's a copy of a constant string.
+   Return nonzero if a call to store_by_pieces should succeed.  */
+
+int
+can_store_by_pieces (unsigned HOST_WIDE_INT len,
+		     rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
+		     void *constfundata, unsigned int align, bool memsetp)
+{
+  unsigned HOST_WIDE_INT l;
+  unsigned int max_size;
+  HOST_WIDE_INT offset = 0;
+  machine_mode mode;
+  enum insn_code icode;
+  int reverse;
+  /* cst is set but not used if LEGITIMATE_CONSTANT doesn't use it.  */
+  rtx cst ATTRIBUTE_UNUSED;
+
+  if (len == 0)
+    return 1;
+
+  if (!targetm.use_by_pieces_infrastructure_p (len, align,
+					       memsetp
+						 ? SET_BY_PIECES
+						 : STORE_BY_PIECES,
+					       optimize_insn_for_speed_p ()))
+    return 0;
+
+  align = alignment_for_piecewise_move (STORE_MAX_PIECES, align);
+
+  /* We would first store what we can in the largest integer mode, then go to
+     successively smaller modes.  */
+
+  for (reverse = 0;
+       reverse <= (HAVE_PRE_DECREMENT || HAVE_POST_DECREMENT);
+       reverse++)
+    {
+      l = len;
+      max_size = STORE_MAX_PIECES + 1;
+      while (max_size > 1 && l > 0)
 	{
-	  if (endp == 2)
+	  mode = widest_int_mode_for_size (max_size);
+
+	  if (mode == VOIDmode)
+	    break;
+
+	  icode = optab_handler (mov_optab, mode);
+	  if (icode != CODE_FOR_nothing
+	      && align >= GET_MODE_ALIGNMENT (mode))
 	    {
-	      if (HAVE_POST_INCREMENT && data.explicit_inc_to > 0)
-		emit_insn (gen_add2_insn (data.to_addr, constm1_rtx));
-	      else
-		data.to_addr = copy_to_mode_reg (to_addr_mode,
-						 plus_constant (to_addr_mode,
-								data.to_addr,
-								-1));
+	      unsigned int size = GET_MODE_SIZE (mode);
+
+	      while (l >= size)
+		{
+		  if (reverse)
+		    offset -= size;
+
+		  cst = (*constfun) (constfundata, offset, mode);
+		  if (!targetm.legitimate_constant_p (mode, cst))
+		    return 0;
+
+		  if (!reverse)
+		    offset += size;
+
+		  l -= size;
+		}
 	    }
-	  to1 = adjust_automodify_address (data.to, QImode, data.to_addr,
-					   data.offset);
-	}
-      else
-	{
-	  if (endp == 2)
-	    --data.offset;
-	  to1 = adjust_address (data.to, QImode, data.offset);
+
+	  max_size = GET_MODE_SIZE (mode);
 	}
-      return to1;
+
+      /* The code above should have handled everything.  */
+      gcc_assert (!l);
+    }
+
+  return 1;
+}
+
+/* Generate several move instructions to store LEN bytes generated by
+   CONSTFUN to block TO.  (A MEM rtx with BLKmode).  CONSTFUNDATA is a
+   pointer which will be passed as argument in every CONSTFUN call.
+   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
+   a memset operation and false if it's a copy of a constant string.
+   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
+   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
+   stpcpy.  */
+
+rtx
+store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
+		 rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
+		 void *constfundata, unsigned int align, bool memsetp, int endp)
+{
+  if (len == 0)
+    {
+      gcc_assert (endp != 2);
+      return to;
     }
+
+  gcc_assert (targetm.use_by_pieces_infrastructure_p
+		(len, align,
+		 memsetp ? SET_BY_PIECES : STORE_BY_PIECES,
+		 optimize_insn_for_speed_p ()));
+
+  store_by_pieces_d data (to, constfun, constfundata, len, align);
+  data.run ();
+
+  if (endp)
+    return data.finish_endp (endp);
   else
-    return data.to;
+    return to;
 }
 
-/* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+/* Callback routine for clear_by_pieces.
+   Return const0_rtx unconditionally.  */
 
-unsigned HOST_WIDE_INT
-move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+static rtx
+clear_by_pieces_1 (void *, HOST_WIDE_INT, machine_mode)
 {
-  unsigned HOST_WIDE_INT n_insns = 0;
+  return const0_rtx;
+}
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
+/* Generate several move instructions to clear LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
-  while (max_size > 1 && l > 0)
-    {
-      machine_mode mode;
-      enum insn_code icode;
+static void
+clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
+{
+  if (len == 0)
+    return;
 
-      mode = widest_int_mode_for_size (max_size);
+  store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align);
+  data.run ();
+}
 
-      if (mode == VOIDmode)
-	break;
+/* Context used by compare_by_pieces_genfn.  It stores the fail label
+   to jump to in case of miscomparison, and for branch ratios greater than 1,
+   it stores an accumulator and the current and maximum counts before
+   emitting another branch.  */
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+class compare_by_pieces_d : public op_by_pieces_d
+{
+  rtx_code_label *m_fail_label;
+  rtx m_accumulator;
+  int m_count, m_batch;
 
-      max_size = GET_MODE_SIZE (mode);
+  void generate (rtx, rtx, machine_mode);
+  bool prepare_mode (machine_mode, unsigned int);
+  void finish_mode (machine_mode);
+ public:
+  compare_by_pieces_d (rtx op0, rtx op1, by_pieces_constfn op1_cfn,
+		       void *op1_cfn_data, HOST_WIDE_INT len, int align,
+		       rtx_code_label *fail_label)
+    : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len, align)
+  {
+    m_fail_label = fail_label;
+  }
+};
+
+/* A callback used when iterating for a compare_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  DATA holds a pointer to the compare_by_pieces_data
+   context structure.  */
+
+void
+compare_by_pieces_d::generate (rtx op0, rtx op1, machine_mode mode)
+{
+  if (m_batch > 1)
+    {
+      rtx temp = expand_binop (mode, sub_optab, op0, op1, NULL_RTX,
+			       true, OPTAB_LIB_WIDEN);
+      if (m_count != 0)
+	temp = expand_binop (mode, ior_optab, m_accumulator, temp, temp,
+			     true, OPTAB_LIB_WIDEN);
+      m_accumulator = temp;
+
+      if (++m_count < m_batch)
+	return;
+
+      m_count = 0;
+      op0 = m_accumulator;
+      op1 = const0_rtx;
+      m_accumulator = NULL_RTX;
     }
+  do_compare_rtx_and_jump (op0, op1, NE, true, mode, NULL_RTX, NULL,
+			   m_fail_label, -1);
+}
 
-  gcc_assert (!l);
-  return n_insns;
+/* Return true if MODE can be used for a set of moves and comparisons,
+   given an alignment ALIGN.  Prepare whatever data is necessary for
+   later calls to generate.  */
+
+bool
+compare_by_pieces_d::prepare_mode (machine_mode mode, unsigned int align)
+{
+  insn_code icode = optab_handler (mov_optab, mode);
+  if (icode == CODE_FOR_nothing
+      || align < GET_MODE_ALIGNMENT (mode)
+      || !can_compare_p (EQ, mode, ccp_jump))
+    return false;
+  m_batch = targetm.compare_by_pieces_branch_ratio (mode);
+  if (m_batch < 0)
+    return false;
+  m_accumulator = NULL_RTX;
+  m_count = 0;
+  return true;
 }
 
-/* Subroutine of move_by_pieces.  Move as many bytes as appropriate
-   with move instructions for mode MODE.  GENFUN is the gen_... function
-   to make a move insn for that mode.  DATA has all the other info.  */
+/* Called after expanding a series of comparisons in MODE.  If we have
+   accumulated results for which we haven't emitted a branch yet, do
+   so now.  */
 
-static void
-move_by_pieces_1 (insn_gen_fn genfun, machine_mode mode,
-		  struct move_by_pieces_d *data)
+void
+compare_by_pieces_d::finish_mode (machine_mode mode)
 {
-  unsigned int size = GET_MODE_SIZE (mode);
-  rtx to1 = NULL_RTX, from1;
+  if (m_accumulator != NULL_RTX)
+    do_compare_rtx_and_jump (m_accumulator, const0_rtx, NE, true, mode,
+			     NULL_RTX, NULL, m_fail_label, -1);
+}
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
+/* Generate several move instructions to compare LEN bytes from blocks
+   ARG0 and ARG1.  (These are MEM rtx's with BLKmode).
 
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
+   used to push FROM to the stack.
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
-      else
-	from1 = adjust_address (data->from, mode, data->offset);
+   ALIGN is maximum stack alignment we can assume.
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->to_addr))));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->from_addr))));
+   Optionally, the caller can pass a constfn and associated data in A1_CFN
+   and A1_CFN_DATA. describing that the second operand being compared is a
+   known constant and how to obtain its data.  */
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
-#ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
-#else
-	  gcc_unreachable ();
-#endif
-	}
+static rtx
+compare_by_pieces (rtx arg0, rtx arg1, unsigned HOST_WIDE_INT len,
+		   rtx target, unsigned int align,
+		   by_pieces_constfn a1_cfn, void *a1_cfn_data)
+{
+  rtx_code_label *fail_label = gen_label_rtx ();
+  rtx_code_label *end_label = gen_label_rtx ();
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->to_addr))));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->from_addr))));
+  if (target == NULL_RTX
+      || !REG_P (target) || REGNO (target) < FIRST_PSEUDO_REGISTER)
+    target = gen_reg_rtx (TYPE_MODE (integer_type_node));
 
-      if (! data->reverse)
-	data->offset += size;
+  compare_by_pieces_d data (arg0, arg1, a1_cfn, a1_cfn_data, len, align,
+			    fail_label);
 
-      data->len -= size;
-    }
+  data.run ();
+
+  emit_move_insn (target, const0_rtx);
+  emit_jump (end_label);
+  emit_barrier ();
+  emit_label (fail_label);
+  emit_move_insn (target, const1_rtx);
+  emit_label (end_label);
+
+  return target;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1068,8 +1552,7 @@ emit_block_move_hints (rtx x, rtx y, rtx
   unsigned int align;
 
   gcc_assert (size);
-  if (CONST_INT_P (size)
-      && INTVAL (size) == 0)
+  if (CONST_INT_P (size) && INTVAL (size) == 0)
     return 0;
 
   switch (method)
@@ -1463,6 +1946,116 @@ emit_block_move_via_loop (rtx x, rtx y,
   emit_cmp_and_jump_insns (iter, size, LT, NULL_RTX, iter_mode,
 			   true, top_label, REG_BR_PROB_BASE * 90 / 100);
 }
+
+/* Try to expand cmpstrn or cmpmem operation ICODE with the given operands.
+   ARG3_TYPE is the type of ARG3_RTX.  Return the result rtx on success,
+   otherwise return null.  */
+
+rtx
+expand_cmpstrn_or_cmpmem (insn_code icode, rtx target, rtx arg1_rtx,
+			  rtx arg2_rtx, tree arg3_type, rtx arg3_rtx,
+			  HOST_WIDE_INT align)
+{
+  machine_mode insn_mode = insn_data[icode].operand[0].mode;
+
+  if (target && (!REG_P (target) || HARD_REGISTER_P (target)))
+    target = NULL_RTX;
+
+  struct expand_operand ops[5];
+  create_output_operand (&ops[0], target, insn_mode);
+  create_fixed_operand (&ops[1], arg1_rtx);
+  create_fixed_operand (&ops[2], arg2_rtx);
+  create_convert_operand_from (&ops[3], arg3_rtx, TYPE_MODE (arg3_type),
+			       TYPE_UNSIGNED (arg3_type));
+  create_integer_operand (&ops[4], align);
+  if (maybe_expand_insn (icode, 5, ops))
+    return ops[0].value;
+  return NULL_RTX;
+}
+
+static rtx
+emit_block_cmp_via_cmpmem (rtx x, rtx y, rtx len, tree len_type, rtx target,
+			   unsigned align)
+{
+  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
+     implementing memcmp because it will stop if it encounters two
+     zero bytes.  */
+  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
+
+  if (icode == CODE_FOR_nothing)
+    return NULL_RTX;
+
+  return expand_cmpstrn_or_cmpmem (icode, target, x, y,
+				   len_type, len, align);
+}
+
+/* Emit code to compare a block Y to a block X.  This may be done with
+   string-compare instructions, with multiple scalar instructions,
+   or with a library call.
+
+   Both X and Y must be MEM rtx's.  LEN is an rtx that says how long
+   they are.  LEN_TYPE is the type of the expression that was used to
+   calculate it.
+
+   If EQUALITY_ONLY is true, it means we don't have to return the tri-state
+   value of a normal memcmp call, instead we can just compare for equality.
+   If FORCE_LIBCALL is true, we should emit a call to memcmp rather than
+   returning NULL_RTX.
+
+   Optionally, the caller can pass a constfn and associated data in Y_CFN
+   and Y_CFN_DATA. describing that the second operand being compared is a
+   known constant and how to obtain its data.
+   Return the result of the comparison, or NULL_RTX if we failed to
+   perform the operation.  */
+
+rtx
+emit_block_cmp_hints (rtx x, rtx y, rtx len, tree len_type, rtx target,
+		      bool equality_only, bool force_libcall,
+		      by_pieces_constfn y_cfn, void *y_cfndata)
+{
+  rtx result = 0;
+
+  if (CONST_INT_P (len) && INTVAL (len) == 0)
+    return const0_rtx;
+
+  gcc_assert (MEM_P (x) && MEM_P (y));
+  unsigned int align = MIN (MEM_ALIGN (x), MEM_ALIGN (y));
+  gcc_assert (align >= BITS_PER_UNIT);
+
+  x = adjust_address (x, BLKmode, 0);
+  y = adjust_address (y, BLKmode, 0);
+
+  if (equality_only
+      && CONST_INT_P (len)
+      && can_do_by_pieces (INTVAL (len), align, COMPARE_BY_PIECES))
+    result = compare_by_pieces (x, y, INTVAL (len), target, align,
+				y_cfn, y_cfndata);
+  else if ((result = emit_block_cmp_via_cmpmem (x, y, len, len_type,
+						target, align)) != NULL_RTX)
+    ;
+  else if (force_libcall)
+    {
+      gcc_assert (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (x)));
+      gcc_assert (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (y)));
+
+      result = target;
+      machine_mode result_mode = TYPE_MODE (integer_type_node);
+      if (result == NULL_RTX
+	  || !REG_P (result)
+	  || GET_MODE (result) != result_mode
+	  || REGNO (result) < FIRST_PSEUDO_REGISTER)
+	result = gen_reg_rtx (result_mode);
+
+      machine_mode size_mode = TYPE_MODE (sizetype);
+      len = convert_to_mode (size_mode, len, TYPE_UNSIGNED (sizetype));
+      emit_library_call_value (memcmp_libfunc, result, LCT_PURE,
+			       result_mode, 3,
+			       XEXP (x, 0), Pmode, XEXP (y, 0), Pmode,
+			       len, size_mode);
+    }
+
+  return result;
+}
 \f
 /* Copy all or part of a value X into registers starting at REGNO.
    The number of registers to be filled is NREGS.  */
@@ -2399,308 +2992,6 @@ get_def_for_expr_class (tree name, enum
   return def_stmt;
 }
 \f
-
-/* Determine whether the LEN bytes generated by CONSTFUN can be
-   stored to memory using several move instructions.  CONSTFUNDATA is
-   a pointer which will be passed as argument in every CONSTFUN call.
-   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
-   a memset operation and false if it's a copy of a constant string.
-   Return nonzero if a call to store_by_pieces should succeed.  */
-
-int
-can_store_by_pieces (unsigned HOST_WIDE_INT len,
-		     rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
-		     void *constfundata, unsigned int align, bool memsetp)
-{
-  unsigned HOST_WIDE_INT l;
-  unsigned int max_size;
-  HOST_WIDE_INT offset = 0;
-  machine_mode mode;
-  enum insn_code icode;
-  int reverse;
-  /* cst is set but not used if LEGITIMATE_CONSTANT doesn't use it.  */
-  rtx cst ATTRIBUTE_UNUSED;
-
-  if (len == 0)
-    return 1;
-
-  if (!targetm.use_by_pieces_infrastructure_p (len, align,
-					       memsetp
-						 ? SET_BY_PIECES
-						 : STORE_BY_PIECES,
-					       optimize_insn_for_speed_p ()))
-    return 0;
-
-  align = alignment_for_piecewise_move (STORE_MAX_PIECES, align);
-
-  /* We would first store what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  for (reverse = 0;
-       reverse <= (HAVE_PRE_DECREMENT || HAVE_POST_DECREMENT);
-       reverse++)
-    {
-      l = len;
-      max_size = STORE_MAX_PIECES + 1;
-      while (max_size > 1 && l > 0)
-	{
-	  mode = widest_int_mode_for_size (max_size);
-
-	  if (mode == VOIDmode)
-	    break;
-
-	  icode = optab_handler (mov_optab, mode);
-	  if (icode != CODE_FOR_nothing
-	      && align >= GET_MODE_ALIGNMENT (mode))
-	    {
-	      unsigned int size = GET_MODE_SIZE (mode);
-
-	      while (l >= size)
-		{
-		  if (reverse)
-		    offset -= size;
-
-		  cst = (*constfun) (constfundata, offset, mode);
-		  if (!targetm.legitimate_constant_p (mode, cst))
-		    return 0;
-
-		  if (!reverse)
-		    offset += size;
-
-		  l -= size;
-		}
-	    }
-
-	  max_size = GET_MODE_SIZE (mode);
-	}
-
-      /* The code above should have handled everything.  */
-      gcc_assert (!l);
-    }
-
-  return 1;
-}
-
-/* Generate several move instructions to store LEN bytes generated by
-   CONSTFUN to block TO.  (A MEM rtx with BLKmode).  CONSTFUNDATA is a
-   pointer which will be passed as argument in every CONSTFUN call.
-   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
-   a memset operation and false if it's a copy of a constant string.
-   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
-   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
-   stpcpy.  */
-
-rtx
-store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
-		 rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
-		 void *constfundata, unsigned int align, bool memsetp, int endp)
-{
-  machine_mode to_addr_mode = get_address_mode (to);
-  struct store_by_pieces_d data;
-
-  if (len == 0)
-    {
-      gcc_assert (endp != 2);
-      return to;
-    }
-
-  gcc_assert (targetm.use_by_pieces_infrastructure_p
-		(len, align,
-		 memsetp
-		   ? SET_BY_PIECES
-		   : STORE_BY_PIECES,
-		 optimize_insn_for_speed_p ()));
-
-  data.constfun = constfun;
-  data.constfundata = constfundata;
-  data.len = len;
-  data.to = to;
-  store_by_pieces_1 (&data, align);
-  if (endp)
-    {
-      rtx to1;
-
-      gcc_assert (!data.reverse);
-      if (data.autinc_to)
-	{
-	  if (endp == 2)
-	    {
-	      if (HAVE_POST_INCREMENT && data.explicit_inc_to > 0)
-		emit_insn (gen_add2_insn (data.to_addr, constm1_rtx));
-	      else
-		data.to_addr = copy_to_mode_reg (to_addr_mode,
-						 plus_constant (to_addr_mode,
-								data.to_addr,
-								-1));
-	    }
-	  to1 = adjust_automodify_address (data.to, QImode, data.to_addr,
-					   data.offset);
-	}
-      else
-	{
-	  if (endp == 2)
-	    --data.offset;
-	  to1 = adjust_address (data.to, QImode, data.offset);
-	}
-      return to1;
-    }
-  else
-    return data.to;
-}
-
-/* Generate several move instructions to clear LEN bytes of block TO.  (A MEM
-   rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
-
-static void
-clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
-{
-  struct store_by_pieces_d data;
-
-  if (len == 0)
-    return;
-
-  data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
-  data.len = len;
-  data.to = to;
-  store_by_pieces_1 (&data, align);
-}
-
-/* Callback routine for clear_by_pieces.
-   Return const0_rtx unconditionally.  */
-
-static rtx
-clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
-		   HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
-		   machine_mode mode ATTRIBUTE_UNUSED)
-{
-  return const0_rtx;
-}
-
-/* Subroutine of clear_by_pieces and store_by_pieces.
-   Generate several move instructions to store LEN bytes of block TO.  (A MEM
-   rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
-
-static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
-{
-  machine_mode to_addr_mode = get_address_mode (data->to);
-  rtx to_addr = XEXP (data->to, 0);
-  unsigned int max_size = STORE_MAX_PIECES + 1;
-  enum insn_code icode;
-
-  data->offset = 0;
-  data->to_addr = to_addr;
-  data->autinc_to
-    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
-       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
-
-  data->explicit_inc_to = 0;
-  data->reverse
-    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
-  if (data->reverse)
-    data->offset = data->len;
-
-  /* If storing requires more than two move insns,
-     copy addresses to registers (to make displacements shorter)
-     and use post-increment if available.  */
-  if (!data->autinc_to
-      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
-    {
-      /* Determine the main mode we'll be using.
-	 MODE might not be used depending on the definitions of the
-	 USE_* macros below.  */
-      machine_mode mode ATTRIBUTE_UNUSED
-	= widest_int_mode_for_size (max_size);
-
-      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
-	{
-	  data->to_addr = copy_to_mode_reg (to_addr_mode,
-					    plus_constant (to_addr_mode,
-							   to_addr,
-							   data->len));
-	  data->autinc_to = 1;
-	  data->explicit_inc_to = -1;
-	}
-
-      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
-	  && ! data->autinc_to)
-	{
-	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
-	  data->autinc_to = 1;
-	  data->explicit_inc_to = 1;
-	}
-
-      if ( !data->autinc_to && CONSTANT_P (to_addr))
-	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
-    }
-
-  align = alignment_for_piecewise_move (STORE_MAX_PIECES, align);
-
-  /* First store what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1 && data->len > 0)
-    {
-      machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
-
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	store_by_pieces_2 (GEN_FCN (icode), mode, data);
-
-      max_size = GET_MODE_SIZE (mode);
-    }
-
-  /* The code above should have handled everything.  */
-  gcc_assert (!data->len);
-}
-
-/* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
-   with move instructions for mode MODE.  GENFUN is the gen_... function
-   to make a move insn for that mode.  DATA has all the other info.  */
-
-static void
-store_by_pieces_2 (insn_gen_fn genfun, machine_mode mode,
-		   struct store_by_pieces_d *data)
-{
-  unsigned int size = GET_MODE_SIZE (mode);
-  rtx to1, cst;
-
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->autinc_to)
-	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					 data->offset);
-      else
-	to1 = adjust_address (data->to, mode, data->offset);
-
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->to_addr))));
-
-      cst = (*data->constfun) (data->constfundata, data->offset, mode);
-      emit_insn ((*genfun) (to1, cst));
-
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->to_addr))));
-
-      if (! data->reverse)
-	data->offset += size;
-
-      data->len -= size;
-    }
-}
-\f
 /* Write zeros through the storage of OBJECT.  If OBJECT has BLKmode, SIZE is
    its length in bytes.  */
 
Index: gcc/expr.h
===================================================================
--- gcc/expr.h	(revision 236113)
+++ gcc/expr.h	(working copy)
@@ -82,6 +82,8 @@ enum block_op_methods
   BLOCK_OP_TAILCALL
 };
 
+typedef rtx (*by_pieces_constfn) (void *, HOST_WIDE_INT, machine_mode);
+
 extern GTY(()) tree block_clear_fn;
 extern void init_block_move_fn (const char *);
 extern void init_block_clear_fn (const char *);
@@ -93,6 +95,8 @@ extern rtx emit_block_move_hints (rtx, r
 				  unsigned HOST_WIDE_INT,
 				  unsigned HOST_WIDE_INT,
 				  unsigned HOST_WIDE_INT);
+extern rtx emit_block_cmp_hints (rtx, rtx, rtx, tree, rtx, bool, bool,
+				 by_pieces_constfn, void *);
 extern bool emit_storent_insn (rtx to, rtx from);
 
 /* Copy all or part of a value X into registers starting at REGNO.
@@ -157,6 +161,11 @@ extern void use_regs (rtx *, int, int);
 /* Mark a PARALLEL as holding a parameter for the next CALL_INSN.  */
 extern void use_group_regs (rtx *, rtx);
 
+#ifdef GCC_INSN_CODES_H
+extern rtx expand_cmpstrn_or_cmpmem (insn_code, rtx, rtx, rtx, tree, rtx,
+				     HOST_WIDE_INT);
+#endif
+
 /* Write zeros through the storage of OBJECT.
    If OBJECT has BLKmode, SIZE is its length in bytes.  */
 extern rtx clear_storage (rtx, rtx, enum block_op_methods);
@@ -175,10 +184,6 @@ extern bool set_storage_via_setmem (rtx,
 				    unsigned HOST_WIDE_INT,
 				    unsigned HOST_WIDE_INT);
 
-extern unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
-						     unsigned int,
-						     unsigned int);
-
 /* Return nonzero if it is desirable to store LEN bytes generated by
    CONSTFUN with several move instructions by store_by_pieces
    function.  CONSTFUNDATA is a pointer which will be passed as argument
@@ -187,8 +192,7 @@ extern unsigned HOST_WIDE_INT move_by_pi
    MEMSETP is true if this is a real memset/bzero, not a copy
    of a const string.  */
 extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
-				rtx (*) (void *, HOST_WIDE_INT,
-					 machine_mode),
+				by_pieces_constfn,
 				void *, unsigned int, bool);
 
 /* Generate several move instructions to store LEN bytes generated by
@@ -197,8 +201,7 @@ extern int can_store_by_pieces (unsigned
    ALIGN is maximum alignment we can assume.
    MEMSETP is true if this is a real memset/bzero, not a copy.
    Returns TO + LEN.  */
-extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT,
-			    rtx (*) (void *, HOST_WIDE_INT, machine_mode),
+extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
 			    void *, unsigned int, bool, int);
 
 /* Emit insns to set X from Y.  */
@@ -279,7 +282,7 @@ rtx get_personality_function (tree);
 /* Determine whether the LEN bytes can be moved by using several move
    instructions.  Return nonzero if a call to move_by_pieces should
    succeed.  */
-extern int can_move_by_pieces (unsigned HOST_WIDE_INT, unsigned int);
+extern bool can_move_by_pieces (unsigned HOST_WIDE_INT, unsigned int);
 
 extern unsigned HOST_WIDE_INT highest_pow2_factor (const_tree);
 
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 236113)
+++ gcc/target.def	(working copy)
@@ -3397,8 +3397,9 @@ Both @var{size} and @var{alignment} are
 units.\n\
 \n\
 The parameter @var{op} is one of: @code{CLEAR_BY_PIECES},\n\
-@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES}.\n\
-These describe the type of memory operation under consideration.\n\
+@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES} or\n\
+@code{COMPARE_BY_PIECES}.  These describe the type of memory operation\n\
+under consideration.\n\
 \n\
 The parameter @var{speed_p} is true if the code is currently being\n\
 optimized for speed rather than size.\n\
@@ -3418,6 +3419,18 @@ move would be greater than that of a lib
  default_use_by_pieces_infrastructure_p)
 
 DEFHOOK
+(compare_by_pieces_branch_ratio,
+ "When expanding a block comparison in MODE, gcc can try to reduce the\n\
+number of branches at the expense of more memory operations.  This hook\n\
+allows the target to override the default choice.  It should return the\n\
+factor by which branches should be reduced over the plain expansion with\n\
+one comparison per @var{mode}-sized piece.  A port can also prevent a\n\
+particular mode from being used for block comparisons by returning a\n\
+negative number from this hook.",
+ int, (machine_mode mode),
+ default_compare_by_pieces_branch_ratio)
+
+DEFHOOK
 (optab_supported_p,
  "Return true if the optimizers should use optab @var{op} with\n\
 modes @var{mode1} and @var{mode2} for optimization type @var{opt_type}.\n\
Index: gcc/target.h
===================================================================
--- gcc/target.h	(revision 236113)
+++ gcc/target.h	(working copy)
@@ -79,16 +79,23 @@ enum print_switch_type
 };
 
 /* Types of memory operation understood by the "by_pieces" infrastructure.
-   Used by the TARGET_USE_BY_PIECES_INFRASTRUCTURE_P target hook.  */
+   Used by the TARGET_USE_BY_PIECES_INFRASTRUCTURE_P target hook and
+   internally by the functions in expr.c.  */
 
 enum by_pieces_operation
 {
   CLEAR_BY_PIECES,
   MOVE_BY_PIECES,
   SET_BY_PIECES,
-  STORE_BY_PIECES
+  STORE_BY_PIECES,
+  COMPARE_BY_PIECES
 };
 
+extern unsigned HOST_WIDE_INT by_pieces_ninsns (unsigned HOST_WIDE_INT,
+						unsigned int,
+						unsigned int,
+						by_pieces_operation);
+
 typedef int (* print_switch_fn_type) (print_switch_type, const char *);
 
 /* An example implementation for ELF targets.  Defined in varasm.c  */
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	(revision 236113)
+++ gcc/targhooks.c	(working copy)
@@ -1482,25 +1482,40 @@ default_use_by_pieces_infrastructure_p (
 
   switch (op)
     {
-      case CLEAR_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = CLEAR_RATIO (speed_p);
-	break;
-      case MOVE_BY_PIECES:
-	max_size = MOVE_MAX_PIECES;
-	ratio = get_move_ratio (speed_p);
-	break;
-      case SET_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = SET_RATIO (speed_p);
-	break;
-      case STORE_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = get_move_ratio (speed_p);
-	break;
+    case CLEAR_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = CLEAR_RATIO (speed_p);
+      break;
+    case MOVE_BY_PIECES:
+      max_size = MOVE_MAX_PIECES;
+      ratio = get_move_ratio (speed_p);
+      break;
+    case SET_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = SET_RATIO (speed_p);
+      break;
+    case STORE_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = get_move_ratio (speed_p);
+      break;
+    case COMPARE_BY_PIECES:
+      max_size = COMPARE_MAX_PIECES;
+      /* Pick a likely default, just as in get_move_ratio.  */
+      ratio = speed_p ? 15 : 3;
+      break;
     }
 
-  return move_by_pieces_ninsns (size, alignment, max_size + 1) < ratio;
+  return by_pieces_ninsns (size, alignment, max_size + 1, op) < ratio;
+}
+
+/* This hook controls code generation for expanding a memcmp operation by
+   pieces.  Return 1 for the normal pattern of compare/jump after each pair
+   of loads, or a higher number to reduce the number of branches.  */
+
+int
+default_compare_by_pieces_branch_ratio (machine_mode)
+{
+  return 1;
 }
 
 bool
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	(revision 236113)
+++ gcc/targhooks.h	(working copy)
@@ -199,6 +199,7 @@ extern bool default_use_by_pieces_infras
 						    unsigned int,
 						    enum by_pieces_operation,
 						    bool);
+extern int default_compare_by_pieces_branch_ratio (machine_mode);
 
 extern bool default_profile_before_prologue (void);
 extern reg_class_t default_preferred_reload_class (rtx, reg_class_t);
Index: gcc/testsuite/gcc.dg/pr52171.c
===================================================================
--- gcc/testsuite/gcc.dg/pr52171.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr52171.c	(working copy)
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler-not "memcmp" } } */
+#include <string.h>
+struct A { int x; } a, b;
+
+extern char s[], t[];
+
+int foo ()
+{
+  return memcmp (&a, &b, sizeof (struct A)) == 0;
+}
Index: gcc/testsuite/gcc.target/i386/pr52171.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr52171.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr52171.c	(working copy)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler-not "memcmp" } } */
+/* { dg-final { scan-assembler "1752394086" } } */
+
+/* This should turn into four compare/jump pairs with -m32, within the
+   limit of what the tuning considers acceptable for -O2.  */
+int cmp (char *p, char *q)
+{
+  char *pa = __builtin_assume_aligned (p, 4);
+  char *qa = __builtin_assume_aligned (q, 4);
+  if (__builtin_memcmp (pa, qa, 16) != 0)
+    return 1;
+  return 0;
+}
+/* Since we have fast unaligned access, we should make a single
+   constant comparison.  The constant becomes 1752394086.  */
+int cmp2 (char *p)
+{
+  if (__builtin_memcmp (p, "fish", 4) != 0)
+    return 1;
+  return 0;
+}
Index: gcc/tree-ssa-strlen.c
===================================================================
--- gcc/tree-ssa-strlen.c	(revision 236113)
+++ gcc/tree-ssa-strlen.c	(working copy)
@@ -44,6 +44,7 @@ along with GCC; see the file COPYING3.
 #include "params.h"
 #include "ipa-chkp.h"
 #include "tree-hash-traits.h"
+#include "builtins.h"
 
 /* A vector indexed by SSA_NAME_VERSION.  0 means unknown, positive value
    is an index into strinfo vector, negative value stands for
@@ -1843,6 +1844,83 @@ handle_builtin_memset (gimple_stmt_itera
   return false;
 }
 
+/* Handle a call to memcmp.  We try to handle small comparisons by
+   converting them to load and compare, and replacing the call to memcmp
+   with a __builtin_memcmp_eq call where possible.  */
+
+static bool
+handle_builtin_memcmp (gimple_stmt_iterator *gsi)
+{
+  gcall *stmt2 = as_a <gcall *> (gsi_stmt (*gsi));
+  tree res = gimple_call_lhs (stmt2);
+  tree arg1 = gimple_call_arg (stmt2, 0);
+  tree arg2 = gimple_call_arg (stmt2, 1);
+  tree len = gimple_call_arg (stmt2, 2);
+  unsigned HOST_WIDE_INT leni;
+  unsigned align1, align2;
+  use_operand_p use_p;
+  imm_use_iterator iter;
+
+  if (!res)
+    return true;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iter, res)
+    {
+      gimple *ustmt = USE_STMT (use_p);
+
+      if (gimple_code (ustmt) == GIMPLE_ASSIGN)
+	{
+	  gassign *asgn = as_a <gassign *> (ustmt);
+	  tree_code code = gimple_assign_rhs_code (asgn);
+	  if ((code != EQ_EXPR && code != NE_EXPR)
+	      || !integer_zerop (gimple_assign_rhs2 (asgn)))
+	    return true;
+	}
+      else if (gimple_code (ustmt) == GIMPLE_COND)
+	{
+	  tree_code code = gimple_cond_code (ustmt);
+	  if ((code != EQ_EXPR && code != NE_EXPR)
+	      || !integer_zerop (gimple_cond_rhs (ustmt)))
+	    return true;
+	}
+      else
+	return true;
+    }
+
+  if (tree_fits_uhwi_p (len)
+      && (leni = tree_to_uhwi (len)) <= GET_MODE_SIZE (word_mode)
+      && exact_log2 (leni) != -1
+      && (align1 = get_pointer_alignment (arg1)) >= leni * BITS_PER_UNIT
+      && (align2 = get_pointer_alignment (arg2)) >= leni * BITS_PER_UNIT)
+    {
+      location_t loc = gimple_location (stmt2);
+      tree type, off;
+      type = build_nonstandard_integer_type (leni * BITS_PER_UNIT, 1);
+      gcc_assert (GET_MODE_SIZE (TYPE_MODE (type)) == leni);
+      tree ptrtype = build_pointer_type_for_mode (char_type_node, ptr_mode,
+						  true);
+      off = build_int_cst (ptrtype, 0);
+      arg1 = build2_loc (loc, MEM_REF, type, arg1, off);
+      arg2 = build2_loc (loc, MEM_REF, type, arg2, off);
+      res = fold_convert_loc (loc, TREE_TYPE (res),
+			      fold_build2_loc (loc, NE_EXPR,
+					       boolean_type_node,
+					       arg1, arg2));
+      gimplify_and_update_call_from_tree (gsi, res);
+      return false;
+    }
+
+  tree newfn = builtin_decl_explicit (BUILT_IN_MEMCMP_EQ);
+  gcall *repl = gimple_build_call (newfn, 3, arg1, arg2, len);
+  gimple_call_set_lhs (repl, res);
+  gimple_set_location (repl, gimple_location (stmt2));
+  gimple_set_vuse (repl, gimple_vuse (stmt2));
+  *gimple_call_use_set (repl) = *gimple_call_use_set (stmt2);
+  gcc_assert (!gimple_vdef (stmt2));
+  gsi_replace (gsi, repl, false);
+  return false;
+}
+
 /* Handle a POINTER_PLUS_EXPR statement.
    For p = "abcd" + 2; compute associated length, or if
    p = q + off is pointing to a '\0' character of a string, call
@@ -2100,6 +2178,10 @@ strlen_optimize_stmt (gimple_stmt_iterat
 	    if (!handle_builtin_memset (gsi))
 	      return false;
 	    break;
+	  case BUILT_IN_MEMCMP:
+	    if (!handle_builtin_memcmp (gsi))
+	      return false;
+	    break;
 	  default:
 	    break;
 	  }
Index: gcc/tree.c
===================================================================
--- gcc/tree.c	(revision 236113)
+++ gcc/tree.c	(working copy)
@@ -10594,6 +10594,13 @@ build_common_builtin_nodes (void)
 			BUILT_IN_STACK_RESTORE,
 			"__builtin_stack_restore", ECF_NOTHROW | ECF_LEAF);
 
+  ftype = build_function_type_list (integer_type_node, const_ptr_type_node,
+				    const_ptr_type_node, size_type_node,
+				    NULL_TREE);
+  local_define_builtin ("__builtin_memcmp_eq", ftype, BUILT_IN_MEMCMP_EQ,
+			"__builtin_memcmp_eq",
+			ECF_PURE | ECF_NOTHROW | ECF_LEAF);
+
   /* If there's a possibility that we might use the ARM EABI, build the
     alternate __cxa_end_cleanup node used to resume from C++ and Java.  */
   if (targetm.arm_eabi_unwinder)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-12 17:14           ` Bernd Schmidt
@ 2016-05-13 10:20             ` Richard Biener
  2016-05-13 13:05               ` Bernd Schmidt
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2016-05-13 10:20 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, Nick Clifton

On Thu, May 12, 2016 at 7:14 PM, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 05/02/2016 03:14 PM, Richard Biener wrote:
>
>> I think it fits best in tree-ssa-strlen.c:strlen_optimize_stmt for the
>> moment.
>
>
> I've done this in this version. Note that this apparently means it won't be
> run at -O unlike the previous version.
>
> The code moved to tree-ssa-strlen.c is nearly identical to the previous
> version, with the exception of a new line that copies the contents of
> gimple_call_use_set to the newly constructed call. Without this, I found a
> testcase where we can miss a DSE opportunity since we think memcmp_eq can
> access anything.
>
> In the by_pieces_code, I've gone ahead and progressed the C++ conversion a
> little further by making subclasses of op_by_pieces_d. Now the
> {move,store,compare}_by_pieces functions generally just make an object of
> the appropriate type and call its run method. I've also converted
> store_by_pieces to this infrastructure to eliminate more duplication.
>
> I've verified that if you disable the strlenopt code, you get identical code
> generation before and after the patch on x86_64-linux and arm-eabi. With the
> strlenopt enabled, it finds memcmp optimization opportunities on both
> targets (more on x86 due to !SLOW_UNALIGNED_ACCESS). On arm, there is a
> question mark over whether the autoinc support in the by_pieces code is
> really a good idea, but that's unchanged from before and can be addressed
> later.
> Microbenchmarks on x86 suggest that the new by-pieces expansion is a whole
> lot faster than calling memcmp (about a factor of 3 in my tests).
>
> Bootstrapped and tested on x86_64-linux. The testcase failed at -O (now
> changed to -O2), and there was a failure in vrp47.c which also seems to
> appear in gcc-testresults, so I think it's unrelated. Still, I'll make
> another run against a new baseline. Ok for trunk if that all comes back
> clean?

@@ -6081,9 +6056,16 @@ expand_builtin (tree exp, rtx target, rt

     case BUILT_IN_BCMP:
     case BUILT_IN_MEMCMP:
-      target = expand_builtin_memcmp (exp, target);
+    case BUILT_IN_MEMCMP_EQ:
+      target = expand_builtin_memcmp (exp, target, fcode ==
BUILT_IN_MEMCMP_EQ);
       if (target)
        return target;
+      if (fcode == BUILT_IN_MEMCMP_EQ)
+       {
+         exp = copy_node (exp);
+         tree newdecl = builtin_decl_explicit (BUILT_IN_MEMCMP);
+         TREE_OPERAND (exp, 1) = build_fold_addr_expr (newdecl);
+       }

you can modify 'exp' in-place (yeah, we still build a GENERIC call from GIMPLE
just because nobody "fixed" downstream routines to accept a GIMPLE call).

I'm not much of a fan of C++-ification (in this case it makes review
harder) but well ...

+  if (tree_fits_uhwi_p (len)
+      && (leni = tree_to_uhwi (len)) <= GET_MODE_SIZE (word_mode)
+      && exact_log2 (leni) != -1
+      && (align1 = get_pointer_alignment (arg1)) >= leni * BITS_PER_UNIT
+      && (align2 = get_pointer_alignment (arg2)) >= leni * BITS_PER_UNIT)

I think * BITS_PER_UNIT has to be * 8 here as the C standard defines
it that way.

I think to increase coverage you want || ! SLOW_UNALIGNED_ACCESS
(TYPE_MODE (), align1)
here and build properly mis-aligned access types for the MEM_REF like

                  if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE (type)))
                    srctype = build_aligned_type (type, src_align);

(copied from gimple_fold_builtin_memory_op).

+      type = build_nonstandard_integer_type (leni * BITS_PER_UNIT, 1);
+      gcc_assert (GET_MODE_SIZE (TYPE_MODE (type)) == leni);

Likewise leni * 8 and the assert would have to assert
GET_MODE_SIZE () * BITS_PER_UNIT == leni * 8 (or using GET_MODE_BITSIZE).

+      arg1 = build2_loc (loc, MEM_REF, type, arg1, off);
+      arg2 = build2_loc (loc, MEM_REF, type, arg2, off);

you want to add

        tree tem = fold_const_aggregate_ref (arg1);
        if (tem)
          arg1 = tem;

and same for arg2.  That avoids leaving unfolded memory refs in the IL
after the transform.

+  tree newfn = builtin_decl_explicit (BUILT_IN_MEMCMP_EQ);
+  gcall *repl = gimple_build_call (newfn, 3, arg1, arg2, len);
+  gimple_call_set_lhs (repl, res);
+  gimple_set_location (repl, gimple_location (stmt2));
+  gimple_set_vuse (repl, gimple_vuse (stmt2));
+  *gimple_call_use_set (repl) = *gimple_call_use_set (stmt2);
+  gcc_assert (!gimple_vdef (stmt2));
+  gsi_replace (gsi, repl, false);
+  return false;

I think you can avoid all this by just doing

   gimple_call_set_fndecl (stmt2, builtin_decl_explicit (BUILT_IN_MEMCMP_EQ));

as MEMCMP and MEMCMP_EQ have equal signatures.

Ok with those changes.

Thanks,
Richard.


>
> Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-13 10:20             ` Richard Biener
@ 2016-05-13 13:05               ` Bernd Schmidt
  2016-05-13 13:07                 ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schmidt @ 2016-05-13 13:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Nick Clifton

On 05/13/2016 12:20 PM, Richard Biener wrote:
> I'm not much of a fan of C++-ification (in this case it makes review
> harder) but well ...

I felt it was a pretty natural way to structure the code to avoid 
duplicating the same logic across more functions, and we might as well 
use the language for such purposes given that we've bothered to switch.

> +  if (tree_fits_uhwi_p (len)
> +      && (leni = tree_to_uhwi (len)) <= GET_MODE_SIZE (word_mode)
> +      && exact_log2 (leni) != -1
> +      && (align1 = get_pointer_alignment (arg1)) >= leni * BITS_PER_UNIT
> +      && (align2 = get_pointer_alignment (arg2)) >= leni * BITS_PER_UNIT)
>
> I think * BITS_PER_UNIT has to be * 8 here as the C standard defines
> it that way.

Huh? Can you elaborate?

[...]
> Ok with those changes.

Thanks. I won't be reading email for the next two weeks, so I'll be 
checking it in afterwards.


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-13 13:05               ` Bernd Schmidt
@ 2016-05-13 13:07                 ` Richard Biener
  2016-05-13 13:13                   ` Bernd Schmidt
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2016-05-13 13:07 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, Nick Clifton

On Fri, May 13, 2016 at 3:05 PM, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 05/13/2016 12:20 PM, Richard Biener wrote:
>>
>> I'm not much of a fan of C++-ification (in this case it makes review
>> harder) but well ...
>
>
> I felt it was a pretty natural way to structure the code to avoid
> duplicating the same logic across more functions, and we might as well use
> the language for such purposes given that we've bothered to switch.
>
>> +  if (tree_fits_uhwi_p (len)
>> +      && (leni = tree_to_uhwi (len)) <= GET_MODE_SIZE (word_mode)
>> +      && exact_log2 (leni) != -1
>> +      && (align1 = get_pointer_alignment (arg1)) >= leni * BITS_PER_UNIT
>> +      && (align2 = get_pointer_alignment (arg2)) >= leni * BITS_PER_UNIT)
>>
>> I think * BITS_PER_UNIT has to be * 8 here as the C standard defines
>> it that way.
>
>
> Huh? Can you elaborate?

When you have a builtin taking a size in bytes then a byte is 8 bits,
not BITS_PER_UNIT bits.

Richard.

> [...]
>>
>> Ok with those changes.
>
>
> Thanks. I won't be reading email for the next two weeks, so I'll be checking
> it in afterwards.
>
>
> Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-13 13:07                 ` Richard Biener
@ 2016-05-13 13:13                   ` Bernd Schmidt
  2016-05-13 13:53                     ` Joseph Myers
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schmidt @ 2016-05-13 13:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Nick Clifton, Joseph Myers

On 05/13/2016 03:07 PM, Richard Biener wrote:
> On Fri, May 13, 2016 at 3:05 PM, Bernd Schmidt <bschmidt@redhat.com> wrote:
>> Huh? Can you elaborate?
>
> When you have a builtin taking a size in bytes then a byte is 8 bits,
> not BITS_PER_UNIT bits.

That makes no sense to me. I think the definition of a byte depends on 
the machine (hence the term "octet" was coined to be unambiguous). Also, 
such a definition would seem to imply that machines with 10-bit bytes 
cannot implement memcpy or memcmp.

Joseph, can you clarify the standard's meaning here?


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-13 13:13                   ` Bernd Schmidt
@ 2016-05-13 13:53                     ` Joseph Myers
  2016-05-13 14:00                       ` Bernd Schmidt
  0 siblings, 1 reply; 19+ messages in thread
From: Joseph Myers @ 2016-05-13 13:53 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Biener, GCC Patches, Nick Clifton

On Fri, 13 May 2016, Bernd Schmidt wrote:

> On 05/13/2016 03:07 PM, Richard Biener wrote:
> > On Fri, May 13, 2016 at 3:05 PM, Bernd Schmidt <bschmidt@redhat.com> wrote:
> > > Huh? Can you elaborate?
> > 
> > When you have a builtin taking a size in bytes then a byte is 8 bits,
> > not BITS_PER_UNIT bits.
> 
> That makes no sense to me. I think the definition of a byte depends on the
> machine (hence the term "octet" was coined to be unambiguous). Also, such a
> definition would seem to imply that machines with 10-bit bytes cannot
> implement memcpy or memcmp.
> 
> Joseph, can you clarify the standard's meaning here?

* In C: a byte is the minimal addressable unit; an N-byte object is made 
up of N "unsigned char" objects, with successive addresses in terms of 
incrementing an "unsigned char *" pointer.  A byte is at least 8 bits.

* In GCC, at the level of GNU C APIs on the target, which generally 
includes built-in functions: a byte (on the target) is made of 
CHAR_TYPE_SIZE bits.  In theory this could be more than BITS_PER_UNIT, or 
that could be more than 8, though support for either of those cases would 
be very bit-rotten (and I'm not sure there ever have been targets with 
CHAR_TYPE_SIZE > BITS_PER_UNIT).  Sizes passed to memcpy and memcmp are 
sizes in units of CHAR_TYPE_SIZE bits.

* In GCC, at the RTL level: a byte (on the target) is a QImode object, 
which is made of BITS_PER_UNIT bits.  (HImode is always two bytes, SImode 
four, etc., if those modes exist.)  Support for BITS_PER_UNIT being more 
than 8 is very bit-rotten.

* In GCC, on the host: GCC only supports hosts (and $build) where bytes 
are 8-bit (though writing it as CHAR_BIT makes it clear that this 8 means 
the number of bits in a host byte).

Internal interfaces e.g. representing the contents of strings or other 
memory on the target may not currently be well-defined except when 
BITS_PER_UNIT is 8.  Cf. e.g. 
<https://gcc.gnu.org/ml/gcc/2003-06/msg01159.html>.  But the above should 
at least give guidance as to whether BITS_PER_UNIT, CHAR_TYPE_SIZE (or 
TYPE_PRECISION (char_type_node), preferred where possible to minimize 
usage of target macros) or CHAR_BIT is logically right in a particular 
place.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-13 13:53                     ` Joseph Myers
@ 2016-05-13 14:00                       ` Bernd Schmidt
  2016-05-13 20:41                         ` Joseph Myers
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schmidt @ 2016-05-13 14:00 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Richard Biener, GCC Patches, Nick Clifton

On 05/13/2016 03:53 PM, Joseph Myers wrote:
> * In C: a byte is the minimal addressable unit; an N-byte object is made
> up of N "unsigned char" objects, with successive addresses in terms of
> incrementing an "unsigned char *" pointer.  A byte is at least 8 bits.
>
> * In GCC, at the level of GNU C APIs on the target, which generally
> includes built-in functions: a byte (on the target) is made of
> CHAR_TYPE_SIZE bits.  In theory this could be more than BITS_PER_UNIT, or
> that could be more than 8, though support for either of those cases would
> be very bit-rotten (and I'm not sure there ever have been targets with
> CHAR_TYPE_SIZE > BITS_PER_UNIT).  Sizes passed to memcpy and memcmp are
> sizes in units of CHAR_TYPE_SIZE bits.
>
> * In GCC, at the RTL level: a byte (on the target) is a QImode object,
> which is made of BITS_PER_UNIT bits.  (HImode is always two bytes, SImode
> four, etc., if those modes exist.)  Support for BITS_PER_UNIT being more
> than 8 is very bit-rotten.
>
> * In GCC, on the host: GCC only supports hosts (and $build) where bytes
> are 8-bit (though writing it as CHAR_BIT makes it clear that this 8 means
> the number of bits in a host byte).
>
> Internal interfaces e.g. representing the contents of strings or other
> memory on the target may not currently be well-defined except when
> BITS_PER_UNIT is 8.  Cf. e.g.
> <https://gcc.gnu.org/ml/gcc/2003-06/msg01159.html>.  But the above should
> at least give guidance as to whether BITS_PER_UNIT, CHAR_TYPE_SIZE (or
> TYPE_PRECISION (char_type_node), preferred where possible to minimize
> usage of target macros) or CHAR_BIT is logically right in a particular
> place.

Thanks. So, this would seem to suggest that BITS_PER_UNIT in 
memcmp/memcpy expansion is more accurate than a plain 8, although 
pedantically we might want to use CHAR_TYPE_SIZE? Should I adjust my 
patch to use the latter or leave these parts as-is?


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-13 14:00                       ` Bernd Schmidt
@ 2016-05-13 20:41                         ` Joseph Myers
  2016-05-31 23:50                           ` Bernd Schmidt
  0 siblings, 1 reply; 19+ messages in thread
From: Joseph Myers @ 2016-05-13 20:41 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Biener, GCC Patches, Nick Clifton

On Fri, 13 May 2016, Bernd Schmidt wrote:

> Thanks. So, this would seem to suggest that BITS_PER_UNIT in memcmp/memcpy
> expansion is more accurate than a plain 8, although pedantically we might want
> to use CHAR_TYPE_SIZE? Should I adjust my patch to use the latter or leave
> these parts as-is?

I'd say use CHAR_TYPE_SIZE.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-01-15 16:58 Thoughts on memcmp expansion (PR43052) Bernd Schmidt
                   ` (2 preceding siblings ...)
  2016-01-19 21:36 ` Jeff Law
@ 2016-05-30 11:29 ` Florian Weimer
  2016-05-31 15:05   ` Bernd Schmidt
  3 siblings, 1 reply; 19+ messages in thread
From: Florian Weimer @ 2016-05-30 11:29 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches, Nick Clifton

On 01/15/2016 05:58 PM, Bernd Schmidt wrote:

> One question Richard posed in the comments: why aren't we optimizing
> small constant size memcmps other than size 1 to *s == *q? The reason is
> the return value of memcmp, which implies byte-sized operation
> (incidentally, the use of SImode in the cmpmem/cmpstr patterns is really
> odd). It's possible to work around this, but expansion becomes a little
> more tricky (subtract after bswap, maybe).

When I did this (big-endian conversion, wide substract, sign) to the 
tail difference check in glibc's x86_64 memcmp, it was actually a bit 
faster than isolating the differing byte and returning its difference, 
even for non-random data such as encountered during qsort:

>  * Expand __memcmp_eq for small constant sizes with loads and
>    comparison, fall back to a memcmp call.

Should we export such a function from glibc?  I expect it's fairly 
common.  Computing the tail difference costs a few cycles.

It may also make sense to call a streamlined implementation if you have 
interesting alignment information (for x86_64, that would be at least 16 
on one or both inputs, so it's perhaps not easy to come by).

Florian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-30 11:29 ` Florian Weimer
@ 2016-05-31 15:05   ` Bernd Schmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Bernd Schmidt @ 2016-05-31 15:05 UTC (permalink / raw)
  To: Florian Weimer, GCC Patches, Nick Clifton

On 05/30/2016 11:37 AM, Florian Weimer wrote:

>>  * Expand __memcmp_eq for small constant sizes with loads and
>>    comparison, fall back to a memcmp call.
>
> Should we export such a function from glibc?  I expect it's fairly
> common.  Computing the tail difference costs a few cycles.

At the moment the patch ensures that no actual memcmp_eq function is 
called, it's either expanded inline or we fall back to memcmp. But 
there's no reason why it couldn't be added under some appropriate name.


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on memcmp expansion (PR43052)
  2016-05-13 20:41                         ` Joseph Myers
@ 2016-05-31 23:50                           ` Bernd Schmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Bernd Schmidt @ 2016-05-31 23:50 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Richard Biener, GCC Patches, Nick Clifton

[-- Attachment #1: Type: text/plain, Size: 622 bytes --]

On 05/13/2016 10:40 PM, Joseph Myers wrote:
> On Fri, 13 May 2016, Bernd Schmidt wrote:
>
>> Thanks. So, this would seem to suggest that BITS_PER_UNIT in memcmp/memcpy
>> expansion is more accurate than a plain 8, although pedantically we might want
>> to use CHAR_TYPE_SIZE? Should I adjust my patch to use the latter or leave
>> these parts as-is?
>
> I'd say use CHAR_TYPE_SIZE.

Here's the current patch, retested as before. Since I had to make some 
adjustments since the approval to cope with changes such as the removal 
of memcmp_libfunc, I'll leave it for a few days for comments before 
checking it in.


Bernd


[-- Attachment #2: memcmp-eq0531.diff --]
[-- Type: text/x-patch, Size: 68682 bytes --]

Index: gcc/builtins.c
===================================================================
--- gcc/builtins.c	(revision 236909)
+++ gcc/builtins.c	(working copy)
@@ -3671,53 +3671,24 @@ expand_cmpstr (insn_code icode, rtx targ
   return NULL_RTX;
 }
 
-/* Try to expand cmpstrn or cmpmem operation ICODE with the given operands.
-   ARG3_TYPE is the type of ARG3_RTX.  Return the result rtx on success,
-   otherwise return null.  */
-
-static rtx
-expand_cmpstrn_or_cmpmem (insn_code icode, rtx target, rtx arg1_rtx,
-			  rtx arg2_rtx, tree arg3_type, rtx arg3_rtx,
-			  HOST_WIDE_INT align)
-{
-  machine_mode insn_mode = insn_data[icode].operand[0].mode;
-
-  if (target && (!REG_P (target) || HARD_REGISTER_P (target)))
-    target = NULL_RTX;
-
-  struct expand_operand ops[5];
-  create_output_operand (&ops[0], target, insn_mode);
-  create_fixed_operand (&ops[1], arg1_rtx);
-  create_fixed_operand (&ops[2], arg2_rtx);
-  create_convert_operand_from (&ops[3], arg3_rtx, TYPE_MODE (arg3_type),
-			       TYPE_UNSIGNED (arg3_type));
-  create_integer_operand (&ops[4], align);
-  if (maybe_expand_insn (icode, 5, ops))
-    return ops[0].value;
-  return NULL_RTX;
-}
-
 /* Expand expression EXP, which is a call to the memcmp built-in function.
    Return NULL_RTX if we failed and the caller should emit a normal call,
-   otherwise try to get the result in TARGET, if convenient.  */
+   otherwise try to get the result in TARGET, if convenient.
+   RESULT_EQ is true if we can relax the returned value to be either zero
+   or nonzero, without caring about the sign.  */
 
 static rtx
-expand_builtin_memcmp (tree exp, rtx target)
+expand_builtin_memcmp (tree exp, rtx target, bool result_eq)
 {
   if (!validate_arglist (exp,
  			 POINTER_TYPE, POINTER_TYPE, INTEGER_TYPE, VOID_TYPE))
     return NULL_RTX;
 
-  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
-     implementing memcmp because it will stop if it encounters two
-     zero bytes.  */
-  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
-  if (icode == CODE_FOR_nothing)
-    return NULL_RTX;
-
   tree arg1 = CALL_EXPR_ARG (exp, 0);
   tree arg2 = CALL_EXPR_ARG (exp, 1);
   tree len = CALL_EXPR_ARG (exp, 2);
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  location_t loc = EXPR_LOCATION (exp);
 
   unsigned int arg1_align = get_pointer_alignment (arg1) / BITS_PER_UNIT;
   unsigned int arg2_align = get_pointer_alignment (arg2) / BITS_PER_UNIT;
@@ -3726,22 +3697,38 @@ expand_builtin_memcmp (tree exp, rtx tar
   if (arg1_align == 0 || arg2_align == 0)
     return NULL_RTX;
 
-  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
-  location_t loc = EXPR_LOCATION (exp);
   rtx arg1_rtx = get_memory_rtx (arg1, len);
   rtx arg2_rtx = get_memory_rtx (arg2, len);
-  rtx arg3_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
+  rtx len_rtx = expand_normal (fold_convert_loc (loc, sizetype, len));
 
   /* Set MEM_SIZE as appropriate.  */
-  if (CONST_INT_P (arg3_rtx))
+  if (CONST_INT_P (len_rtx))
     {
-      set_mem_size (arg1_rtx, INTVAL (arg3_rtx));
-      set_mem_size (arg2_rtx, INTVAL (arg3_rtx));
+      set_mem_size (arg1_rtx, INTVAL (len_rtx));
+      set_mem_size (arg2_rtx, INTVAL (len_rtx));
     }
 
-  rtx result = expand_cmpstrn_or_cmpmem (icode, target, arg1_rtx, arg2_rtx,
-					 TREE_TYPE (len), arg3_rtx,
-					 MIN (arg1_align, arg2_align));
+  by_pieces_constfn constfn = NULL;
+
+  const char *src_str = c_getstr (arg1);
+  if (src_str == NULL)
+    src_str = c_getstr (arg2);
+  else
+    std::swap (arg1_rtx, arg2_rtx);
+
+  /* If SRC is a string constant and block move would be done
+     by pieces, we can avoid loading the string from memory
+     and only stored the computed constants.  */
+  if (src_str
+      && CONST_INT_P (len_rtx)
+      && (unsigned HOST_WIDE_INT) INTVAL (len_rtx) <= strlen (src_str) + 1)
+    constfn = builtin_memcpy_read_str;
+
+  rtx result = emit_block_cmp_hints (arg1_rtx, arg2_rtx, len_rtx,
+				     TREE_TYPE (len), target,
+				     result_eq, constfn,
+				     CONST_CAST (char *, src_str));
+
   if (result)
     {
       /* Return the value in the proper mode for this function.  */
@@ -6073,9 +6060,15 @@ expand_builtin (tree exp, rtx target, rt
 
     case BUILT_IN_BCMP:
     case BUILT_IN_MEMCMP:
-      target = expand_builtin_memcmp (exp, target);
+    case BUILT_IN_MEMCMP_EQ:
+      target = expand_builtin_memcmp (exp, target, fcode == BUILT_IN_MEMCMP_EQ);
       if (target)
 	return target;
+      if (fcode == BUILT_IN_MEMCMP_EQ)
+	{
+	  tree newdecl = builtin_decl_explicit (BUILT_IN_MEMCMP);
+	  TREE_OPERAND (exp, 1) = build_fold_addr_expr (newdecl);
+	}
       break;
 
     case BUILT_IN_SETJMP:
Index: gcc/builtins.def
===================================================================
--- gcc/builtins.def	(revision 236909)
+++ gcc/builtins.def	(working copy)
@@ -864,6 +864,10 @@ DEF_BUILTIN_STUB (BUILT_IN_STACK_SAVE, "
 DEF_BUILTIN_STUB (BUILT_IN_STACK_RESTORE, "__builtin_stack_restore")
 DEF_BUILTIN_STUB (BUILT_IN_ALLOCA_WITH_ALIGN, "__builtin_alloca_with_align")
 
+/* An internal version of memcmp, used when the result is only tested for
+   equality with zero.  */
+DEF_BUILTIN_STUB (BUILT_IN_MEMCMP_EQ, "__builtin_memcmp_eq")
+
 /* Object size checking builtins.  */
 DEF_GCC_BUILTIN	       (BUILT_IN_OBJECT_SIZE, "object_size", BT_FN_SIZE_CONST_PTR_INT, ATTR_PURE_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN_CHKP (BUILT_IN_MEMCPY_CHK, "__memcpy_chk", BT_FN_PTR_PTR_CONST_PTR_SIZE_SIZE, ATTR_RET1_NOTHROW_NONNULL_LEAF)
Index: gcc/defaults.h
===================================================================
--- gcc/defaults.h	(revision 236909)
+++ gcc/defaults.h	(working copy)
@@ -1039,6 +1039,11 @@ see the files COPYING3 and COPYING.RUNTI
 #define STORE_MAX_PIECES  MIN (MOVE_MAX_PIECES, 2 * sizeof (HOST_WIDE_INT))
 #endif
 
+/* Likewise for block comparisons.  */
+#ifndef COMPARE_MAX_PIECES
+#define COMPARE_MAX_PIECES  MOVE_MAX_PIECES
+#endif
+
 #ifndef MAX_MOVE_MAX
 #define MAX_MOVE_MAX MOVE_MAX
 #endif
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 236909)
+++ gcc/doc/tm.texi	(working copy)
@@ -6311,8 +6311,9 @@ Both @var{size} and @var{alignment} are
 units.
 
 The parameter @var{op} is one of: @code{CLEAR_BY_PIECES},
-@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES}.
-These describe the type of memory operation under consideration.
+@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES} or
+@code{COMPARE_BY_PIECES}.  These describe the type of memory operation
+under consideration.
 
 The parameter @var{speed_p} is true if the code is currently being
 optimized for speed rather than size.
@@ -6329,11 +6330,33 @@ in code size, for example where the numb
 move would be greater than that of a library call.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_COMPARE_BY_PIECES_BRANCH_RATIO (machine_mode @var{mode})
+When expanding a block comparison in MODE, gcc can try to reduce the
+number of branches at the expense of more memory operations.  This hook
+allows the target to override the default choice.  It should return the
+factor by which branches should be reduced over the plain expansion with
+one comparison per @var{mode}-sized piece.  A port can also prevent a
+particular mode from being used for block comparisons by returning a
+negative number from this hook.
+@end deftypefn
+
 @defmac MOVE_MAX_PIECES
 A C expression used by @code{move_by_pieces} to determine the largest unit
 a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
 @end defmac
 
+@defmac STORE_MAX_PIECES
+A C expression used by @code{store_by_pieces} to determine the largest unit
+a store used to memory is.  Defaults to @code{MOVE_MAX_PIECES}, or two times
+the size of @code{HOST_WIDE_INT}, whichever is smaller.
+@end defmac
+
+@defmac COMPARE_MAX_PIECES
+A C expression used by @code{compare_by_pieces} to determine the largest unit
+a load or store used to compare memory is.  Defaults to
+@code{MOVE_MAX_PIECES}.
+@end defmac
+
 @defmac CLEAR_RATIO (@var{speed})
 The threshold of number of scalar move insns, @emph{below} which a sequence
 of insns should be generated to clear memory instead of a string clear insn
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 236909)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -4653,11 +4653,25 @@ If you don't define this, a reasonable d
 
 @hook TARGET_USE_BY_PIECES_INFRASTRUCTURE_P
 
+@hook TARGET_COMPARE_BY_PIECES_BRANCH_RATIO
+
 @defmac MOVE_MAX_PIECES
 A C expression used by @code{move_by_pieces} to determine the largest unit
 a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
 @end defmac
 
+@defmac STORE_MAX_PIECES
+A C expression used by @code{store_by_pieces} to determine the largest unit
+a store used to memory is.  Defaults to @code{MOVE_MAX_PIECES}, or two times
+the size of @code{HOST_WIDE_INT}, whichever is smaller.
+@end defmac
+
+@defmac COMPARE_MAX_PIECES
+A C expression used by @code{compare_by_pieces} to determine the largest unit
+a load or store used to compare memory is.  Defaults to
+@code{MOVE_MAX_PIECES}.
+@end defmac
+
 @defmac CLEAR_RATIO (@var{speed})
 The threshold of number of scalar move insns, @emph{below} which a sequence
 of insns should be generated to clear memory instead of a string clear insn
Index: gcc/expr.c
===================================================================
--- gcc/expr.c	(revision 236909)
+++ gcc/expr.c	(working copy)
@@ -70,51 +70,12 @@ along with GCC; see the file COPYING3.
    the same indirect address eventually.  */
 int cse_not_expected;
 
-/* This structure is used by move_by_pieces to describe the move to
-   be performed.  */
-struct move_by_pieces_d
-{
-  rtx to;
-  rtx to_addr;
-  int autinc_to;
-  int explicit_inc_to;
-  rtx from;
-  rtx from_addr;
-  int autinc_from;
-  int explicit_inc_from;
-  unsigned HOST_WIDE_INT len;
-  HOST_WIDE_INT offset;
-  int reverse;
-};
-
-/* This structure is used by store_by_pieces to describe the clear to
-   be performed.  */
-
-struct store_by_pieces_d
-{
-  rtx to;
-  rtx to_addr;
-  int autinc_to;
-  int explicit_inc_to;
-  unsigned HOST_WIDE_INT len;
-  HOST_WIDE_INT offset;
-  rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode);
-  void *constfundata;
-  int reverse;
-};
-
-static void move_by_pieces_1 (insn_gen_fn, machine_mode,
-			      struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT,
 					unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
 					unsigned HOST_WIDE_INT);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
-static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
-static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
-static void store_by_pieces_2 (insn_gen_fn, machine_mode,
-			       struct store_by_pieces_d *);
 static rtx_insn *compress_float_constant (rtx, rtx);
 static rtx get_subtarget (rtx);
 static void store_constructor_field (rtx, unsigned HOST_WIDE_INT,
@@ -767,276 +728,799 @@ widest_int_mode_for_size (unsigned int s
   return mode;
 }
 
+/* Determine whether an operation OP on LEN bytes with alignment ALIGN can
+   and should be performed piecewise.  */
+
+static bool
+can_do_by_pieces (unsigned HOST_WIDE_INT len, unsigned int align,
+		  enum by_pieces_operation op)
+{
+  return targetm.use_by_pieces_infrastructure_p (len, align, op,
+						 optimize_insn_for_speed_p ());
+}
+
 /* Determine whether the LEN bytes can be moved by using several move
    instructions.  Return nonzero if a call to move_by_pieces should
    succeed.  */
 
-int
-can_move_by_pieces (unsigned HOST_WIDE_INT len,
-		    unsigned int align)
+bool
+can_move_by_pieces (unsigned HOST_WIDE_INT len, unsigned int align)
 {
-  return targetm.use_by_pieces_infrastructure_p (len, align, MOVE_BY_PIECES,
-						 optimize_insn_for_speed_p ());
+  return can_do_by_pieces (len, align, MOVE_BY_PIECES);
 }
 
-/* Generate several move instructions to copy LEN bytes from block FROM to
-   block TO.  (These are MEM rtx's with BLKmode).
+/* Return number of insns required to perform operation OP by pieces
+   for L bytes.  ALIGN (in bits) is maximum alignment we can assume.  */
 
-   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
-   used to push FROM to the stack.
+unsigned HOST_WIDE_INT
+by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
+		  unsigned int max_size, by_pieces_operation op)
+{
+  unsigned HOST_WIDE_INT n_insns = 0;
 
-   ALIGN is maximum stack alignment we can assume.
+  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
 
-   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
-   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
-   stpcpy.  */
+  while (max_size > 1 && l > 0)
+    {
+      machine_mode mode;
+      enum insn_code icode;
 
-rtx
-move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
-		unsigned int align, int endp)
+      mode = widest_int_mode_for_size (max_size);
+
+      if (mode == VOIDmode)
+	break;
+      unsigned int modesize = GET_MODE_SIZE (mode);
+
+      icode = optab_handler (mov_optab, mode);
+      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
+	{
+	  unsigned HOST_WIDE_INT n_pieces = l / modesize;
+	  l %= modesize;
+	  switch (op)
+	    {
+	    default:
+	      n_insns += n_pieces;
+	      break;
+
+	    case COMPARE_BY_PIECES:
+	      int batch = targetm.compare_by_pieces_branch_ratio (mode);
+	      int batch_ops = 4 * batch - 1;
+	      int full = n_pieces / batch;
+	      n_insns += full * batch_ops;
+	      if (n_pieces % batch != 0)
+		n_insns++;
+	      break;
+
+	    }
+	}
+      max_size = modesize;
+    }
+
+  gcc_assert (!l);
+  return n_insns;
+}
+
+/* Used when performing piecewise block operations, holds information
+   about one of the memory objects involved.  The member functions
+   can be used to generate code for loading from the object and
+   updating the address when iterating.  */
+
+class pieces_addr
 {
-  struct move_by_pieces_d data;
-  machine_mode to_addr_mode;
-  machine_mode from_addr_mode = get_address_mode (from);
-  rtx to_addr, from_addr = XEXP (from, 0);
-  unsigned int max_size = MOVE_MAX_PIECES + 1;
-  enum insn_code icode;
+  /* The object being referenced, a MEM.  Can be NULL_RTX to indicate
+     stack pushes.  */
+  rtx m_obj;
+  /* The address of the object.  Can differ from that seen in the
+     MEM rtx if we copied the address to a register.  */
+  rtx m_addr;
+  /* Nonzero if the address on the object has an autoincrement already,
+     signifies whether that was an increment or decrement.  */
+  signed char m_addr_inc;
+  /* Nonzero if we intend to use autoinc without the address already
+     having autoinc form.  We will insert add insns around each memory
+     reference, expecting later passes to form autoinc addressing modes.
+     The only supported options are predecrement and postincrement.  */
+  signed char m_explicit_inc;
+  /* True if we have either of the two possible cases of using
+     autoincrement.  */
+  bool m_auto;
+  /* True if this is an address to be used for load operations rather
+     than stores.  */
+  bool m_is_load;
 
-  align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
+  /* Optionally, a function to obtain constants for any given offset into
+     the objects, and data associated with it.  */
+  by_pieces_constfn m_constfn;
+  void *m_cfndata;
+public:
+  pieces_addr (rtx, bool, by_pieces_constfn, void *);
+  rtx adjust (machine_mode, HOST_WIDE_INT);
+  void increment_address (HOST_WIDE_INT);
+  void maybe_predec (HOST_WIDE_INT);
+  void maybe_postinc (HOST_WIDE_INT);
+  void decide_autoinc (machine_mode, bool, HOST_WIDE_INT);
+  int get_addr_inc ()
+  {
+    return m_addr_inc;
+  }
+};
 
-  data.offset = 0;
-  data.from_addr = from_addr;
-  if (to)
+/* Initialize a pieces_addr structure from an object OBJ.  IS_LOAD is
+   true if the operation to be performed on this object is a load
+   rather than a store.  For stores, OBJ can be NULL, in which case we
+   assume the operation is a stack push.  For loads, the optional
+   CONSTFN and its associated CFNDATA can be used in place of the
+   memory load.  */
+
+pieces_addr::pieces_addr (rtx obj, bool is_load, by_pieces_constfn constfn,
+			  void *cfndata)
+  : m_obj (obj), m_is_load (is_load), m_constfn (constfn), m_cfndata (cfndata)
+{
+  m_addr_inc = 0;
+  m_auto = false;
+  if (obj)
     {
-      to_addr_mode = get_address_mode (to);
-      to_addr = XEXP (to, 0);
-      data.to = to;
-      data.autinc_to
-	= (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
-	   || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
-      data.reverse
-	= (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+      rtx addr = XEXP (obj, 0);
+      rtx_code code = GET_CODE (addr);
+      m_addr = addr;
+      bool dec = code == PRE_DEC || code == POST_DEC;
+      bool inc = code == PRE_INC || code == POST_INC;
+      m_auto = inc || dec;
+      if (m_auto)
+	m_addr_inc = dec ? -1 : 1;
+
+      /* While we have always looked for these codes here, the code
+	 implementing the memory operation has never handled them.
+	 Support could be added later if necessary or beneficial.  */
+      gcc_assert (code != PRE_INC && code != POST_DEC);
     }
   else
     {
-      to_addr_mode = VOIDmode;
-      to_addr = NULL_RTX;
-      data.to = NULL_RTX;
-      data.autinc_to = 1;
-      if (STACK_GROWS_DOWNWARD)
-	data.reverse = 1;
+      m_addr = NULL_RTX;
+      if (!is_load)
+	{
+	  m_auto = true;
+	  if (STACK_GROWS_DOWNWARD)
+	    m_addr_inc = -1;
+	  else
+	    m_addr_inc = 1;
+	}
       else
-	data.reverse = 0;
+	gcc_assert (constfn != NULL);
     }
-  data.to_addr = to_addr;
-  data.from = from;
-  data.autinc_from
-    = (GET_CODE (from_addr) == PRE_INC || GET_CODE (from_addr) == PRE_DEC
-       || GET_CODE (from_addr) == POST_INC
-       || GET_CODE (from_addr) == POST_DEC);
+  m_explicit_inc = 0;
+  if (constfn)
+    gcc_assert (is_load);
+}
 
-  data.explicit_inc_from = 0;
-  data.explicit_inc_to = 0;
-  if (data.reverse) data.offset = len;
-  data.len = len;
+/* Decide whether to use autoinc for an address involved in a memory op.
+   MODE is the mode of the accesses, REVERSE is true if we've decided to
+   perform the operation starting from the end, and LEN is the length of
+   the operation.  Don't override an earlier decision to set m_auto.  */
+
+void
+pieces_addr::decide_autoinc (machine_mode ARG_UNUSED (mode), bool reverse,
+			     HOST_WIDE_INT len)
+{
+  if (m_auto || m_obj == NULL_RTX)
+    return;
+
+  bool use_predec = (m_is_load
+		     ? USE_LOAD_PRE_DECREMENT (mode)
+		     : USE_STORE_PRE_DECREMENT (mode));
+  bool use_postinc = (m_is_load
+		      ? USE_LOAD_POST_INCREMENT (mode)
+		      : USE_STORE_POST_INCREMENT (mode));
+  machine_mode addr_mode = get_address_mode (m_obj);
+
+  if (use_predec && reverse)
+    {
+      m_addr = copy_to_mode_reg (addr_mode,
+				 plus_constant (addr_mode,
+						m_addr, len));
+      m_auto = true;
+      m_explicit_inc = -1;
+    }
+  else if (use_postinc && !reverse)
+    {
+      m_addr = copy_to_mode_reg (addr_mode, m_addr);
+      m_auto = true;
+      m_explicit_inc = 1;
+    }
+  else if (CONSTANT_P (m_addr))
+    m_addr = copy_to_mode_reg (addr_mode, m_addr);
+}
+
+/* Adjust the address to refer to the data at OFFSET in MODE.  If we
+   are using autoincrement for this address, we don't add the offset,
+   but we still modify the MEM's properties.  */
+
+rtx
+pieces_addr::adjust (machine_mode mode, HOST_WIDE_INT offset)
+{
+  if (m_constfn)
+    return m_constfn (m_cfndata, offset, mode);
+  if (m_obj == NULL_RTX)
+    return NULL_RTX;
+  if (m_auto)
+    return adjust_automodify_address (m_obj, mode, m_addr, offset);
+  else
+    return adjust_address (m_obj, mode, offset);
+}
+
+/* Emit an add instruction to increment the address by SIZE.  */
+
+void
+pieces_addr::increment_address (HOST_WIDE_INT size)
+{
+  rtx amount = gen_int_mode (size, GET_MODE (m_addr));
+  emit_insn (gen_add2_insn (m_addr, amount));
+}
+
+/* If we are supposed to decrement the address after each access, emit code
+   to do so now.  Increment by SIZE (which has should have the correct sign
+   already).  */
+
+void
+pieces_addr::maybe_predec (HOST_WIDE_INT size)
+{
+  if (m_explicit_inc >= 0)
+    return;
+  gcc_assert (HAVE_PRE_DECREMENT);
+  increment_address (size);
+}
+
+/* If we are supposed to decrement the address after each access, emit code
+   to do so now.  Increment by SIZE.  */
+
+void
+pieces_addr::maybe_postinc (HOST_WIDE_INT size)
+{
+  if (m_explicit_inc <= 0)
+    return;
+  gcc_assert (HAVE_POST_INCREMENT);
+  increment_address (size);
+}
+
+/* This structure is used by do_op_by_pieces to describe the operation
+   to be performed.  */
+
+class op_by_pieces_d
+{
+ protected:
+  pieces_addr m_to, m_from;
+  unsigned HOST_WIDE_INT m_len;
+  HOST_WIDE_INT m_offset;
+  unsigned int m_align;
+  unsigned int m_max_size;
+  bool m_reverse;
+
+  /* Virtual functions, overriden by derived classes for the specific
+     operation.  */
+  virtual void generate (rtx, rtx, machine_mode) = 0;
+  virtual bool prepare_mode (machine_mode, unsigned int) = 0;
+  virtual void finish_mode (machine_mode)
+  {
+  }
+
+ public:
+  op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *,
+		  unsigned HOST_WIDE_INT, unsigned int);
+  void run ();
+};
+
+/* The constructor for an op_by_pieces_d structure.  We require two
+   objects named TO and FROM, which are identified as loads or stores
+   by TO_LOAD and FROM_LOAD.  If FROM is a load, the optional FROM_CFN
+   and its associated FROM_CFN_DATA can be used to replace loads with
+   constant values.  LEN describes the length of the operation.  */
+
+op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
+				rtx from, bool from_load,
+				by_pieces_constfn from_cfn,
+				void *from_cfn_data,
+				unsigned HOST_WIDE_INT len,
+				unsigned int align)
+  : m_to (to, to_load, NULL, NULL),
+    m_from (from, from_load, from_cfn, from_cfn_data),
+    m_len (len), m_max_size (MOVE_MAX_PIECES + 1)
+{
+  int toi = m_to.get_addr_inc ();
+  int fromi = m_from.get_addr_inc ();
+  if (toi >= 0 && fromi >= 0)
+    m_reverse = false;
+  else if (toi <= 0 && fromi <= 0)
+    m_reverse = true;
+  else
+    gcc_unreachable ();
+
+  m_offset = m_reverse ? len : 0;
+  align = MIN (to ? MEM_ALIGN (to) : align,
+	       from ? MEM_ALIGN (from) : align);
 
   /* If copying requires more than two move insns,
      copy addresses to registers (to make displacements shorter)
      and use post-increment if available.  */
-  if (!(data.autinc_from && data.autinc_to)
-      && move_by_pieces_ninsns (len, align, max_size) > 2)
+  if (by_pieces_ninsns (len, align, m_max_size, MOVE_BY_PIECES) > 2)
     {
-      /* Find the mode of the largest move...
-	 MODE might not be used depending on the definitions of the
-	 USE_* macros below.  */
-      machine_mode mode ATTRIBUTE_UNUSED
-	= widest_int_mode_for_size (max_size);
+      /* Find the mode of the largest comparison.  */
+      machine_mode mode = widest_int_mode_for_size (m_max_size);
 
-      if (USE_LOAD_PRE_DECREMENT (mode) && data.reverse && ! data.autinc_from)
-	{
-	  data.from_addr = copy_to_mode_reg (from_addr_mode,
-					     plus_constant (from_addr_mode,
-							    from_addr, len));
-	  data.autinc_from = 1;
-	  data.explicit_inc_from = -1;
-	}
-      if (USE_LOAD_POST_INCREMENT (mode) && ! data.autinc_from)
-	{
-	  data.from_addr = copy_to_mode_reg (from_addr_mode, from_addr);
-	  data.autinc_from = 1;
-	  data.explicit_inc_from = 1;
-	}
-      if (!data.autinc_from && CONSTANT_P (from_addr))
-	data.from_addr = copy_to_mode_reg (from_addr_mode, from_addr);
-      if (USE_STORE_PRE_DECREMENT (mode) && data.reverse && ! data.autinc_to)
-	{
-	  data.to_addr = copy_to_mode_reg (to_addr_mode,
-					   plus_constant (to_addr_mode,
-							  to_addr, len));
-	  data.autinc_to = 1;
-	  data.explicit_inc_to = -1;
-	}
-      if (USE_STORE_POST_INCREMENT (mode) && ! data.reverse && ! data.autinc_to)
-	{
-	  data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
-	  data.autinc_to = 1;
-	  data.explicit_inc_to = 1;
-	}
-      if (!data.autinc_to && CONSTANT_P (to_addr))
-	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+      m_from.decide_autoinc (mode, m_reverse, len);
+      m_to.decide_autoinc (mode, m_reverse, len);
     }
 
   align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
+  m_align = align;
+}
 
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
+/* This function contains the main loop used for expanding a block
+   operation.  First move what we can in the largest integer mode,
+   then go to successively smaller modes.  For every access, call
+   GENFUN with the two operands and the EXTRA_DATA.  */
 
-  while (max_size > 1 && data.len > 0)
+void
+op_by_pieces_d::run ()
+{
+  while (m_max_size > 1 && m_len > 0)
     {
-      machine_mode mode = widest_int_mode_for_size (max_size);
+      machine_mode mode = widest_int_mode_for_size (m_max_size);
 
       if (mode == VOIDmode)
 	break;
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+      if (prepare_mode (mode, m_align))
+	{
+	  unsigned int size = GET_MODE_SIZE (mode);
+	  rtx to1 = NULL_RTX, from1;
 
-      max_size = GET_MODE_SIZE (mode);
+	  while (m_len >= size)
+	    {
+	      if (m_reverse)
+		m_offset -= size;
+
+	      to1 = m_to.adjust (mode, m_offset);
+	      from1 = m_from.adjust (mode, m_offset);
+
+	      m_to.maybe_predec (-(HOST_WIDE_INT)size);
+	      m_from.maybe_predec (-(HOST_WIDE_INT)size);
+
+	      generate (to1, from1, mode);
+
+	      m_to.maybe_postinc (size);
+	      m_from.maybe_postinc (size);
+
+	      if (!m_reverse)
+		m_offset += size;
+
+	      m_len -= size;
+	    }
+
+	  finish_mode (mode);
+	}
+
+      m_max_size = GET_MODE_SIZE (mode);
     }
 
   /* The code above should have handled everything.  */
-  gcc_assert (!data.len);
+  gcc_assert (!m_len);
+}
+
+/* Derived class from op_by_pieces_d, providing support for block move
+   operations.  */
+
+class move_by_pieces_d : public op_by_pieces_d
+{
+  insn_gen_fn m_gen_fun;
+  void generate (rtx, rtx, machine_mode);
+  bool prepare_mode (machine_mode, unsigned int);
+
+ public:
+  move_by_pieces_d (rtx to, rtx from, unsigned HOST_WIDE_INT len,
+		    unsigned int align)
+    : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align)
+  {
+  }
+  rtx finish_endp (int);
+};
+
+/* Return true if MODE can be used for a set of copies, given an
+   alignment ALIGN.  Prepare whatever data is necessary for later
+   calls to generate.  */
+
+bool
+move_by_pieces_d::prepare_mode (machine_mode mode, unsigned int align)
+{
+  insn_code icode = optab_handler (mov_optab, mode);
+  m_gen_fun = GEN_FCN (icode);
+  return icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode);
+}
+
+/* A callback used when iterating for a compare_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  If OP0 is NULL, this means we should generate a
+   push; otherwise EXTRA_DATA holds a pointer to a pointer to the insn
+   gen function that should be used to generate the mode.  */
+
+void
+move_by_pieces_d::generate (rtx op0, rtx op1, machine_mode mode)
+{
+#ifdef PUSH_ROUNDING
+  if (op0 == NULL_RTX)
+    {
+      emit_single_push_insn (mode, op1, NULL);
+      return;
+    }
+#endif
+  emit_insn (m_gen_fun (op0, op1));
+}
+
+/* Perform the final adjustment at the end of a string to obtain the
+   correct return value for the block operation.  If ENDP is 1 return
+   memory at the end ala mempcpy, and if ENDP is 2 return memory the
+   end minus one byte ala stpcpy.  */
+
+rtx
+move_by_pieces_d::finish_endp (int endp)
+{
+  gcc_assert (!m_reverse);
+  if (endp == 2)
+    {
+      m_to.maybe_postinc (-1);
+      --m_offset;
+    }
+  return m_to.adjust (QImode, m_offset);
+}
+
+/* Generate several move instructions to copy LEN bytes from block FROM to
+   block TO.  (These are MEM rtx's with BLKmode).
+
+   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
+   used to push FROM to the stack.
+
+   ALIGN is maximum stack alignment we can assume.
+
+   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
+   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
+   stpcpy.  */
+
+rtx
+move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
+		unsigned int align, int endp)
+{
+#ifndef PUSH_ROUNDING
+  if (to == NULL)
+    gcc_unreachable ();
+#endif
+
+  move_by_pieces_d data (to, from, len, align);
+
+  data.run ();
 
   if (endp)
+    return data.finish_endp (endp);
+  else
+    return to;
+}
+
+/* Derived class from op_by_pieces_d, providing support for block move
+   operations.  */
+
+class store_by_pieces_d : public op_by_pieces_d
+{
+  insn_gen_fn m_gen_fun;
+  void generate (rtx, rtx, machine_mode);
+  bool prepare_mode (machine_mode, unsigned int);
+
+ public:
+  store_by_pieces_d (rtx to, by_pieces_constfn cfn, void *cfn_data,
+		     unsigned HOST_WIDE_INT len, unsigned int align)
+    : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len, align)
+  {
+  }
+  rtx finish_endp (int);
+};
+
+/* Return true if MODE can be used for a set of stores, given an
+   alignment ALIGN.  Prepare whatever data is necessary for later
+   calls to generate.  */
+
+bool
+store_by_pieces_d::prepare_mode (machine_mode mode, unsigned int align)
+{
+  insn_code icode = optab_handler (mov_optab, mode);
+  m_gen_fun = GEN_FCN (icode);
+  return icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode);
+}
+
+/* A callback used when iterating for a store_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  If OP0 is NULL, this means we should generate a
+   push; otherwise EXTRA_DATA holds a pointer to a pointer to the insn
+   gen function that should be used to generate the mode.  */
+
+void
+store_by_pieces_d::generate (rtx op0, rtx op1, machine_mode)
+{
+  emit_insn (m_gen_fun (op0, op1));
+}
+
+/* Perform the final adjustment at the end of a string to obtain the
+   correct return value for the block operation.  If ENDP is 1 return
+   memory at the end ala mempcpy, and if ENDP is 2 return memory the
+   end minus one byte ala stpcpy.  */
+
+rtx
+store_by_pieces_d::finish_endp (int endp)
+{
+  gcc_assert (!m_reverse);
+  if (endp == 2)
     {
-      rtx to1;
+      m_to.maybe_postinc (-1);
+      --m_offset;
+    }
+  return m_to.adjust (QImode, m_offset);
+}
 
-      gcc_assert (!data.reverse);
-      if (data.autinc_to)
+/* Determine whether the LEN bytes generated by CONSTFUN can be
+   stored to memory using several move instructions.  CONSTFUNDATA is
+   a pointer which will be passed as argument in every CONSTFUN call.
+   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
+   a memset operation and false if it's a copy of a constant string.
+   Return nonzero if a call to store_by_pieces should succeed.  */
+
+int
+can_store_by_pieces (unsigned HOST_WIDE_INT len,
+		     rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
+		     void *constfundata, unsigned int align, bool memsetp)
+{
+  unsigned HOST_WIDE_INT l;
+  unsigned int max_size;
+  HOST_WIDE_INT offset = 0;
+  machine_mode mode;
+  enum insn_code icode;
+  int reverse;
+  /* cst is set but not used if LEGITIMATE_CONSTANT doesn't use it.  */
+  rtx cst ATTRIBUTE_UNUSED;
+
+  if (len == 0)
+    return 1;
+
+  if (!targetm.use_by_pieces_infrastructure_p (len, align,
+					       memsetp
+						 ? SET_BY_PIECES
+						 : STORE_BY_PIECES,
+					       optimize_insn_for_speed_p ()))
+    return 0;
+
+  align = alignment_for_piecewise_move (STORE_MAX_PIECES, align);
+
+  /* We would first store what we can in the largest integer mode, then go to
+     successively smaller modes.  */
+
+  for (reverse = 0;
+       reverse <= (HAVE_PRE_DECREMENT || HAVE_POST_DECREMENT);
+       reverse++)
+    {
+      l = len;
+      max_size = STORE_MAX_PIECES + 1;
+      while (max_size > 1 && l > 0)
 	{
-	  if (endp == 2)
+	  mode = widest_int_mode_for_size (max_size);
+
+	  if (mode == VOIDmode)
+	    break;
+
+	  icode = optab_handler (mov_optab, mode);
+	  if (icode != CODE_FOR_nothing
+	      && align >= GET_MODE_ALIGNMENT (mode))
 	    {
-	      if (HAVE_POST_INCREMENT && data.explicit_inc_to > 0)
-		emit_insn (gen_add2_insn (data.to_addr, constm1_rtx));
-	      else
-		data.to_addr = copy_to_mode_reg (to_addr_mode,
-						 plus_constant (to_addr_mode,
-								data.to_addr,
-								-1));
+	      unsigned int size = GET_MODE_SIZE (mode);
+
+	      while (l >= size)
+		{
+		  if (reverse)
+		    offset -= size;
+
+		  cst = (*constfun) (constfundata, offset, mode);
+		  if (!targetm.legitimate_constant_p (mode, cst))
+		    return 0;
+
+		  if (!reverse)
+		    offset += size;
+
+		  l -= size;
+		}
 	    }
-	  to1 = adjust_automodify_address (data.to, QImode, data.to_addr,
-					   data.offset);
-	}
-      else
-	{
-	  if (endp == 2)
-	    --data.offset;
-	  to1 = adjust_address (data.to, QImode, data.offset);
+
+	  max_size = GET_MODE_SIZE (mode);
 	}
-      return to1;
+
+      /* The code above should have handled everything.  */
+      gcc_assert (!l);
+    }
+
+  return 1;
+}
+
+/* Generate several move instructions to store LEN bytes generated by
+   CONSTFUN to block TO.  (A MEM rtx with BLKmode).  CONSTFUNDATA is a
+   pointer which will be passed as argument in every CONSTFUN call.
+   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
+   a memset operation and false if it's a copy of a constant string.
+   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
+   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
+   stpcpy.  */
+
+rtx
+store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
+		 rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
+		 void *constfundata, unsigned int align, bool memsetp, int endp)
+{
+  if (len == 0)
+    {
+      gcc_assert (endp != 2);
+      return to;
     }
+
+  gcc_assert (targetm.use_by_pieces_infrastructure_p
+		(len, align,
+		 memsetp ? SET_BY_PIECES : STORE_BY_PIECES,
+		 optimize_insn_for_speed_p ()));
+
+  store_by_pieces_d data (to, constfun, constfundata, len, align);
+  data.run ();
+
+  if (endp)
+    return data.finish_endp (endp);
   else
-    return data.to;
+    return to;
 }
 
-/* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+/* Callback routine for clear_by_pieces.
+   Return const0_rtx unconditionally.  */
 
-unsigned HOST_WIDE_INT
-move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+static rtx
+clear_by_pieces_1 (void *, HOST_WIDE_INT, machine_mode)
 {
-  unsigned HOST_WIDE_INT n_insns = 0;
+  return const0_rtx;
+}
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
+/* Generate several move instructions to clear LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
-  while (max_size > 1 && l > 0)
-    {
-      machine_mode mode;
-      enum insn_code icode;
+static void
+clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
+{
+  if (len == 0)
+    return;
 
-      mode = widest_int_mode_for_size (max_size);
+  store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align);
+  data.run ();
+}
 
-      if (mode == VOIDmode)
-	break;
+/* Context used by compare_by_pieces_genfn.  It stores the fail label
+   to jump to in case of miscomparison, and for branch ratios greater than 1,
+   it stores an accumulator and the current and maximum counts before
+   emitting another branch.  */
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+class compare_by_pieces_d : public op_by_pieces_d
+{
+  rtx_code_label *m_fail_label;
+  rtx m_accumulator;
+  int m_count, m_batch;
 
-      max_size = GET_MODE_SIZE (mode);
+  void generate (rtx, rtx, machine_mode);
+  bool prepare_mode (machine_mode, unsigned int);
+  void finish_mode (machine_mode);
+ public:
+  compare_by_pieces_d (rtx op0, rtx op1, by_pieces_constfn op1_cfn,
+		       void *op1_cfn_data, HOST_WIDE_INT len, int align,
+		       rtx_code_label *fail_label)
+    : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len, align)
+  {
+    m_fail_label = fail_label;
+  }
+};
+
+/* A callback used when iterating for a compare_by_pieces_operation.
+   OP0 and OP1 are the values that have been loaded and should be
+   compared in MODE.  DATA holds a pointer to the compare_by_pieces_data
+   context structure.  */
+
+void
+compare_by_pieces_d::generate (rtx op0, rtx op1, machine_mode mode)
+{
+  if (m_batch > 1)
+    {
+      rtx temp = expand_binop (mode, sub_optab, op0, op1, NULL_RTX,
+			       true, OPTAB_LIB_WIDEN);
+      if (m_count != 0)
+	temp = expand_binop (mode, ior_optab, m_accumulator, temp, temp,
+			     true, OPTAB_LIB_WIDEN);
+      m_accumulator = temp;
+
+      if (++m_count < m_batch)
+	return;
+
+      m_count = 0;
+      op0 = m_accumulator;
+      op1 = const0_rtx;
+      m_accumulator = NULL_RTX;
     }
+  do_compare_rtx_and_jump (op0, op1, NE, true, mode, NULL_RTX, NULL,
+			   m_fail_label, -1);
+}
 
-  gcc_assert (!l);
-  return n_insns;
+/* Return true if MODE can be used for a set of moves and comparisons,
+   given an alignment ALIGN.  Prepare whatever data is necessary for
+   later calls to generate.  */
+
+bool
+compare_by_pieces_d::prepare_mode (machine_mode mode, unsigned int align)
+{
+  insn_code icode = optab_handler (mov_optab, mode);
+  if (icode == CODE_FOR_nothing
+      || align < GET_MODE_ALIGNMENT (mode)
+      || !can_compare_p (EQ, mode, ccp_jump))
+    return false;
+  m_batch = targetm.compare_by_pieces_branch_ratio (mode);
+  if (m_batch < 0)
+    return false;
+  m_accumulator = NULL_RTX;
+  m_count = 0;
+  return true;
 }
 
-/* Subroutine of move_by_pieces.  Move as many bytes as appropriate
-   with move instructions for mode MODE.  GENFUN is the gen_... function
-   to make a move insn for that mode.  DATA has all the other info.  */
+/* Called after expanding a series of comparisons in MODE.  If we have
+   accumulated results for which we haven't emitted a branch yet, do
+   so now.  */
 
-static void
-move_by_pieces_1 (insn_gen_fn genfun, machine_mode mode,
-		  struct move_by_pieces_d *data)
+void
+compare_by_pieces_d::finish_mode (machine_mode mode)
 {
-  unsigned int size = GET_MODE_SIZE (mode);
-  rtx to1 = NULL_RTX, from1;
+  if (m_accumulator != NULL_RTX)
+    do_compare_rtx_and_jump (m_accumulator, const0_rtx, NE, true, mode,
+			     NULL_RTX, NULL, m_fail_label, -1);
+}
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
+/* Generate several move instructions to compare LEN bytes from blocks
+   ARG0 and ARG1.  (These are MEM rtx's with BLKmode).
 
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+   If PUSH_ROUNDING is defined and TO is NULL, emit_single_push_insn is
+   used to push FROM to the stack.
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
-      else
-	from1 = adjust_address (data->from, mode, data->offset);
+   ALIGN is maximum stack alignment we can assume.
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->to_addr))));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->from_addr))));
+   Optionally, the caller can pass a constfn and associated data in A1_CFN
+   and A1_CFN_DATA. describing that the second operand being compared is a
+   known constant and how to obtain its data.  */
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
-#ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
-#else
-	  gcc_unreachable ();
-#endif
-	}
+static rtx
+compare_by_pieces (rtx arg0, rtx arg1, unsigned HOST_WIDE_INT len,
+		   rtx target, unsigned int align,
+		   by_pieces_constfn a1_cfn, void *a1_cfn_data)
+{
+  rtx_code_label *fail_label = gen_label_rtx ();
+  rtx_code_label *end_label = gen_label_rtx ();
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->to_addr))));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->from_addr))));
+  if (target == NULL_RTX
+      || !REG_P (target) || REGNO (target) < FIRST_PSEUDO_REGISTER)
+    target = gen_reg_rtx (TYPE_MODE (integer_type_node));
 
-      if (! data->reverse)
-	data->offset += size;
+  compare_by_pieces_d data (arg0, arg1, a1_cfn, a1_cfn_data, len, align,
+			    fail_label);
 
-      data->len -= size;
-    }
+  data.run ();
+
+  emit_move_insn (target, const0_rtx);
+  emit_jump (end_label);
+  emit_barrier ();
+  emit_label (fail_label);
+  emit_move_insn (target, const1_rtx);
+  emit_label (end_label);
+
+  return target;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1066,8 +1550,7 @@ emit_block_move_hints (rtx x, rtx y, rtx
   unsigned int align;
 
   gcc_assert (size);
-  if (CONST_INT_P (size)
-      && INTVAL (size) == 0)
+  if (CONST_INT_P (size) && INTVAL (size) == 0)
     return 0;
 
   switch (method)
@@ -1394,6 +1877,99 @@ emit_block_op_via_libcall (enum built_in
 
   return expand_call (call_expr, NULL_RTX, false);
 }
+
+/* Try to expand cmpstrn or cmpmem operation ICODE with the given operands.
+   ARG3_TYPE is the type of ARG3_RTX.  Return the result rtx on success,
+   otherwise return null.  */
+
+rtx
+expand_cmpstrn_or_cmpmem (insn_code icode, rtx target, rtx arg1_rtx,
+			  rtx arg2_rtx, tree arg3_type, rtx arg3_rtx,
+			  HOST_WIDE_INT align)
+{
+  machine_mode insn_mode = insn_data[icode].operand[0].mode;
+
+  if (target && (!REG_P (target) || HARD_REGISTER_P (target)))
+    target = NULL_RTX;
+
+  struct expand_operand ops[5];
+  create_output_operand (&ops[0], target, insn_mode);
+  create_fixed_operand (&ops[1], arg1_rtx);
+  create_fixed_operand (&ops[2], arg2_rtx);
+  create_convert_operand_from (&ops[3], arg3_rtx, TYPE_MODE (arg3_type),
+			       TYPE_UNSIGNED (arg3_type));
+  create_integer_operand (&ops[4], align);
+  if (maybe_expand_insn (icode, 5, ops))
+    return ops[0].value;
+  return NULL_RTX;
+}
+
+/* Expand a block compare between X and Y with length LEN using the
+   cmpmem optab, placing the result in TARGET.  LEN_TYPE is the type
+   of the expression that was used to calculate the length.  ALIGN
+   gives the known minimum common alignment.  */
+
+static rtx
+emit_block_cmp_via_cmpmem (rtx x, rtx y, rtx len, tree len_type, rtx target,
+			   unsigned align)
+{
+  /* Note: The cmpstrnsi pattern, if it exists, is not suitable for
+     implementing memcmp because it will stop if it encounters two
+     zero bytes.  */
+  insn_code icode = direct_optab_handler (cmpmem_optab, SImode);
+
+  if (icode == CODE_FOR_nothing)
+    return NULL_RTX;
+
+  return expand_cmpstrn_or_cmpmem (icode, target, x, y, len_type, len, align);
+}
+
+/* Emit code to compare a block Y to a block X.  This may be done with
+   string-compare instructions, with multiple scalar instructions,
+   or with a library call.
+
+   Both X and Y must be MEM rtx's.  LEN is an rtx that says how long
+   they are.  LEN_TYPE is the type of the expression that was used to
+   calculate it.
+
+   If EQUALITY_ONLY is true, it means we don't have to return the tri-state
+   value of a normal memcmp call, instead we can just compare for equality.
+   If FORCE_LIBCALL is true, we should emit a call to memcmp rather than
+   returning NULL_RTX.
+
+   Optionally, the caller can pass a constfn and associated data in Y_CFN
+   and Y_CFN_DATA. describing that the second operand being compared is a
+   known constant and how to obtain its data.
+   Return the result of the comparison, or NULL_RTX if we failed to
+   perform the operation.  */
+
+rtx
+emit_block_cmp_hints (rtx x, rtx y, rtx len, tree len_type, rtx target,
+		      bool equality_only, by_pieces_constfn y_cfn,
+		      void *y_cfndata)
+{
+  rtx result = 0;
+
+  if (CONST_INT_P (len) && INTVAL (len) == 0)
+    return const0_rtx;
+
+  gcc_assert (MEM_P (x) && MEM_P (y));
+  unsigned int align = MIN (MEM_ALIGN (x), MEM_ALIGN (y));
+  gcc_assert (align >= BITS_PER_UNIT);
+
+  x = adjust_address (x, BLKmode, 0);
+  y = adjust_address (y, BLKmode, 0);
+
+  if (equality_only
+      && CONST_INT_P (len)
+      && can_do_by_pieces (INTVAL (len), align, COMPARE_BY_PIECES))
+    result = compare_by_pieces (x, y, INTVAL (len), target, align,
+				y_cfn, y_cfndata);
+  else
+    result = emit_block_cmp_via_cmpmem (x, y, len, len_type, target, align);
+
+  return result;
+}
 \f
 /* Copy all or part of a value X into registers starting at REGNO.
    The number of registers to be filled is NREGS.  */
@@ -2330,308 +2906,6 @@ get_def_for_expr_class (tree name, enum
   return def_stmt;
 }
 \f
-
-/* Determine whether the LEN bytes generated by CONSTFUN can be
-   stored to memory using several move instructions.  CONSTFUNDATA is
-   a pointer which will be passed as argument in every CONSTFUN call.
-   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
-   a memset operation and false if it's a copy of a constant string.
-   Return nonzero if a call to store_by_pieces should succeed.  */
-
-int
-can_store_by_pieces (unsigned HOST_WIDE_INT len,
-		     rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
-		     void *constfundata, unsigned int align, bool memsetp)
-{
-  unsigned HOST_WIDE_INT l;
-  unsigned int max_size;
-  HOST_WIDE_INT offset = 0;
-  machine_mode mode;
-  enum insn_code icode;
-  int reverse;
-  /* cst is set but not used if LEGITIMATE_CONSTANT doesn't use it.  */
-  rtx cst ATTRIBUTE_UNUSED;
-
-  if (len == 0)
-    return 1;
-
-  if (!targetm.use_by_pieces_infrastructure_p (len, align,
-					       memsetp
-						 ? SET_BY_PIECES
-						 : STORE_BY_PIECES,
-					       optimize_insn_for_speed_p ()))
-    return 0;
-
-  align = alignment_for_piecewise_move (STORE_MAX_PIECES, align);
-
-  /* We would first store what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  for (reverse = 0;
-       reverse <= (HAVE_PRE_DECREMENT || HAVE_POST_DECREMENT);
-       reverse++)
-    {
-      l = len;
-      max_size = STORE_MAX_PIECES + 1;
-      while (max_size > 1 && l > 0)
-	{
-	  mode = widest_int_mode_for_size (max_size);
-
-	  if (mode == VOIDmode)
-	    break;
-
-	  icode = optab_handler (mov_optab, mode);
-	  if (icode != CODE_FOR_nothing
-	      && align >= GET_MODE_ALIGNMENT (mode))
-	    {
-	      unsigned int size = GET_MODE_SIZE (mode);
-
-	      while (l >= size)
-		{
-		  if (reverse)
-		    offset -= size;
-
-		  cst = (*constfun) (constfundata, offset, mode);
-		  if (!targetm.legitimate_constant_p (mode, cst))
-		    return 0;
-
-		  if (!reverse)
-		    offset += size;
-
-		  l -= size;
-		}
-	    }
-
-	  max_size = GET_MODE_SIZE (mode);
-	}
-
-      /* The code above should have handled everything.  */
-      gcc_assert (!l);
-    }
-
-  return 1;
-}
-
-/* Generate several move instructions to store LEN bytes generated by
-   CONSTFUN to block TO.  (A MEM rtx with BLKmode).  CONSTFUNDATA is a
-   pointer which will be passed as argument in every CONSTFUN call.
-   ALIGN is maximum alignment we can assume.  MEMSETP is true if this is
-   a memset operation and false if it's a copy of a constant string.
-   If ENDP is 0 return to, if ENDP is 1 return memory at the end ala
-   mempcpy, and if ENDP is 2 return memory the end minus one byte ala
-   stpcpy.  */
-
-rtx
-store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
-		 rtx (*constfun) (void *, HOST_WIDE_INT, machine_mode),
-		 void *constfundata, unsigned int align, bool memsetp, int endp)
-{
-  machine_mode to_addr_mode = get_address_mode (to);
-  struct store_by_pieces_d data;
-
-  if (len == 0)
-    {
-      gcc_assert (endp != 2);
-      return to;
-    }
-
-  gcc_assert (targetm.use_by_pieces_infrastructure_p
-		(len, align,
-		 memsetp
-		   ? SET_BY_PIECES
-		   : STORE_BY_PIECES,
-		 optimize_insn_for_speed_p ()));
-
-  data.constfun = constfun;
-  data.constfundata = constfundata;
-  data.len = len;
-  data.to = to;
-  store_by_pieces_1 (&data, align);
-  if (endp)
-    {
-      rtx to1;
-
-      gcc_assert (!data.reverse);
-      if (data.autinc_to)
-	{
-	  if (endp == 2)
-	    {
-	      if (HAVE_POST_INCREMENT && data.explicit_inc_to > 0)
-		emit_insn (gen_add2_insn (data.to_addr, constm1_rtx));
-	      else
-		data.to_addr = copy_to_mode_reg (to_addr_mode,
-						 plus_constant (to_addr_mode,
-								data.to_addr,
-								-1));
-	    }
-	  to1 = adjust_automodify_address (data.to, QImode, data.to_addr,
-					   data.offset);
-	}
-      else
-	{
-	  if (endp == 2)
-	    --data.offset;
-	  to1 = adjust_address (data.to, QImode, data.offset);
-	}
-      return to1;
-    }
-  else
-    return data.to;
-}
-
-/* Generate several move instructions to clear LEN bytes of block TO.  (A MEM
-   rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
-
-static void
-clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
-{
-  struct store_by_pieces_d data;
-
-  if (len == 0)
-    return;
-
-  data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
-  data.len = len;
-  data.to = to;
-  store_by_pieces_1 (&data, align);
-}
-
-/* Callback routine for clear_by_pieces.
-   Return const0_rtx unconditionally.  */
-
-static rtx
-clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
-		   HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
-		   machine_mode mode ATTRIBUTE_UNUSED)
-{
-  return const0_rtx;
-}
-
-/* Subroutine of clear_by_pieces and store_by_pieces.
-   Generate several move instructions to store LEN bytes of block TO.  (A MEM
-   rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
-
-static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
-{
-  machine_mode to_addr_mode = get_address_mode (data->to);
-  rtx to_addr = XEXP (data->to, 0);
-  unsigned int max_size = STORE_MAX_PIECES + 1;
-  enum insn_code icode;
-
-  data->offset = 0;
-  data->to_addr = to_addr;
-  data->autinc_to
-    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
-       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
-
-  data->explicit_inc_to = 0;
-  data->reverse
-    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
-  if (data->reverse)
-    data->offset = data->len;
-
-  /* If storing requires more than two move insns,
-     copy addresses to registers (to make displacements shorter)
-     and use post-increment if available.  */
-  if (!data->autinc_to
-      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
-    {
-      /* Determine the main mode we'll be using.
-	 MODE might not be used depending on the definitions of the
-	 USE_* macros below.  */
-      machine_mode mode ATTRIBUTE_UNUSED
-	= widest_int_mode_for_size (max_size);
-
-      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
-	{
-	  data->to_addr = copy_to_mode_reg (to_addr_mode,
-					    plus_constant (to_addr_mode,
-							   to_addr,
-							   data->len));
-	  data->autinc_to = 1;
-	  data->explicit_inc_to = -1;
-	}
-
-      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
-	  && ! data->autinc_to)
-	{
-	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
-	  data->autinc_to = 1;
-	  data->explicit_inc_to = 1;
-	}
-
-      if ( !data->autinc_to && CONSTANT_P (to_addr))
-	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
-    }
-
-  align = alignment_for_piecewise_move (STORE_MAX_PIECES, align);
-
-  /* First store what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1 && data->len > 0)
-    {
-      machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
-
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	store_by_pieces_2 (GEN_FCN (icode), mode, data);
-
-      max_size = GET_MODE_SIZE (mode);
-    }
-
-  /* The code above should have handled everything.  */
-  gcc_assert (!data->len);
-}
-
-/* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
-   with move instructions for mode MODE.  GENFUN is the gen_... function
-   to make a move insn for that mode.  DATA has all the other info.  */
-
-static void
-store_by_pieces_2 (insn_gen_fn genfun, machine_mode mode,
-		   struct store_by_pieces_d *data)
-{
-  unsigned int size = GET_MODE_SIZE (mode);
-  rtx to1, cst;
-
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->autinc_to)
-	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					 data->offset);
-      else
-	to1 = adjust_address (data->to, mode, data->offset);
-
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (-(HOST_WIDE_INT) size,
-						GET_MODE (data->to_addr))));
-
-      cst = (*data->constfun) (data->constfundata, data->offset, mode);
-      emit_insn ((*genfun) (to1, cst));
-
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  gen_int_mode (size,
-						GET_MODE (data->to_addr))));
-
-      if (! data->reverse)
-	data->offset += size;
-
-      data->len -= size;
-    }
-}
-\f
 /* Write zeros through the storage of OBJECT.  If OBJECT has BLKmode, SIZE is
    its length in bytes.  */
 
Index: gcc/expr.h
===================================================================
--- gcc/expr.h	(revision 236909)
+++ gcc/expr.h	(working copy)
@@ -103,12 +103,16 @@ enum block_op_methods
   BLOCK_OP_TAILCALL
 };
 
+typedef rtx (*by_pieces_constfn) (void *, HOST_WIDE_INT, machine_mode);
+
 extern rtx emit_block_move (rtx, rtx, rtx, enum block_op_methods);
 extern rtx emit_block_move_hints (rtx, rtx, rtx, enum block_op_methods,
 			          unsigned int, HOST_WIDE_INT,
 				  unsigned HOST_WIDE_INT,
 				  unsigned HOST_WIDE_INT,
 				  unsigned HOST_WIDE_INT);
+extern rtx emit_block_cmp_hints (rtx, rtx, rtx, tree, rtx, bool,
+				 by_pieces_constfn, void *);
 extern bool emit_storent_insn (rtx to, rtx from);
 
 /* Copy all or part of a value X into registers starting at REGNO.
@@ -173,6 +177,11 @@ extern void use_regs (rtx *, int, int);
 /* Mark a PARALLEL as holding a parameter for the next CALL_INSN.  */
 extern void use_group_regs (rtx *, rtx);
 
+#ifdef GCC_INSN_CODES_H
+extern rtx expand_cmpstrn_or_cmpmem (insn_code, rtx, rtx, rtx, tree, rtx,
+				     HOST_WIDE_INT);
+#endif
+
 /* Write zeros through the storage of OBJECT.
    If OBJECT has BLKmode, SIZE is its length in bytes.  */
 extern rtx clear_storage (rtx, rtx, enum block_op_methods);
@@ -191,10 +200,6 @@ extern bool set_storage_via_setmem (rtx,
 				    unsigned HOST_WIDE_INT,
 				    unsigned HOST_WIDE_INT);
 
-extern unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
-						     unsigned int,
-						     unsigned int);
-
 /* Return nonzero if it is desirable to store LEN bytes generated by
    CONSTFUN with several move instructions by store_by_pieces
    function.  CONSTFUNDATA is a pointer which will be passed as argument
@@ -203,8 +208,7 @@ extern unsigned HOST_WIDE_INT move_by_pi
    MEMSETP is true if this is a real memset/bzero, not a copy
    of a const string.  */
 extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
-				rtx (*) (void *, HOST_WIDE_INT,
-					 machine_mode),
+				by_pieces_constfn,
 				void *, unsigned int, bool);
 
 /* Generate several move instructions to store LEN bytes generated by
@@ -213,8 +217,7 @@ extern int can_store_by_pieces (unsigned
    ALIGN is maximum alignment we can assume.
    MEMSETP is true if this is a real memset/bzero, not a copy.
    Returns TO + LEN.  */
-extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT,
-			    rtx (*) (void *, HOST_WIDE_INT, machine_mode),
+extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
 			    void *, unsigned int, bool, int);
 
 /* Emit insns to set X from Y.  */
@@ -295,7 +298,7 @@ rtx get_personality_function (tree);
 /* Determine whether the LEN bytes can be moved by using several move
    instructions.  Return nonzero if a call to move_by_pieces should
    succeed.  */
-extern int can_move_by_pieces (unsigned HOST_WIDE_INT, unsigned int);
+extern bool can_move_by_pieces (unsigned HOST_WIDE_INT, unsigned int);
 
 extern unsigned HOST_WIDE_INT highest_pow2_factor (const_tree);
 
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 236909)
+++ gcc/target.def	(working copy)
@@ -3397,8 +3397,9 @@ Both @var{size} and @var{alignment} are
 units.\n\
 \n\
 The parameter @var{op} is one of: @code{CLEAR_BY_PIECES},\n\
-@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES}.\n\
-These describe the type of memory operation under consideration.\n\
+@code{MOVE_BY_PIECES}, @code{SET_BY_PIECES}, @code{STORE_BY_PIECES} or\n\
+@code{COMPARE_BY_PIECES}.  These describe the type of memory operation\n\
+under consideration.\n\
 \n\
 The parameter @var{speed_p} is true if the code is currently being\n\
 optimized for speed rather than size.\n\
@@ -3418,6 +3419,18 @@ move would be greater than that of a lib
  default_use_by_pieces_infrastructure_p)
 
 DEFHOOK
+(compare_by_pieces_branch_ratio,
+ "When expanding a block comparison in MODE, gcc can try to reduce the\n\
+number of branches at the expense of more memory operations.  This hook\n\
+allows the target to override the default choice.  It should return the\n\
+factor by which branches should be reduced over the plain expansion with\n\
+one comparison per @var{mode}-sized piece.  A port can also prevent a\n\
+particular mode from being used for block comparisons by returning a\n\
+negative number from this hook.",
+ int, (machine_mode mode),
+ default_compare_by_pieces_branch_ratio)
+
+DEFHOOK
 (optab_supported_p,
  "Return true if the optimizers should use optab @var{op} with\n\
 modes @var{mode1} and @var{mode2} for optimization type @var{opt_type}.\n\
Index: gcc/target.h
===================================================================
--- gcc/target.h	(revision 236909)
+++ gcc/target.h	(working copy)
@@ -79,16 +79,23 @@ enum print_switch_type
 };
 
 /* Types of memory operation understood by the "by_pieces" infrastructure.
-   Used by the TARGET_USE_BY_PIECES_INFRASTRUCTURE_P target hook.  */
+   Used by the TARGET_USE_BY_PIECES_INFRASTRUCTURE_P target hook and
+   internally by the functions in expr.c.  */
 
 enum by_pieces_operation
 {
   CLEAR_BY_PIECES,
   MOVE_BY_PIECES,
   SET_BY_PIECES,
-  STORE_BY_PIECES
+  STORE_BY_PIECES,
+  COMPARE_BY_PIECES
 };
 
+extern unsigned HOST_WIDE_INT by_pieces_ninsns (unsigned HOST_WIDE_INT,
+						unsigned int,
+						unsigned int,
+						by_pieces_operation);
+
 typedef int (* print_switch_fn_type) (print_switch_type, const char *);
 
 /* An example implementation for ELF targets.  Defined in varasm.c  */
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	(revision 236909)
+++ gcc/targhooks.c	(working copy)
@@ -1482,25 +1482,40 @@ default_use_by_pieces_infrastructure_p (
 
   switch (op)
     {
-      case CLEAR_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = CLEAR_RATIO (speed_p);
-	break;
-      case MOVE_BY_PIECES:
-	max_size = MOVE_MAX_PIECES;
-	ratio = get_move_ratio (speed_p);
-	break;
-      case SET_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = SET_RATIO (speed_p);
-	break;
-      case STORE_BY_PIECES:
-	max_size = STORE_MAX_PIECES;
-	ratio = get_move_ratio (speed_p);
-	break;
+    case CLEAR_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = CLEAR_RATIO (speed_p);
+      break;
+    case MOVE_BY_PIECES:
+      max_size = MOVE_MAX_PIECES;
+      ratio = get_move_ratio (speed_p);
+      break;
+    case SET_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = SET_RATIO (speed_p);
+      break;
+    case STORE_BY_PIECES:
+      max_size = STORE_MAX_PIECES;
+      ratio = get_move_ratio (speed_p);
+      break;
+    case COMPARE_BY_PIECES:
+      max_size = COMPARE_MAX_PIECES;
+      /* Pick a likely default, just as in get_move_ratio.  */
+      ratio = speed_p ? 15 : 3;
+      break;
     }
 
-  return move_by_pieces_ninsns (size, alignment, max_size + 1) < ratio;
+  return by_pieces_ninsns (size, alignment, max_size + 1, op) < ratio;
+}
+
+/* This hook controls code generation for expanding a memcmp operation by
+   pieces.  Return 1 for the normal pattern of compare/jump after each pair
+   of loads, or a higher number to reduce the number of branches.  */
+
+int
+default_compare_by_pieces_branch_ratio (machine_mode)
+{
+  return 1;
 }
 
 bool
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	(revision 236909)
+++ gcc/targhooks.h	(working copy)
@@ -199,6 +199,7 @@ extern bool default_use_by_pieces_infras
 						    unsigned int,
 						    enum by_pieces_operation,
 						    bool);
+extern int default_compare_by_pieces_branch_ratio (machine_mode);
 
 extern bool default_profile_before_prologue (void);
 extern reg_class_t default_preferred_reload_class (rtx, reg_class_t);
Index: gcc/testsuite/gcc.dg/pr52171.c
===================================================================
--- gcc/testsuite/gcc.dg/pr52171.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr52171.c	(working copy)
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler-not "memcmp" } } */
+#include <string.h>
+struct A { int x; } a, b;
+
+extern char s[], t[];
+
+int foo ()
+{
+  return memcmp (&a, &b, sizeof (struct A)) == 0;
+}
Index: gcc/testsuite/gcc.target/i386/pr52171.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr52171.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr52171.c	(working copy)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler-not "memcmp" } } */
+/* { dg-final { scan-assembler "1752394086" } } */
+
+/* This should turn into four compare/jump pairs with -m32, within the
+   limit of what the tuning considers acceptable for -O2.  */
+int cmp (char *p, char *q)
+{
+  char *pa = __builtin_assume_aligned (p, 4);
+  char *qa = __builtin_assume_aligned (q, 4);
+  if (__builtin_memcmp (pa, qa, 16) != 0)
+    return 1;
+  return 0;
+}
+/* Since we have fast unaligned access, we should make a single
+   constant comparison.  The constant becomes 1752394086.  */
+int cmp2 (char *p)
+{
+  if (__builtin_memcmp (p, "fish", 4) != 0)
+    return 1;
+  return 0;
+}
Index: gcc/tree-ssa-strlen.c
===================================================================
--- gcc/tree-ssa-strlen.c	(revision 236909)
+++ gcc/tree-ssa-strlen.c	(working copy)
@@ -44,6 +44,7 @@ along with GCC; see the file COPYING3.
 #include "params.h"
 #include "ipa-chkp.h"
 #include "tree-hash-traits.h"
+#include "builtins.h"
 
 /* A vector indexed by SSA_NAME_VERSION.  0 means unknown, positive value
    is an index into strinfo vector, negative value stands for
@@ -1843,6 +1844,88 @@ handle_builtin_memset (gimple_stmt_itera
   return false;
 }
 
+/* Handle a call to memcmp.  We try to handle small comparisons by
+   converting them to load and compare, and replacing the call to memcmp
+   with a __builtin_memcmp_eq call where possible.  */
+
+static bool
+handle_builtin_memcmp (gimple_stmt_iterator *gsi)
+{
+  gcall *stmt2 = as_a <gcall *> (gsi_stmt (*gsi));
+  tree res = gimple_call_lhs (stmt2);
+  tree arg1 = gimple_call_arg (stmt2, 0);
+  tree arg2 = gimple_call_arg (stmt2, 1);
+  tree len = gimple_call_arg (stmt2, 2);
+  unsigned HOST_WIDE_INT leni;
+  use_operand_p use_p;
+  imm_use_iterator iter;
+
+  if (!res)
+    return true;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iter, res)
+    {
+      gimple *ustmt = USE_STMT (use_p);
+
+      if (gimple_code (ustmt) == GIMPLE_ASSIGN)
+	{
+	  gassign *asgn = as_a <gassign *> (ustmt);
+	  tree_code code = gimple_assign_rhs_code (asgn);
+	  if ((code != EQ_EXPR && code != NE_EXPR)
+	      || !integer_zerop (gimple_assign_rhs2 (asgn)))
+	    return true;
+	}
+      else if (gimple_code (ustmt) == GIMPLE_COND)
+	{
+	  tree_code code = gimple_cond_code (ustmt);
+	  if ((code != EQ_EXPR && code != NE_EXPR)
+	      || !integer_zerop (gimple_cond_rhs (ustmt)))
+	    return true;
+	}
+      else
+	return true;
+    }
+
+  if (tree_fits_uhwi_p (len)
+      && (leni = tree_to_uhwi (len)) <= GET_MODE_SIZE (word_mode)
+      && exact_log2 (leni) != -1)
+    {
+      leni *= CHAR_TYPE_SIZE;
+      unsigned align1 = get_pointer_alignment (arg1);
+      unsigned align2 = get_pointer_alignment (arg2);
+      unsigned align = MIN (align1, align2);
+      machine_mode mode = mode_for_size (leni, MODE_INT, 1);
+      if (mode != BLKmode
+	  && (align >= leni || !SLOW_UNALIGNED_ACCESS (mode, align)))
+	{
+	  location_t loc = gimple_location (stmt2);
+	  tree type, off;
+	  type = build_nonstandard_integer_type (leni, 1);
+	  gcc_assert (GET_MODE_BITSIZE (TYPE_MODE (type)) == leni);
+	  tree ptrtype = build_pointer_type_for_mode (char_type_node,
+						      ptr_mode, true);
+	  off = build_int_cst (ptrtype, 0);
+	  arg1 = build2_loc (loc, MEM_REF, type, arg1, off);
+	  arg2 = build2_loc (loc, MEM_REF, type, arg2, off);
+	  tree tem1 = fold_const_aggregate_ref (arg1);
+	  if (tem1)
+	    arg1 = tem1;
+	  tree tem2 = fold_const_aggregate_ref (arg2);
+	  if (tem2)
+	    arg2 = tem2;
+	  res = fold_convert_loc (loc, TREE_TYPE (res),
+				  fold_build2_loc (loc, NE_EXPR,
+						   boolean_type_node,
+						   arg1, arg2));
+	  gimplify_and_update_call_from_tree (gsi, res);
+	  return false;
+	}
+    }
+
+  gimple_call_set_fndecl (stmt2, builtin_decl_explicit (BUILT_IN_MEMCMP_EQ));
+  return false;
+}
+
 /* Handle a POINTER_PLUS_EXPR statement.
    For p = "abcd" + 2; compute associated length, or if
    p = q + off is pointing to a '\0' character of a string, call
@@ -2100,6 +2183,10 @@ strlen_optimize_stmt (gimple_stmt_iterat
 	    if (!handle_builtin_memset (gsi))
 	      return false;
 	    break;
+	  case BUILT_IN_MEMCMP:
+	    if (!handle_builtin_memcmp (gsi))
+	      return false;
+	    break;
 	  default:
 	    break;
 	  }
Index: gcc/tree.c
===================================================================
--- gcc/tree.c	(revision 236909)
+++ gcc/tree.c	(working copy)
@@ -10603,6 +10603,13 @@ build_common_builtin_nodes (void)
 			BUILT_IN_STACK_RESTORE,
 			"__builtin_stack_restore", ECF_NOTHROW | ECF_LEAF);
 
+  ftype = build_function_type_list (integer_type_node, const_ptr_type_node,
+				    const_ptr_type_node, size_type_node,
+				    NULL_TREE);
+  local_define_builtin ("__builtin_memcmp_eq", ftype, BUILT_IN_MEMCMP_EQ,
+			"__builtin_memcmp_eq",
+			ECF_PURE | ECF_NOTHROW | ECF_LEAF);
+
   /* If there's a possibility that we might use the ARM EABI, build the
     alternate __cxa_end_cleanup node used to resume from C++ and Java.  */
   if (targetm.arm_eabi_unwinder)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-05-31 21:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-15 16:58 Thoughts on memcmp expansion (PR43052) Bernd Schmidt
2016-01-18  9:22 ` Richard Biener
2016-04-28 18:36   ` Bernd Schmidt
2016-05-02 12:52     ` Richard Biener
2016-05-02 12:57       ` Bernd Schmidt
2016-05-02 13:14         ` Richard Biener
2016-05-12 17:14           ` Bernd Schmidt
2016-05-13 10:20             ` Richard Biener
2016-05-13 13:05               ` Bernd Schmidt
2016-05-13 13:07                 ` Richard Biener
2016-05-13 13:13                   ` Bernd Schmidt
2016-05-13 13:53                     ` Joseph Myers
2016-05-13 14:00                       ` Bernd Schmidt
2016-05-13 20:41                         ` Joseph Myers
2016-05-31 23:50                           ` Bernd Schmidt
2016-01-18 12:22 ` Nick Clifton
2016-01-19 21:36 ` Jeff Law
2016-05-30 11:29 ` Florian Weimer
2016-05-31 15:05   ` Bernd Schmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).