[PATCH ARM] Improve ARM memset inlining

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH ARM] Improve ARM memset inlining
@ 2014-04-30  5:56 bin.cheng
  2014-05-02 13:59 ` Richard Earnshaw
  0 siblings, 1 reply; 14+ messages in thread
From: bin.cheng @ 2014-04-30  5:56 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3037 bytes --]

Hi,
This patch expands small memset calls into direct memory set instructions by
introducing "setmemsi" pattern.  For processors without NEON support, it
expands memset using general store instruction.  For example, strd for
4-bytes aligned addresses.  For processors with NEON support, it expands
memset using neon instructions like vstr and miscellaneous vst1.*
instructions for both aligned and unaligned cases.

This patch depends on
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise vst1.64
will be generated for 32-bit aligned memory unit.

There is also one leftover work of this patch:  Since vst1.* instructions
only support post-increment addressing mode, the inlined memset for
unaligned neon cases should be like:
  vmov.i32   q8, #...
  vst1.8     {q8}, [r3]!
  vst1.8     {q8}, [r3]!
  vst1.8     {q8}, [r3]!
  vst1.8     {q8}, [r3]
But for now, gcc can't do this and below code is generated:
  vmov.i32   q8, #...
  vst1.8     {q8}, [r3]
  add        r2,   r3,  #16
  add        r3,   r2,  #16
  vst1.8     {q8}, [r2]
  vst1.8     {q8}, [r3]
  add        r2,   r3,  #16
  vst1.8     {q8}, [r2]

I investigated this issue.  The root cause lies in rtx cost returned by ARM
backend.  Anyway, I think this is another issue and should be fixed in
separated patch.

Bootstrap and reg-test on cortex-a15, with or without neon support.  Is it
OK?

Thanks,
bin


2014-04-29  Bin Cheng  <bin.cheng@arm.com>

	PR target/55701
	* config/arm/arm.md (setmem): New pattern.
	* config/arm/arm-protos.h (struct tune_params): New field.
	(arm_gen_setmem): New prototype.
	* config/arm/arm.c (arm_slowmul_tune): Initialize new field.
	(arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
	(arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
	(arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
	(arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
	(arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
	(arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
	(arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
	(arm_const_inline_cost): New function.
	(arm_block_set_max_insns): New function.
	(arm_block_set_straight_profit_p): New function.
	(arm_block_set_vect_profit_p): New function.
	(arm_block_set_unaligned_vect): New function.
	(arm_block_set_aligned_vect): New function.
	(arm_block_set_unaligned_straight): New function.
	(arm_block_set_aligned_straight): New function.
	(arm_block_set_vect, arm_gen_setmem): New functions.

gcc/testsuite/ChangeLog
2014-04-29  Bin Cheng  <bin.cheng@arm.com>

	PR target/55701
	* gcc.target/arm/memset-inline-1.c: New test.
	* gcc.target/arm/memset-inline-2.c: New test.
	* gcc.target/arm/memset-inline-3.c: New test.
	* gcc.target/arm/memset-inline-4.c: New test.
	* gcc.target/arm/memset-inline-5.c: New test.
	* gcc.target/arm/memset-inline-6.c: New test.
	* gcc.target/arm/memset-inline-7.c: New test.
	* gcc.target/arm/memset-inline-8.c: New test.
	* gcc.target/arm/memset-inline-9.c: New test.

[-- Attachment #2: j1328-20140429.txt --]
[-- Type: text/plain, Size: 50199 bytes --]

Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 209852)
+++ gcc/config/arm/arm.c	(working copy)
@@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1602,10 +1603,11 @@ const struct tune_params arm_fastmul_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1622,10 +1624,11 @@ const struct tune_params arm_strongarm_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1639,10 +1642,11 @@ const struct tune_params arm_xscale_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_9e_tune =
@@ -1656,10 +1660,11 @@ const struct tune_params arm_9e_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_v6t2_tune =
@@ -1673,10 +1678,11 @@ const struct tune_params arm_v6t2_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1691,10 +1697,11 @@ const struct tune_params arm_cortex_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
@@ -1708,10 +1715,11 @@ const struct tune_params arm_cortex_a8_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1725,10 +1733,11 @@ const struct tune_params arm_cortex_a7_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  true                                  /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
@@ -1742,10 +1751,11 @@ const struct tune_params arm_cortex_a15_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  true, true                                    /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  true, true,                           /* Prefer 32-bit encodings.  */
+  true                                  /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
@@ -1759,10 +1769,11 @@ const struct tune_params arm_cortex_a53_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
@@ -1775,11 +1786,12 @@ const struct tune_params arm_cortex_a57_tune =
   ARM_PREFETCH_NOT_BENEFICIAL,
   false,                                       /* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,                                       /* Prefer LDRD/STRD.  */
-  {true, true},                                /* Prefer non short circuit.  */
-  &arm_default_vec_cost,                       /* Vectorizer costs.  */
-  false,                                       /* Prefer Neon for 64-bits bitops.  */
-  true, true                                   /* Prefer 32-bit encodings.  */
+  true,                                        /* Prefer LDRD/STRD.  */
+  {true, true},                         /* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  true, true,                           /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1796,10 +1808,11 @@ const struct tune_params arm_cortex_a5_tune =
   false,					/* Prefer constant pool.  */
   arm_cortex_a5_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1813,10 +1826,11 @@ const struct tune_params arm_cortex_a9_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1830,10 +1844,11 @@ const struct tune_params arm_cortex_a12_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  true                                  /* Prefer Neon for stringops.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -1854,10 +1869,11 @@ const struct tune_params arm_v7m_tune =
   true,						/* Prefer constant pool.  */
   arm_cortex_m_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1873,10 +1889,11 @@ const struct tune_params arm_v6m_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -1890,10 +1907,11 @@ const struct tune_params arm_fa726te_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 
@@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
 			      NULL_RTX, NULL_RTX, 0, 0));
 }
 
+/* Cost of loading a SImode constant.  */
+static inline int
+arm_const_inline_cost (rtx val)
+{
+  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
+                           NULL_RTX, NULL_RTX, 0, 0);
+}
+
 /* Return true if it is worthwhile to split a 64-bit constant into two
    32-bit operations.  This is the case if optimizing for size, or
    if we have load delay slots, or if one 32-bit part can be done with
@@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx *comparison, rtx * op
 
 }
 
+/* Maximum number of instructions to set block of memory.  */
+static int
+arm_block_set_max_insns (void)
+{
+  return (optimize_function_for_size_p (cfun) ? 4 : 8);
+}
+
+/* Return TRUE if it's profitable to set block of memory for straight
+   case.  */
+static bool
+arm_block_set_straight_profit_p (rtx val,
+				 unsigned HOST_WIDE_INT length,
+				 unsigned HOST_WIDE_INT align,
+				 bool unaligned_p, bool use_strd_p)
+{
+  int num = 0;
+  /* For leftovers in bytes of 0-7, we can set the memory block using
+     strb/strh/str with minimum instruction number.  */
+  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
+
+  if (unaligned_p)
+    {
+      num = arm_const_inline_cost (val);
+      num += length / align + length % align;
+    }
+  else if (use_strd_p)
+    {
+      num = arm_const_double_inline_cost (val);
+      num += (length >> 3) + leftover[length & 7];
+    }
+  else
+    {
+      num = arm_const_inline_cost (val);
+      num += (length >> 2) + leftover[length & 3];
+    }
+
+  /* We may be able to combine last pair STRH/STRB into a single STR
+     by shifting one byte back.  */
+  if (unaligned_access && length > 3 && (length & 3) == 3)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Return TRUE if it's profitable to set block of memory for vector case.  */
+static bool
+arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
+			     unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED,
+			     bool unaligned_p, enum machine_mode mode)
+{
+  int num;
+  unsigned int nelt = GET_MODE_NUNITS (mode);
+
+  /* Num of instruction loading constant value.  */
+  num = 1;
+  /* Num of store instructions.  */
+  num += (length + nelt - 1) / nelt;
+  /* Num of address adjusting instructions.  */
+  if (unaligned_p)
+    /* For unaligned case, it's one less than the store instructions.  */
+    num += (length + nelt - 1) / nelt - 1;
+  else if ((length & 3) != 0)
+    /* For aligned case, it's one if bytes leftover can only be stored
+       by mis-aligned store instruction.  */
+    num++;
+
+  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
+  if (!unaligned_p && mode == V16QImode)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Set a block of memory using vectorization instructions for the
+   unaligned case.  We fill the first LENGTH bytes of the memory
+   area starting from DSTBASE with byte constant VALUE.  ALIGN is
+   the alignment requirement of memory.  */
+static bool
+arm_block_set_unaligned_vect (rtx dstbase,
+			      unsigned HOST_WIDE_INT length,
+			      unsigned HOST_WIDE_INT value,
+			      unsigned HOST_WIDE_INT align)
+{
+  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
+  rtx dst, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  rtx (*gen_func) (rtx, rtx);
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) != 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16)
+    {
+      mode = V16QImode;
+      gen_func = gen_movmisalignv16qi;
+    }
+  else
+    {
+      mode = V8QImode;
+      gen_func = gen_movmisalignv8qi;
+    }
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, true, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle nelt_mode bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      emit_insn ((*gen_func) (mem, reg));
+      if (i + 2 * nelt_mode <= length)
+	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
+    }
+
+  if (i + nelt_v8 <= length)
+    gcc_assert (mode == V16QImode);
+
+  /* Handle (8, 16) bytes leftover.  */
+  if (i + nelt_v8 < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+    }
+  /* Handle (0, 8] bytes leftover.  */
+  else if (i < length && i + nelt_v8 >= length)
+    {
+      if (mode == V16QImode)
+	{
+	  reg = gen_lowpart (V8QImode, reg);
+	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
+	}
+      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
+					      + (nelt_mode - nelt_v8))));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for the
+   aligned case.  We fill the first LENGTH bytes of the memory area
+   starting from DSTBASE with byte constant VALUE.  ALIGN is the
+   alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_vect (rtx dstbase,
+			    unsigned HOST_WIDE_INT length,
+			    unsigned HOST_WIDE_INT value,
+			    unsigned HOST_WIDE_INT align)
+{
+  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
+  rtx dst, addr, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) == 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
+    mode = V16QImode;
+  else
+    mode = V8QImode;
+
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, false, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
+  if (mode == V16QImode)
+    {
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+      i += nelt_mode;
+      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
+      if (i + nelt_v8 < length && i + nelt_v16 > length)
+	{
+	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+	  /* We are shifting bytes back, set the alignment accordingly.  */
+	  if ((length & 0x3) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 4);
+	  else if ((length & 0x1) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 2);
+	  else
+	    set_mem_align (mem, BITS_PER_UNIT);
+
+	  emit_insn (gen_movmisalignv16qi (mem, reg));
+	  return true;
+	}
+      /* Fall through for bytes leftover.  */
+      mode = V8QImode;
+      nelt_mode = GET_MODE_NUNITS (mode);
+      reg = gen_lowpart (V8QImode, reg);
+    }
+
+  /* Handle 8 bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single word leftover by shifting 4 bytes back.  We can
+     use aligned access for this case.  */
+  if (i + UNITS_PER_WORD == length)
+    {
+      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
+      mem = adjust_automodify_address (dstbase, mode,
+				       addr, i - UNITS_PER_WORD);
+      /* We are shifting 4 bytes back, set the alignment accordingly.  */
+      if (align > UNITS_PER_WORD)
+	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
+
+      emit_move_insn (mem, reg);
+    }
+  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
+     We have to use unaligned access for this case.  */
+  else if (i < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT * 2);
+      else
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using plain strh/strb instructions, only
+   using instructions allowed by ALIGN on processor.  We fill the
+   first LENGTH bytes of the memory area starting from DSTBASE
+   with byte constant VALUE.  ALIGN is the alignment requirement
+   of memory.  */
+static bool
+arm_block_set_unaligned_straight (rtx dstbase,
+				  unsigned HOST_WIDE_INT length,
+				  unsigned HOST_WIDE_INT value,
+				  unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  enum machine_mode mode;
+  HOST_WIDE_INT v = value;
+
+  gcc_assert (align == 1 || align == 2);
+
+  if (align == 2)
+    v |= (value << BITS_PER_UNIT);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_straight_profit_p (val_exp, length,
+					align, true, false))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mode = (align == 2 ? HImode : QImode);
+  val_reg = force_reg (SImode, val_exp);
+  reg = gen_lowpart (mode, val_reg);
+
+  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+      i++;
+    }
+
+  gcc_assert (i == length);
+  return true;
+}
+
+/* Set a block of memory using plain strd/str/strh/strb instructions,
+   to permit unaligned copies on processors which support unaligned
+   semantics for those instructions.  We fill the first LENGTH bytes
+   of the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_straight (rtx dstbase,
+				unsigned HOST_WIDE_INT length,
+				unsigned HOST_WIDE_INT value,
+				unsigned HOST_WIDE_INT align)
+{
+  unsigned int i = 0;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  unsigned HOST_WIDE_INT v;
+  bool use_strd_p;
+
+  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
+		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
+
+  v = (value | (value << 8) | (value << 16) | (value << 24));
+  if (length < UNITS_PER_WORD)
+    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
+
+  if (use_strd_p)
+    v |= (v << BITS_PER_WORD);
+  else
+    v = sext_hwi (v, BITS_PER_WORD);
+
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_straight_profit_p (val_exp, length,
+					align, false, use_strd_p))
+    {
+      /* Try without strd.  */
+      v = (v >> BITS_PER_WORD);
+      v = sext_hwi (v, BITS_PER_WORD);
+      val_exp = GEN_INT (v);
+      use_strd_p = false;
+      if (!arm_block_set_straight_profit_p (val_exp, length,
+					    align, false, use_strd_p))
+	return false;
+    }
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  /* Handle double words using strd if possible.  */
+  if (use_strd_p)
+    {
+      val_reg = force_reg (DImode, val_exp);
+      reg = val_reg;
+      for (; (i + 8 <= length); i += 8)
+	{
+	  addr = plus_constant (Pmode, dst, i);
+	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
+	  emit_move_insn (mem, reg);
+	}
+    }
+  else
+    val_reg = force_reg (SImode, val_exp);
+
+  /* Handle words.  */
+  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
+  for (; (i + 4 <= length); i += 4)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i);
+      if ((align & 3) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storesi (mem, reg));
+    }
+
+  /* Merge last pair of STRH and STRB into a STR if possible.  */
+  if (unaligned_access && i > 0 && (i + 3) == length)
+    {
+      addr = plus_constant (Pmode, dst, i - 1);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
+      /* We are shifting one byte back, set the alignment accordingly.  */
+      if ((align & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      /* Most likely this is an unaligned access, and we can't tell at
+	 compilation time.  */
+      emit_insn (gen_unaligned_storesi (mem, reg));
+      return true;
+    }
+
+  /* Handle half word leftover.  */
+  if (i + 2 <= length)
+    {
+      reg = gen_lowpart (HImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, HImode, addr, i);
+      if ((align & 1) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storehi (mem, reg));
+
+      i += 2;
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for both
+   aligned and unaligned cases.  We fill the first LENGTH bytes of
+   the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_vect (rtx dstbase,
+		    unsigned HOST_WIDE_INT length,
+		    unsigned HOST_WIDE_INT value,
+		    unsigned HOST_WIDE_INT align)
+{
+  /* Check whether we need to use unaligned store instruction.  */
+  if (((align & 3) != 0 || (length & 3) != 0)
+      /* Check whether unaligned store instruction is available.  */
+      && (!unaligned_access || BYTES_BIG_ENDIAN))
+    return false;
+
+  if ((align & 3) == 0)
+    return arm_block_set_aligned_vect (dstbase, length, value, align);
+  else
+    return arm_block_set_unaligned_vect (dstbase, length, value, align);
+}
+
+/* Expand string store operation.  Firstly we try to do that by using
+   vectorization instructions, then try with ARM unaligned access and
+   double-word store if profitable.  OPERANDS[0] is the destination,
+   OPERANDS[1] is the number of bytes, operands[2] is the value to
+   initialize the memory, OPERANDS[3] is the known alignment of the
+   destination.  */
+bool
+arm_gen_setmem (rtx *operands)
+{
+  rtx dstbase = operands[0];
+  unsigned HOST_WIDE_INT length;
+  unsigned HOST_WIDE_INT value;
+  unsigned HOST_WIDE_INT align;
+
+  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
+    return false;
+
+  length = UINTVAL (operands[1]);
+  if (length > 64)
+    return false;
+
+  value = (UINTVAL (operands[2]) & 0xFF);
+  align = UINTVAL (operands[3]);
+  if (TARGET_NEON && length >= 8
+      && current_tune->string_ops_prefer_neon
+      && arm_block_set_vect (dstbase, length, value, align))
+    return true;
+
+  if (!unaligned_access && (align & 3) != 0)
+    return arm_block_set_unaligned_straight (dstbase, length, value, align);
+
+  return arm_block_set_aligned_straight (dstbase, length, value, align);
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
Index: gcc/config/arm/arm-protos.h
===================================================================
--- gcc/config/arm/arm-protos.h	(revision 209852)
+++ gcc/config/arm/arm-protos.h	(working copy)
@@ -277,6 +277,8 @@ struct tune_params
   /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
      would be set.  */
   bool disparage_partial_flag_setting_t16_encodings;
+  /* Prefer to inline string operations like memset by using Neon.  */
+  bool string_ops_prefer_neon;
 };
 
 extern const struct tune_params *current_tune;
@@ -289,6 +291,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
 extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
 #endif /* RTX_CODE */
 
+extern bool arm_gen_setmem (rtx *);
 extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
 extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
 
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 209852)
+++ gcc/config/arm/arm.md	(working copy)
@@ -7555,6 +7555,20 @@
 })
 
 
+(define_expand "setmemsi"
+  [(match_operand:BLK 0 "general_operand" "")
+   (match_operand:SI 1 "const_int_operand" "")
+   (match_operand:SI 2 "const_int_operand" "")
+   (match_operand:SI 3 "const_int_operand" "")]
+  "TARGET_32BIT"
+{
+  if (arm_gen_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
+
 ;; Move a block of memory if it is word aligned and MORE than 2 words long.
 ;; We could let this apply for blocks of less than this, but it clobbers so
 ;; many registers that there is then probably a better way.
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 20);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 24);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 32);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 20, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 24, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 32, sizeof (c), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vst1" 3 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vstr" 4 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-7.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
@@ -0,0 +1,171 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+int b[LEN];
+
+void
+init (signed char *arr, int len)
+{
+  int i;
+  for (i = 0; i < len; i++)
+    arr[i] = 0;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+#define TEST(a,l,v)			\
+	init ((signed char*)(a), sizeof (a));		\
+	memset ((a), (v), (l));				\
+	check ((signed char *)(a), (l), sizeof (a), (v));
+int
+main(void)
+{
+  TEST (a, 1, -1);
+  TEST (a, 2, -1);
+  TEST (a, 3, -1);
+  TEST (a, 4, -1);
+  TEST (a, 5, -1);
+  TEST (a, 6, -1);
+  TEST (a, 7, -1);
+  TEST (a, 8, -1);
+  TEST (a, 9, 1);
+  TEST (a, 10, -1);
+  TEST (a, 11, 1);
+  TEST (a, 12, -1);
+  TEST (a, 13, 1);
+  TEST (a, 14, -1);
+  TEST (a, 15, 1);
+  TEST (a, 16, -1);
+  TEST (a, 17, 1);
+  TEST (a, 18, -1);
+  TEST (a, 19, 1);
+  TEST (a, 20, -1);
+  TEST (a, 21, 1);
+  TEST (a, 22, -1);
+  TEST (a, 23, 1);
+  TEST (a, 24, -1);
+  TEST (a, 25, 1);
+  TEST (a, 26, -1);
+  TEST (a, 27, 1);
+  TEST (a, 28, -1);
+  TEST (a, 29, 1);
+  TEST (a, 30, -1);
+  TEST (a, 31, 1);
+  TEST (a, 32, -1);
+  TEST (a, 33, 1);
+  TEST (a, 34, -1);
+  TEST (a, 35, 1);
+  TEST (a, 36, -1);
+  TEST (a, 37, 1);
+  TEST (a, 38, -1);
+  TEST (a, 39, 1);
+  TEST (a, 40, -1);
+  TEST (a, 41, 1);
+  TEST (a, 42, -1);
+  TEST (a, 43, 1);
+  TEST (a, 44, -1);
+  TEST (a, 45, 1);
+  TEST (a, 46, -1);
+  TEST (a, 47, 1);
+  TEST (a, 48, -1);
+  TEST (a, 49, 1);
+  TEST (a, 50, -1);
+  TEST (a, 51, 1);
+  TEST (a, 52, -1);
+  TEST (a, 53, 1);
+  TEST (a, 54, -1);
+  TEST (a, 55, 1);
+  TEST (a, 56, -1);
+  TEST (a, 57, 1);
+  TEST (a, 58, -1);
+  TEST (a, 59, 1);
+  TEST (a, 60, -1);
+  TEST (a, 61, 1);
+  TEST (a, 62, -1);
+  TEST (a, 63, 1);
+  TEST (a, 64, -1);
+
+  TEST (b, 1, -1);
+  TEST (b, 2, -1);
+  TEST (b, 3, -1);
+  TEST (b, 4, -1);
+  TEST (b, 5, -1);
+  TEST (b, 6, -1);
+  TEST (b, 7, -1);
+  TEST (b, 8, -1);
+  TEST (b, 9, 1);
+  TEST (b, 10, -1);
+  TEST (b, 11, 1);
+  TEST (b, 12, -1);
+  TEST (b, 13, 1);
+  TEST (b, 14, -1);
+  TEST (b, 15, 1);
+  TEST (b, 16, -1);
+  TEST (b, 17, 1);
+  TEST (b, 18, -1);
+  TEST (b, 19, 1);
+  TEST (b, 20, -1);
+  TEST (b, 21, 1);
+  TEST (b, 22, -1);
+  TEST (b, 23, 1);
+  TEST (b, 24, -1);
+  TEST (b, 25, 1);
+  TEST (b, 26, -1);
+  TEST (b, 27, 1);
+  TEST (b, 28, -1);
+  TEST (b, 29, 1);
+  TEST (b, 30, -1);
+  TEST (b, 31, 1);
+  TEST (b, 32, -1);
+  TEST (b, 33, 1);
+  TEST (b, 34, -1);
+  TEST (b, 35, 1);
+  TEST (b, 36, -1);
+  TEST (b, 37, 1);
+  TEST (b, 38, -1);
+  TEST (b, 39, 1);
+  TEST (b, 40, -1);
+  TEST (b, 41, 1);
+  TEST (b, 42, -1);
+  TEST (b, 43, 1);
+  TEST (b, 44, -1);
+  TEST (b, 45, 1);
+  TEST (b, 46, -1);
+  TEST (b, 47, 1);
+  TEST (b, 48, -1);
+  TEST (b, 49, 1);
+  TEST (b, 50, -1);
+  TEST (b, 51, 1);
+  TEST (b, 52, -1);
+  TEST (b, 53, 1);
+  TEST (b, 54, -1);
+  TEST (b, 55, 1);
+  TEST (b, 56, -1);
+  TEST (b, 57, 1);
+  TEST (b, 58, -1);
+  TEST (b, 59, 1);
+  TEST (b, 60, -1);
+  TEST (b, 61, 1);
+  TEST (b, 62, -1);
+  TEST (b, 63, 1);
+  TEST (b, 64, -1);
+
+  return 0;
+}
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
@@ -0,0 +1,39 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler "bl?\[ \t\]*memset" { target { ! arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-3.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 7);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 7, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-not "strh" { target { ! arm_thumb1 } } } } */
+/* { dg-final { scan-assembler-not "strb" { target { ! arm_thumb1 } } } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 8);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 12);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, 1, 13);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  int i;
+
+  foo1 ();
+  check ((signed char *)a, 8, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 12, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 13, sizeof (c), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-times "vst1\.8" 1 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
@@ -0,0 +1,78 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+int d[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 16);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 25);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 19);
+  return;
+}
+
+void
+foo4 (void)
+{
+  memset (d, 1, 23);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 16, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 25, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 19, sizeof (c), -1);
+
+  foo4 ();
+  check ((signed char *)d, 23, sizeof (d), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr"  { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-04-30  5:56 [PATCH ARM] Improve ARM memset inlining bin.cheng
@ 2014-05-02 13:59 ` Richard Earnshaw
  2014-05-05  7:21   ` bin.cheng
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Earnshaw @ 2014-05-02 13:59 UTC (permalink / raw)
  To: bin.cheng; +Cc: gcc-patches

On 30/04/14 03:52, bin.cheng wrote:
> Hi,
> This patch expands small memset calls into direct memory set instructions by
> introducing "setmemsi" pattern.  For processors without NEON support, it
> expands memset using general store instruction.  For example, strd for
> 4-bytes aligned addresses.  For processors with NEON support, it expands
> memset using neon instructions like vstr and miscellaneous vst1.*
> instructions for both aligned and unaligned cases.
> 
> This patch depends on
> http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise vst1.64
> will be generated for 32-bit aligned memory unit.
> 
> There is also one leftover work of this patch:  Since vst1.* instructions
> only support post-increment addressing mode, the inlined memset for
> unaligned neon cases should be like:
>   vmov.i32   q8, #...
>   vst1.8     {q8}, [r3]!
>   vst1.8     {q8}, [r3]!
>   vst1.8     {q8}, [r3]!
>   vst1.8     {q8}, [r3]

Other than for zero, I'd expect the vmov to be vmov.i8 to move an
arbitrary byte value into all lanes in a vector.  After that, if the
alignment is known to be more than 8-bit, I'd expect the vst1
instructions (with the exception of the last store if the length is not
a multiple of the alignment) to use

	vst1.<align> {reg}, [addr-reg :<align>]!

Hence, for 16-bit aligned data, we want

	vst1.16	{q8}, [r3:16]!

> But for now, gcc can't do this and below code is generated:
>   vmov.i32   q8, #...
>   vst1.8     {q8}, [r3]
>   add        r2,   r3,  #16
>   add        r3,   r2,  #16
>   vst1.8     {q8}, [r2]
>   vst1.8     {q8}, [r3]
>   add        r2,   r3,  #16
>   vst1.8     {q8}, [r2]
> 
> I investigated this issue.  The root cause lies in rtx cost returned by ARM
> backend.  Anyway, I think this is another issue and should be fixed in
> separated patch.
> 
> Bootstrap and reg-test on cortex-a15, with or without neon support.  Is it
> OK?
> 

Some more comments inline.

> Thanks,
> bin
> 
> 
> 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> 
> 	PR target/55701
> 	* config/arm/arm.md (setmem): New pattern.
> 	* config/arm/arm-protos.h (struct tune_params): New field.
> 	(arm_gen_setmem): New prototype.
> 	* config/arm/arm.c (arm_slowmul_tune): Initialize new field.
> 	(arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
> 	(arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
> 	(arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
> 	(arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
> 	(arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
> 	(arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
> 	(arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
> 	(arm_const_inline_cost): New function.
> 	(arm_block_set_max_insns): New function.
> 	(arm_block_set_straight_profit_p): New function.
> 	(arm_block_set_vect_profit_p): New function.
> 	(arm_block_set_unaligned_vect): New function.
> 	(arm_block_set_aligned_vect): New function.
> 	(arm_block_set_unaligned_straight): New function.
> 	(arm_block_set_aligned_straight): New function.
> 	(arm_block_set_vect, arm_gen_setmem): New functions.
> 
> gcc/testsuite/ChangeLog
> 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> 
> 	PR target/55701
> 	* gcc.target/arm/memset-inline-1.c: New test.
> 	* gcc.target/arm/memset-inline-2.c: New test.
> 	* gcc.target/arm/memset-inline-3.c: New test.
> 	* gcc.target/arm/memset-inline-4.c: New test.
> 	* gcc.target/arm/memset-inline-5.c: New test.
> 	* gcc.target/arm/memset-inline-6.c: New test.
> 	* gcc.target/arm/memset-inline-7.c: New test.
> 	* gcc.target/arm/memset-inline-8.c: New test.
> 	* gcc.target/arm/memset-inline-9.c: New test.
> 
> 
> j1328-20140429.txt
> 
> 
> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c	(revision 209852)
> +++ gcc/config/arm/arm.c	(working copy)
> @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune =
>    true,						/* Prefer constant pool.  */
>    arm_default_branch_cost,
>    false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,                /* Vectorizer costs.  */
> +  false,                                /* Prefer Neon for 64-bits bitops.  */
> +  false, false,                         /* Prefer 32-bit encodings.  */
> +  false                                 /* Prefer Neon for stringops.  */
>  };
>  

Please make sure that all the white space before the comments is using
TAB, not spaces.  Similarly for the other tables.

> @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
>  			      NULL_RTX, NULL_RTX, 0, 0));
>  }
>  
> +/* Cost of loading a SImode constant.  */
> +static inline int
> +arm_const_inline_cost (rtx val)
> +{
> +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
> +                           NULL_RTX, NULL_RTX, 0, 0);
> +}
> +

This could be used more widely if you passed the SET in as a parameter
(there are cases in arm_new_rtx_cost that could use it, for example).
Also, you want to enable sub-targets (only once you can't create new
pseudos is that not safe), so the penultimate argument in the call to
arm_gen_constant should be 1.

>  /* Return true if it is worthwhile to split a 64-bit constant into two
>     32-bit operations.  This is the case if optimizing for size, or
>     if we have load delay slots, or if one 32-bit part can be done with
> @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx *comparison, rtx * op
>  
>  }
>  
> +/* Maximum number of instructions to set block of memory.  */
> +static int
> +arm_block_set_max_insns (void)
> +{
> +  return (optimize_function_for_size_p (cfun) ? 4 : 8);
> +}

I think the non-size_p alternative should really be a parameter in the
per-cpu costs table.

> +
> +/* Return TRUE if it's profitable to set block of memory for straight
> +   case.  */

"Straight" is confusing here.  Do you mean non-vectorized?  If so, then
non_vect might be clearer.

The arguments should really be documented (see comment below about
align, for example).

> +static bool
> +arm_block_set_straight_profit_p (rtx val,
> +				 unsigned HOST_WIDE_INT length,
> +				 unsigned HOST_WIDE_INT align,
> +				 bool unaligned_p, bool use_strd_p)
> +{
> +  int num = 0;
> +  /* For leftovers in bytes of 0-7, we can set the memory block using
> +     strb/strh/str with minimum instruction number.  */
> +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};

This should be marked const.

> +
> +  if (unaligned_p)
> +    {
> +      num = arm_const_inline_cost (val);
> +      num += length / align + length % align;

Isn't align in bits here, when you really want it in bytes?

What if align > 4 bytes?

> +    }
> +  else if (use_strd_p)
> +    {
> +      num = arm_const_double_inline_cost (val);
> +      num += (length >> 3) + leftover[length & 7];
> +    }
> +  else
> +    {
> +      num = arm_const_inline_cost (val);
> +      num += (length >> 2) + leftover[length & 3];
> +    }
> +
> +  /* We may be able to combine last pair STRH/STRB into a single STR
> +     by shifting one byte back.  */
> +  if (unaligned_access && length > 3 && (length & 3) == 3)
> +    num--;
> +
> +  return (num <= arm_block_set_max_insns ());
> +}
> +
> +/* Return TRUE if it's profitable to set block of memory for vector case.  */
> +static bool
> +arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
> +			     unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED,
> +			     bool unaligned_p, enum machine_mode mode)

I'm not sure what you mean by unaligned here.  Again, documenting the
arguments might help.

> +{
> +  int num;
> +  unsigned int nelt = GET_MODE_NUNITS (mode);
> +
> +  /* Num of instruction loading constant value.  */

Use either "Number" or, in this case, simply drop that bit and write:
  /* Instruction loading constant value.  */

> +  num = 1;
> +  /* Num of store instructions.  */

Likewise.

> +  num += (length + nelt - 1) / nelt;
> +  /* Num of address adjusting instructions.  */

Can't we work on the premise that the address adjusting instructions
will be merged into the stores?  I know you said that they currently do
not, but that's not a problem that this bit of code should have to worry
about.

> +  if (unaligned_p)
> +    /* For unaligned case, it's one less than the store instructions.  */
> +    num += (length + nelt - 1) / nelt - 1;
> +  else if ((length & 3) != 0)
> +    /* For aligned case, it's one if bytes leftover can only be stored
> +       by mis-aligned store instruction.  */
> +    num++;
> +
> +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
> +  if (!unaligned_p && mode == V16QImode)
> +    num--;
> +
> +  return (num <= arm_block_set_max_insns ());
> +}
> +
> +/* Set a block of memory using vectorization instructions for the
> +   unaligned case.  We fill the first LENGTH bytes of the memory
> +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
> +   the alignment requirement of memory.  */

What's the return value mean?

> +static bool
> +arm_block_set_unaligned_vect (rtx dstbase,
> +			      unsigned HOST_WIDE_INT length,
> +			      unsigned HOST_WIDE_INT value,
> +			      unsigned HOST_WIDE_INT align)
> +{
> +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;

Don't mix initialized declarations with unitialized ones on the same
line.  You don't appear to use either I or J until their first use in
the loop control below, so why initialize them here?

> +  rtx dst, mem;
> +  rtx val_elt, val_vec, reg;
> +  rtx rval[MAX_VECT_LEN];
> +  rtx (*gen_func) (rtx, rtx);
> +  enum machine_mode mode;
> +  unsigned HOST_WIDE_INT v = value;
> +
> +  gcc_assert ((align & 0x3) != 0);
> +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> +  nelt_v16 = GET_MODE_NUNITS (V16QImode);
> +  if (length >= nelt_v16)
> +    {
> +      mode = V16QImode;
> +      gen_func = gen_movmisalignv16qi;
> +    }
> +  else
> +    {
> +      mode = V8QImode;
> +      gen_func = gen_movmisalignv8qi;
> +    }
> +  nelt_mode = GET_MODE_NUNITS (mode);
> +  gcc_assert (length >= nelt_mode);
> +  /* Skip if it isn't profitable.  */
> +  if (!arm_block_set_vect_profit_p (length, align, true, mode))
> +    return false;
> +
> +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> +  mem = adjust_automodify_address (dstbase, mode, dst, 0);
> +
> +  v = sext_hwi (v, BITS_PER_WORD);
> +  val_elt = GEN_INT (v);
> +  for (; j < nelt_mode; j++)
> +    rval[j] = val_elt;

Is this the first use of J?  If so, initialize it here.

> +
> +  reg = gen_reg_rtx (mode);
> +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
> +  /* Emit instruction loading the constant value.  */
> +  emit_move_insn (reg, val_vec);
> +
> +  /* Handle nelt_mode bytes in a vector.  */
> +  for (; (i + nelt_mode <= length); i += nelt_mode)

Similarly for I.

> +    {
> +      emit_insn ((*gen_func) (mem, reg));
> +      if (i + 2 * nelt_mode <= length)
> +	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
> +    }
> +
> +  if (i + nelt_v8 <= length)
> +    gcc_assert (mode == V16QImode);

Why not drop the if and write:

     gcc_assert ((i + nelt_v8) > length || mode == V16QImode);

> +
> +  /* Handle (8, 16) bytes leftover.  */
> +  if (i + nelt_v8 < length)

Your assertion above checked <=, but here you use <.  Is that correct?

> +    {
> +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
> +      /* We are shifting bytes back, set the alignment accordingly.  */
> +      if ((length & 1) != 0 && align >= 2)
> +	set_mem_align (mem, BITS_PER_UNIT);
> +
> +      emit_insn (gen_movmisalignv16qi (mem, reg));
> +    }
> +  /* Handle (0, 8] bytes leftover.  */
> +  else if (i < length && i + nelt_v8 >= length)
> +    {
> +      if (mode == V16QImode)
> +	{
> +	  reg = gen_lowpart (V8QImode, reg);
> +	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
> +	}
> +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
> +					      + (nelt_mode - nelt_v8))));
> +      /* We are shifting bytes back, set the alignment accordingly.  */
> +      if ((length & 1) != 0 && align >= 2)
> +	set_mem_align (mem, BITS_PER_UNIT);
> +
> +      emit_insn (gen_movmisalignv8qi (mem, reg));
> +    }
> +
> +  return true;
> +}
> +
> +/* Set a block of memory using vectorization instructions for the
> +   aligned case.  We fill the first LENGTH bytes of the memory area
> +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
> +   alignment requirement of memory.  */

See all the comments above for the unaligend case.

> +static bool
> +arm_block_set_aligned_vect (rtx dstbase,
> +			    unsigned HOST_WIDE_INT length,
> +			    unsigned HOST_WIDE_INT value,
> +			    unsigned HOST_WIDE_INT align)
> +{
> +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
> +  rtx dst, addr, mem;
> +  rtx val_elt, val_vec, reg;
> +  rtx rval[MAX_VECT_LEN];
> +  enum machine_mode mode;
> +  unsigned HOST_WIDE_INT v = value;
> +
> +  gcc_assert ((align & 0x3) == 0);
> +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> +  nelt_v16 = GET_MODE_NUNITS (V16QImode);
> +  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
> +    mode = V16QImode;
> +  else
> +    mode = V8QImode;
> +
> +  nelt_mode = GET_MODE_NUNITS (mode);
> +  gcc_assert (length >= nelt_mode);
> +  /* Skip if it isn't profitable.  */
> +  if (!arm_block_set_vect_profit_p (length, align, false, mode))
> +    return false;
> +
> +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> +
> +  v = sext_hwi (v, BITS_PER_WORD);
> +  val_elt = GEN_INT (v);
> +  for (; j < nelt_mode; j++)
> +    rval[j] = val_elt;
> +
> +  reg = gen_reg_rtx (mode);
> +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
> +  /* Emit instruction loading the constant value.  */
> +  emit_move_insn (reg, val_vec);
> +
> +  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
> +  if (mode == V16QImode)
> +    {
> +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> +      emit_insn (gen_movmisalignv16qi (mem, reg));
> +      i += nelt_mode;
> +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
> +      if (i + nelt_v8 < length && i + nelt_v16 > length)
> +	{
> +	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> +	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
> +	  /* We are shifting bytes back, set the alignment accordingly.  */
> +	  if ((length & 0x3) == 0)
> +	    set_mem_align (mem, BITS_PER_UNIT * 4);
> +	  else if ((length & 0x1) == 0)
> +	    set_mem_align (mem, BITS_PER_UNIT * 2);
> +	  else
> +	    set_mem_align (mem, BITS_PER_UNIT);
> +
> +	  emit_insn (gen_movmisalignv16qi (mem, reg));
> +	  return true;
> +	}
> +      /* Fall through for bytes leftover.  */
> +      mode = V8QImode;
> +      nelt_mode = GET_MODE_NUNITS (mode);
> +      reg = gen_lowpart (V8QImode, reg);
> +    }
> +
> +  /* Handle 8 bytes in a vector.  */
> +  for (; (i + nelt_mode <= length); i += nelt_mode)
> +    {
> +      addr = plus_constant (Pmode, dst, i);
> +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> +      emit_move_insn (mem, reg);
> +    }
> +
> +  /* Handle single word leftover by shifting 4 bytes back.  We can
> +     use aligned access for this case.  */
> +  if (i + UNITS_PER_WORD == length)
> +    {
> +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
> +      mem = adjust_automodify_address (dstbase, mode,
> +				       addr, i - UNITS_PER_WORD);
> +      /* We are shifting 4 bytes back, set the alignment accordingly.  */
> +      if (align > UNITS_PER_WORD)
> +	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
> +
> +      emit_move_insn (mem, reg);
> +    }
> +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
> +     We have to use unaligned access for this case.  */
> +  else if (i < length)
> +    {
> +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> +      /* We are shifting bytes back, set the alignment accordingly.  */
> +      if ((length & 1) == 0)
> +	set_mem_align (mem, BITS_PER_UNIT * 2);
> +      else
> +	set_mem_align (mem, BITS_PER_UNIT);
> +
> +      emit_insn (gen_movmisalignv8qi (mem, reg));
> +    }
> +
> +  return true;
> +}
> +
> +/* Set a block of memory using plain strh/strb instructions, only
> +   using instructions allowed by ALIGN on processor.  We fill the
> +   first LENGTH bytes of the memory area starting from DSTBASE
> +   with byte constant VALUE.  ALIGN is the alignment requirement
> +   of memory.  */
> +static bool
> +arm_block_set_unaligned_straight (rtx dstbase,
> +				  unsigned HOST_WIDE_INT length,
> +				  unsigned HOST_WIDE_INT value,
> +				  unsigned HOST_WIDE_INT align)
> +{
> +  unsigned int i;
> +  rtx dst, addr, mem;
> +  rtx val_exp, val_reg, reg;
> +  enum machine_mode mode;
> +  HOST_WIDE_INT v = value;
> +
> +  gcc_assert (align == 1 || align == 2);
> +
> +  if (align == 2)
> +    v |= (value << BITS_PER_UNIT);
> +
> +  v = sext_hwi (v, BITS_PER_WORD);
> +  val_exp = GEN_INT (v);
> +  /* Skip if it isn't profitable.  */
> +  if (!arm_block_set_straight_profit_p (val_exp, length,
> +					align, true, false))
> +    return false;
> +
> +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> +  mode = (align == 2 ? HImode : QImode);
> +  val_reg = force_reg (SImode, val_exp);
> +  reg = gen_lowpart (mode, val_reg);
> +
> +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
> +    {
> +      addr = plus_constant (Pmode, dst, i);
> +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> +      emit_move_insn (mem, reg);
> +    }
> +
> +  /* Handle single byte leftover.  */
> +  if (i + 1 == length)
> +    {
> +      reg = gen_lowpart (QImode, val_reg);
> +      addr = plus_constant (Pmode, dst, i);
> +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> +      emit_move_insn (mem, reg);
> +      i++;
> +    }
> +
> +  gcc_assert (i == length);
> +  return true;
> +}
> +
> +/* Set a block of memory using plain strd/str/strh/strb instructions,
> +   to permit unaligned copies on processors which support unaligned
> +   semantics for those instructions.  We fill the first LENGTH bytes
> +   of the memory area starting from DSTBASE with byte constant VALUE.
> +   ALIGN is the alignment requirement of memory.  */
> +static bool
> +arm_block_set_aligned_straight (rtx dstbase,
> +				unsigned HOST_WIDE_INT length,
> +				unsigned HOST_WIDE_INT value,
> +				unsigned HOST_WIDE_INT align)
> +{
> +  unsigned int i = 0;
> +  rtx dst, addr, mem;
> +  rtx val_exp, val_reg, reg;
> +  unsigned HOST_WIDE_INT v;
> +  bool use_strd_p;
> +
> +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
> +		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
> +
> +  v = (value | (value << 8) | (value << 16) | (value << 24));
> +  if (length < UNITS_PER_WORD)
> +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
> +
> +  if (use_strd_p)
> +    v |= (v << BITS_PER_WORD);
> +  else
> +    v = sext_hwi (v, BITS_PER_WORD);
> +
> +  val_exp = GEN_INT (v);
> +  /* Skip if it isn't profitable.  */
> +  if (!arm_block_set_straight_profit_p (val_exp, length,
> +					align, false, use_strd_p))
> +    {
> +      /* Try without strd.  */
> +      v = (v >> BITS_PER_WORD);
> +      v = sext_hwi (v, BITS_PER_WORD);
> +      val_exp = GEN_INT (v);
> +      use_strd_p = false;
> +      if (!arm_block_set_straight_profit_p (val_exp, length,
> +					    align, false, use_strd_p))
> +	return false;
> +    }
> +
> +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> +  /* Handle double words using strd if possible.  */
> +  if (use_strd_p)
> +    {
> +      val_reg = force_reg (DImode, val_exp);
> +      reg = val_reg;
> +      for (; (i + 8 <= length); i += 8)
> +	{
> +	  addr = plus_constant (Pmode, dst, i);
> +	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
> +	  emit_move_insn (mem, reg);
> +	}
> +    }
> +  else
> +    val_reg = force_reg (SImode, val_exp);
> +
> +  /* Handle words.  */
> +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
> +  for (; (i + 4 <= length); i += 4)
> +    {
> +      addr = plus_constant (Pmode, dst, i);
> +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
> +      if ((align & 3) == 0)
> +	emit_move_insn (mem, reg);
> +      else
> +	emit_insn (gen_unaligned_storesi (mem, reg));
> +    }
> +
> +  /* Merge last pair of STRH and STRB into a STR if possible.  */
> +  if (unaligned_access && i > 0 && (i + 3) == length)
> +    {
> +      addr = plus_constant (Pmode, dst, i - 1);
> +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
> +      /* We are shifting one byte back, set the alignment accordingly.  */
> +      if ((align & 1) == 0)
> +	set_mem_align (mem, BITS_PER_UNIT);
> +
> +      /* Most likely this is an unaligned access, and we can't tell at
> +	 compilation time.  */
> +      emit_insn (gen_unaligned_storesi (mem, reg));
> +      return true;
> +    }
> +
> +  /* Handle half word leftover.  */
> +  if (i + 2 <= length)
> +    {
> +      reg = gen_lowpart (HImode, val_reg);
> +      addr = plus_constant (Pmode, dst, i);
> +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
> +      if ((align & 1) == 0)
> +	emit_move_insn (mem, reg);
> +      else
> +	emit_insn (gen_unaligned_storehi (mem, reg));
> +
> +      i += 2;
> +    }
> +
> +  /* Handle single byte leftover.  */
> +  if (i + 1 == length)
> +    {
> +      reg = gen_lowpart (QImode, val_reg);
> +      addr = plus_constant (Pmode, dst, i);
> +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> +      emit_move_insn (mem, reg);
> +    }
> +
> +  return true;
> +}
> +
> +/* Set a block of memory using vectorization instructions for both
> +   aligned and unaligned cases.  We fill the first LENGTH bytes of
> +   the memory area starting from DSTBASE with byte constant VALUE.
> +   ALIGN is the alignment requirement of memory.  */
> +static bool
> +arm_block_set_vect (rtx dstbase,
> +		    unsigned HOST_WIDE_INT length,
> +		    unsigned HOST_WIDE_INT value,
> +		    unsigned HOST_WIDE_INT align)
> +{
> +  /* Check whether we need to use unaligned store instruction.  */
> +  if (((align & 3) != 0 || (length & 3) != 0)
> +      /* Check whether unaligned store instruction is available.  */
> +      && (!unaligned_access || BYTES_BIG_ENDIAN))
> +    return false;

Huh!  vst1.8 can work for unaligned accesses even when hw alignment
checking is strict.

> +
> +  if ((align & 3) == 0)
> +    return arm_block_set_aligned_vect (dstbase, length, value, align);
> +  else
> +    return arm_block_set_unaligned_vect (dstbase, length, value, align);
> +}
> +
> +/* Expand string store operation.  Firstly we try to do that by using
> +   vectorization instructions, then try with ARM unaligned access and
> +   double-word store if profitable.  OPERANDS[0] is the destination,
> +   OPERANDS[1] is the number of bytes, operands[2] is the value to
> +   initialize the memory, OPERANDS[3] is the known alignment of the
> +   destination.  */
> +bool
> +arm_gen_setmem (rtx *operands)
> +{
> +  rtx dstbase = operands[0];
> +  unsigned HOST_WIDE_INT length;
> +  unsigned HOST_WIDE_INT value;
> +  unsigned HOST_WIDE_INT align;
> +
> +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
> +    return false;
> +
> +  length = UINTVAL (operands[1]);
> +  if (length > 64)
> +    return false;
> +
> +  value = (UINTVAL (operands[2]) & 0xFF);
> +  align = UINTVAL (operands[3]);
> +  if (TARGET_NEON && length >= 8
> +      && current_tune->string_ops_prefer_neon
> +      && arm_block_set_vect (dstbase, length, value, align))
> +    return true;
> +
> +  if (!unaligned_access && (align & 3) != 0)
> +    return arm_block_set_unaligned_straight (dstbase, length, value, align);
> +
> +  return arm_block_set_aligned_straight (dstbase, length, value, align);
> +}
> +
>  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
>  
>  static unsigned HOST_WIDE_INT
> Index: gcc/config/arm/arm-protos.h
> ===================================================================
> --- gcc/config/arm/arm-protos.h	(revision 209852)
> +++ gcc/config/arm/arm-protos.h	(working copy)
> @@ -277,6 +277,8 @@ struct tune_params
>    /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
>       would be set.  */
>    bool disparage_partial_flag_setting_t16_encodings;
> +  /* Prefer to inline string operations like memset by using Neon.  */
> +  bool string_ops_prefer_neon;
>  };
>  
>  extern const struct tune_params *current_tune;
> @@ -289,6 +291,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
>  extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
>  #endif /* RTX_CODE */
>  
> +extern bool arm_gen_setmem (rtx *);
>  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
>  extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
>  
> Index: gcc/config/arm/arm.md
> ===================================================================
> --- gcc/config/arm/arm.md	(revision 209852)
> +++ gcc/config/arm/arm.md	(working copy)
> @@ -7555,6 +7555,20 @@
>  })
>  
>  
> +(define_expand "setmemsi"
> +  [(match_operand:BLK 0 "general_operand" "")
> +   (match_operand:SI 1 "const_int_operand" "")
> +   (match_operand:SI 2 "const_int_operand" "")
> +   (match_operand:SI 3 "const_int_operand" "")]
> +  "TARGET_32BIT"
> +{
> +  if (arm_gen_setmem (operands))
> +    DONE;
> +
> +  FAIL;
> +})
> +
> +
>  ;; Move a block of memory if it is word aligned and MORE than 2 words long.
>  ;; We could let this apply for blocks of less than this, but it clobbers so
>  ;; many registers that there is then probably a better way.
> Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
> ===================================================================
> --- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
> +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)

Have you tested these when the compiler was configured with
"--with-cpu=cortex-a9"?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH ARM] Improve ARM memset inlining
  2014-05-02 13:59 ` Richard Earnshaw
@ 2014-05-05  7:21   ` bin.cheng
  2014-05-06  5:00     ` bin.cheng
  0 siblings, 1 reply; 14+ messages in thread
From: bin.cheng @ 2014-05-05  7:21 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 30671 bytes --]

Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
also updated the patch.

> -----Original Message-----
> From: Richard Earnshaw
> Sent: Friday, May 02, 2014 10:00 PM
> To: Bin Cheng
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH ARM] Improve ARM memset inlining
> 
> On 30/04/14 03:52, bin.cheng wrote:
> > Hi,
> > This patch expands small memset calls into direct memory set
> > instructions by introducing "setmemsi" pattern.  For processors
> > without NEON support, it expands memset using general store
> > instruction.  For example, strd for 4-bytes aligned addresses.  For
> > processors with NEON support, it expands memset using neon
> > instructions like vstr and miscellaneous vst1.* instructions for both
aligned
> and unaligned cases.
> >
> > This patch depends on
> > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
> > vst1.64 will be generated for 32-bit aligned memory unit.
> >
> > There is also one leftover work of this patch:  Since vst1.*
> > instructions only support post-increment addressing mode, the inlined
> > memset for unaligned neon cases should be like:
> >   vmov.i32   q8, #...
> >   vst1.8     {q8}, [r3]!
> >   vst1.8     {q8}, [r3]!
> >   vst1.8     {q8}, [r3]!
> >   vst1.8     {q8}, [r3]
> 
> Other than for zero, I'd expect the vmov to be vmov.i8 to move an
arbitrary
I just used vmov.i32 as an example.  The element size is actually calculated
by function neon_valid_immediate which works as expected I think.

> byte value into all lanes in a vector.  After that, if the alignment is
known to
> be more than 8-bit, I'd expect the vst1 instructions (with the exception
of the
> last store if the length is not a multiple of the alignment) to use
> 
> 	vst1.<align> {reg}, [addr-reg :<align>]!
> 
> Hence, for 16-bit aligned data, we want
> 
> 	vst1.16	{q8}, [r3:16]!
Did I miss something important?  It seems to me the explicit alignment notes
supported are 64/128/256.  So what do you mean by 16 bits alignment here?

> 
> > But for now, gcc can't do this and below code is generated:
> >   vmov.i32   q8, #...
> >   vst1.8     {q8}, [r3]
> >   add        r2,   r3,  #16
> >   add        r3,   r2,  #16
> >   vst1.8     {q8}, [r2]
> >   vst1.8     {q8}, [r3]
> >   add        r2,   r3,  #16
> >   vst1.8     {q8}, [r2]
> >
> > I investigated this issue.  The root cause lies in rtx cost returned
> > by ARM backend.  Anyway, I think this is another issue and should be
> > fixed in separated patch.
> >
> > Bootstrap and reg-test on cortex-a15, with or without neon support.
> > Is it OK?
> >
> 
> Some more comments inline.
> 
> > Thanks,
> > bin
> >
> >
> > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> >
> > 	PR target/55701
> > 	* config/arm/arm.md (setmem): New pattern.
> > 	* config/arm/arm-protos.h (struct tune_params): New field.
> > 	(arm_gen_setmem): New prototype.
> > 	* config/arm/arm.c (arm_slowmul_tune): Initialize new field.
> > 	(arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
> > 	(arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
> > 	(arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
> > 	(arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
> > 	(arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
> > 	(arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
> > 	(arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
> > 	(arm_const_inline_cost): New function.
> > 	(arm_block_set_max_insns): New function.
> > 	(arm_block_set_straight_profit_p): New function.
> > 	(arm_block_set_vect_profit_p): New function.
> > 	(arm_block_set_unaligned_vect): New function.
> > 	(arm_block_set_aligned_vect): New function.
> > 	(arm_block_set_unaligned_straight): New function.
> > 	(arm_block_set_aligned_straight): New function.
> > 	(arm_block_set_vect, arm_gen_setmem): New functions.
> >
> > gcc/testsuite/ChangeLog
> > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> >
> > 	PR target/55701
> > 	* gcc.target/arm/memset-inline-1.c: New test.
> > 	* gcc.target/arm/memset-inline-2.c: New test.
> > 	* gcc.target/arm/memset-inline-3.c: New test.
> > 	* gcc.target/arm/memset-inline-4.c: New test.
> > 	* gcc.target/arm/memset-inline-5.c: New test.
> > 	* gcc.target/arm/memset-inline-6.c: New test.
> > 	* gcc.target/arm/memset-inline-7.c: New test.
> > 	* gcc.target/arm/memset-inline-8.c: New test.
> > 	* gcc.target/arm/memset-inline-9.c: New test.
> >
> >
> > j1328-20140429.txt
> >
> >
> > Index: gcc/config/arm/arm.c
> >
> ==========================================================
> =========
> > --- gcc/config/arm/arm.c	(revision 209852)
> > +++ gcc/config/arm/arm.c	(working copy)
> > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune =
> >    true,						/* Prefer constant
> pool.  */
> >    arm_default_branch_cost,
> >    false,					/* Prefer LDRD/STRD.  */
> > -  {true, true},					/* Prefer non short
> circuit.  */
> > -  &arm_default_vec_cost,                        /* Vectorizer costs.
*/
> > -  false,                                        /* Prefer Neon for
64-bits bitops.  */
> > -  false, false                                  /* Prefer 32-bit
encodings.  */
> > +  {true, true},				/* Prefer non short circuit.
*/
> > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
> > +  false,                                /* Prefer Neon for 64-bits
bitops.  */
> > +  false, false,                         /* Prefer 32-bit encodings.  */
> > +  false                                 /* Prefer Neon for stringops.
*/
> >  };
> >
> 
> Please make sure that all the white space before the comments is using
TAB,
> not spaces.  Similarly for the other tables.
Fixed.

> 
> > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
> >  			      NULL_RTX, NULL_RTX, 0, 0));
> >  }
> >
> > +/* Cost of loading a SImode constant.  */ static inline int
> > +arm_const_inline_cost (rtx val) {
> > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
> > +                           NULL_RTX, NULL_RTX, 0, 0); }
> > +
> 
> This could be used more widely if you passed the SET in as a parameter
> (there are cases in arm_new_rtx_cost that could use it, for example).
> Also, you want to enable sub-targets (only once you can't create new
> pseudos is that not safe), so the penultimate argument in the call to
> arm_gen_constant should be 1.
Fixed.

> 
> >  /* Return true if it is worthwhile to split a 64-bit constant into two
> >     32-bit operations.  This is the case if optimizing for size, or
> >     if we have load delay slots, or if one 32-bit part can be done
> > with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
> > *comparison, rtx * op
> >
> >  }
> >
> > +/* Maximum number of instructions to set block of memory.  */ static
> > +int arm_block_set_max_insns (void) {
> > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
> 
> I think the non-size_p alternative should really be a parameter in the
per-cpu
> costs table.
Fixed.

> 
> > +
> > +/* Return TRUE if it's profitable to set block of memory for straight
> > +   case.  */
> 
> "Straight" is confusing here.  Do you mean non-vectorized?  If so, then
> non_vect might be clearer.
Fixed.

> 
> The arguments should really be documented (see comment below about
> align, for example).
Fixed.

> 
> > +static bool
> > +arm_block_set_straight_profit_p (rtx val,
> > +				 unsigned HOST_WIDE_INT length,
> > +				 unsigned HOST_WIDE_INT align,
> > +				 bool unaligned_p, bool use_strd_p) {
> > +  int num = 0;
> > +  /* For leftovers in bytes of 0-7, we can set the memory block using
> > +     strb/strh/str with minimum instruction number.  */
> > +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
> 
> This should be marked const.
Fixed.

> 
> > +
> > +  if (unaligned_p)
> > +    {
> > +      num = arm_const_inline_cost (val);
> > +      num += length / align + length % align;
> 
> Isn't align in bits here, when you really want it in bytes?
All alignments are in bytes starting from pattern "setmem".

> 
> What if align > 4 bytes?
Then it's the "!unaligned_p" case and handled by other arms of this if
statement.

> 
> > +    }
> > +  else if (use_strd_p)
> > +    {
> > +      num = arm_const_double_inline_cost (val);
> > +      num += (length >> 3) + leftover[length & 7];
> > +    }
> > +  else
> > +    {
> > +      num = arm_const_inline_cost (val);
> > +      num += (length >> 2) + leftover[length & 3];
> > +    }
> > +
> > +  /* We may be able to combine last pair STRH/STRB into a single STR
> > +     by shifting one byte back.  */
> > +  if (unaligned_access && length > 3 && (length & 3) == 3)
> > +    num--;
> > +
> > +  return (num <= arm_block_set_max_insns ()); }
> > +
> > +/* Return TRUE if it's profitable to set block of memory for vector
> > +case.  */ static bool arm_block_set_vect_profit_p (unsigned
> > +HOST_WIDE_INT length,
> > +			     unsigned HOST_WIDE_INT align
> ATTRIBUTE_UNUSED,
> > +			     bool unaligned_p, enum machine_mode mode)
> 
> I'm not sure what you mean by unaligned here.  Again, documenting the
> arguments might help.
Fixed.

> 
> > +{
> > +  int num;
> > +  unsigned int nelt = GET_MODE_NUNITS (mode);
> > +
> > +  /* Num of instruction loading constant value.  */
> 
> Use either "Number" or, in this case, simply drop that bit and write:
>   /* Instruction loading constant value.  */
Fixed.

> 
> > +  num = 1;
> > +  /* Num of store instructions.  */
> 
> Likewise.
> 
> > +  num += (length + nelt - 1) / nelt;
> > +  /* Num of address adjusting instructions.  */
> 
> Can't we work on the premise that the address adjusting instructions will
be
> merged into the stores?  I know you said that they currently do not, but
> that's not a problem that this bit of code should have to worry about.
Fixed.

> 
> > +  if (unaligned_p)
> > +    /* For unaligned case, it's one less than the store instructions.
*/
> > +    num += (length + nelt - 1) / nelt - 1;  else if ((length & 3) !=
> > + 0)
> > +    /* For aligned case, it's one if bytes leftover can only be stored
> > +       by mis-aligned store instruction.  */
> > +    num++;
> > +
> > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
> > + */  if (!unaligned_p && mode == V16QImode)
> > +    num--;
> > +
> > +  return (num <= arm_block_set_max_insns ()); }
> > +
> > +/* Set a block of memory using vectorization instructions for the
> > +   unaligned case.  We fill the first LENGTH bytes of the memory
> > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
> > +   the alignment requirement of memory.  */
> 
> What's the return value mean?
Documented.

> 
> > +static bool
> > +arm_block_set_unaligned_vect (rtx dstbase,
> > +			      unsigned HOST_WIDE_INT length,
> > +			      unsigned HOST_WIDE_INT value,
> > +			      unsigned HOST_WIDE_INT align) {
> > +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
> 
> Don't mix initialized declarations with unitialized ones on the same line.
You
> don't appear to use either I or J until their first use in the loop
control below,
> so why initialize them here?
Fixed.

> 
> > +  rtx dst, mem;
> > +  rtx val_elt, val_vec, reg;
> > +  rtx rval[MAX_VECT_LEN];
> > +  rtx (*gen_func) (rtx, rtx);
> > +  enum machine_mode mode;
> > +  unsigned HOST_WIDE_INT v = value;
> > +
> > +  gcc_assert ((align & 0x3) != 0);
> > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16)
> > +    {
> > +      mode = V16QImode;
> > +      gen_func = gen_movmisalignv16qi;
> > +    }
> > +  else
> > +    {
> > +      mode = V8QImode;
> > +      gen_func = gen_movmisalignv8qi;
> > +    }
> > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> > + nelt_mode);
> > +  /* Skip if it isn't profitable.  */  if
> > + (!arm_block_set_vect_profit_p (length, align, true, mode))
> > +    return false;
> > +
> > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mem =
> > + adjust_automodify_address (dstbase, mode, dst, 0);
> > +
> > +  v = sext_hwi (v, BITS_PER_WORD);
> > +  val_elt = GEN_INT (v);
> > +  for (; j < nelt_mode; j++)
> > +    rval[j] = val_elt;
> 
> Is this the first use of J?  If so, initialize it here.
> 
> > +
> > +  reg = gen_reg_rtx (mode);
> > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
> > + rval));
> > +  /* Emit instruction loading the constant value.  */  emit_move_insn
> > + (reg, val_vec);
> > +
> > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i + nelt_mode
> > + <= length); i += nelt_mode)
> 
> Similarly for I.
> 
> > +    {
> > +      emit_insn ((*gen_func) (mem, reg));
> > +      if (i + 2 * nelt_mode <= length)
> > +	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
> > +    }
> > +
> > +  if (i + nelt_v8 <= length)
> > +    gcc_assert (mode == V16QImode);
> 
> Why not drop the if and write:
> 
>      gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
Fixed.

> 
> > +
> > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 < length)
> 
> Your assertion above checked <=, but here you use <.  Is that correct?
Yes, it is.  For case "==", it means we have nelt_v8 bytes leftover, which
will be handled by the last branch of if statement.
 
> 
> > +    {
> > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
> > +      /* We are shifting bytes back, set the alignment accordingly.  */
> > +      if ((length & 1) != 0 && align >= 2)
> > +	set_mem_align (mem, BITS_PER_UNIT);
> > +
> > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> > +    }
> > +  /* Handle (0, 8] bytes leftover.  */
> > +  else if (i < length && i + nelt_v8 >= length)
> > +    {
> > +      if (mode == V16QImode)
> > +	{
> > +	  reg = gen_lowpart (V8QImode, reg);
> > +	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
> > +	}
> > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
> > +					      + (nelt_mode - nelt_v8))));
> > +      /* We are shifting bytes back, set the alignment accordingly.  */
> > +      if ((length & 1) != 0 && align >= 2)
> > +	set_mem_align (mem, BITS_PER_UNIT);
> > +
> > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Set a block of memory using vectorization instructions for the
> > +   aligned case.  We fill the first LENGTH bytes of the memory area
> > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
> > +   alignment requirement of memory.  */
> 
> See all the comments above for the unaligend case.
Fixed accordingly.

> 
> > +static bool
> > +arm_block_set_aligned_vect (rtx dstbase,
> > +			    unsigned HOST_WIDE_INT length,
> > +			    unsigned HOST_WIDE_INT value,
> > +			    unsigned HOST_WIDE_INT align)
> > +{
> > +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
> > +  rtx dst, addr, mem;
> > +  rtx val_elt, val_vec, reg;
> > +  rtx rval[MAX_VECT_LEN];
> > +  enum machine_mode mode;
> > +  unsigned HOST_WIDE_INT v = value;
> > +
> > +  gcc_assert ((align & 0x3) == 0);
> > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16 &&
> > + unaligned_access && !BYTES_BIG_ENDIAN)
> > +    mode = V16QImode;
> > +  else
> > +    mode = V8QImode;
> > +
> > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> > + nelt_mode);
> > +  /* Skip if it isn't profitable.  */  if
> > + (!arm_block_set_vect_profit_p (length, align, false, mode))
> > +    return false;
> > +
> > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> > +
> > +  v = sext_hwi (v, BITS_PER_WORD);
> > +  val_elt = GEN_INT (v);
> > +  for (; j < nelt_mode; j++)
> > +    rval[j] = val_elt;
> > +
> > +  reg = gen_reg_rtx (mode);
> > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
> > + rval));
> > +  /* Emit instruction loading the constant value.  */  emit_move_insn
> > + (reg, val_vec);
> > +
> > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
> > +*/
> > +  if (mode == V16QImode)
> > +    {
> > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> > +      i += nelt_mode;
> > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
> > +      if (i + nelt_v8 < length && i + nelt_v16 > length)
> > +	{
> > +	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> > +	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
> > +	  /* We are shifting bytes back, set the alignment accordingly.  */
> > +	  if ((length & 0x3) == 0)
> > +	    set_mem_align (mem, BITS_PER_UNIT * 4);
> > +	  else if ((length & 0x1) == 0)
> > +	    set_mem_align (mem, BITS_PER_UNIT * 2);
> > +	  else
> > +	    set_mem_align (mem, BITS_PER_UNIT);
> > +
> > +	  emit_insn (gen_movmisalignv16qi (mem, reg));
> > +	  return true;
> > +	}
> > +      /* Fall through for bytes leftover.  */
> > +      mode = V8QImode;
> > +      nelt_mode = GET_MODE_NUNITS (mode);
> > +      reg = gen_lowpart (V8QImode, reg);
> > +    }
> > +
> > +  /* Handle 8 bytes in a vector.  */
> > +  for (; (i + nelt_mode <= length); i += nelt_mode)
> > +    {
> > +      addr = plus_constant (Pmode, dst, i);
> > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> > +      emit_move_insn (mem, reg);
> > +    }
> > +
> > +  /* Handle single word leftover by shifting 4 bytes back.  We can
> > +     use aligned access for this case.  */
> > +  if (i + UNITS_PER_WORD == length)
> > +    {
> > +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
> > +      mem = adjust_automodify_address (dstbase, mode,
> > +				       addr, i - UNITS_PER_WORD);
> > +      /* We are shifting 4 bytes back, set the alignment accordingly.
*/
> > +      if (align > UNITS_PER_WORD)
> > +	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
> > +
> > +      emit_move_insn (mem, reg);
> > +    }
> > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
> > +     We have to use unaligned access for this case.  */
> > +  else if (i < length)
> > +    {
> > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> > +      /* We are shifting bytes back, set the alignment accordingly.  */
> > +      if ((length & 1) == 0)
> > +	set_mem_align (mem, BITS_PER_UNIT * 2);
> > +      else
> > +	set_mem_align (mem, BITS_PER_UNIT);
> > +
> > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Set a block of memory using plain strh/strb instructions, only
> > +   using instructions allowed by ALIGN on processor.  We fill the
> > +   first LENGTH bytes of the memory area starting from DSTBASE
> > +   with byte constant VALUE.  ALIGN is the alignment requirement
> > +   of memory.  */
> > +static bool
> > +arm_block_set_unaligned_straight (rtx dstbase,
> > +				  unsigned HOST_WIDE_INT length,
> > +				  unsigned HOST_WIDE_INT value,
> > +				  unsigned HOST_WIDE_INT align)
> > +{
> > +  unsigned int i;
> > +  rtx dst, addr, mem;
> > +  rtx val_exp, val_reg, reg;
> > +  enum machine_mode mode;
> > +  HOST_WIDE_INT v = value;
> > +
> > +  gcc_assert (align == 1 || align == 2);
> > +
> > +  if (align == 2)
> > +    v |= (value << BITS_PER_UNIT);
> > +
> > +  v = sext_hwi (v, BITS_PER_WORD);
> > +  val_exp = GEN_INT (v);
> > +  /* Skip if it isn't profitable.  */
> > +  if (!arm_block_set_straight_profit_p (val_exp, length,
> > +					align, true, false))
> > +    return false;
> > +
> > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mode = (align == 2 ?
> > + HImode : QImode);  val_reg = force_reg (SImode, val_exp);  reg =
> > + gen_lowpart (mode, val_reg);
> > +
> > +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE
> (mode))
> > +    {
> > +      addr = plus_constant (Pmode, dst, i);
> > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> > +      emit_move_insn (mem, reg);
> > +    }
> > +
> > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> > +    {
> > +      reg = gen_lowpart (QImode, val_reg);
> > +      addr = plus_constant (Pmode, dst, i);
> > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> > +      emit_move_insn (mem, reg);
> > +      i++;
> > +    }
> > +
> > +  gcc_assert (i == length);
> > +  return true;
> > +}
> > +
> > +/* Set a block of memory using plain strd/str/strh/strb instructions,
> > +   to permit unaligned copies on processors which support unaligned
> > +   semantics for those instructions.  We fill the first LENGTH bytes
> > +   of the memory area starting from DSTBASE with byte constant VALUE.
> > +   ALIGN is the alignment requirement of memory.  */ static bool
> > +arm_block_set_aligned_straight (rtx dstbase,
> > +				unsigned HOST_WIDE_INT length,
> > +				unsigned HOST_WIDE_INT value,
> > +				unsigned HOST_WIDE_INT align)
> > +{
> > +  unsigned int i = 0;
> > +  rtx dst, addr, mem;
> > +  rtx val_exp, val_reg, reg;
> > +  unsigned HOST_WIDE_INT v;
> > +  bool use_strd_p;
> > +
> > +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
> > +		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
> > +
> > +  v = (value | (value << 8) | (value << 16) | (value << 24));  if
> > + (length < UNITS_PER_WORD)
> > +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
> > +
> > +  if (use_strd_p)
> > +    v |= (v << BITS_PER_WORD);
> > +  else
> > +    v = sext_hwi (v, BITS_PER_WORD);
> > +
> > +  val_exp = GEN_INT (v);
> > +  /* Skip if it isn't profitable.  */
> > +  if (!arm_block_set_straight_profit_p (val_exp, length,
> > +					align, false, use_strd_p))
> > +    {
> > +      /* Try without strd.  */
> > +      v = (v >> BITS_PER_WORD);
> > +      v = sext_hwi (v, BITS_PER_WORD);
> > +      val_exp = GEN_INT (v);
> > +      use_strd_p = false;
> > +      if (!arm_block_set_straight_profit_p (val_exp, length,
> > +					    align, false, use_strd_p))
> > +	return false;
> > +    }
> > +
> > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> > +  /* Handle double words using strd if possible.  */
> > +  if (use_strd_p)
> > +    {
> > +      val_reg = force_reg (DImode, val_exp);
> > +      reg = val_reg;
> > +      for (; (i + 8 <= length); i += 8)
> > +	{
> > +	  addr = plus_constant (Pmode, dst, i);
> > +	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
> > +	  emit_move_insn (mem, reg);
> > +	}
> > +    }
> > +  else
> > +    val_reg = force_reg (SImode, val_exp);
> > +
> > +  /* Handle words.  */
> > +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
> > +  for (; (i + 4 <= length); i += 4)
> > +    {
> > +      addr = plus_constant (Pmode, dst, i);
> > +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
> > +      if ((align & 3) == 0)
> > +	emit_move_insn (mem, reg);
> > +      else
> > +	emit_insn (gen_unaligned_storesi (mem, reg));
> > +    }
> > +
> > +  /* Merge last pair of STRH and STRB into a STR if possible.  */
> > +  if (unaligned_access && i > 0 && (i + 3) == length)
> > +    {
> > +      addr = plus_constant (Pmode, dst, i - 1);
> > +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
> > +      /* We are shifting one byte back, set the alignment accordingly.
*/
> > +      if ((align & 1) == 0)
> > +	set_mem_align (mem, BITS_PER_UNIT);
> > +
> > +      /* Most likely this is an unaligned access, and we can't tell at
> > +	 compilation time.  */
> > +      emit_insn (gen_unaligned_storesi (mem, reg));
> > +      return true;
> > +    }
> > +
> > +  /* Handle half word leftover.  */
> > +  if (i + 2 <= length)
> > +    {
> > +      reg = gen_lowpart (HImode, val_reg);
> > +      addr = plus_constant (Pmode, dst, i);
> > +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
> > +      if ((align & 1) == 0)
> > +	emit_move_insn (mem, reg);
> > +      else
> > +	emit_insn (gen_unaligned_storehi (mem, reg));
> > +
> > +      i += 2;
> > +    }
> > +
> > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> > +    {
> > +      reg = gen_lowpart (QImode, val_reg);
> > +      addr = plus_constant (Pmode, dst, i);
> > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> > +      emit_move_insn (mem, reg);
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Set a block of memory using vectorization instructions for both
> > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
> > +   the memory area starting from DSTBASE with byte constant VALUE.
> > +   ALIGN is the alignment requirement of memory.  */ static bool
> > +arm_block_set_vect (rtx dstbase,
> > +		    unsigned HOST_WIDE_INT length,
> > +		    unsigned HOST_WIDE_INT value,
> > +		    unsigned HOST_WIDE_INT align)
> > +{
> > +  /* Check whether we need to use unaligned store instruction.  */
> > +  if (((align & 3) != 0 || (length & 3) != 0)
> > +      /* Check whether unaligned store instruction is available.  */
> > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
> > +    return false;
> 
> Huh!  vst1.8 can work for unaligned accesses even when hw alignment
> checking is strict.
Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
unaligned_access", vst1.8 instructions  can't be recognized now in this way.
I agree that it's too strict, but that's another problem I think.  

> 
> > +
> > +  if ((align & 3) == 0)
> > +    return arm_block_set_aligned_vect (dstbase, length, value,
> > +align);
> > +  else
> > +    return arm_block_set_unaligned_vect (dstbase, length, value,
> > +align); }
> > +
> > +/* Expand string store operation.  Firstly we try to do that by using
> > +   vectorization instructions, then try with ARM unaligned access and
> > +   double-word store if profitable.  OPERANDS[0] is the destination,
> > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
> > +   initialize the memory, OPERANDS[3] is the known alignment of the
> > +   destination.  */
> > +bool
> > +arm_gen_setmem (rtx *operands)
> > +{
> > +  rtx dstbase = operands[0];
> > +  unsigned HOST_WIDE_INT length;
> > +  unsigned HOST_WIDE_INT value;
> > +  unsigned HOST_WIDE_INT align;
> > +
> > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
> > +    return false;
> > +
> > +  length = UINTVAL (operands[1]);
> > +  if (length > 64)
> > +    return false;
> > +
> > +  value = (UINTVAL (operands[2]) & 0xFF);  align = UINTVAL
> > + (operands[3]);  if (TARGET_NEON && length >= 8
> > +      && current_tune->string_ops_prefer_neon
> > +      && arm_block_set_vect (dstbase, length, value, align))
> > +    return true;
> > +
> > +  if (!unaligned_access && (align & 3) != 0)
> > +    return arm_block_set_unaligned_straight (dstbase, length, value,
> > + align);
> > +
> > +  return arm_block_set_aligned_straight (dstbase, length, value,
> > +align); }
> > +
> >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
> >
> >  static unsigned HOST_WIDE_INT
> > Index: gcc/config/arm/arm-protos.h
> >
> ==========================================================
> =========
> > --- gcc/config/arm/arm-protos.h	(revision 209852)
> > +++ gcc/config/arm/arm-protos.h	(working copy)
> > @@ -277,6 +277,8 @@ struct tune_params
> >    /* Prefer 32-bit encoding instead of 16-bit encoding where subset of
flags
> >       would be set.  */
> >    bool disparage_partial_flag_setting_t16_encodings;
> > +  /* Prefer to inline string operations like memset by using Neon.
> > + */  bool string_ops_prefer_neon;
> >  };
> >
> >  extern const struct tune_params *current_tune; @@ -289,6 +291,7 @@
> > extern void arm_emit_coreregs_64bit_shift (enum rt  extern bool
> > arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /* RTX_CODE */
> >
> > +extern bool arm_gen_setmem (rtx *);
> >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx
> > sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx
> > op1, rtx sel);
> >
> > Index: gcc/config/arm/arm.md
> >
> ==========================================================
> =========
> > --- gcc/config/arm/arm.md	(revision 209852)
> > +++ gcc/config/arm/arm.md	(working copy)
> > @@ -7555,6 +7555,20 @@
> >  })
> >
> >
> > +(define_expand "setmemsi"
> > +  [(match_operand:BLK 0 "general_operand" "")
> > +   (match_operand:SI 1 "const_int_operand" "")
> > +   (match_operand:SI 2 "const_int_operand" "")
> > +   (match_operand:SI 3 "const_int_operand" "")]
> > +  "TARGET_32BIT"
> > +{
> > +  if (arm_gen_setmem (operands))
> > +    DONE;
> > +
> > +  FAIL;
> > +})
> > +
> > +
> >  ;; Move a block of memory if it is word aligned and MORE than 2 words
> long.
> >  ;; We could let this apply for blocks of less than this, but it
> > clobbers so  ;; many registers that there is then probably a better way.
> > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
> >
> ==========================================================
> =========
> > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
> > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
> 
> Have you tested these when the compiler was configured with "--with-
> cpu=cortex-a9"?
Here is the tricky part.
For compiler configured with "--with-tune=cortex-a9", the neon related cases
(4/5/6/8/9) would fail because we have no way to determine that we are
compiling with cortex-a9 tune here.
For compiler configured with "--with-cpu=cortex-a9", the test cases would
pass but I think this is a mistake.  It reveals an issue that GCC won't pass
"-mcpu=cortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It just
makes no sense.
With these issues, I didn't change the tests for now.

During the review process, I spotted and fixed a latent bug in some rare
cases.  I also set the string_ops_prefer_neon tune parameter to true for
cortex-a8 and cortex-a5.
The second version patch is as attached.  Is it OK?

Thanks,
bin



[-- Attachment #2: j1328-20140505.txt --]
[-- Type: text/plain, Size: 56384 bytes --]

Index: gcc/testsuite/gcc.target/arm/memset-inline-7.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
@@ -0,0 +1,171 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+int b[LEN];
+
+void
+init (signed char *arr, int len)
+{
+  int i;
+  for (i = 0; i < len; i++)
+    arr[i] = 0;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+#define TEST(a,l,v)			\
+	init ((signed char*)(a), sizeof (a));		\
+	memset ((a), (v), (l));				\
+	check ((signed char *)(a), (l), sizeof (a), (v));
+int
+main(void)
+{
+  TEST (a, 1, -1);
+  TEST (a, 2, -1);
+  TEST (a, 3, -1);
+  TEST (a, 4, -1);
+  TEST (a, 5, -1);
+  TEST (a, 6, -1);
+  TEST (a, 7, -1);
+  TEST (a, 8, -1);
+  TEST (a, 9, 1);
+  TEST (a, 10, -1);
+  TEST (a, 11, 1);
+  TEST (a, 12, -1);
+  TEST (a, 13, 1);
+  TEST (a, 14, -1);
+  TEST (a, 15, 1);
+  TEST (a, 16, -1);
+  TEST (a, 17, 1);
+  TEST (a, 18, -1);
+  TEST (a, 19, 1);
+  TEST (a, 20, -1);
+  TEST (a, 21, 1);
+  TEST (a, 22, -1);
+  TEST (a, 23, 1);
+  TEST (a, 24, -1);
+  TEST (a, 25, 1);
+  TEST (a, 26, -1);
+  TEST (a, 27, 1);
+  TEST (a, 28, -1);
+  TEST (a, 29, 1);
+  TEST (a, 30, -1);
+  TEST (a, 31, 1);
+  TEST (a, 32, -1);
+  TEST (a, 33, 1);
+  TEST (a, 34, -1);
+  TEST (a, 35, 1);
+  TEST (a, 36, -1);
+  TEST (a, 37, 1);
+  TEST (a, 38, -1);
+  TEST (a, 39, 1);
+  TEST (a, 40, -1);
+  TEST (a, 41, 1);
+  TEST (a, 42, -1);
+  TEST (a, 43, 1);
+  TEST (a, 44, -1);
+  TEST (a, 45, 1);
+  TEST (a, 46, -1);
+  TEST (a, 47, 1);
+  TEST (a, 48, -1);
+  TEST (a, 49, 1);
+  TEST (a, 50, -1);
+  TEST (a, 51, 1);
+  TEST (a, 52, -1);
+  TEST (a, 53, 1);
+  TEST (a, 54, -1);
+  TEST (a, 55, 1);
+  TEST (a, 56, -1);
+  TEST (a, 57, 1);
+  TEST (a, 58, -1);
+  TEST (a, 59, 1);
+  TEST (a, 60, -1);
+  TEST (a, 61, 1);
+  TEST (a, 62, -1);
+  TEST (a, 63, 1);
+  TEST (a, 64, -1);
+
+  TEST (b, 1, -1);
+  TEST (b, 2, -1);
+  TEST (b, 3, -1);
+  TEST (b, 4, -1);
+  TEST (b, 5, -1);
+  TEST (b, 6, -1);
+  TEST (b, 7, -1);
+  TEST (b, 8, -1);
+  TEST (b, 9, 1);
+  TEST (b, 10, -1);
+  TEST (b, 11, 1);
+  TEST (b, 12, -1);
+  TEST (b, 13, 1);
+  TEST (b, 14, -1);
+  TEST (b, 15, 1);
+  TEST (b, 16, -1);
+  TEST (b, 17, 1);
+  TEST (b, 18, -1);
+  TEST (b, 19, 1);
+  TEST (b, 20, -1);
+  TEST (b, 21, 1);
+  TEST (b, 22, -1);
+  TEST (b, 23, 1);
+  TEST (b, 24, -1);
+  TEST (b, 25, 1);
+  TEST (b, 26, -1);
+  TEST (b, 27, 1);
+  TEST (b, 28, -1);
+  TEST (b, 29, 1);
+  TEST (b, 30, -1);
+  TEST (b, 31, 1);
+  TEST (b, 32, -1);
+  TEST (b, 33, 1);
+  TEST (b, 34, -1);
+  TEST (b, 35, 1);
+  TEST (b, 36, -1);
+  TEST (b, 37, 1);
+  TEST (b, 38, -1);
+  TEST (b, 39, 1);
+  TEST (b, 40, -1);
+  TEST (b, 41, 1);
+  TEST (b, 42, -1);
+  TEST (b, 43, 1);
+  TEST (b, 44, -1);
+  TEST (b, 45, 1);
+  TEST (b, 46, -1);
+  TEST (b, 47, 1);
+  TEST (b, 48, -1);
+  TEST (b, 49, 1);
+  TEST (b, 50, -1);
+  TEST (b, 51, 1);
+  TEST (b, 52, -1);
+  TEST (b, 53, 1);
+  TEST (b, 54, -1);
+  TEST (b, 55, 1);
+  TEST (b, 56, -1);
+  TEST (b, 57, 1);
+  TEST (b, 58, -1);
+  TEST (b, 59, 1);
+  TEST (b, 60, -1);
+  TEST (b, 61, 1);
+  TEST (b, 62, -1);
+  TEST (b, 63, 1);
+  TEST (b, 64, -1);
+
+  return 0;
+}
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
@@ -0,0 +1,39 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler "bl?\[ \t\]*memset" { target { ! arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-3.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 7);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 7, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-not "strh" { target { ! arm_thumb1 } } } } */
+/* { dg-final { scan-assembler-not "strb" { target { ! arm_thumb1 } } } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 8);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 12);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, 1, 13);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  int i;
+
+  foo1 ();
+  check ((signed char *)a, 8, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 12, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 13, sizeof (c), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-times "vst1\.8" 1 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
@@ -0,0 +1,78 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+int d[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 16);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 25);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 19);
+  return;
+}
+
+void
+foo4 (void)
+{
+  memset (d, 1, 23);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 16, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 25, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 19, sizeof (c), -1);
+
+  foo4 ();
+  check ((signed char *)d, 23, sizeof (d), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr"  { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 20);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 24);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 32);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 20, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 24, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 32, sizeof (c), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vst1" 3 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vstr" 4 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
+
Index: gcc/config/arm/arm-protos.h
===================================================================
--- gcc/config/arm/arm-protos.h	(revision 209852)
+++ gcc/config/arm/arm-protos.h	(working copy)
@@ -277,6 +277,10 @@ struct tune_params
   /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
      would be set.  */
   bool disparage_partial_flag_setting_t16_encodings;
+  /* Prefer to inline string operations like memset by using Neon.  */
+  bool string_ops_prefer_neon;
+  /* Maximum number of instructions to inline calls to memset.  */
+  int max_insns_inline_memset;
 };
 
 extern const struct tune_params *current_tune;
@@ -289,6 +293,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
 extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
 #endif /* RTX_CODE */
 
+extern bool arm_gen_setmem (rtx *);
 extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
 extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
 
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 209852)
+++ gcc/config/arm/arm.c	(working copy)
@@ -1578,34 +1578,38 @@ const struct tune_params arm_slowmul_tune =
 {
   arm_slowmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  3,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  3,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fastmul_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1615,17 +1619,19 @@ const struct tune_params arm_strongarm_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  3,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  3,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1633,50 +1639,56 @@ const struct tune_params arm_xscale_tune =
   arm_xscale_rtx_costs,
   NULL,
   xscale_sched_adjust_cost,
-  2,						/* Constant limit.  */
-  3,						/* Max cond insns.  */
+  2,					/* Constant limit.  */
+  3,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_9e_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_v6t2_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1684,34 +1696,38 @@ const struct tune_params arm_cortex_tune =
 {
   arm_9e_rtx_costs,
   &generic_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
 {
   arm_9e_rtx_costs,
   &cortexa8_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1719,67 +1735,75 @@ const struct tune_params arm_cortex_a7_tune =
   arm_9e_rtx_costs,
   &cortexa7_extra_costs,
   NULL,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
 {
   arm_9e_rtx_costs,
   &cortexa15_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  2,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  true, true                                    /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  true, true,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
 {
   arm_9e_rtx_costs,
   &cortexa53_extra_costs,
-  NULL,						/* Scheduler cost adjustment.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Scheduler cost adjustment.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
 {
   arm_9e_rtx_costs,
   &cortexa57_extra_costs,
-  NULL,                                         /* Scheduler cost adjustment.  */
-  1,                                           /* Constant limit.  */
-  2,                                           /* Max cond insns.  */
+  NULL,					/* Scheduler cost adjustment.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,                                       /* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,                                       /* Prefer LDRD/STRD.  */
-  {true, true},                                /* Prefer non short circuit.  */
-  &arm_default_vec_cost,                       /* Vectorizer costs.  */
-  false,                                       /* Prefer Neon for 64-bits bitops.  */
-  true, true                                   /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  true, true,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1789,17 +1813,19 @@ const struct tune_params arm_cortex_a5_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  1,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  1,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_cortex_a5_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1807,16 +1833,18 @@ const struct tune_params arm_cortex_a9_tune =
   arm_9e_rtx_costs,
   &cortexa9_extra_costs,
   cortex_a9_sched_adjust_cost,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_BENEFICIAL(4,32,32),
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1824,16 +1852,18 @@ const struct tune_params arm_cortex_a12_tune =
   arm_9e_rtx_costs,
   &cortexa12_extra_costs,
   NULL,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_BENEFICIAL(4,32,32),
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -1847,17 +1877,19 @@ const struct tune_params arm_v7m_tune =
 {
   arm_9e_rtx_costs,
   &v7m_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  2,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_cortex_m_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1866,17 +1898,19 @@ const struct tune_params arm_v6m_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -1884,16 +1918,18 @@ const struct tune_params arm_fa726te_tune =
   arm_9e_rtx_costs,
   NULL,
   fa726te_sched_adjust_cost,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 
@@ -16788,6 +16824,14 @@ arm_const_double_inline_cost (rtx val)
 			      NULL_RTX, NULL_RTX, 0, 0));
 }
 
+/* Cost of loading a SImode constant.  */
+static inline int
+arm_const_inline_cost (enum rtx_code code, rtx val)
+{
+  return arm_gen_constant (code, SImode, NULL_RTX, INTVAL (val),
+                           NULL_RTX, NULL_RTX, 1, 0);
+}
+
 /* Return true if it is worthwhile to split a 64-bit constant into two
    32-bit operations.  This is the case if optimizing for size, or
    if we have load delay slots, or if one 32-bit part can be done with
@@ -31350,6 +31401,519 @@ arm_validize_comparison (rtx *comparison, rtx * op
 
 }
 
+/* Maximum number of instructions to set block of memory.  */
+static int
+arm_block_set_max_insns (void)
+{
+  if (optimize_function_for_size_p (cfun))
+    return 4;
+  else
+    return current_tune->max_insns_inline_memset;
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   non-vectorized case.  VAL is the value to set the memory
+   with.  LENGTH is the number of bytes to set.  ALIGN is the
+   alignment of the destination memory in bytes.  UNALIGNED_P
+   is TRUE if we can only set the memory with instructions
+   meeting alignment requirements.  USE_STRD_P is TRUE if we
+   can use strd to set the memory.  */
+static bool
+arm_block_set_non_vect_profit_p (rtx val,
+				 unsigned HOST_WIDE_INT length,
+				 unsigned HOST_WIDE_INT align,
+				 bool unaligned_p, bool use_strd_p)
+{
+  int num = 0;
+  /* For leftovers in bytes of 0-7, we can set the memory block using
+     strb/strh/str with minimum instruction number.  */
+  const int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
+
+  if (unaligned_p)
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += length / align + length % align;
+    }
+  else if (use_strd_p)
+    {
+      num = arm_const_double_inline_cost (val);
+      num += (length >> 3) + leftover[length & 7];
+    }
+  else
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += (length >> 2) + leftover[length & 3];
+    }
+
+  /* We may be able to combine last pair STRH/STRB into a single STR
+     by shifting one byte back.  */
+  if (unaligned_access && length > 3 && (length & 3) == 3)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   vectorized case.  LENGTH is the number of bytes to set.
+   ALIGN is the alignment of destination memory in bytes.
+   MODE is the vector mode used to set the memory.  */
+static bool
+arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
+			     unsigned HOST_WIDE_INT align,
+			     enum machine_mode mode)
+{
+  int num;
+  bool unaligned_p = ((align & 3) != 0);
+  unsigned int nelt = GET_MODE_NUNITS (mode);
+
+  /* Instruction loading constant value.  */
+  num = 1;
+  /* Instructions storing the memory.  */
+  num += (length + nelt - 1) / nelt;
+  /* Instructions adjusting the address expression.  Only need to
+     adjust address expression if it's 4 bytes aligned and bytes
+     leftover can only be stored by mis-aligned store instruction.  */
+  if (!unaligned_p && (length & 3) != 0)
+    num++;
+
+  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
+  if (!unaligned_p && mode == V16QImode)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Set a block of memory using vectorization instructions for the
+   unaligned case.  We fill the first LENGTH bytes of the memory
+   area starting from DSTBASE with byte constant VALUE.  ALIGN is
+   the alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_unaligned_vect (rtx dstbase,
+			      unsigned HOST_WIDE_INT length,
+			      unsigned HOST_WIDE_INT value,
+			      unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v16, nelt_v8, nelt_mode;
+  rtx dst, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  rtx (*gen_func) (rtx, rtx);
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) != 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16)
+    {
+      mode = V16QImode;
+      gen_func = gen_movmisalignv16qi;
+    }
+  else
+    {
+      mode = V8QImode;
+      gen_func = gen_movmisalignv8qi;
+    }
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle nelt_mode bytes in a vector.  */
+  for (i = 0; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      emit_insn ((*gen_func) (mem, reg));
+      if (i + 2 * nelt_mode <= length)
+	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
+    }
+
+  /* If there are not less than nelt_v8 bytes leftover, we must be in
+     V16QI mode.  */
+  gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
+
+  /* Handle (8, 16) bytes leftover.  */
+  if (i + nelt_v8 < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+    }
+  /* Handle (0, 8] bytes leftover.  */
+  else if (i < length && i + nelt_v8 >= length)
+    {
+      if (mode == V16QImode)
+	{
+	  reg = gen_lowpart (V8QImode, reg);
+	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
+	}
+      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
+					      + (nelt_mode - nelt_v8))));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for the
+   aligned case.  We fill the first LENGTH bytes of the memory area
+   starting from DSTBASE with byte constant VALUE.  ALIGN is the
+   alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_aligned_vect (rtx dstbase,
+			    unsigned HOST_WIDE_INT length,
+			    unsigned HOST_WIDE_INT value,
+			    unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v8, nelt_v16, nelt_mode;
+  rtx dst, addr, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) == 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
+    mode = V16QImode;
+  else
+    mode = V8QImode;
+
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  i = 0;
+  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
+  if (mode == V16QImode)
+    {
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+      i += nelt_mode;
+      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
+      if (i + nelt_v8 < length && i + nelt_v16 > length)
+	{
+	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+	  /* We are shifting bytes back, set the alignment accordingly.  */
+	  if ((length & 0x3) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 4);
+	  else if ((length & 0x1) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 2);
+	  else
+	    set_mem_align (mem, BITS_PER_UNIT);
+
+	  emit_insn (gen_movmisalignv16qi (mem, reg));
+	  return true;
+	}
+      /* Fall through for bytes leftover.  */
+      mode = V8QImode;
+      nelt_mode = GET_MODE_NUNITS (mode);
+      reg = gen_lowpart (V8QImode, reg);
+    }
+
+  /* Handle 8 bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single word leftover by shifting 4 bytes back.  We can
+     use aligned access for this case.  */
+  if (i + UNITS_PER_WORD == length)
+    {
+      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
+      mem = adjust_automodify_address (dstbase, mode,
+				       addr, i - UNITS_PER_WORD);
+      /* We are shifting 4 bytes back, set the alignment accordingly.  */
+      if (align > UNITS_PER_WORD)
+	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
+
+      emit_move_insn (mem, reg);
+    }
+  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
+     We have to use unaligned access for this case.  */
+  else if (i < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT * 2);
+      else
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using plain strh/strb instructions, only
+   using instructions allowed by ALIGN on processor.  We fill the
+   first LENGTH bytes of the memory area starting from DSTBASE
+   with byte constant VALUE.  ALIGN is the alignment requirement
+   of memory.  */
+static bool
+arm_block_set_unaligned_non_vect (rtx dstbase,
+				  unsigned HOST_WIDE_INT length,
+				  unsigned HOST_WIDE_INT value,
+				  unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  enum machine_mode mode;
+  HOST_WIDE_INT v = value;
+
+  gcc_assert (align == 1 || align == 2);
+
+  if (align == 2)
+    v |= (value << BITS_PER_UNIT);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, true, false))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mode = (align == 2 ? HImode : QImode);
+  val_reg = force_reg (SImode, val_exp);
+  reg = gen_lowpart (mode, val_reg);
+
+  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+      i++;
+    }
+
+  gcc_assert (i == length);
+  return true;
+}
+
+/* Set a block of memory using plain strd/str/strh/strb instructions,
+   to permit unaligned copies on processors which support unaligned
+   semantics for those instructions.  We fill the first LENGTH bytes
+   of the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_non_vect (rtx dstbase,
+				unsigned HOST_WIDE_INT length,
+				unsigned HOST_WIDE_INT value,
+				unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  unsigned HOST_WIDE_INT v;
+  bool use_strd_p;
+
+  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
+		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
+
+  v = (value | (value << 8) | (value << 16) | (value << 24));
+  if (length < UNITS_PER_WORD)
+    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
+
+  if (use_strd_p)
+    v |= (v << BITS_PER_WORD);
+  else
+    v = sext_hwi (v, BITS_PER_WORD);
+
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, false, use_strd_p))
+    {
+      if (!use_strd_p)
+	return false;
+
+      /* Try without strd.  */
+      v = (v >> BITS_PER_WORD);
+      v = sext_hwi (v, BITS_PER_WORD);
+      val_exp = GEN_INT (v);
+      use_strd_p = false;
+      if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					    align, false, use_strd_p))
+	return false;
+    }
+
+  i = 0;
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  /* Handle double words using strd if possible.  */
+  if (use_strd_p)
+    {
+      val_reg = force_reg (DImode, val_exp);
+      reg = val_reg;
+      for (; (i + 8 <= length); i += 8)
+	{
+	  addr = plus_constant (Pmode, dst, i);
+	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
+	  emit_move_insn (mem, reg);
+	}
+    }
+  else
+    val_reg = force_reg (SImode, val_exp);
+
+  /* Handle words.  */
+  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
+  for (; (i + 4 <= length); i += 4)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i);
+      if ((align & 3) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storesi (mem, reg));
+    }
+
+  /* Merge last pair of STRH and STRB into a STR if possible.  */
+  if (unaligned_access && i > 0 && (i + 3) == length)
+    {
+      addr = plus_constant (Pmode, dst, i - 1);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
+      /* We are shifting one byte back, set the alignment accordingly.  */
+      if ((align & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      /* Most likely this is an unaligned access, and we can't tell at
+	 compilation time.  */
+      emit_insn (gen_unaligned_storesi (mem, reg));
+      return true;
+    }
+
+  /* Handle half word leftover.  */
+  if (i + 2 <= length)
+    {
+      reg = gen_lowpart (HImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, HImode, addr, i);
+      if ((align & 1) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storehi (mem, reg));
+
+      i += 2;
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for both
+   aligned and unaligned cases.  We fill the first LENGTH bytes of
+   the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_vect (rtx dstbase,
+		    unsigned HOST_WIDE_INT length,
+		    unsigned HOST_WIDE_INT value,
+		    unsigned HOST_WIDE_INT align)
+{
+  /* Check whether we need to use unaligned store instruction.  */
+  if (((align & 3) != 0 || (length & 3) != 0)
+      /* Check whether unaligned store instruction is available.  */
+      && (!unaligned_access || BYTES_BIG_ENDIAN))
+    return false;
+
+  if ((align & 3) == 0)
+    return arm_block_set_aligned_vect (dstbase, length, value, align);
+  else
+    return arm_block_set_unaligned_vect (dstbase, length, value, align);
+}
+
+/* Expand string store operation.  Firstly we try to do that by using
+   vectorization instructions, then try with ARM unaligned access and
+   double-word store if profitable.  OPERANDS[0] is the destination,
+   OPERANDS[1] is the number of bytes, operands[2] is the value to
+   initialize the memory, OPERANDS[3] is the known alignment of the
+   destination.  */
+bool
+arm_gen_setmem (rtx *operands)
+{
+  rtx dstbase = operands[0];
+  unsigned HOST_WIDE_INT length;
+  unsigned HOST_WIDE_INT value;
+  unsigned HOST_WIDE_INT align;
+
+  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
+    return false;
+
+  length = UINTVAL (operands[1]);
+  if (length > 64)
+    return false;
+
+  value = (UINTVAL (operands[2]) & 0xFF);
+  align = UINTVAL (operands[3]);
+  if (TARGET_NEON && length >= 8
+      && current_tune->string_ops_prefer_neon
+      && arm_block_set_vect (dstbase, length, value, align))
+    return true;
+
+  if (!unaligned_access && (align & 3) != 0)
+    return arm_block_set_unaligned_non_vect (dstbase, length, value, align);
+
+  return arm_block_set_aligned_non_vect (dstbase, length, value, align);
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 209852)
+++ gcc/config/arm/arm.md	(working copy)
@@ -7555,6 +7555,20 @@
 })
 
 
+(define_expand "setmemsi"
+  [(match_operand:BLK 0 "general_operand" "")
+   (match_operand:SI 1 "const_int_operand" "")
+   (match_operand:SI 2 "const_int_operand" "")
+   (match_operand:SI 3 "const_int_operand" "")]
+  "TARGET_32BIT"
+{
+  if (arm_gen_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
+
 ;; Move a block of memory if it is word aligned and MORE than 2 words long.
 ;; We could let this apply for blocks of less than this, but it clobbers so
 ;; many registers that there is then probably a better way.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH ARM] Improve ARM memset inlining
  2014-05-05  7:21   ` bin.cheng
@ 2014-05-06  5:00     ` bin.cheng
  2014-05-12  3:17       ` Bin.Cheng
  2014-06-27  8:21       ` Ramana Radhakrishnan
  0 siblings, 2 replies; 14+ messages in thread
From: bin.cheng @ 2014-05-06  5:00 UTC (permalink / raw)
  To: Bin Cheng, Richard Earnshaw; +Cc: gcc-patches



> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Monday, May 05, 2014 3:21 PM
> To: Richard Earnshaw
> Cc: gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH ARM] Improve ARM memset inlining
> 
> Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
> also updated the patch.
> 
> > -----Original Message-----
> > From: Richard Earnshaw
> > Sent: Friday, May 02, 2014 10:00 PM
> > To: Bin Cheng
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
> >
> > On 30/04/14 03:52, bin.cheng wrote:
> > > Hi,
> > > This patch expands small memset calls into direct memory set
> > > instructions by introducing "setmemsi" pattern.  For processors
> > > without NEON support, it expands memset using general store
> > > instruction.  For example, strd for 4-bytes aligned addresses.  For
> > > processors with NEON support, it expands memset using neon
> > > instructions like vstr and miscellaneous vst1.* instructions for
> > > both
> aligned
> > and unaligned cases.
> > >
> > > This patch depends on
> > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
> > > vst1.64 will be generated for 32-bit aligned memory unit.
> > >
> > > There is also one leftover work of this patch:  Since vst1.*
> > > instructions only support post-increment addressing mode, the
> > > inlined memset for unaligned neon cases should be like:
> > >   vmov.i32   q8, #...
> > >   vst1.8     {q8}, [r3]!
> > >   vst1.8     {q8}, [r3]!
> > >   vst1.8     {q8}, [r3]!
> > >   vst1.8     {q8}, [r3]
> >
> > Other than for zero, I'd expect the vmov to be vmov.i8 to move an
> arbitrary
> I just used vmov.i32 as an example.  The element size is actually
calculated by
> function neon_valid_immediate which works as expected I think.
> 
> > byte value into all lanes in a vector.  After that, if the alignment
> > is
> known to
> > be more than 8-bit, I'd expect the vst1 instructions (with the
> > exception
> of the
> > last store if the length is not a multiple of the alignment) to use
> >
> > 	vst1.<align> {reg}, [addr-reg :<align>]!
> >
> > Hence, for 16-bit aligned data, we want
> >
> > 	vst1.16	{q8}, [r3:16]!
> Did I miss something important?  It seems to me the explicit alignment
notes
> supported are 64/128/256.  So what do you mean by 16 bits alignment here?
> 
> >
> > > But for now, gcc can't do this and below code is generated:
> > >   vmov.i32   q8, #...
> > >   vst1.8     {q8}, [r3]
> > >   add        r2,   r3,  #16
> > >   add        r3,   r2,  #16
> > >   vst1.8     {q8}, [r2]
> > >   vst1.8     {q8}, [r3]
> > >   add        r2,   r3,  #16
> > >   vst1.8     {q8}, [r2]
> > >
> > > I investigated this issue.  The root cause lies in rtx cost returned
> > > by ARM backend.  Anyway, I think this is another issue and should be
> > > fixed in separated patch.
> > >
> > > Bootstrap and reg-test on cortex-a15, with or without neon support.
> > > Is it OK?
> > >
> >
> > Some more comments inline.
> >
> > > Thanks,
> > > bin
> > >
> > >
> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> > >
> > > 	PR target/55701
> > > 	* config/arm/arm.md (setmem): New pattern.
> > > 	* config/arm/arm-protos.h (struct tune_params): New field.
> > > 	(arm_gen_setmem): New prototype.
> > > 	* config/arm/arm.c (arm_slowmul_tune): Initialize new field.
> > > 	(arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
> > > 	(arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
> > > 	(arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
> > > 	(arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
> > > 	(arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
> > > 	(arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
> > > 	(arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
> > > 	(arm_const_inline_cost): New function.
> > > 	(arm_block_set_max_insns): New function.
> > > 	(arm_block_set_straight_profit_p): New function.
> > > 	(arm_block_set_vect_profit_p): New function.
> > > 	(arm_block_set_unaligned_vect): New function.
> > > 	(arm_block_set_aligned_vect): New function.
> > > 	(arm_block_set_unaligned_straight): New function.
> > > 	(arm_block_set_aligned_straight): New function.
> > > 	(arm_block_set_vect, arm_gen_setmem): New functions.
> > >
> > > gcc/testsuite/ChangeLog
> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> > >
> > > 	PR target/55701
> > > 	* gcc.target/arm/memset-inline-1.c: New test.
> > > 	* gcc.target/arm/memset-inline-2.c: New test.
> > > 	* gcc.target/arm/memset-inline-3.c: New test.
> > > 	* gcc.target/arm/memset-inline-4.c: New test.
> > > 	* gcc.target/arm/memset-inline-5.c: New test.
> > > 	* gcc.target/arm/memset-inline-6.c: New test.
> > > 	* gcc.target/arm/memset-inline-7.c: New test.
> > > 	* gcc.target/arm/memset-inline-8.c: New test.
> > > 	* gcc.target/arm/memset-inline-9.c: New test.
> > >
> > >
> > > j1328-20140429.txt
> > >
> > >
> > > Index: gcc/config/arm/arm.c
> > >
> >
> ==========================================================
> > =========
> > > --- gcc/config/arm/arm.c	(revision 209852)
> > > +++ gcc/config/arm/arm.c	(working copy)
> > > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune
> =
> > >    true,						/* Prefer constant
> > pool.  */
> > >    arm_default_branch_cost,
> > >    false,					/* Prefer LDRD/STRD.  */
> > > -  {true, true},					/* Prefer non short
> > circuit.  */
> > > -  &arm_default_vec_cost,                        /* Vectorizer costs.
> */
> > > -  false,                                        /* Prefer Neon for
> 64-bits bitops.  */
> > > -  false, false                                  /* Prefer 32-bit
> encodings.  */
> > > +  {true, true},				/* Prefer non short circuit.
> */
> > > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
> > > +  false,                                /* Prefer Neon for 64-bits
> bitops.  */
> > > +  false, false,                         /* Prefer 32-bit encodings.
*/
> > > +  false                                 /* Prefer Neon for stringops.
> */
> > >  };
> > >
> >
> > Please make sure that all the white space before the comments is using
> TAB,
> > not spaces.  Similarly for the other tables.
> Fixed.
> 
> >
> > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
> > >  			      NULL_RTX, NULL_RTX, 0, 0));  }
> > >
> > > +/* Cost of loading a SImode constant.  */ static inline int
> > > +arm_const_inline_cost (rtx val) {
> > > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
> > > +                           NULL_RTX, NULL_RTX, 0, 0); }
> > > +
> >
> > This could be used more widely if you passed the SET in as a parameter
> > (there are cases in arm_new_rtx_cost that could use it, for example).
> > Also, you want to enable sub-targets (only once you can't create new
> > pseudos is that not safe), so the penultimate argument in the call to
> > arm_gen_constant should be 1.
> Fixed.
> 
> >
> > >  /* Return true if it is worthwhile to split a 64-bit constant into
two
> > >     32-bit operations.  This is the case if optimizing for size, or
> > >     if we have load delay slots, or if one 32-bit part can be done
> > > with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
> > > *comparison, rtx * op
> > >
> > >  }
> > >
> > > +/* Maximum number of instructions to set block of memory.  */
> > > +static int arm_block_set_max_insns (void) {
> > > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
> >
> > I think the non-size_p alternative should really be a parameter in the
> per-cpu
> > costs table.
> Fixed.
> 
> >
> > > +
> > > +/* Return TRUE if it's profitable to set block of memory for straight
> > > +   case.  */
> >
> > "Straight" is confusing here.  Do you mean non-vectorized?  If so,
> > then non_vect might be clearer.
> Fixed.
> 
> >
> > The arguments should really be documented (see comment below about
> > align, for example).
> Fixed.
> 
> >
> > > +static bool
> > > +arm_block_set_straight_profit_p (rtx val,
> > > +				 unsigned HOST_WIDE_INT length,
> > > +				 unsigned HOST_WIDE_INT align,
> > > +				 bool unaligned_p, bool use_strd_p) {
> > > +  int num = 0;
> > > +  /* For leftovers in bytes of 0-7, we can set the memory block using
> > > +     strb/strh/str with minimum instruction number.  */
> > > +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
> >
> > This should be marked const.
> Fixed.
> 
> >
> > > +
> > > +  if (unaligned_p)
> > > +    {
> > > +      num = arm_const_inline_cost (val);
> > > +      num += length / align + length % align;
> >
> > Isn't align in bits here, when you really want it in bytes?
> All alignments are in bytes starting from pattern "setmem".
> 
> >
> > What if align > 4 bytes?
> Then it's the "!unaligned_p" case and handled by other arms of this if
> statement.
> 
> >
> > > +    }
> > > +  else if (use_strd_p)
> > > +    {
> > > +      num = arm_const_double_inline_cost (val);
> > > +      num += (length >> 3) + leftover[length & 7];
> > > +    }
> > > +  else
> > > +    {
> > > +      num = arm_const_inline_cost (val);
> > > +      num += (length >> 2) + leftover[length & 3];
> > > +    }
> > > +
> > > +  /* We may be able to combine last pair STRH/STRB into a single STR
> > > +     by shifting one byte back.  */  if (unaligned_access && length
> > > + > 3 && (length & 3) == 3)
> > > +    num--;
> > > +
> > > +  return (num <= arm_block_set_max_insns ()); }
> > > +
> > > +/* Return TRUE if it's profitable to set block of memory for vector
> > > +case.  */ static bool arm_block_set_vect_profit_p (unsigned
> > > +HOST_WIDE_INT length,
> > > +			     unsigned HOST_WIDE_INT align
> > ATTRIBUTE_UNUSED,
> > > +			     bool unaligned_p, enum machine_mode mode)
> >
> > I'm not sure what you mean by unaligned here.  Again, documenting the
> > arguments might help.
> Fixed.
> 
> >
> > > +{
> > > +  int num;
> > > +  unsigned int nelt = GET_MODE_NUNITS (mode);
> > > +
> > > +  /* Num of instruction loading constant value.  */
> >
> > Use either "Number" or, in this case, simply drop that bit and write:
> >   /* Instruction loading constant value.  */
> Fixed.
> 
> >
> > > +  num = 1;
> > > +  /* Num of store instructions.  */
> >
> > Likewise.
> >
> > > +  num += (length + nelt - 1) / nelt;
> > > +  /* Num of address adjusting instructions.  */
> >
> > Can't we work on the premise that the address adjusting instructions
> > will
> be
> > merged into the stores?  I know you said that they currently do not,
> > but that's not a problem that this bit of code should have to worry
about.
> Fixed.
> 
> >
> > > +  if (unaligned_p)
> > > +    /* For unaligned case, it's one less than the store instructions.
> */
> > > +    num += (length + nelt - 1) / nelt - 1;  else if ((length & 3)
> > > + !=
> > > + 0)
> > > +    /* For aligned case, it's one if bytes leftover can only be
stored
> > > +       by mis-aligned store instruction.  */
> > > +    num++;
> > > +
> > > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
> > > + */  if (!unaligned_p && mode == V16QImode)
> > > +    num--;
> > > +
> > > +  return (num <= arm_block_set_max_insns ()); }
> > > +
> > > +/* Set a block of memory using vectorization instructions for the
> > > +   unaligned case.  We fill the first LENGTH bytes of the memory
> > > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
> > > +   the alignment requirement of memory.  */
> >
> > What's the return value mean?
> Documented.
> 
> >
> > > +static bool
> > > +arm_block_set_unaligned_vect (rtx dstbase,
> > > +			      unsigned HOST_WIDE_INT length,
> > > +			      unsigned HOST_WIDE_INT value,
> > > +			      unsigned HOST_WIDE_INT align) {
> > > +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
> >
> > Don't mix initialized declarations with unitialized ones on the same
line.
> You
> > don't appear to use either I or J until their first use in the loop
> control below,
> > so why initialize them here?
> Fixed.
> 
> >
> > > +  rtx dst, mem;
> > > +  rtx val_elt, val_vec, reg;
> > > +  rtx rval[MAX_VECT_LEN];
> > > +  rtx (*gen_func) (rtx, rtx);
> > > +  enum machine_mode mode;
> > > +  unsigned HOST_WIDE_INT v = value;
> > > +
> > > +  gcc_assert ((align & 0x3) != 0);
> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16)
> > > +    {
> > > +      mode = V16QImode;
> > > +      gen_func = gen_movmisalignv16qi;
> > > +    }
> > > +  else
> > > +    {
> > > +      mode = V8QImode;
> > > +      gen_func = gen_movmisalignv8qi;
> > > +    }
> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> > > + nelt_mode);
> > > +  /* Skip if it isn't profitable.  */  if
> > > + (!arm_block_set_vect_profit_p (length, align, true, mode))
> > > +    return false;
> > > +
> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mem =
> > > + adjust_automodify_address (dstbase, mode, dst, 0);
> > > +
> > > +  v = sext_hwi (v, BITS_PER_WORD);
> > > +  val_elt = GEN_INT (v);
> > > +  for (; j < nelt_mode; j++)
> > > +    rval[j] = val_elt;
> >
> > Is this the first use of J?  If so, initialize it here.
> >
> > > +
> > > +  reg = gen_reg_rtx (mode);
> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
> > > + rval));
> > > +  /* Emit instruction loading the constant value.  */
> > > + emit_move_insn (reg, val_vec);
> > > +
> > > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i + nelt_mode
> > > + <= length); i += nelt_mode)
> >
> > Similarly for I.
> >
> > > +    {
> > > +      emit_insn ((*gen_func) (mem, reg));
> > > +      if (i + 2 * nelt_mode <= length)
> > > +	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
> > > +    }
> > > +
> > > +  if (i + nelt_v8 <= length)
> > > +    gcc_assert (mode == V16QImode);
> >
> > Why not drop the if and write:
> >
> >      gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
> Fixed.
> 
> >
> > > +
> > > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 < length)
> >
> > Your assertion above checked <=, but here you use <.  Is that correct?
> Yes, it is.  For case "==", it means we have nelt_v8 bytes leftover, which
will
> be handled by the last branch of if statement.
> 
> >
> > > +    {
> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
> > > +      /* We are shifting bytes back, set the alignment accordingly.
*/
> > > +      if ((length & 1) != 0 && align >= 2)
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> > > +    }
> > > +  /* Handle (0, 8] bytes leftover.  */
> > > +  else if (i < length && i + nelt_v8 >= length)
> > > +    {
> > > +      if (mode == V16QImode)
> > > +	{
> > > +	  reg = gen_lowpart (V8QImode, reg);
> > > +	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
> > > +	}
> > > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
> > > +					      + (nelt_mode - nelt_v8))));
> > > +      /* We are shifting bytes back, set the alignment accordingly.
*/
> > > +      if ((length & 1) != 0 && align >= 2)
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> > > +    }
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using vectorization instructions for the
> > > +   aligned case.  We fill the first LENGTH bytes of the memory area
> > > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
> > > +   alignment requirement of memory.  */
> >
> > See all the comments above for the unaligend case.
> Fixed accordingly.
> 
> >
> > > +static bool
> > > +arm_block_set_aligned_vect (rtx dstbase,
> > > +			    unsigned HOST_WIDE_INT length,
> > > +			    unsigned HOST_WIDE_INT value,
> > > +			    unsigned HOST_WIDE_INT align) {
> > > +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
> > > +  rtx dst, addr, mem;
> > > +  rtx val_elt, val_vec, reg;
> > > +  rtx rval[MAX_VECT_LEN];
> > > +  enum machine_mode mode;
> > > +  unsigned HOST_WIDE_INT v = value;
> > > +
> > > +  gcc_assert ((align & 0x3) == 0);
> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16
> > > + && unaligned_access && !BYTES_BIG_ENDIAN)
> > > +    mode = V16QImode;
> > > +  else
> > > +    mode = V8QImode;
> > > +
> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> > > + nelt_mode);
> > > +  /* Skip if it isn't profitable.  */  if
> > > + (!arm_block_set_vect_profit_p (length, align, false, mode))
> > > +    return false;
> > > +
> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> > > +
> > > +  v = sext_hwi (v, BITS_PER_WORD);
> > > +  val_elt = GEN_INT (v);
> > > +  for (; j < nelt_mode; j++)
> > > +    rval[j] = val_elt;
> > > +
> > > +  reg = gen_reg_rtx (mode);
> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
> > > + rval));
> > > +  /* Emit instruction loading the constant value.  */
> > > + emit_move_insn (reg, val_vec);
> > > +
> > > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
> > > +*/
> > > +  if (mode == V16QImode)
> > > +    {
> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> > > +      i += nelt_mode;
> > > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
> > > +      if (i + nelt_v8 < length && i + nelt_v16 > length)
> > > +	{
> > > +	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> > > +	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
> > > +	  /* We are shifting bytes back, set the alignment accordingly.  */
> > > +	  if ((length & 0x3) == 0)
> > > +	    set_mem_align (mem, BITS_PER_UNIT * 4);
> > > +	  else if ((length & 0x1) == 0)
> > > +	    set_mem_align (mem, BITS_PER_UNIT * 2);
> > > +	  else
> > > +	    set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +	  emit_insn (gen_movmisalignv16qi (mem, reg));
> > > +	  return true;
> > > +	}
> > > +      /* Fall through for bytes leftover.  */
> > > +      mode = V8QImode;
> > > +      nelt_mode = GET_MODE_NUNITS (mode);
> > > +      reg = gen_lowpart (V8QImode, reg);
> > > +    }
> > > +
> > > +  /* Handle 8 bytes in a vector.  */  for (; (i + nelt_mode <=
> > > + length); i += nelt_mode)
> > > +    {
> > > +      addr = plus_constant (Pmode, dst, i);
> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +
> > > +  /* Handle single word leftover by shifting 4 bytes back.  We can
> > > +     use aligned access for this case.  */
> > > +  if (i + UNITS_PER_WORD == length)
> > > +    {
> > > +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
> > > +      mem = adjust_automodify_address (dstbase, mode,
> > > +				       addr, i - UNITS_PER_WORD);
> > > +      /* We are shifting 4 bytes back, set the alignment accordingly.
> */
> > > +      if (align > UNITS_PER_WORD)
> > > +	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
> > > +
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
> > > +     We have to use unaligned access for this case.  */
> > > +  else if (i < length)
> > > +    {
> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> > > +      /* We are shifting bytes back, set the alignment accordingly.
*/
> > > +      if ((length & 1) == 0)
> > > +	set_mem_align (mem, BITS_PER_UNIT * 2);
> > > +      else
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> > > +    }
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using plain strh/strb instructions, only
> > > +   using instructions allowed by ALIGN on processor.  We fill the
> > > +   first LENGTH bytes of the memory area starting from DSTBASE
> > > +   with byte constant VALUE.  ALIGN is the alignment requirement
> > > +   of memory.  */
> > > +static bool
> > > +arm_block_set_unaligned_straight (rtx dstbase,
> > > +				  unsigned HOST_WIDE_INT length,
> > > +				  unsigned HOST_WIDE_INT value,
> > > +				  unsigned HOST_WIDE_INT align) {
> > > +  unsigned int i;
> > > +  rtx dst, addr, mem;
> > > +  rtx val_exp, val_reg, reg;
> > > +  enum machine_mode mode;
> > > +  HOST_WIDE_INT v = value;
> > > +
> > > +  gcc_assert (align == 1 || align == 2);
> > > +
> > > +  if (align == 2)
> > > +    v |= (value << BITS_PER_UNIT);
> > > +
> > > +  v = sext_hwi (v, BITS_PER_WORD);
> > > +  val_exp = GEN_INT (v);
> > > +  /* Skip if it isn't profitable.  */
> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
> > > +					align, true, false))
> > > +    return false;
> > > +
> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mode = (align == 2 ?
> > > + HImode : QImode);  val_reg = force_reg (SImode, val_exp);  reg =
> > > + gen_lowpart (mode, val_reg);
> > > +
> > > +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i +=
> > > + GET_MODE_SIZE
> > (mode))
> > > +    {
> > > +      addr = plus_constant (Pmode, dst, i);
> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +
> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> > > +    {
> > > +      reg = gen_lowpart (QImode, val_reg);
> > > +      addr = plus_constant (Pmode, dst, i);
> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +      i++;
> > > +    }
> > > +
> > > +  gcc_assert (i == length);
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using plain strd/str/strh/strb instructions,
> > > +   to permit unaligned copies on processors which support unaligned
> > > +   semantics for those instructions.  We fill the first LENGTH bytes
> > > +   of the memory area starting from DSTBASE with byte constant VALUE.
> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> > > +arm_block_set_aligned_straight (rtx dstbase,
> > > +				unsigned HOST_WIDE_INT length,
> > > +				unsigned HOST_WIDE_INT value,
> > > +				unsigned HOST_WIDE_INT align)
> > > +{
> > > +  unsigned int i = 0;
> > > +  rtx dst, addr, mem;
> > > +  rtx val_exp, val_reg, reg;
> > > +  unsigned HOST_WIDE_INT v;
> > > +  bool use_strd_p;
> > > +
> > > +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
> > > +		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
> > > +
> > > +  v = (value | (value << 8) | (value << 16) | (value << 24));  if
> > > + (length < UNITS_PER_WORD)
> > > +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
> > > +
> > > +  if (use_strd_p)
> > > +    v |= (v << BITS_PER_WORD);
> > > +  else
> > > +    v = sext_hwi (v, BITS_PER_WORD);
> > > +
> > > +  val_exp = GEN_INT (v);
> > > +  /* Skip if it isn't profitable.  */
> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
> > > +					align, false, use_strd_p))
> > > +    {
> > > +      /* Try without strd.  */
> > > +      v = (v >> BITS_PER_WORD);
> > > +      v = sext_hwi (v, BITS_PER_WORD);
> > > +      val_exp = GEN_INT (v);
> > > +      use_strd_p = false;
> > > +      if (!arm_block_set_straight_profit_p (val_exp, length,
> > > +					    align, false, use_strd_p))
> > > +	return false;
> > > +    }
> > > +
> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> > > +  /* Handle double words using strd if possible.  */
> > > +  if (use_strd_p)
> > > +    {
> > > +      val_reg = force_reg (DImode, val_exp);
> > > +      reg = val_reg;
> > > +      for (; (i + 8 <= length); i += 8)
> > > +	{
> > > +	  addr = plus_constant (Pmode, dst, i);
> > > +	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
> > > +	  emit_move_insn (mem, reg);
> > > +	}
> > > +    }
> > > +  else
> > > +    val_reg = force_reg (SImode, val_exp);
> > > +
> > > +  /* Handle words.  */
> > > +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
> > > +  for (; (i + 4 <= length); i += 4)
> > > +    {
> > > +      addr = plus_constant (Pmode, dst, i);
> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
> > > +      if ((align & 3) == 0)
> > > +	emit_move_insn (mem, reg);
> > > +      else
> > > +	emit_insn (gen_unaligned_storesi (mem, reg));
> > > +    }
> > > +
> > > +  /* Merge last pair of STRH and STRB into a STR if possible.  */
> > > + if (unaligned_access && i > 0 && (i + 3) == length)
> > > +    {
> > > +      addr = plus_constant (Pmode, dst, i - 1);
> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
> > > +      /* We are shifting one byte back, set the alignment
accordingly.
> */
> > > +      if ((align & 1) == 0)
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      /* Most likely this is an unaligned access, and we can't tell
at
> > > +	 compilation time.  */
> > > +      emit_insn (gen_unaligned_storesi (mem, reg));
> > > +      return true;
> > > +    }
> > > +
> > > +  /* Handle half word leftover.  */
> > > +  if (i + 2 <= length)
> > > +    {
> > > +      reg = gen_lowpart (HImode, val_reg);
> > > +      addr = plus_constant (Pmode, dst, i);
> > > +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
> > > +      if ((align & 1) == 0)
> > > +	emit_move_insn (mem, reg);
> > > +      else
> > > +	emit_insn (gen_unaligned_storehi (mem, reg));
> > > +
> > > +      i += 2;
> > > +    }
> > > +
> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> > > +    {
> > > +      reg = gen_lowpart (QImode, val_reg);
> > > +      addr = plus_constant (Pmode, dst, i);
> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using vectorization instructions for both
> > > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
> > > +   the memory area starting from DSTBASE with byte constant VALUE.
> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> > > +arm_block_set_vect (rtx dstbase,
> > > +		    unsigned HOST_WIDE_INT length,
> > > +		    unsigned HOST_WIDE_INT value,
> > > +		    unsigned HOST_WIDE_INT align) {
> > > +  /* Check whether we need to use unaligned store instruction.  */
> > > +  if (((align & 3) != 0 || (length & 3) != 0)
> > > +      /* Check whether unaligned store instruction is available.  */
> > > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
> > > +    return false;
> >
> > Huh!  vst1.8 can work for unaligned accesses even when hw alignment
> > checking is strict.
> Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
> unaligned_access", vst1.8 instructions  can't be recognized now in this
way.
> I agree that it's too strict, but that's another problem I think.
> 
> >
> > > +
> > > +  if ((align & 3) == 0)
> > > +    return arm_block_set_aligned_vect (dstbase, length, value,
> > > +align);
> > > +  else
> > > +    return arm_block_set_unaligned_vect (dstbase, length, value,
> > > +align); }
> > > +
> > > +/* Expand string store operation.  Firstly we try to do that by using
> > > +   vectorization instructions, then try with ARM unaligned access and
> > > +   double-word store if profitable.  OPERANDS[0] is the destination,
> > > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
> > > +   initialize the memory, OPERANDS[3] is the known alignment of the
> > > +   destination.  */
> > > +bool
> > > +arm_gen_setmem (rtx *operands)
> > > +{
> > > +  rtx dstbase = operands[0];
> > > +  unsigned HOST_WIDE_INT length;
> > > +  unsigned HOST_WIDE_INT value;
> > > +  unsigned HOST_WIDE_INT align;
> > > +
> > > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
> > > +    return false;
> > > +
> > > +  length = UINTVAL (operands[1]);
> > > +  if (length > 64)
> > > +    return false;
> > > +
> > > +  value = (UINTVAL (operands[2]) & 0xFF);  align = UINTVAL
> > > + (operands[3]);  if (TARGET_NEON && length >= 8
> > > +      && current_tune->string_ops_prefer_neon
> > > +      && arm_block_set_vect (dstbase, length, value, align))
> > > +    return true;
> > > +
> > > +  if (!unaligned_access && (align & 3) != 0)
> > > +    return arm_block_set_unaligned_straight (dstbase, length,
> > > + value, align);
> > > +
> > > +  return arm_block_set_aligned_straight (dstbase, length, value,
> > > +align); }
> > > +
> > >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
> > >
> > >  static unsigned HOST_WIDE_INT
> > > Index: gcc/config/arm/arm-protos.h
> > >
> >
> ==========================================================
> > =========
> > > --- gcc/config/arm/arm-protos.h	(revision 209852)
> > > +++ gcc/config/arm/arm-protos.h	(working copy)
> > > @@ -277,6 +277,8 @@ struct tune_params
> > >    /* Prefer 32-bit encoding instead of 16-bit encoding where subset
> > > of
> flags
> > >       would be set.  */
> > >    bool disparage_partial_flag_setting_t16_encodings;
> > > +  /* Prefer to inline string operations like memset by using Neon.
> > > + */  bool string_ops_prefer_neon;
> > >  };
> > >
> > >  extern const struct tune_params *current_tune; @@ -289,6 +291,7 @@
> > > extern void arm_emit_coreregs_64bit_shift (enum rt  extern bool
> > > arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /* RTX_CODE
> > > */
> > >
> > > +extern bool arm_gen_setmem (rtx *);
> > >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx
> > > sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx op0,
> > > rtx op1, rtx sel);
> > >
> > > Index: gcc/config/arm/arm.md
> > >
> >
> ==========================================================
> > =========
> > > --- gcc/config/arm/arm.md	(revision 209852)
> > > +++ gcc/config/arm/arm.md	(working copy)
> > > @@ -7555,6 +7555,20 @@
> > >  })
> > >
> > >
> > > +(define_expand "setmemsi"
> > > +  [(match_operand:BLK 0 "general_operand" "")
> > > +   (match_operand:SI 1 "const_int_operand" "")
> > > +   (match_operand:SI 2 "const_int_operand" "")
> > > +   (match_operand:SI 3 "const_int_operand" "")]
> > > +  "TARGET_32BIT"
> > > +{
> > > +  if (arm_gen_setmem (operands))
> > > +    DONE;
> > > +
> > > +  FAIL;
> > > +})
> > > +
> > > +
> > >  ;; Move a block of memory if it is word aligned and MORE than 2
> > > words
> > long.
> > >  ;; We could let this apply for blocks of less than this, but it
> > > clobbers so  ;; many registers that there is then probably a better
way.
> > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
> > >
> >
> ==========================================================
> > =========
> > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
> > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
> >
> > Have you tested these when the compiler was configured with "--with-
> > cpu=cortex-a9"?
> Here is the tricky part.
> For compiler configured with "--with-tune=cortex-a9", the neon related
> cases
> (4/5/6/8/9) would fail because we have no way to determine that we are
> compiling with cortex-a9 tune here.
> For compiler configured with "--with-cpu=cortex-a9", the test cases would
> pass but I think this is a mistake.  It reveals an issue that GCC won't
pass "-
> mcpu=cortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It just
makes
> no sense.
> With these issues, I didn't change the tests for now.
Precisely, I configured gcc with options "--with-arch=armv7-a
--with-cpu|--with-tune=cortex-a9".
I read gcc documents and realized that "-mcpu" is ignored when "-march" is
specified.  I don't know why gcc acts in this manner, but it leads to
inconsistent configuration/command line behavior.
If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9", then
only "-march=armv7-a" is passed to cc1.
If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works fine and
passes "-march=armv7-a -mcpu=cortex-a9" to cc1.

Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
-march=armv7-m switch". 

Thanks,
bin




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-05-06  5:00     ` bin.cheng
@ 2014-05-12  3:17       ` Bin.Cheng
  2014-05-19  6:40         ` Bin.Cheng
  2014-06-27  8:21       ` Ramana Radhakrishnan
  1 sibling, 1 reply; 14+ messages in thread
From: Bin.Cheng @ 2014-05-12  3:17 UTC (permalink / raw)
  To: bin.cheng; +Cc: Richard Earnshaw, gcc-patches List

Ping.

Thanks,
bin

On Tue, May 6, 2014 at 12:59 PM, bin.cheng <bin.cheng@arm.com> wrote:
>
>
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
>> owner@gcc.gnu.org] On Behalf Of bin.cheng
>> Sent: Monday, May 05, 2014 3:21 PM
>> To: Richard Earnshaw
>> Cc: gcc-patches@gcc.gnu.org
>> Subject: RE: [PATCH ARM] Improve ARM memset inlining
>>
>> Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
>> also updated the patch.
>>
>> > -----Original Message-----
>> > From: Richard Earnshaw
>> > Sent: Friday, May 02, 2014 10:00 PM
>> > To: Bin Cheng
>> > Cc: gcc-patches@gcc.gnu.org
>> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
>> >
>> > On 30/04/14 03:52, bin.cheng wrote:
>> > > Hi,
>> > > This patch expands small memset calls into direct memory set
>> > > instructions by introducing "setmemsi" pattern.  For processors
>> > > without NEON support, it expands memset using general store
>> > > instruction.  For example, strd for 4-bytes aligned addresses.  For
>> > > processors with NEON support, it expands memset using neon
>> > > instructions like vstr and miscellaneous vst1.* instructions for
>> > > both
>> aligned
>> > and unaligned cases.
>> > >
>> > > This patch depends on
>> > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
>> > > vst1.64 will be generated for 32-bit aligned memory unit.
>> > >
>> > > There is also one leftover work of this patch:  Since vst1.*
>> > > instructions only support post-increment addressing mode, the
>> > > inlined memset for unaligned neon cases should be like:
>> > >   vmov.i32   q8, #...
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]
>> >
>> > Other than for zero, I'd expect the vmov to be vmov.i8 to move an
>> arbitrary
>> I just used vmov.i32 as an example.  The element size is actually
> calculated by
>> function neon_valid_immediate which works as expected I think.
>>
>> > byte value into all lanes in a vector.  After that, if the alignment
>> > is
>> known to
>> > be more than 8-bit, I'd expect the vst1 instructions (with the
>> > exception
>> of the
>> > last store if the length is not a multiple of the alignment) to use
>> >
>> >     vst1.<align> {reg}, [addr-reg :<align>]!
>> >
>> > Hence, for 16-bit aligned data, we want
>> >
>> >     vst1.16 {q8}, [r3:16]!
>> Did I miss something important?  It seems to me the explicit alignment
> notes
>> supported are 64/128/256.  So what do you mean by 16 bits alignment here?
>>
>> >
>> > > But for now, gcc can't do this and below code is generated:
>> > >   vmov.i32   q8, #...
>> > >   vst1.8     {q8}, [r3]
>> > >   add        r2,   r3,  #16
>> > >   add        r3,   r2,  #16
>> > >   vst1.8     {q8}, [r2]
>> > >   vst1.8     {q8}, [r3]
>> > >   add        r2,   r3,  #16
>> > >   vst1.8     {q8}, [r2]
>> > >
>> > > I investigated this issue.  The root cause lies in rtx cost returned
>> > > by ARM backend.  Anyway, I think this is another issue and should be
>> > > fixed in separated patch.
>> > >
>> > > Bootstrap and reg-test on cortex-a15, with or without neon support.
>> > > Is it OK?
>> > >
>> >
>> > Some more comments inline.
>> >
>> > > Thanks,
>> > > bin
>> > >
>> > >
>> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
>> > >
>> > >   PR target/55701
>> > >   * config/arm/arm.md (setmem): New pattern.
>> > >   * config/arm/arm-protos.h (struct tune_params): New field.
>> > >   (arm_gen_setmem): New prototype.
>> > >   * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
>> > >   (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
>> > >   (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
>> > >   (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
>> > >   (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
>> > >   (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
>> > >   (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
>> > >   (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
>> > >   (arm_const_inline_cost): New function.
>> > >   (arm_block_set_max_insns): New function.
>> > >   (arm_block_set_straight_profit_p): New function.
>> > >   (arm_block_set_vect_profit_p): New function.
>> > >   (arm_block_set_unaligned_vect): New function.
>> > >   (arm_block_set_aligned_vect): New function.
>> > >   (arm_block_set_unaligned_straight): New function.
>> > >   (arm_block_set_aligned_straight): New function.
>> > >   (arm_block_set_vect, arm_gen_setmem): New functions.
>> > >
>> > > gcc/testsuite/ChangeLog
>> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
>> > >
>> > >   PR target/55701
>> > >   * gcc.target/arm/memset-inline-1.c: New test.
>> > >   * gcc.target/arm/memset-inline-2.c: New test.
>> > >   * gcc.target/arm/memset-inline-3.c: New test.
>> > >   * gcc.target/arm/memset-inline-4.c: New test.
>> > >   * gcc.target/arm/memset-inline-5.c: New test.
>> > >   * gcc.target/arm/memset-inline-6.c: New test.
>> > >   * gcc.target/arm/memset-inline-7.c: New test.
>> > >   * gcc.target/arm/memset-inline-8.c: New test.
>> > >   * gcc.target/arm/memset-inline-9.c: New test.
>> > >
>> > >
>> > > j1328-20140429.txt
>> > >
>> > >
>> > > Index: gcc/config/arm/arm.c
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm.c  (revision 209852)
>> > > +++ gcc/config/arm/arm.c  (working copy)
>> > > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune
>> =
>> > >    true,                                          /* Prefer constant
>> > pool.  */
>> > >    arm_default_branch_cost,
>> > >    false,                                 /* Prefer LDRD/STRD.  */
>> > > -  {true, true},                                  /* Prefer non short
>> > circuit.  */
>> > > -  &arm_default_vec_cost,                        /* Vectorizer costs.
>> */
>> > > -  false,                                        /* Prefer Neon for
>> 64-bits bitops.  */
>> > > -  false, false                                  /* Prefer 32-bit
>> encodings.  */
>> > > +  {true, true},                          /* Prefer non short circuit.
>> */
>> > > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
>> > > +  false,                                /* Prefer Neon for 64-bits
>> bitops.  */
>> > > +  false, false,                         /* Prefer 32-bit encodings.
> */
>> > > +  false                                 /* Prefer Neon for stringops.
>> */
>> > >  };
>> > >
>> >
>> > Please make sure that all the white space before the comments is using
>> TAB,
>> > not spaces.  Similarly for the other tables.
>> Fixed.
>>
>> >
>> > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
>> > >                         NULL_RTX, NULL_RTX, 0, 0));  }
>> > >
>> > > +/* Cost of loading a SImode constant.  */ static inline int
>> > > +arm_const_inline_cost (rtx val) {
>> > > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
>> > > +                           NULL_RTX, NULL_RTX, 0, 0); }
>> > > +
>> >
>> > This could be used more widely if you passed the SET in as a parameter
>> > (there are cases in arm_new_rtx_cost that could use it, for example).
>> > Also, you want to enable sub-targets (only once you can't create new
>> > pseudos is that not safe), so the penultimate argument in the call to
>> > arm_gen_constant should be 1.
>> Fixed.
>>
>> >
>> > >  /* Return true if it is worthwhile to split a 64-bit constant into
> two
>> > >     32-bit operations.  This is the case if optimizing for size, or
>> > >     if we have load delay slots, or if one 32-bit part can be done
>> > > with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
>> > > *comparison, rtx * op
>> > >
>> > >  }
>> > >
>> > > +/* Maximum number of instructions to set block of memory.  */
>> > > +static int arm_block_set_max_insns (void) {
>> > > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
>> >
>> > I think the non-size_p alternative should really be a parameter in the
>> per-cpu
>> > costs table.
>> Fixed.
>>
>> >
>> > > +
>> > > +/* Return TRUE if it's profitable to set block of memory for straight
>> > > +   case.  */
>> >
>> > "Straight" is confusing here.  Do you mean non-vectorized?  If so,
>> > then non_vect might be clearer.
>> Fixed.
>>
>> >
>> > The arguments should really be documented (see comment below about
>> > align, for example).
>> Fixed.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_straight_profit_p (rtx val,
>> > > +                          unsigned HOST_WIDE_INT length,
>> > > +                          unsigned HOST_WIDE_INT align,
>> > > +                          bool unaligned_p, bool use_strd_p) {
>> > > +  int num = 0;
>> > > +  /* For leftovers in bytes of 0-7, we can set the memory block using
>> > > +     strb/strh/str with minimum instruction number.  */
>> > > +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
>> >
>> > This should be marked const.
>> Fixed.
>>
>> >
>> > > +
>> > > +  if (unaligned_p)
>> > > +    {
>> > > +      num = arm_const_inline_cost (val);
>> > > +      num += length / align + length % align;
>> >
>> > Isn't align in bits here, when you really want it in bytes?
>> All alignments are in bytes starting from pattern "setmem".
>>
>> >
>> > What if align > 4 bytes?
>> Then it's the "!unaligned_p" case and handled by other arms of this if
>> statement.
>>
>> >
>> > > +    }
>> > > +  else if (use_strd_p)
>> > > +    {
>> > > +      num = arm_const_double_inline_cost (val);
>> > > +      num += (length >> 3) + leftover[length & 7];
>> > > +    }
>> > > +  else
>> > > +    {
>> > > +      num = arm_const_inline_cost (val);
>> > > +      num += (length >> 2) + leftover[length & 3];
>> > > +    }
>> > > +
>> > > +  /* We may be able to combine last pair STRH/STRB into a single STR
>> > > +     by shifting one byte back.  */  if (unaligned_access && length
>> > > + > 3 && (length & 3) == 3)
>> > > +    num--;
>> > > +
>> > > +  return (num <= arm_block_set_max_insns ()); }
>> > > +
>> > > +/* Return TRUE if it's profitable to set block of memory for vector
>> > > +case.  */ static bool arm_block_set_vect_profit_p (unsigned
>> > > +HOST_WIDE_INT length,
>> > > +                      unsigned HOST_WIDE_INT align
>> > ATTRIBUTE_UNUSED,
>> > > +                      bool unaligned_p, enum machine_mode mode)
>> >
>> > I'm not sure what you mean by unaligned here.  Again, documenting the
>> > arguments might help.
>> Fixed.
>>
>> >
>> > > +{
>> > > +  int num;
>> > > +  unsigned int nelt = GET_MODE_NUNITS (mode);
>> > > +
>> > > +  /* Num of instruction loading constant value.  */
>> >
>> > Use either "Number" or, in this case, simply drop that bit and write:
>> >   /* Instruction loading constant value.  */
>> Fixed.
>>
>> >
>> > > +  num = 1;
>> > > +  /* Num of store instructions.  */
>> >
>> > Likewise.
>> >
>> > > +  num += (length + nelt - 1) / nelt;
>> > > +  /* Num of address adjusting instructions.  */
>> >
>> > Can't we work on the premise that the address adjusting instructions
>> > will
>> be
>> > merged into the stores?  I know you said that they currently do not,
>> > but that's not a problem that this bit of code should have to worry
> about.
>> Fixed.
>>
>> >
>> > > +  if (unaligned_p)
>> > > +    /* For unaligned case, it's one less than the store instructions.
>> */
>> > > +    num += (length + nelt - 1) / nelt - 1;  else if ((length & 3)
>> > > + !=
>> > > + 0)
>> > > +    /* For aligned case, it's one if bytes leftover can only be
> stored
>> > > +       by mis-aligned store instruction.  */
>> > > +    num++;
>> > > +
>> > > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
>> > > + */  if (!unaligned_p && mode == V16QImode)
>> > > +    num--;
>> > > +
>> > > +  return (num <= arm_block_set_max_insns ()); }
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for the
>> > > +   unaligned case.  We fill the first LENGTH bytes of the memory
>> > > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
>> > > +   the alignment requirement of memory.  */
>> >
>> > What's the return value mean?
>> Documented.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_unaligned_vect (rtx dstbase,
>> > > +                       unsigned HOST_WIDE_INT length,
>> > > +                       unsigned HOST_WIDE_INT value,
>> > > +                       unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
>> >
>> > Don't mix initialized declarations with unitialized ones on the same
> line.
>> You
>> > don't appear to use either I or J until their first use in the loop
>> control below,
>> > so why initialize them here?
>> Fixed.
>>
>> >
>> > > +  rtx dst, mem;
>> > > +  rtx val_elt, val_vec, reg;
>> > > +  rtx rval[MAX_VECT_LEN];
>> > > +  rtx (*gen_func) (rtx, rtx);
>> > > +  enum machine_mode mode;
>> > > +  unsigned HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert ((align & 0x3) != 0);
>> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
>> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16)
>> > > +    {
>> > > +      mode = V16QImode;
>> > > +      gen_func = gen_movmisalignv16qi;
>> > > +    }
>> > > +  else
>> > > +    {
>> > > +      mode = V8QImode;
>> > > +      gen_func = gen_movmisalignv8qi;
>> > > +    }
>> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
>> > > + nelt_mode);
>> > > +  /* Skip if it isn't profitable.  */  if
>> > > + (!arm_block_set_vect_profit_p (length, align, true, mode))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mem =
>> > > + adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_elt = GEN_INT (v);
>> > > +  for (; j < nelt_mode; j++)
>> > > +    rval[j] = val_elt;
>> >
>> > Is this the first use of J?  If so, initialize it here.
>> >
>> > > +
>> > > +  reg = gen_reg_rtx (mode);
>> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
>> > > + rval));
>> > > +  /* Emit instruction loading the constant value.  */
>> > > + emit_move_insn (reg, val_vec);
>> > > +
>> > > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i + nelt_mode
>> > > + <= length); i += nelt_mode)
>> >
>> > Similarly for I.
>> >
>> > > +    {
>> > > +      emit_insn ((*gen_func) (mem, reg));
>> > > +      if (i + 2 * nelt_mode <= length)
>> > > + emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
>> > > +    }
>> > > +
>> > > +  if (i + nelt_v8 <= length)
>> > > +    gcc_assert (mode == V16QImode);
>> >
>> > Why not drop the if and write:
>> >
>> >      gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
>> Fixed.
>>
>> >
>> > > +
>> > > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 < length)
>> >
>> > Your assertion above checked <=, but here you use <.  Is that correct?
>> Yes, it is.  For case "==", it means we have nelt_v8 bytes leftover, which
> will
>> be handled by the last branch of if statement.
>>
>> >
>> > > +    {
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) != 0 && align >= 2)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +    }
>> > > +  /* Handle (0, 8] bytes leftover.  */
>> > > +  else if (i < length && i + nelt_v8 >= length)
>> > > +    {
>> > > +      if (mode == V16QImode)
>> > > + {
>> > > +   reg = gen_lowpart (V8QImode, reg);
>> > > +   mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
>> > > + }
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
>> > > +                                       + (nelt_mode - nelt_v8))));
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) != 0 && align >= 2)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for the
>> > > +   aligned case.  We fill the first LENGTH bytes of the memory area
>> > > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
>> > > +   alignment requirement of memory.  */
>> >
>> > See all the comments above for the unaligend case.
>> Fixed accordingly.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_aligned_vect (rtx dstbase,
>> > > +                     unsigned HOST_WIDE_INT length,
>> > > +                     unsigned HOST_WIDE_INT value,
>> > > +                     unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_elt, val_vec, reg;
>> > > +  rtx rval[MAX_VECT_LEN];
>> > > +  enum machine_mode mode;
>> > > +  unsigned HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert ((align & 0x3) == 0);
>> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
>> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16
>> > > + && unaligned_access && !BYTES_BIG_ENDIAN)
>> > > +    mode = V16QImode;
>> > > +  else
>> > > +    mode = V8QImode;
>> > > +
>> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
>> > > + nelt_mode);
>> > > +  /* Skip if it isn't profitable.  */  if
>> > > + (!arm_block_set_vect_profit_p (length, align, false, mode))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_elt = GEN_INT (v);
>> > > +  for (; j < nelt_mode; j++)
>> > > +    rval[j] = val_elt;
>> > > +
>> > > +  reg = gen_reg_rtx (mode);
>> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
>> > > + rval));
>> > > +  /* Emit instruction loading the constant value.  */
>> > > + emit_move_insn (reg, val_vec);
>> > > +
>> > > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
>> > > +*/
>> > > +  if (mode == V16QImode)
>> > > +    {
>> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +      i += nelt_mode;
>> > > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
>> > > +      if (i + nelt_v8 < length && i + nelt_v16 > length)
>> > > + {
>> > > +   emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
>> > > +   mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +   /* We are shifting bytes back, set the alignment accordingly.  */
>> > > +   if ((length & 0x3) == 0)
>> > > +     set_mem_align (mem, BITS_PER_UNIT * 4);
>> > > +   else if ((length & 0x1) == 0)
>> > > +     set_mem_align (mem, BITS_PER_UNIT * 2);
>> > > +   else
>> > > +     set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +   emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +   return true;
>> > > + }
>> > > +      /* Fall through for bytes leftover.  */
>> > > +      mode = V8QImode;
>> > > +      nelt_mode = GET_MODE_NUNITS (mode);
>> > > +      reg = gen_lowpart (V8QImode, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle 8 bytes in a vector.  */  for (; (i + nelt_mode <=
>> > > + length); i += nelt_mode)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle single word leftover by shifting 4 bytes back.  We can
>> > > +     use aligned access for this case.  */
>> > > +  if (i + UNITS_PER_WORD == length)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
>> > > +      mem = adjust_automodify_address (dstbase, mode,
>> > > +                                addr, i - UNITS_PER_WORD);
>> > > +      /* We are shifting 4 bytes back, set the alignment accordingly.
>> */
>> > > +      if (align > UNITS_PER_WORD)
>> > > + set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
>> > > +
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
>> > > +     We have to use unaligned access for this case.  */
>> > > +  else if (i < length)
>> > > +    {
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
>> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) == 0)
>> > > + set_mem_align (mem, BITS_PER_UNIT * 2);
>> > > +      else
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using plain strh/strb instructions, only
>> > > +   using instructions allowed by ALIGN on processor.  We fill the
>> > > +   first LENGTH bytes of the memory area starting from DSTBASE
>> > > +   with byte constant VALUE.  ALIGN is the alignment requirement
>> > > +   of memory.  */
>> > > +static bool
>> > > +arm_block_set_unaligned_straight (rtx dstbase,
>> > > +                           unsigned HOST_WIDE_INT length,
>> > > +                           unsigned HOST_WIDE_INT value,
>> > > +                           unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_exp, val_reg, reg;
>> > > +  enum machine_mode mode;
>> > > +  HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert (align == 1 || align == 2);
>> > > +
>> > > +  if (align == 2)
>> > > +    v |= (value << BITS_PER_UNIT);
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_exp = GEN_INT (v);
>> > > +  /* Skip if it isn't profitable.  */
>> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                 align, true, false))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mode = (align == 2 ?
>> > > + HImode : QImode);  val_reg = force_reg (SImode, val_exp);  reg =
>> > > + gen_lowpart (mode, val_reg);
>> > > +
>> > > +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i +=
>> > > + GET_MODE_SIZE
>> > (mode))
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
>> > > +    {
>> > > +      reg = gen_lowpart (QImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +      i++;
>> > > +    }
>> > > +
>> > > +  gcc_assert (i == length);
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using plain strd/str/strh/strb instructions,
>> > > +   to permit unaligned copies on processors which support unaligned
>> > > +   semantics for those instructions.  We fill the first LENGTH bytes
>> > > +   of the memory area starting from DSTBASE with byte constant VALUE.
>> > > +   ALIGN is the alignment requirement of memory.  */ static bool
>> > > +arm_block_set_aligned_straight (rtx dstbase,
>> > > +                         unsigned HOST_WIDE_INT length,
>> > > +                         unsigned HOST_WIDE_INT value,
>> > > +                         unsigned HOST_WIDE_INT align)
>> > > +{
>> > > +  unsigned int i = 0;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_exp, val_reg, reg;
>> > > +  unsigned HOST_WIDE_INT v;
>> > > +  bool use_strd_p;
>> > > +
>> > > +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
>> > > +         && TARGET_LDRD && current_tune->prefer_ldrd_strd);
>> > > +
>> > > +  v = (value | (value << 8) | (value << 16) | (value << 24));  if
>> > > + (length < UNITS_PER_WORD)
>> > > +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
>> > > +
>> > > +  if (use_strd_p)
>> > > +    v |= (v << BITS_PER_WORD);
>> > > +  else
>> > > +    v = sext_hwi (v, BITS_PER_WORD);
>> > > +
>> > > +  val_exp = GEN_INT (v);
>> > > +  /* Skip if it isn't profitable.  */
>> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                 align, false, use_strd_p))
>> > > +    {
>> > > +      /* Try without strd.  */
>> > > +      v = (v >> BITS_PER_WORD);
>> > > +      v = sext_hwi (v, BITS_PER_WORD);
>> > > +      val_exp = GEN_INT (v);
>> > > +      use_strd_p = false;
>> > > +      if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                     align, false, use_strd_p))
>> > > + return false;
>> > > +    }
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
>> > > +  /* Handle double words using strd if possible.  */
>> > > +  if (use_strd_p)
>> > > +    {
>> > > +      val_reg = force_reg (DImode, val_exp);
>> > > +      reg = val_reg;
>> > > +      for (; (i + 8 <= length); i += 8)
>> > > + {
>> > > +   addr = plus_constant (Pmode, dst, i);
>> > > +   mem = adjust_automodify_address (dstbase, DImode, addr, i);
>> > > +   emit_move_insn (mem, reg);
>> > > + }
>> > > +    }
>> > > +  else
>> > > +    val_reg = force_reg (SImode, val_exp);
>> > > +
>> > > +  /* Handle words.  */
>> > > +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
>> > > +  for (; (i + 4 <= length); i += 4)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
>> > > +      if ((align & 3) == 0)
>> > > + emit_move_insn (mem, reg);
>> > > +      else
>> > > + emit_insn (gen_unaligned_storesi (mem, reg));
>> > > +    }
>> > > +
>> > > +  /* Merge last pair of STRH and STRB into a STR if possible.  */
>> > > + if (unaligned_access && i > 0 && (i + 3) == length)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i - 1);
>> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
>> > > +      /* We are shifting one byte back, set the alignment
> accordingly.
>> */
>> > > +      if ((align & 1) == 0)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      /* Most likely this is an unaligned access, and we can't tell
> at
>> > > +  compilation time.  */
>> > > +      emit_insn (gen_unaligned_storesi (mem, reg));
>> > > +      return true;
>> > > +    }
>> > > +
>> > > +  /* Handle half word leftover.  */
>> > > +  if (i + 2 <= length)
>> > > +    {
>> > > +      reg = gen_lowpart (HImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
>> > > +      if ((align & 1) == 0)
>> > > + emit_move_insn (mem, reg);
>> > > +      else
>> > > + emit_insn (gen_unaligned_storehi (mem, reg));
>> > > +
>> > > +      i += 2;
>> > > +    }
>> > > +
>> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
>> > > +    {
>> > > +      reg = gen_lowpart (QImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for both
>> > > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
>> > > +   the memory area starting from DSTBASE with byte constant VALUE.
>> > > +   ALIGN is the alignment requirement of memory.  */ static bool
>> > > +arm_block_set_vect (rtx dstbase,
>> > > +             unsigned HOST_WIDE_INT length,
>> > > +             unsigned HOST_WIDE_INT value,
>> > > +             unsigned HOST_WIDE_INT align) {
>> > > +  /* Check whether we need to use unaligned store instruction.  */
>> > > +  if (((align & 3) != 0 || (length & 3) != 0)
>> > > +      /* Check whether unaligned store instruction is available.  */
>> > > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
>> > > +    return false;
>> >
>> > Huh!  vst1.8 can work for unaligned accesses even when hw alignment
>> > checking is strict.
>> Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
>> unaligned_access", vst1.8 instructions  can't be recognized now in this
> way.
>> I agree that it's too strict, but that's another problem I think.
>>
>> >
>> > > +
>> > > +  if ((align & 3) == 0)
>> > > +    return arm_block_set_aligned_vect (dstbase, length, value,
>> > > +align);
>> > > +  else
>> > > +    return arm_block_set_unaligned_vect (dstbase, length, value,
>> > > +align); }
>> > > +
>> > > +/* Expand string store operation.  Firstly we try to do that by using
>> > > +   vectorization instructions, then try with ARM unaligned access and
>> > > +   double-word store if profitable.  OPERANDS[0] is the destination,
>> > > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
>> > > +   initialize the memory, OPERANDS[3] is the known alignment of the
>> > > +   destination.  */
>> > > +bool
>> > > +arm_gen_setmem (rtx *operands)
>> > > +{
>> > > +  rtx dstbase = operands[0];
>> > > +  unsigned HOST_WIDE_INT length;
>> > > +  unsigned HOST_WIDE_INT value;
>> > > +  unsigned HOST_WIDE_INT align;
>> > > +
>> > > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
>> > > +    return false;
>> > > +
>> > > +  length = UINTVAL (operands[1]);
>> > > +  if (length > 64)
>> > > +    return false;
>> > > +
>> > > +  value = (UINTVAL (operands[2]) & 0xFF);  align = UINTVAL
>> > > + (operands[3]);  if (TARGET_NEON && length >= 8
>> > > +      && current_tune->string_ops_prefer_neon
>> > > +      && arm_block_set_vect (dstbase, length, value, align))
>> > > +    return true;
>> > > +
>> > > +  if (!unaligned_access && (align & 3) != 0)
>> > > +    return arm_block_set_unaligned_straight (dstbase, length,
>> > > + value, align);
>> > > +
>> > > +  return arm_block_set_aligned_straight (dstbase, length, value,
>> > > +align); }
>> > > +
>> > >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
>> > >
>> > >  static unsigned HOST_WIDE_INT
>> > > Index: gcc/config/arm/arm-protos.h
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm-protos.h   (revision 209852)
>> > > +++ gcc/config/arm/arm-protos.h   (working copy)
>> > > @@ -277,6 +277,8 @@ struct tune_params
>> > >    /* Prefer 32-bit encoding instead of 16-bit encoding where subset
>> > > of
>> flags
>> > >       would be set.  */
>> > >    bool disparage_partial_flag_setting_t16_encodings;
>> > > +  /* Prefer to inline string operations like memset by using Neon.
>> > > + */  bool string_ops_prefer_neon;
>> > >  };
>> > >
>> > >  extern const struct tune_params *current_tune; @@ -289,6 +291,7 @@
>> > > extern void arm_emit_coreregs_64bit_shift (enum rt  extern bool
>> > > arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /* RTX_CODE
>> > > */
>> > >
>> > > +extern bool arm_gen_setmem (rtx *);
>> > >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx
>> > > sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx op0,
>> > > rtx op1, rtx sel);
>> > >
>> > > Index: gcc/config/arm/arm.md
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm.md (revision 209852)
>> > > +++ gcc/config/arm/arm.md (working copy)
>> > > @@ -7555,6 +7555,20 @@
>> > >  })
>> > >
>> > >
>> > > +(define_expand "setmemsi"
>> > > +  [(match_operand:BLK 0 "general_operand" "")
>> > > +   (match_operand:SI 1 "const_int_operand" "")
>> > > +   (match_operand:SI 2 "const_int_operand" "")
>> > > +   (match_operand:SI 3 "const_int_operand" "")]
>> > > +  "TARGET_32BIT"
>> > > +{
>> > > +  if (arm_gen_setmem (operands))
>> > > +    DONE;
>> > > +
>> > > +  FAIL;
>> > > +})
>> > > +
>> > > +
>> > >  ;; Move a block of memory if it is word aligned and MORE than 2
>> > > words
>> > long.
>> > >  ;; We could let this apply for blocks of less than this, but it
>> > > clobbers so  ;; many registers that there is then probably a better
> way.
>> > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
>> > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
>> >
>> > Have you tested these when the compiler was configured with "--with-
>> > cpu=cortex-a9"?
>> Here is the tricky part.
>> For compiler configured with "--with-tune=cortex-a9", the neon related
>> cases
>> (4/5/6/8/9) would fail because we have no way to determine that we are
>> compiling with cortex-a9 tune here.
>> For compiler configured with "--with-cpu=cortex-a9", the test cases would
>> pass but I think this is a mistake.  It reveals an issue that GCC won't
> pass "-
>> mcpu=cortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It just
> makes
>> no sense.
>> With these issues, I didn't change the tests for now.
> Precisely, I configured gcc with options "--with-arch=armv7-a
> --with-cpu|--with-tune=cortex-a9".
> I read gcc documents and realized that "-mcpu" is ignored when "-march" is
> specified.  I don't know why gcc acts in this manner, but it leads to
> inconsistent configuration/command line behavior.
> If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9", then
> only "-march=armv7-a" is passed to cc1.
> If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works fine and
> passes "-march=armv7-a -mcpu=cortex-a9" to cc1.
>
> Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
> -march=armv7-m switch".
>
> Thanks,
> bin
>
>
>
>



-- 
Best Regards.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-05-12  3:17       ` Bin.Cheng
@ 2014-05-19  6:40         ` Bin.Cheng
  2014-05-28  8:53           ` bin.cheng
  0 siblings, 1 reply; 14+ messages in thread
From: Bin.Cheng @ 2014-05-19  6:40 UTC (permalink / raw)
  To: bin.cheng; +Cc: Richard Earnshaw, gcc-patches List

Ping^2

Thanks,
bin

On Mon, May 12, 2014 at 11:17 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> Ping.
>
> Thanks,
> bin
>
> On Tue, May 6, 2014 at 12:59 PM, bin.cheng <bin.cheng@arm.com> wrote:
>>
>>

>> Precisely, I configured gcc with options "--with-arch=armv7-a
>> --with-cpu|--with-tune=cortex-a9".
>> I read gcc documents and realized that "-mcpu" is ignored when "-march" is
>> specified.  I don't know why gcc acts in this manner, but it leads to
>> inconsistent configuration/command line behavior.
>> If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9", then
>> only "-march=armv7-a" is passed to cc1.
>> If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works fine and
>> passes "-march=armv7-a -mcpu=cortex-a9" to cc1.
>>
>> Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
>> -march=armv7-m switch".
>>
>> Thanks,
>> bin
>>
>>
>>
>>
>
>
>
> --
> Best Regards.



-- 
Best Regards.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH ARM] Improve ARM memset inlining
  2014-05-19  6:40         ` Bin.Cheng
@ 2014-05-28  8:53           ` bin.cheng
  2014-06-04  9:11             ` bin.cheng
  0 siblings, 1 reply; 14+ messages in thread
From: bin.cheng @ 2014-05-28  8:53 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches List

Ping^3

> -----Original Message-----
> From: Bin.Cheng [mailto:amker.cheng@gmail.com]
> Sent: Monday, May 19, 2014 2:40 PM
> To: Bin Cheng
> Cc: Richard Earnshaw; gcc-patches List
> Subject: Re: [PATCH ARM] Improve ARM memset inlining
> 
> Ping^2
> 
> Thanks,
> bin
> 
> On Mon, May 12, 2014 at 11:17 AM, Bin.Cheng <amker.cheng@gmail.com>
> wrote:
> > Ping.
> >
> > Thanks,
> > bin
> >
> > On Tue, May 6, 2014 at 12:59 PM, bin.cheng <bin.cheng@arm.com> wrote:
> >>
> >>
> 
> >> Precisely, I configured gcc with options "--with-arch=armv7-a
> >> --with-cpu|--with-tune=cortex-a9".
> >> I read gcc documents and realized that "-mcpu" is ignored when
> >> "-march" is specified.  I don't know why gcc acts in this manner, but
> >> it leads to inconsistent configuration/command line behavior.
> >> If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9",
> >> then only "-march=armv7-a" is passed to cc1.
> >> If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works
> >> fine and passes "-march=armv7-a -mcpu=cortex-a9" to cc1.
> >>
> >> Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
> >> -march=armv7-m switch".
> >>
> >> Thanks,
> >> bin
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > Best Regards.
> 
> 
> 
> --
> Best Regards.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH ARM] Improve ARM memset inlining
  2014-05-28  8:53           ` bin.cheng
@ 2014-06-04  9:11             ` bin.cheng
  0 siblings, 0 replies; 14+ messages in thread
From: bin.cheng @ 2014-06-04  9:11 UTC (permalink / raw)
  To: Richard Earnshaw, Ramana Radhakrishnan; +Cc: gcc-patches List

Ping^4.

The original thread is
https://gcc.gnu.org/ml/gcc-patches/2014-05/msg00182.html, also there is some
info at https://gcc.gnu.org/ml/gcc-patches/2014-05/msg00182.html in the same
thread.

Thanks,
bin

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Wednesday, May 28, 2014 4:53 PM
> To: Richard Earnshaw
> Cc: gcc-patches List
> Subject: RE: [PATCH ARM] Improve ARM memset inlining
> 
> Ping^3
> 
> > -----Original Message-----
> > From: Bin.Cheng [mailto:amker.cheng@gmail.com]
> > Sent: Monday, May 19, 2014 2:40 PM
> > To: Bin Cheng
> > Cc: Richard Earnshaw; gcc-patches List
> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
> >
> > Ping^2
> >
> > Thanks,
> > bin
> >
> > On Mon, May 12, 2014 at 11:17 AM, Bin.Cheng <amker.cheng@gmail.com>
> > wrote:
> > > Ping.
> > >
> > > Thanks,
> > > bin
> > >
> > > On Tue, May 6, 2014 at 12:59 PM, bin.cheng <bin.cheng@arm.com>
> wrote:
> > >>
> > >>
> >
> > >> Precisely, I configured gcc with options "--with-arch=armv7-a
> > >> --with-cpu|--with-tune=cortex-a9".
> > >> I read gcc documents and realized that "-mcpu" is ignored when
> > >> "-march" is specified.  I don't know why gcc acts in this manner,
> > >> but it leads to inconsistent configuration/command line behavior.
> > >> If we configure GCC with "--with-arch=armv7-a
> > >> --with-cpu=cortex-a9", then only "-march=armv7-a" is passed to cc1.
> > >> If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works
> > >> fine and passes "-march=armv7-a -mcpu=cortex-a9" to cc1.
> > >>
> > >> Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts
> > >> with -march=armv7-m switch".
> > >>
> > >> Thanks,
> > >> bin
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> > >
> > > --
> > > Best Regards.
> >
> >
> >
> > --
> > Best Regards.
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-05-06  5:00     ` bin.cheng
  2014-05-12  3:17       ` Bin.Cheng
@ 2014-06-27  8:21       ` Ramana Radhakrishnan
  2014-07-04 12:18         ` Bin Cheng
  1 sibling, 1 reply; 14+ messages in thread
From: Ramana Radhakrishnan @ 2014-06-27  8:21 UTC (permalink / raw)
  To: bin.cheng; +Cc: Richard Earnshaw, gcc-patches

On Tue, May 6, 2014 at 5:59 AM, bin.cheng <bin.cheng@arm.com> wrote:
>
>
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
>> owner@gcc.gnu.org] On Behalf Of bin.cheng
>> Sent: Monday, May 05, 2014 3:21 PM
>> To: Richard Earnshaw
>> Cc: gcc-patches@gcc.gnu.org
>> Subject: RE: [PATCH ARM] Improve ARM memset inlining
>>
>> Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
>> also updated the patch.
>>
>> > -----Original Message-----
>> > From: Richard Earnshaw
>> > Sent: Friday, May 02, 2014 10:00 PM
>> > To: Bin Cheng
>> > Cc: gcc-patches@gcc.gnu.org
>> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
>> >
>> > On 30/04/14 03:52, bin.cheng wrote:
>> > > Hi,
>> > > This patch expands small memset calls into direct memory set
>> > > instructions by introducing "setmemsi" pattern.  For processors
>> > > without NEON support, it expands memset using general store
>> > > instruction.  For example, strd for 4-bytes aligned addresses.  For
>> > > processors with NEON support, it expands memset using neon
>> > > instructions like vstr and miscellaneous vst1.* instructions for
>> > > both
>> aligned
>> > and unaligned cases.
>> > >
>> > > This patch depends on
>> > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
>> > > vst1.64 will be generated for 32-bit aligned memory unit.
>> > >
>> > > There is also one leftover work of this patch:  Since vst1.*
>> > > instructions only support post-increment addressing mode, the
>> > > inlined memset for unaligned neon cases should be like:
>> > >   vmov.i32   q8, #...
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]
>> >
>> > Other than for zero, I'd expect the vmov to be vmov.i8 to move an
>> arbitrary
>> I just used vmov.i32 as an example.  The element size is actually
> calculated by
>> function neon_valid_immediate which works as expected I think.
>>
>> > byte value into all lanes in a vector.  After that, if the alignment
>> > is
>> known to
>> > be more than 8-bit, I'd expect the vst1 instructions (with the
>> > exception
>> of the
>> > last store if the length is not a multiple of the alignment) to use
>> >
>> >     vst1.<align> {reg}, [addr-reg :<align>]!
>> >
>> > Hence, for 16-bit aligned data, we want
>> >
>> >     vst1.16 {q8}, [r3:16]!
>> Did I miss something important?  It seems to me the explicit alignment
> notes
>> supported are 64/128/256.  So what do you mean by 16 bits alignment here?
>>
>> >
>> > > But for now, gcc can't do this and below code is generated:
>> > >   vmov.i32   q8, #...
>> > >   vst1.8     {q8}, [r3]
>> > >   add        r2,   r3,  #16
>> > >   add        r3,   r2,  #16
>> > >   vst1.8     {q8}, [r2]
>> > >   vst1.8     {q8}, [r3]
>> > >   add        r2,   r3,  #16
>> > >   vst1.8     {q8}, [r2]
>> > >
>> > > I investigated this issue.  The root cause lies in rtx cost returned
>> > > by ARM backend.  Anyway, I think this is another issue and should be
>> > > fixed in separated patch.

Ok looks like Charles B from Linaro has run into the same thing and
has some fixes to suggest in costs.

>> > >
>> > > Bootstrap and reg-test on cortex-a15, with or without neon support.
>> > > Is it OK?
>> > >
>> >
>> > Some more comments inline.
>> >
>> > > Thanks,
>> > > bin
>> > >
>> > >
>> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
>> > >
>> > >   PR target/55701
>> > >   * config/arm/arm.md (setmem): New pattern.
>> > >   * config/arm/arm-protos.h (struct tune_params): New field.
>> > >   (arm_gen_setmem): New prototype.
>> > >   * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
>> > >   (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
>> > >   (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
>> > >   (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
>> > >   (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
>> > >   (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
>> > >   (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
>> > >   (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
>> > >   (arm_const_inline_cost): New function.
>> > >   (arm_block_set_max_insns): New function.
>> > >   (arm_block_set_straight_profit_p): New function.
>> > >   (arm_block_set_vect_profit_p): New function.
>> > >   (arm_block_set_unaligned_vect): New function.
>> > >   (arm_block_set_aligned_vect): New function.
>> > >   (arm_block_set_unaligned_straight): New function.
>> > >   (arm_block_set_aligned_straight): New function.
>> > >   (arm_block_set_vect, arm_gen_setmem): New functions.
>> > >
>> > > gcc/testsuite/ChangeLog
>> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
>> > >
>> > >   PR target/55701
>> > >   * gcc.target/arm/memset-inline-1.c: New test.
>> > >   * gcc.target/arm/memset-inline-2.c: New test.
>> > >   * gcc.target/arm/memset-inline-3.c: New test.
>> > >   * gcc.target/arm/memset-inline-4.c: New test.
>> > >   * gcc.target/arm/memset-inline-5.c: New test.
>> > >   * gcc.target/arm/memset-inline-6.c: New test.
>> > >   * gcc.target/arm/memset-inline-7.c: New test.
>> > >   * gcc.target/arm/memset-inline-8.c: New test.
>> > >   * gcc.target/arm/memset-inline-9.c: New test.
>> > >
>> > >
>> > > j1328-20140429.txt
>> > >
>> > >
>> > > Index: gcc/config/arm/arm.c
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm.c  (revision 209852)
>> > > +++ gcc/config/arm/arm.c  (working copy)
>> > > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune
>> =
>> > >    true,                                          /* Prefer constant
>> > pool.  */
>> > >    arm_default_branch_cost,
>> > >    false,                                 /* Prefer LDRD/STRD.  */
>> > > -  {true, true},                                  /* Prefer non short
>> > circuit.  */
>> > > -  &arm_default_vec_cost,                        /* Vectorizer costs.
>> */
>> > > -  false,                                        /* Prefer Neon for
>> 64-bits bitops.  */
>> > > -  false, false                                  /* Prefer 32-bit
>> encodings.  */
>> > > +  {true, true},                          /* Prefer non short circuit.
>> */
>> > > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
>> > > +  false,                                /* Prefer Neon for 64-bits
>> bitops.  */
>> > > +  false, false,                         /* Prefer 32-bit encodings.
> */
>> > > +  false                                 /* Prefer Neon for stringops.
>> */
>> > >  };
>> > >
>> >
>> > Please make sure that all the white space before the comments is using
>> TAB,
>> > not spaces.  Similarly for the other tables.
>> Fixed.
>>
>> >
>> > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
>> > >                         NULL_RTX, NULL_RTX, 0, 0));  }
>> > >
>> > > +/* Cost of loading a SImode constant.  */ static inline int
>> > > +arm_const_inline_cost (rtx val) {
>> > > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
>> > > +                           NULL_RTX, NULL_RTX, 0, 0); }
>> > > +
>> >
>> > This could be used more widely if you passed the SET in as a parameter
>> > (there are cases in arm_new_rtx_cost that could use it, for example).
>> > Also, you want to enable sub-targets (only once you can't create new
>> > pseudos is that not safe), so the penultimate argument in the call to
>> > arm_gen_constant should be 1.
>> Fixed.



>>
>> >
>> > >  /* Return true if it is worthwhile to split a 64-bit constant into
> two
>> > >     32-bit operations.  This is the case if optimizing for size, or
>> > >     if we have load delay slots, or if one 32-bit part can be done
>> > > with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
>> > > *comparison, rtx * op
>> > >
>> > >  }
>> > >
>> > > +/* Maximum number of instructions to set block of memory.  */
>> > > +static int arm_block_set_max_insns (void) {
>> > > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
>> >
>> > I think the non-size_p alternative should really be a parameter in the
>> per-cpu
>> > costs table.
>> Fixed.
>>
>> >
>> > > +
>> > > +/* Return TRUE if it's profitable to set block of memory for straight
>> > > +   case.  */
>> >
>> > "Straight" is confusing here.  Do you mean non-vectorized?  If so,
>> > then non_vect might be clearer.
>> Fixed.
>>
>> >
>> > The arguments should really be documented (see comment below about
>> > align, for example).
>> Fixed.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_straight_profit_p (rtx val,
>> > > +                          unsigned HOST_WIDE_INT length,
>> > > +                          unsigned HOST_WIDE_INT align,
>> > > +                          bool unaligned_p, bool use_strd_p) {
>> > > +  int num = 0;
>> > > +  /* For leftovers in bytes of 0-7, we can set the memory block using
>> > > +     strb/strh/str with minimum instruction number.  */
>> > > +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
>> >
>> > This should be marked const.
>> Fixed.
>>
>> >
>> > > +
>> > > +  if (unaligned_p)
>> > > +    {
>> > > +      num = arm_const_inline_cost (val);
>> > > +      num += length / align + length % align;
>> >
>> > Isn't align in bits here, when you really want it in bytes?
>> All alignments are in bytes starting from pattern "setmem".
>>
>> >
>> > What if align > 4 bytes?
>> Then it's the "!unaligned_p" case and handled by other arms of this if
>> statement.
>>
>> >
>> > > +    }
>> > > +  else if (use_strd_p)
>> > > +    {
>> > > +      num = arm_const_double_inline_cost (val);
>> > > +      num += (length >> 3) + leftover[length & 7];
>> > > +    }
>> > > +  else
>> > > +    {
>> > > +      num = arm_const_inline_cost (val);
>> > > +      num += (length >> 2) + leftover[length & 3];
>> > > +    }
>> > > +
>> > > +  /* We may be able to combine last pair STRH/STRB into a single STR
>> > > +     by shifting one byte back.  */  if (unaligned_access && length
>> > > + > 3 && (length & 3) == 3)
>> > > +    num--;
>> > > +
>> > > +  return (num <= arm_block_set_max_insns ()); }
>> > > +
>> > > +/* Return TRUE if it's profitable to set block of memory for vector
>> > > +case.  */ static bool arm_block_set_vect_profit_p (unsigned
>> > > +HOST_WIDE_INT length,
>> > > +                      unsigned HOST_WIDE_INT align
>> > ATTRIBUTE_UNUSED,
>> > > +                      bool unaligned_p, enum machine_mode mode)
>> >
>> > I'm not sure what you mean by unaligned here.  Again, documenting the
>> > arguments might help.
>> Fixed.
>>
>> >
>> > > +{
>> > > +  int num;
>> > > +  unsigned int nelt = GET_MODE_NUNITS (mode);
>> > > +
>> > > +  /* Num of instruction loading constant value.  */
>> >
>> > Use either "Number" or, in this case, simply drop that bit and write:
>> >   /* Instruction loading constant value.  */
>> Fixed.
>>
>> >
>> > > +  num = 1;
>> > > +  /* Num of store instructions.  */
>> >
>> > Likewise.
>> >
>> > > +  num += (length + nelt - 1) / nelt;
>> > > +  /* Num of address adjusting instructions.  */
>> >
>> > Can't we work on the premise that the address adjusting instructions
>> > will
>> be
>> > merged into the stores?  I know you said that they currently do not,
>> > but that's not a problem that this bit of code should have to worry
> about.
>> Fixed.
>>
>> >
>> > > +  if (unaligned_p)
>> > > +    /* For unaligned case, it's one less than the store instructions.
>> */
>> > > +    num += (length + nelt - 1) / nelt - 1;  else if ((length & 3)
>> > > + !=
>> > > + 0)
>> > > +    /* For aligned case, it's one if bytes leftover can only be
> stored
>> > > +       by mis-aligned store instruction.  */
>> > > +    num++;
>> > > +
>> > > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
>> > > + */  if (!unaligned_p && mode == V16QImode)
>> > > +    num--;
>> > > +
>> > > +  return (num <= arm_block_set_max_insns ()); }
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for the
>> > > +   unaligned case.  We fill the first LENGTH bytes of the memory
>> > > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
>> > > +   the alignment requirement of memory.  */
>> >
>> > What's the return value mean?
>> Documented.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_unaligned_vect (rtx dstbase,
>> > > +                       unsigned HOST_WIDE_INT length,
>> > > +                       unsigned HOST_WIDE_INT value,
>> > > +                       unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
>> >
>> > Don't mix initialized declarations with unitialized ones on the same
> line.
>> You
>> > don't appear to use either I or J until their first use in the loop
>> control below,
>> > so why initialize them here?
>> Fixed.
>>
>> >
>> > > +  rtx dst, mem;
>> > > +  rtx val_elt, val_vec, reg;
>> > > +  rtx rval[MAX_VECT_LEN];
>> > > +  rtx (*gen_func) (rtx, rtx);
>> > > +  enum machine_mode mode;
>> > > +  unsigned HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert ((align & 0x3) != 0);
>> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
>> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16)
>> > > +    {
>> > > +      mode = V16QImode;
>> > > +      gen_func = gen_movmisalignv16qi;
>> > > +    }
>> > > +  else
>> > > +    {
>> > > +      mode = V8QImode;
>> > > +      gen_func = gen_movmisalignv8qi;
>> > > +    }
>> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
>> > > + nelt_mode);
>> > > +  /* Skip if it isn't profitable.  */  if
>> > > + (!arm_block_set_vect_profit_p (length, align, true, mode))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mem =
>> > > + adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_elt = GEN_INT (v);
>> > > +  for (; j < nelt_mode; j++)
>> > > +    rval[j] = val_elt;
>> >
>> > Is this the first use of J?  If so, initialize it here.
>> >
>> > > +
>> > > +  reg = gen_reg_rtx (mode);
>> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
>> > > + rval));
>> > > +  /* Emit instruction loading the constant value.  */
>> > > + emit_move_insn (reg, val_vec);
>> > > +
>> > > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i + nelt_mode
>> > > + <= length); i += nelt_mode)
>> >
>> > Similarly for I.
>> >
>> > > +    {
>> > > +      emit_insn ((*gen_func) (mem, reg));
>> > > +      if (i + 2 * nelt_mode <= length)
>> > > + emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
>> > > +    }
>> > > +
>> > > +  if (i + nelt_v8 <= length)
>> > > +    gcc_assert (mode == V16QImode);
>> >
>> > Why not drop the if and write:
>> >
>> >      gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
>> Fixed.
>>
>> >
>> > > +
>> > > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 < length)
>> >
>> > Your assertion above checked <=, but here you use <.  Is that correct?
>> Yes, it is.  For case "==", it means we have nelt_v8 bytes leftover, which
> will
>> be handled by the last branch of if statement.
>>
>> >
>> > > +    {
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) != 0 && align >= 2)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +    }
>> > > +  /* Handle (0, 8] bytes leftover.  */
>> > > +  else if (i < length && i + nelt_v8 >= length)
>> > > +    {
>> > > +      if (mode == V16QImode)
>> > > + {
>> > > +   reg = gen_lowpart (V8QImode, reg);
>> > > +   mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
>> > > + }
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
>> > > +                                       + (nelt_mode - nelt_v8))));
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) != 0 && align >= 2)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for the
>> > > +   aligned case.  We fill the first LENGTH bytes of the memory area
>> > > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
>> > > +   alignment requirement of memory.  */
>> >
>> > See all the comments above for the unaligend case.
>> Fixed accordingly.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_aligned_vect (rtx dstbase,
>> > > +                     unsigned HOST_WIDE_INT length,
>> > > +                     unsigned HOST_WIDE_INT value,
>> > > +                     unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_elt, val_vec, reg;
>> > > +  rtx rval[MAX_VECT_LEN];
>> > > +  enum machine_mode mode;
>> > > +  unsigned HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert ((align & 0x3) == 0);
>> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
>> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16
>> > > + && unaligned_access && !BYTES_BIG_ENDIAN)
>> > > +    mode = V16QImode;
>> > > +  else
>> > > +    mode = V8QImode;
>> > > +
>> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
>> > > + nelt_mode);
>> > > +  /* Skip if it isn't profitable.  */  if
>> > > + (!arm_block_set_vect_profit_p (length, align, false, mode))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_elt = GEN_INT (v);
>> > > +  for (; j < nelt_mode; j++)
>> > > +    rval[j] = val_elt;
>> > > +
>> > > +  reg = gen_reg_rtx (mode);
>> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
>> > > + rval));
>> > > +  /* Emit instruction loading the constant value.  */
>> > > + emit_move_insn (reg, val_vec);
>> > > +
>> > > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
>> > > +*/
>> > > +  if (mode == V16QImode)
>> > > +    {
>> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +      i += nelt_mode;
>> > > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
>> > > +      if (i + nelt_v8 < length && i + nelt_v16 > length)
>> > > + {
>> > > +   emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
>> > > +   mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +   /* We are shifting bytes back, set the alignment accordingly.  */
>> > > +   if ((length & 0x3) == 0)
>> > > +     set_mem_align (mem, BITS_PER_UNIT * 4);
>> > > +   else if ((length & 0x1) == 0)
>> > > +     set_mem_align (mem, BITS_PER_UNIT * 2);
>> > > +   else
>> > > +     set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +   emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +   return true;
>> > > + }
>> > > +      /* Fall through for bytes leftover.  */
>> > > +      mode = V8QImode;
>> > > +      nelt_mode = GET_MODE_NUNITS (mode);
>> > > +      reg = gen_lowpart (V8QImode, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle 8 bytes in a vector.  */  for (; (i + nelt_mode <=
>> > > + length); i += nelt_mode)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle single word leftover by shifting 4 bytes back.  We can
>> > > +     use aligned access for this case.  */
>> > > +  if (i + UNITS_PER_WORD == length)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
>> > > +      mem = adjust_automodify_address (dstbase, mode,
>> > > +                                addr, i - UNITS_PER_WORD);
>> > > +      /* We are shifting 4 bytes back, set the alignment accordingly.
>> */
>> > > +      if (align > UNITS_PER_WORD)
>> > > + set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
>> > > +
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
>> > > +     We have to use unaligned access for this case.  */
>> > > +  else if (i < length)
>> > > +    {
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
>> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) == 0)
>> > > + set_mem_align (mem, BITS_PER_UNIT * 2);
>> > > +      else
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using plain strh/strb instructions, only
>> > > +   using instructions allowed by ALIGN on processor.  We fill the
>> > > +   first LENGTH bytes of the memory area starting from DSTBASE
>> > > +   with byte constant VALUE.  ALIGN is the alignment requirement
>> > > +   of memory.  */
>> > > +static bool
>> > > +arm_block_set_unaligned_straight (rtx dstbase,
>> > > +                           unsigned HOST_WIDE_INT length,
>> > > +                           unsigned HOST_WIDE_INT value,
>> > > +                           unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_exp, val_reg, reg;
>> > > +  enum machine_mode mode;
>> > > +  HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert (align == 1 || align == 2);
>> > > +
>> > > +  if (align == 2)
>> > > +    v |= (value << BITS_PER_UNIT);
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_exp = GEN_INT (v);
>> > > +  /* Skip if it isn't profitable.  */
>> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                 align, true, false))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mode = (align == 2 ?
>> > > + HImode : QImode);  val_reg = force_reg (SImode, val_exp);  reg =
>> > > + gen_lowpart (mode, val_reg);
>> > > +
>> > > +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i +=
>> > > + GET_MODE_SIZE
>> > (mode))
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
>> > > +    {
>> > > +      reg = gen_lowpart (QImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +      i++;
>> > > +    }
>> > > +
>> > > +  gcc_assert (i == length);
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using plain strd/str/strh/strb instructions,
>> > > +   to permit unaligned copies on processors which support unaligned
>> > > +   semantics for those instructions.  We fill the first LENGTH bytes
>> > > +   of the memory area starting from DSTBASE with byte constant VALUE.
>> > > +   ALIGN is the alignment requirement of memory.  */ static bool
>> > > +arm_block_set_aligned_straight (rtx dstbase,
>> > > +                         unsigned HOST_WIDE_INT length,
>> > > +                         unsigned HOST_WIDE_INT value,
>> > > +                         unsigned HOST_WIDE_INT align)
>> > > +{
>> > > +  unsigned int i = 0;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_exp, val_reg, reg;
>> > > +  unsigned HOST_WIDE_INT v;
>> > > +  bool use_strd_p;
>> > > +
>> > > +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
>> > > +         && TARGET_LDRD && current_tune->prefer_ldrd_strd);
>> > > +
>> > > +  v = (value | (value << 8) | (value << 16) | (value << 24));  if
>> > > + (length < UNITS_PER_WORD)
>> > > +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
>> > > +
>> > > +  if (use_strd_p)
>> > > +    v |= (v << BITS_PER_WORD);
>> > > +  else
>> > > +    v = sext_hwi (v, BITS_PER_WORD);
>> > > +
>> > > +  val_exp = GEN_INT (v);
>> > > +  /* Skip if it isn't profitable.  */
>> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                 align, false, use_strd_p))
>> > > +    {
>> > > +      /* Try without strd.  */
>> > > +      v = (v >> BITS_PER_WORD);
>> > > +      v = sext_hwi (v, BITS_PER_WORD);
>> > > +      val_exp = GEN_INT (v);
>> > > +      use_strd_p = false;
>> > > +      if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                     align, false, use_strd_p))
>> > > + return false;
>> > > +    }
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
>> > > +  /* Handle double words using strd if possible.  */
>> > > +  if (use_strd_p)
>> > > +    {
>> > > +      val_reg = force_reg (DImode, val_exp);
>> > > +      reg = val_reg;
>> > > +      for (; (i + 8 <= length); i += 8)
>> > > + {
>> > > +   addr = plus_constant (Pmode, dst, i);
>> > > +   mem = adjust_automodify_address (dstbase, DImode, addr, i);
>> > > +   emit_move_insn (mem, reg);
>> > > + }
>> > > +    }
>> > > +  else
>> > > +    val_reg = force_reg (SImode, val_exp);
>> > > +
>> > > +  /* Handle words.  */
>> > > +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
>> > > +  for (; (i + 4 <= length); i += 4)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
>> > > +      if ((align & 3) == 0)
>> > > + emit_move_insn (mem, reg);
>> > > +      else
>> > > + emit_insn (gen_unaligned_storesi (mem, reg));
>> > > +    }
>> > > +
>> > > +  /* Merge last pair of STRH and STRB into a STR if possible.  */
>> > > + if (unaligned_access && i > 0 && (i + 3) == length)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i - 1);
>> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
>> > > +      /* We are shifting one byte back, set the alignment
> accordingly.
>> */
>> > > +      if ((align & 1) == 0)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      /* Most likely this is an unaligned access, and we can't tell
> at
>> > > +  compilation time.  */
>> > > +      emit_insn (gen_unaligned_storesi (mem, reg));
>> > > +      return true;
>> > > +    }
>> > > +
>> > > +  /* Handle half word leftover.  */
>> > > +  if (i + 2 <= length)
>> > > +    {
>> > > +      reg = gen_lowpart (HImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
>> > > +      if ((align & 1) == 0)
>> > > + emit_move_insn (mem, reg);
>> > > +      else
>> > > + emit_insn (gen_unaligned_storehi (mem, reg));
>> > > +
>> > > +      i += 2;
>> > > +    }
>> > > +
>> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
>> > > +    {
>> > > +      reg = gen_lowpart (QImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for both
>> > > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
>> > > +   the memory area starting from DSTBASE with byte constant VALUE.
>> > > +   ALIGN is the alignment requirement of memory.  */ static bool
>> > > +arm_block_set_vect (rtx dstbase,
>> > > +             unsigned HOST_WIDE_INT length,
>> > > +             unsigned HOST_WIDE_INT value,
>> > > +             unsigned HOST_WIDE_INT align) {
>> > > +  /* Check whether we need to use unaligned store instruction.  */
>> > > +  if (((align & 3) != 0 || (length & 3) != 0)
>> > > +      /* Check whether unaligned store instruction is available.  */
>> > > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
>> > > +    return false;
>> >
>> > Huh!  vst1.8 can work for unaligned accesses even when hw alignment
>> > checking is strict.
>> Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
>> unaligned_access", vst1.8 instructions  can't be recognized now in this
> way.
>> I agree that it's too strict, but that's another problem I think.

That was introduced to "fix up" the issue with another test IIRC.
That's probably not related to this particular patch. It's the other
thread that's been ongoing with Maciej, so let's continue it there.

>>
>> >
>> > > +
>> > > +  if ((align & 3) == 0)
>> > > +    return arm_block_set_aligned_vect (dstbase, length, value,
>> > > +align);
>> > > +  else
>> > > +    return arm_block_set_unaligned_vect (dstbase, length, value,
>> > > +align); }
>> > > +
>> > > +/* Expand string store operation.  Firstly we try to do that by using
>> > > +   vectorization instructions, then try with ARM unaligned access and
>> > > +   double-word store if profitable.  OPERANDS[0] is the destination,
>> > > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
>> > > +   initialize the memory, OPERANDS[3] is the known alignment of the
>> > > +   destination.  */
>> > > +bool
>> > > +arm_gen_setmem (rtx *operands)
>> > > +{
>> > > +  rtx dstbase = operands[0];
>> > > +  unsigned HOST_WIDE_INT length;
>> > > +  unsigned HOST_WIDE_INT value;
>> > > +  unsigned HOST_WIDE_INT align;
>> > > +
>> > > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
>> > > +    return false;
>> > > +
>> > > +  length = UINTVAL (operands[1]);
>> > > +  if (length > 64)
>> > > +    return false;
>> > > +
>> > > +  value = (UINTVAL (operands[2]) & 0xFF);  align = UINTVAL
>> > > + (operands[3]);  if (TARGET_NEON && length >= 8
>> > > +      && current_tune->string_ops_prefer_neon
>> > > +      && arm_block_set_vect (dstbase, length, value, align))
>> > > +    return true;
>> > > +
>> > > +  if (!unaligned_access && (align & 3) != 0)
>> > > +    return arm_block_set_unaligned_straight (dstbase, length,
>> > > + value, align);
>> > > +
>> > > +  return arm_block_set_aligned_straight (dstbase, length, value,
>> > > +align); }
>> > > +
>> > >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
>> > >
>> > >  static unsigned HOST_WIDE_INT
>> > > Index: gcc/config/arm/arm-protos.h
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm-protos.h   (revision 209852)
>> > > +++ gcc/config/arm/arm-protos.h   (working copy)
>> > > @@ -277,6 +277,8 @@ struct tune_params
>> > >    /* Prefer 32-bit encoding instead of 16-bit encoding where subset
>> > > of
>> flags
>> > >       would be set.  */
>> > >    bool disparage_partial_flag_setting_t16_encodings;
>> > > +  /* Prefer to inline string operations like memset by using Neon.
>> > > + */  bool string_ops_prefer_neon;
>> > >  };
>> > >
>> > >  extern const struct tune_params *current_tune; @@ -289,6 +291,7 @@
>> > > extern void arm_emit_coreregs_64bit_shift (enum rt  extern bool
>> > > arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /* RTX_CODE
>> > > */
>> > >
>> > > +extern bool arm_gen_setmem (rtx *);
>> > >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx
>> > > sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx op0,
>> > > rtx op1, rtx sel);
>> > >
>> > > Index: gcc/config/arm/arm.md
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm.md (revision 209852)
>> > > +++ gcc/config/arm/arm.md (working copy)
>> > > @@ -7555,6 +7555,20 @@
>> > >  })
>> > >
>> > >
>> > > +(define_expand "setmemsi"
>> > > +  [(match_operand:BLK 0 "general_operand" "")
>> > > +   (match_operand:SI 1 "const_int_operand" "")
>> > > +   (match_operand:SI 2 "const_int_operand" "")
>> > > +   (match_operand:SI 3 "const_int_operand" "")]
>> > > +  "TARGET_32BIT"
>> > > +{
>> > > +  if (arm_gen_setmem (operands))
>> > > +    DONE;
>> > > +
>> > > +  FAIL;
>> > > +})
>> > > +
>> > > +
>> > >  ;; Move a block of memory if it is word aligned and MORE than 2
>> > > words
>> > long.
>> > >  ;; We could let this apply for blocks of less than this, but it
>> > > clobbers so  ;; many registers that there is then probably a better
> way.
>> > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
>> > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
>> >
>> > Have you tested these when the compiler was configured with "--with-
>> > cpu=cortex-a9"?
>> Here is the tricky part.
>> For compiler configured with "--with-tune=cortex-a9", the neon related
>> cases
>> (4/5/6/8/9) would fail because we have no way to determine that we are
>> compiling with cortex-a9 tune here.

There is no easy way of fixing this.

>> For compiler configured with "--with-cpu=cortex-a9", the test cases would
>> pass but I think this is a mistake.  It reveals an issue that GCC won't
> pass "-
>> mcpu=cortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It just
> makes
>> no sense.
>> With these issues, I didn't change the tests for now.

I think we'll just have to take the hit on noise in other configurations.

> Precisely, I configured gcc with options "--with-arch=armv7-a
> --with-cpu|--with-tune=cortex-a9".
> I read gcc documents and realized that "-mcpu" is ignored when "-march" is
> specified.  I don't know why gcc acts in this manner, but it leads to
> inconsistent configuration/command line behavior.
> If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9", then
> only "-march=armv7-a" is passed to cc1.

That kind of configuration is warned today but no one pays attention
to that. See James's proposal to promote this to an error.

> If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works fine and
> passes "-march=armv7-a -mcpu=cortex-a9" to cc1.

That should be fine.

>
> Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
> -march=armv7-m switch".


This is OK unless there are objections in the next 24 hours.

Please watch out for any fallout - Certainly rebase and retest before
applying and please post the rebased version for archival purposes.


regards
Ramana
>
> Thanks,
> bin
>
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH ARM] Improve ARM memset inlining
  2014-06-27  8:21       ` Ramana Radhakrishnan
@ 2014-07-04 12:18         ` Bin Cheng
  2014-07-08  8:32           ` Bin.Cheng
  2014-07-08  8:56           ` Ramana Radhakrishnan
  0 siblings, 2 replies; 14+ messages in thread
From: Bin Cheng @ 2014-07-04 12:18 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Richard Earnshaw, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 37974 bytes --]



> -----Original Message-----
> From: Ramana Radhakrishnan [mailto:ramana.gcc@googlemail.com]
> Sent: Friday, June 27, 2014 9:22 AM
> To: Bin Cheng
> Cc: Richard Earnshaw; gcc-patches
> Subject: Re: [PATCH ARM] Improve ARM memset inlining
> 
> On Tue, May 6, 2014 at 5:59 AM, bin.cheng <bin.cheng@arm.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >> owner@gcc.gnu.org] On Behalf Of bin.cheng
> >> Sent: Monday, May 05, 2014 3:21 PM
> >> To: Richard Earnshaw
> >> Cc: gcc-patches@gcc.gnu.org
> >> Subject: RE: [PATCH ARM] Improve ARM memset inlining
> >>
> >> Hi Richard,  Thanks for reviewing.  I embedded answers to your
> >> comments, also updated the patch.
> >>
> >> > -----Original Message-----
> >> > From: Richard Earnshaw
> >> > Sent: Friday, May 02, 2014 10:00 PM
> >> > To: Bin Cheng
> >> > Cc: gcc-patches@gcc.gnu.org
> >> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
> >> >
> >> > On 30/04/14 03:52, bin.cheng wrote:
> >> > > Hi,
> >> > > This patch expands small memset calls into direct memory set
> >> > > instructions by introducing "setmemsi" pattern.  For processors
> >> > > without NEON support, it expands memset using general store
> >> > > instruction.  For example, strd for 4-bytes aligned addresses.
> >> > > For processors with NEON support, it expands memset using neon
> >> > > instructions like vstr and miscellaneous vst1.* instructions for
> >> > > both
> >> aligned
> >> > and unaligned cases.
> >> > >
> >> > > This patch depends on
> >> > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
> >> > > vst1.64 will be generated for 32-bit aligned memory unit.
> >> > >
> >> > > There is also one leftover work of this patch:  Since vst1.*
> >> > > instructions only support post-increment addressing mode, the
> >> > > inlined memset for unaligned neon cases should be like:
> >> > >   vmov.i32   q8, #...
> >> > >   vst1.8     {q8}, [r3]!
> >> > >   vst1.8     {q8}, [r3]!
> >> > >   vst1.8     {q8}, [r3]!
> >> > >   vst1.8     {q8}, [r3]
> >> >
> >> > Other than for zero, I'd expect the vmov to be vmov.i8 to move an
> >> arbitrary
> >> I just used vmov.i32 as an example.  The element size is actually
> > calculated by
> >> function neon_valid_immediate which works as expected I think.
> >>
> >> > byte value into all lanes in a vector.  After that, if the
> >> > alignment is
> >> known to
> >> > be more than 8-bit, I'd expect the vst1 instructions (with the
> >> > exception
> >> of the
> >> > last store if the length is not a multiple of the alignment) to use
> >> >
> >> >     vst1.<align> {reg}, [addr-reg :<align>]!
> >> >
> >> > Hence, for 16-bit aligned data, we want
> >> >
> >> >     vst1.16 {q8}, [r3:16]!
> >> Did I miss something important?  It seems to me the explicit
> >> alignment
> > notes
> >> supported are 64/128/256.  So what do you mean by 16 bits alignment
> here?
> >>
> >> >
> >> > > But for now, gcc can't do this and below code is generated:
> >> > >   vmov.i32   q8, #...
> >> > >   vst1.8     {q8}, [r3]
> >> > >   add        r2,   r3,  #16
> >> > >   add        r3,   r2,  #16
> >> > >   vst1.8     {q8}, [r2]
> >> > >   vst1.8     {q8}, [r3]
> >> > >   add        r2,   r3,  #16
> >> > >   vst1.8     {q8}, [r2]
> >> > >
> >> > > I investigated this issue.  The root cause lies in rtx cost
> >> > > returned by ARM backend.  Anyway, I think this is another issue
> >> > > and should be fixed in separated patch.
> 
> Ok looks like Charles B from Linaro has run into the same thing and has some
> fixes to suggest in costs.
> 
> >> > >
> >> > > Bootstrap and reg-test on cortex-a15, with or without neon support.
> >> > > Is it OK?
> >> > >
> >> >
> >> > Some more comments inline.
> >> >
> >> > > Thanks,
> >> > > bin
> >> > >
> >> > >
> >> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> >> > >
> >> > >   PR target/55701
> >> > >   * config/arm/arm.md (setmem): New pattern.
> >> > >   * config/arm/arm-protos.h (struct tune_params): New field.
> >> > >   (arm_gen_setmem): New prototype.
> >> > >   * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
> >> > >   (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
> >> > >   (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
> >> > >   (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
> >> > >   (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
> >> > >   (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
> >> > >   (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
> >> > >   (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
> >> > >   (arm_const_inline_cost): New function.
> >> > >   (arm_block_set_max_insns): New function.
> >> > >   (arm_block_set_straight_profit_p): New function.
> >> > >   (arm_block_set_vect_profit_p): New function.
> >> > >   (arm_block_set_unaligned_vect): New function.
> >> > >   (arm_block_set_aligned_vect): New function.
> >> > >   (arm_block_set_unaligned_straight): New function.
> >> > >   (arm_block_set_aligned_straight): New function.
> >> > >   (arm_block_set_vect, arm_gen_setmem): New functions.
> >> > >
> >> > > gcc/testsuite/ChangeLog
> >> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> >> > >
> >> > >   PR target/55701
> >> > >   * gcc.target/arm/memset-inline-1.c: New test.
> >> > >   * gcc.target/arm/memset-inline-2.c: New test.
> >> > >   * gcc.target/arm/memset-inline-3.c: New test.
> >> > >   * gcc.target/arm/memset-inline-4.c: New test.
> >> > >   * gcc.target/arm/memset-inline-5.c: New test.
> >> > >   * gcc.target/arm/memset-inline-6.c: New test.
> >> > >   * gcc.target/arm/memset-inline-7.c: New test.
> >> > >   * gcc.target/arm/memset-inline-8.c: New test.
> >> > >   * gcc.target/arm/memset-inline-9.c: New test.
> >> > >
> >> > >
> >> > > j1328-20140429.txt
> >> > >
> >> > >
> >> > > Index: gcc/config/arm/arm.c
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/config/arm/arm.c  (revision 209852)
> >> > > +++ gcc/config/arm/arm.c  (working copy)
> >> > > @@ -1585,10 +1585,11 @@ const struct tune_params
> arm_slowmul_tune
> >> =
> >> > >    true,                                          /* Prefer constant
> >> > pool.  */
> >> > >    arm_default_branch_cost,
> >> > >    false,                                 /* Prefer LDRD/STRD.  */
> >> > > -  {true, true},                                  /* Prefer non short
> >> > circuit.  */
> >> > > -  &arm_default_vec_cost,                        /* Vectorizer costs.
> >> */
> >> > > -  false,                                        /* Prefer Neon for
> >> 64-bits bitops.  */
> >> > > -  false, false                                  /* Prefer 32-bit
> >> encodings.  */
> >> > > +  {true, true},                          /* Prefer non short circuit.
> >> */
> >> > > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
> >> > > +  false,                                /* Prefer Neon for 64-bits
> >> bitops.  */
> >> > > +  false, false,                         /* Prefer 32-bit encodings.
> > */
> >> > > +  false                                 /* Prefer Neon for stringops.
> >> */
> >> > >  };
> >> > >
> >> >
> >> > Please make sure that all the white space before the comments is
> >> > using
> >> TAB,
> >> > not spaces.  Similarly for the other tables.
> >> Fixed.
> >>
> >> >
> >> > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
> >> > >                         NULL_RTX, NULL_RTX, 0, 0));  }
> >> > >
> >> > > +/* Cost of loading a SImode constant.  */ static inline int
> >> > > +arm_const_inline_cost (rtx val) {
> >> > > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
> >> > > +                           NULL_RTX, NULL_RTX, 0, 0); }
> >> > > +
> >> >
> >> > This could be used more widely if you passed the SET in as a
> >> > parameter (there are cases in arm_new_rtx_cost that could use it, for
> example).
> >> > Also, you want to enable sub-targets (only once you can't create
> >> > new pseudos is that not safe), so the penultimate argument in the
> >> > call to arm_gen_constant should be 1.
> >> Fixed.
> 
> 
> 
> >>
> >> >
> >> > >  /* Return true if it is worthwhile to split a 64-bit constant
> >> > > into
> > two
> >> > >     32-bit operations.  This is the case if optimizing for size, or
> >> > >     if we have load delay slots, or if one 32-bit part can be
> >> > > done with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
> >> > > *comparison, rtx * op
> >> > >
> >> > >  }
> >> > >
> >> > > +/* Maximum number of instructions to set block of memory.  */
> >> > > +static int arm_block_set_max_insns (void) {
> >> > > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
> >> >
> >> > I think the non-size_p alternative should really be a parameter in
> >> > the
> >> per-cpu
> >> > costs table.
> >> Fixed.
> >>
> >> >
> >> > > +
> >> > > +/* Return TRUE if it's profitable to set block of memory for straight
> >> > > +   case.  */
> >> >
> >> > "Straight" is confusing here.  Do you mean non-vectorized?  If so,
> >> > then non_vect might be clearer.
> >> Fixed.
> >>
> >> >
> >> > The arguments should really be documented (see comment below
> about
> >> > align, for example).
> >> Fixed.
> >>
> >> >
> >> > > +static bool
> >> > > +arm_block_set_straight_profit_p (rtx val,
> >> > > +                          unsigned HOST_WIDE_INT length,
> >> > > +                          unsigned HOST_WIDE_INT align,
> >> > > +                          bool unaligned_p, bool use_strd_p) {
> >> > > +  int num = 0;
> >> > > +  /* For leftovers in bytes of 0-7, we can set the memory block using
> >> > > +     strb/strh/str with minimum instruction number.  */
> >> > > +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
> >> >
> >> > This should be marked const.
> >> Fixed.
> >>
> >> >
> >> > > +
> >> > > +  if (unaligned_p)
> >> > > +    {
> >> > > +      num = arm_const_inline_cost (val);
> >> > > +      num += length / align + length % align;
> >> >
> >> > Isn't align in bits here, when you really want it in bytes?
> >> All alignments are in bytes starting from pattern "setmem".
> >>
> >> >
> >> > What if align > 4 bytes?
> >> Then it's the "!unaligned_p" case and handled by other arms of this
> >> if statement.
> >>
> >> >
> >> > > +    }
> >> > > +  else if (use_strd_p)
> >> > > +    {
> >> > > +      num = arm_const_double_inline_cost (val);
> >> > > +      num += (length >> 3) + leftover[length & 7];
> >> > > +    }
> >> > > +  else
> >> > > +    {
> >> > > +      num = arm_const_inline_cost (val);
> >> > > +      num += (length >> 2) + leftover[length & 3];
> >> > > +    }
> >> > > +
> >> > > +  /* We may be able to combine last pair STRH/STRB into a single STR
> >> > > +     by shifting one byte back.  */  if (unaligned_access &&
> >> > > + length
> >> > > + > 3 && (length & 3) == 3)
> >> > > +    num--;
> >> > > +
> >> > > +  return (num <= arm_block_set_max_insns ()); }
> >> > > +
> >> > > +/* Return TRUE if it's profitable to set block of memory for
> >> > > +vector case.  */ static bool arm_block_set_vect_profit_p
> >> > > +(unsigned HOST_WIDE_INT length,
> >> > > +                      unsigned HOST_WIDE_INT align
> >> > ATTRIBUTE_UNUSED,
> >> > > +                      bool unaligned_p, enum machine_mode mode)
> >> >
> >> > I'm not sure what you mean by unaligned here.  Again, documenting
> >> > the arguments might help.
> >> Fixed.
> >>
> >> >
> >> > > +{
> >> > > +  int num;
> >> > > +  unsigned int nelt = GET_MODE_NUNITS (mode);
> >> > > +
> >> > > +  /* Num of instruction loading constant value.  */
> >> >
> >> > Use either "Number" or, in this case, simply drop that bit and write:
> >> >   /* Instruction loading constant value.  */
> >> Fixed.
> >>
> >> >
> >> > > +  num = 1;
> >> > > +  /* Num of store instructions.  */
> >> >
> >> > Likewise.
> >> >
> >> > > +  num += (length + nelt - 1) / nelt;
> >> > > +  /* Num of address adjusting instructions.  */
> >> >
> >> > Can't we work on the premise that the address adjusting
> >> > instructions will
> >> be
> >> > merged into the stores?  I know you said that they currently do
> >> > not, but that's not a problem that this bit of code should have to
> >> > worry
> > about.
> >> Fixed.
> >>
> >> >
> >> > > +  if (unaligned_p)
> >> > > +    /* For unaligned case, it's one less than the store instructions.
> >> */
> >> > > +    num += (length + nelt - 1) / nelt - 1;  else if ((length &
> >> > > + 3) !=
> >> > > + 0)
> >> > > +    /* For aligned case, it's one if bytes leftover can only be
> > stored
> >> > > +       by mis-aligned store instruction.  */
> >> > > +    num++;
> >> > > +
> >> > > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
> >> > > + */  if (!unaligned_p && mode == V16QImode)
> >> > > +    num--;
> >> > > +
> >> > > +  return (num <= arm_block_set_max_insns ()); }
> >> > > +
> >> > > +/* Set a block of memory using vectorization instructions for the
> >> > > +   unaligned case.  We fill the first LENGTH bytes of the memory
> >> > > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
> >> > > +   the alignment requirement of memory.  */
> >> >
> >> > What's the return value mean?
> >> Documented.
> >>
> >> >
> >> > > +static bool
> >> > > +arm_block_set_unaligned_vect (rtx dstbase,
> >> > > +                       unsigned HOST_WIDE_INT length,
> >> > > +                       unsigned HOST_WIDE_INT value,
> >> > > +                       unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
> >> >
> >> > Don't mix initialized declarations with unitialized ones on the
> >> > same
> > line.
> >> You
> >> > don't appear to use either I or J until their first use in the loop
> >> control below,
> >> > so why initialize them here?
> >> Fixed.
> >>
> >> >
> >> > > +  rtx dst, mem;
> >> > > +  rtx val_elt, val_vec, reg;
> >> > > +  rtx rval[MAX_VECT_LEN];
> >> > > +  rtx (*gen_func) (rtx, rtx);
> >> > > +  enum machine_mode mode;
> >> > > +  unsigned HOST_WIDE_INT v = value;
> >> > > +
> >> > > +  gcc_assert ((align & 0x3) != 0);
> >> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> >> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >=
> nelt_v16)
> >> > > +    {
> >> > > +      mode = V16QImode;
> >> > > +      gen_func = gen_movmisalignv16qi;
> >> > > +    }
> >> > > +  else
> >> > > +    {
> >> > > +      mode = V8QImode;
> >> > > +      gen_func = gen_movmisalignv8qi;
> >> > > +    }
> >> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> >> > > + nelt_mode);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_vect_profit_p (length, align, true, mode))
> >> > > +    return false;
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mem =
> >> > > + adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +
> >> > > +  v = sext_hwi (v, BITS_PER_WORD);  val_elt = GEN_INT (v);  for
> >> > > + (; j < nelt_mode; j++)
> >> > > +    rval[j] = val_elt;
> >> >
> >> > Is this the first use of J?  If so, initialize it here.
> >> >
> >> > > +
> >> > > +  reg = gen_reg_rtx (mode);
> >> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v
> (nelt_mode,
> >> > > + rval));
> >> > > +  /* Emit instruction loading the constant value.  */
> >> > > + emit_move_insn (reg, val_vec);
> >> > > +
> >> > > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i +
> >> > > + nelt_mode <= length); i += nelt_mode)
> >> >
> >> > Similarly for I.
> >> >
> >> > > +    {
> >> > > +      emit_insn ((*gen_func) (mem, reg));
> >> > > +      if (i + 2 * nelt_mode <= length) emit_insn (gen_add2_insn
> >> > > + (dst, GEN_INT (nelt_mode)));
> >> > > +    }
> >> > > +
> >> > > +  if (i + nelt_v8 <= length)
> >> > > +    gcc_assert (mode == V16QImode);
> >> >
> >> > Why not drop the if and write:
> >> >
> >> >      gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
> >> Fixed.
> >>
> >> >
> >> > > +
> >> > > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 <
> >> > > + length)
> >> >
> >> > Your assertion above checked <=, but here you use <.  Is that correct?
> >> Yes, it is.  For case "==", it means we have nelt_v8 bytes leftover,
> >> which
> > will
> >> be handled by the last branch of if statement.
> >>
> >> >
> >> > > +    {
> >> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
> >> > > +      /* We are shifting bytes back, set the alignment accordingly.
> > */
> >> > > +      if ((length & 1) != 0 && align >= 2) set_mem_align (mem,
> >> > > + BITS_PER_UNIT);
> >> > > +
> >> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> >> > > +    }
> >> > > +  /* Handle (0, 8] bytes leftover.  */  else if (i < length && i
> >> > > + + nelt_v8 >= length)
> >> > > +    {
> >> > > +      if (mode == V16QImode)
> >> > > + {
> >> > > +   reg = gen_lowpart (V8QImode, reg);
> >> > > +   mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
> >> > > + }
> >> > > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
> >> > > +                                       + (nelt_mode - nelt_v8))));
> >> > > +      /* We are shifting bytes back, set the alignment accordingly.
> > */
> >> > > +      if ((length & 1) != 0 && align >= 2) set_mem_align (mem,
> >> > > + BITS_PER_UNIT);
> >> > > +
> >> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> >> > > +    }
> >> > > +
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using vectorization instructions for the
> >> > > +   aligned case.  We fill the first LENGTH bytes of the memory area
> >> > > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
> >> > > +   alignment requirement of memory.  */
> >> >
> >> > See all the comments above for the unaligend case.
> >> Fixed accordingly.
> >>
> >> >
> >> > > +static bool
> >> > > +arm_block_set_aligned_vect (rtx dstbase,
> >> > > +                     unsigned HOST_WIDE_INT length,
> >> > > +                     unsigned HOST_WIDE_INT value,
> >> > > +                     unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
> >> > > +  rtx dst, addr, mem;
> >> > > +  rtx val_elt, val_vec, reg;
> >> > > +  rtx rval[MAX_VECT_LEN];
> >> > > +  enum machine_mode mode;
> >> > > +  unsigned HOST_WIDE_INT v = value;
> >> > > +
> >> > > +  gcc_assert ((align & 0x3) == 0);
> >> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> >> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >=
> >> > > + nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
> >> > > +    mode = V16QImode;
> >> > > +  else
> >> > > +    mode = V8QImode;
> >> > > +
> >> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> >> > > + nelt_mode);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_vect_profit_p (length, align, false, mode))
> >> > > +    return false;
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> >> > > +
> >> > > +  v = sext_hwi (v, BITS_PER_WORD);  val_elt = GEN_INT (v);  for
> >> > > + (; j < nelt_mode; j++)
> >> > > +    rval[j] = val_elt;
> >> > > +
> >> > > +  reg = gen_reg_rtx (mode);
> >> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v
> (nelt_mode,
> >> > > + rval));
> >> > > +  /* Emit instruction loading the constant value.  */
> >> > > + emit_move_insn (reg, val_vec);
> >> > > +
> >> > > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
> >> > > +*/
> >> > > +  if (mode == V16QImode)
> >> > > +    {
> >> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> >> > > +      i += nelt_mode;
> >> > > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
> >> > > +      if (i + nelt_v8 < length && i + nelt_v16 > length)  {
> >> > > +   emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> >> > > +   mem = adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +   /* We are shifting bytes back, set the alignment accordingly.  */
> >> > > +   if ((length & 0x3) == 0)
> >> > > +     set_mem_align (mem, BITS_PER_UNIT * 4);
> >> > > +   else if ((length & 0x1) == 0)
> >> > > +     set_mem_align (mem, BITS_PER_UNIT * 2);
> >> > > +   else
> >> > > +     set_mem_align (mem, BITS_PER_UNIT);
> >> > > +
> >> > > +   emit_insn (gen_movmisalignv16qi (mem, reg));
> >> > > +   return true;
> >> > > + }
> >> > > +      /* Fall through for bytes leftover.  */
> >> > > +      mode = V8QImode;
> >> > > +      nelt_mode = GET_MODE_NUNITS (mode);
> >> > > +      reg = gen_lowpart (V8QImode, reg);
> >> > > +    }
> >> > > +
> >> > > +  /* Handle 8 bytes in a vector.  */  for (; (i + nelt_mode <=
> >> > > + length); i += nelt_mode)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +
> >> > > +  /* Handle single word leftover by shifting 4 bytes back.  We can
> >> > > +     use aligned access for this case.  */  if (i +
> >> > > + UNITS_PER_WORD == length)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
> >> > > +      mem = adjust_automodify_address (dstbase, mode,
> >> > > +                                addr, i - UNITS_PER_WORD);
> >> > > +      /* We are shifting 4 bytes back, set the alignment accordingly.
> >> */
> >> > > +      if (align > UNITS_PER_WORD) set_mem_align (mem,
> >> > > + BITS_PER_UNIT * UNITS_PER_WORD);
> >> > > +
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
> >> > > +     We have to use unaligned access for this case.  */  else if
> >> > > + (i < length)
> >> > > +    {
> >> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> >> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +      /* We are shifting bytes back, set the alignment accordingly.
> > */
> >> > > +      if ((length & 1) == 0)
> >> > > + set_mem_align (mem, BITS_PER_UNIT * 2);
> >> > > +      else
> >> > > + set_mem_align (mem, BITS_PER_UNIT);
> >> > > +
> >> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> >> > > +    }
> >> > > +
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using plain strh/strb instructions, only
> >> > > +   using instructions allowed by ALIGN on processor.  We fill the
> >> > > +   first LENGTH bytes of the memory area starting from DSTBASE
> >> > > +   with byte constant VALUE.  ALIGN is the alignment requirement
> >> > > +   of memory.  */
> >> > > +static bool
> >> > > +arm_block_set_unaligned_straight (rtx dstbase,
> >> > > +                           unsigned HOST_WIDE_INT length,
> >> > > +                           unsigned HOST_WIDE_INT value,
> >> > > +                           unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i;
> >> > > +  rtx dst, addr, mem;
> >> > > +  rtx val_exp, val_reg, reg;
> >> > > +  enum machine_mode mode;
> >> > > +  HOST_WIDE_INT v = value;
> >> > > +
> >> > > +  gcc_assert (align == 1 || align == 2);
> >> > > +
> >> > > +  if (align == 2)
> >> > > +    v |= (value << BITS_PER_UNIT);
> >> > > +
> >> > > +  v = sext_hwi (v, BITS_PER_WORD);  val_exp = GEN_INT (v);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_straight_profit_p (val_exp, length,
> >> > > +                                 align, true, false))
> >> > > +    return false;
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mode = (align == 2 ?
> >> > > + HImode : QImode);  val_reg = force_reg (SImode, val_exp);  reg
> >> > > + = gen_lowpart (mode, val_reg);
> >> > > +
> >> > > +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i +=
> >> > > + GET_MODE_SIZE
> >> > (mode))
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +
> >> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> >> > > +    {
> >> > > +      reg = gen_lowpart (QImode, val_reg);
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +      i++;
> >> > > +    }
> >> > > +
> >> > > +  gcc_assert (i == length);
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using plain strd/str/strh/strb instructions,
> >> > > +   to permit unaligned copies on processors which support unaligned
> >> > > +   semantics for those instructions.  We fill the first LENGTH bytes
> >> > > +   of the memory area starting from DSTBASE with byte constant
> VALUE.
> >> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> >> > > +arm_block_set_aligned_straight (rtx dstbase,
> >> > > +                         unsigned HOST_WIDE_INT length,
> >> > > +                         unsigned HOST_WIDE_INT value,
> >> > > +                         unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i = 0;
> >> > > +  rtx dst, addr, mem;
> >> > > +  rtx val_exp, val_reg, reg;
> >> > > +  unsigned HOST_WIDE_INT v;
> >> > > +  bool use_strd_p;
> >> > > +
> >> > > +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
> >> > > +         && TARGET_LDRD && current_tune->prefer_ldrd_strd);
> >> > > +
> >> > > +  v = (value | (value << 8) | (value << 16) | (value << 24));
> >> > > + if (length < UNITS_PER_WORD)
> >> > > +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) *
> >> > > + BITS_PER_UNIT);
> >> > > +
> >> > > +  if (use_strd_p)
> >> > > +    v |= (v << BITS_PER_WORD);
> >> > > +  else
> >> > > +    v = sext_hwi (v, BITS_PER_WORD);
> >> > > +
> >> > > +  val_exp = GEN_INT (v);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_straight_profit_p (val_exp, length,
> >> > > +                                 align, false, use_strd_p))
> >> > > +    {
> >> > > +      /* Try without strd.  */
> >> > > +      v = (v >> BITS_PER_WORD);
> >> > > +      v = sext_hwi (v, BITS_PER_WORD);
> >> > > +      val_exp = GEN_INT (v);
> >> > > +      use_strd_p = false;
> >> > > +      if (!arm_block_set_straight_profit_p (val_exp, length,
> >> > > +                                     align, false, use_strd_p))
> >> > > + return false;
> >> > > +    }
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> >> > > +  /* Handle double words using strd if possible.  */  if
> >> > > + (use_strd_p)
> >> > > +    {
> >> > > +      val_reg = force_reg (DImode, val_exp);
> >> > > +      reg = val_reg;
> >> > > +      for (; (i + 8 <= length); i += 8) {
> >> > > +   addr = plus_constant (Pmode, dst, i);
> >> > > +   mem = adjust_automodify_address (dstbase, DImode, addr, i);
> >> > > +   emit_move_insn (mem, reg);
> >> > > + }
> >> > > +    }
> >> > > +  else
> >> > > +    val_reg = force_reg (SImode, val_exp);
> >> > > +
> >> > > +  /* Handle words.  */
> >> > > +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
> >> > > + for (; (i + 4 <= length); i += 4)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
> >> > > +      if ((align & 3) == 0)
> >> > > + emit_move_insn (mem, reg);
> >> > > +      else
> >> > > + emit_insn (gen_unaligned_storesi (mem, reg));
> >> > > +    }
> >> > > +
> >> > > +  /* Merge last pair of STRH and STRB into a STR if possible.
> >> > > + */ if (unaligned_access && i > 0 && (i + 3) == length)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i - 1);
> >> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
> >> > > +      /* We are shifting one byte back, set the alignment
> > accordingly.
> >> */
> >> > > +      if ((align & 1) == 0)
> >> > > + set_mem_align (mem, BITS_PER_UNIT);
> >> > > +
> >> > > +      /* Most likely this is an unaligned access, and we can't
> >> > > + tell
> > at
> >> > > +  compilation time.  */
> >> > > +      emit_insn (gen_unaligned_storesi (mem, reg));
> >> > > +      return true;
> >> > > +    }
> >> > > +
> >> > > +  /* Handle half word leftover.  */  if (i + 2 <= length)
> >> > > +    {
> >> > > +      reg = gen_lowpart (HImode, val_reg);
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
> >> > > +      if ((align & 1) == 0)
> >> > > + emit_move_insn (mem, reg);
> >> > > +      else
> >> > > + emit_insn (gen_unaligned_storehi (mem, reg));
> >> > > +
> >> > > +      i += 2;
> >> > > +    }
> >> > > +
> >> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> >> > > +    {
> >> > > +      reg = gen_lowpart (QImode, val_reg);
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using vectorization instructions for both
> >> > > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
> >> > > +   the memory area starting from DSTBASE with byte constant VALUE.
> >> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> >> > > +arm_block_set_vect (rtx dstbase,
> >> > > +             unsigned HOST_WIDE_INT length,
> >> > > +             unsigned HOST_WIDE_INT value,
> >> > > +             unsigned HOST_WIDE_INT align) {
> >> > > +  /* Check whether we need to use unaligned store instruction.
> >> > > +*/
> >> > > +  if (((align & 3) != 0 || (length & 3) != 0)
> >> > > +      /* Check whether unaligned store instruction is available.  */
> >> > > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
> >> > > +    return false;
> >> >
> >> > Huh!  vst1.8 can work for unaligned accesses even when hw alignment
> >> > checking is strict.
> >> Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
> >> unaligned_access", vst1.8 instructions  can't be recognized now in
> >> this
> > way.
> >> I agree that it's too strict, but that's another problem I think.
> 
> That was introduced to "fix up" the issue with another test IIRC.
> That's probably not related to this particular patch. It's the other thread that's
> been ongoing with Maciej, so let's continue it there.
> 
> >>
> >> >
> >> > > +
> >> > > +  if ((align & 3) == 0)
> >> > > +    return arm_block_set_aligned_vect (dstbase, length, value,
> >> > > +align);
> >> > > +  else
> >> > > +    return arm_block_set_unaligned_vect (dstbase, length, value,
> >> > > +align); }
> >> > > +
> >> > > +/* Expand string store operation.  Firstly we try to do that by using
> >> > > +   vectorization instructions, then try with ARM unaligned access and
> >> > > +   double-word store if profitable.  OPERANDS[0] is the destination,
> >> > > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
> >> > > +   initialize the memory, OPERANDS[3] is the known alignment of the
> >> > > +   destination.  */
> >> > > +bool
> >> > > +arm_gen_setmem (rtx *operands)
> >> > > +{
> >> > > +  rtx dstbase = operands[0];
> >> > > +  unsigned HOST_WIDE_INT length;
> >> > > +  unsigned HOST_WIDE_INT value;
> >> > > +  unsigned HOST_WIDE_INT align;
> >> > > +
> >> > > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
> >> > > +    return false;
> >> > > +
> >> > > +  length = UINTVAL (operands[1]);  if (length > 64)
> >> > > +    return false;
> >> > > +
> >> > > +  value = (UINTVAL (operands[2]) & 0xFF);  align = UINTVAL
> >> > > + (operands[3]);  if (TARGET_NEON && length >= 8
> >> > > +      && current_tune->string_ops_prefer_neon
> >> > > +      && arm_block_set_vect (dstbase, length, value, align))
> >> > > +    return true;
> >> > > +
> >> > > +  if (!unaligned_access && (align & 3) != 0)
> >> > > +    return arm_block_set_unaligned_straight (dstbase, length,
> >> > > + value, align);
> >> > > +
> >> > > +  return arm_block_set_aligned_straight (dstbase, length, value,
> >> > > +align); }
> >> > > +
> >> > >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
> >> > >
> >> > >  static unsigned HOST_WIDE_INT
> >> > > Index: gcc/config/arm/arm-protos.h
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/config/arm/arm-protos.h   (revision 209852)
> >> > > +++ gcc/config/arm/arm-protos.h   (working copy)
> >> > > @@ -277,6 +277,8 @@ struct tune_params
> >> > >    /* Prefer 32-bit encoding instead of 16-bit encoding where
> >> > > subset of
> >> flags
> >> > >       would be set.  */
> >> > >    bool disparage_partial_flag_setting_t16_encodings;
> >> > > +  /* Prefer to inline string operations like memset by using Neon.
> >> > > + */  bool string_ops_prefer_neon;
> >> > >  };
> >> > >
> >> > >  extern const struct tune_params *current_tune; @@ -289,6 +291,7
> >> > > @@ extern void arm_emit_coreregs_64bit_shift (enum rt  extern
> >> > > bool arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /*
> >> > > RTX_CODE */
> >> > >
> >> > > +extern bool arm_gen_setmem (rtx *);
> >> > >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1,
> >> > > rtx sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx
> >> > > op0, rtx op1, rtx sel);
> >> > >
> >> > > Index: gcc/config/arm/arm.md
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/config/arm/arm.md (revision 209852)
> >> > > +++ gcc/config/arm/arm.md (working copy)
> >> > > @@ -7555,6 +7555,20 @@
> >> > >  })
> >> > >
> >> > >
> >> > > +(define_expand "setmemsi"
> >> > > +  [(match_operand:BLK 0 "general_operand" "")
> >> > > +   (match_operand:SI 1 "const_int_operand" "")
> >> > > +   (match_operand:SI 2 "const_int_operand" "")
> >> > > +   (match_operand:SI 3 "const_int_operand" "")]
> >> > > +  "TARGET_32BIT"
> >> > > +{
> >> > > +  if (arm_gen_setmem (operands))
> >> > > +    DONE;
> >> > > +
> >> > > +  FAIL;
> >> > > +})
> >> > > +
> >> > > +
> >> > >  ;; Move a block of memory if it is word aligned and MORE than 2
> >> > > words
> >> > long.
> >> > >  ;; We could let this apply for blocks of less than this, but it
> >> > > clobbers so  ;; many registers that there is then probably a
> >> > > better
> > way.
> >> > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
> >> > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
> >> >
> >> > Have you tested these when the compiler was configured with
> >> > "--with- cpu=cortex-a9"?
> >> Here is the tricky part.
> >> For compiler configured with "--with-tune=cortex-a9", the neon
> >> related cases
> >> (4/5/6/8/9) would fail because we have no way to determine that we
> >> are compiling with cortex-a9 tune here.
> 
> There is no easy way of fixing this.
> 
> >> For compiler configured with "--with-cpu=cortex-a9", the test cases
> >> would pass but I think this is a mistake.  It reveals an issue that
> >> GCC won't
> > pass "-
> >> mcpu=cortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It
> >> just
> > makes
> >> no sense.
> >> With these issues, I didn't change the tests for now.
> 
> I think we'll just have to take the hit on noise in other configurations.
> 
> > Precisely, I configured gcc with options "--with-arch=armv7-a
> > --with-cpu|--with-tune=cortex-a9".
> > I read gcc documents and realized that "-mcpu" is ignored when
> > "-march" is specified.  I don't know why gcc acts in this manner, but
> > it leads to inconsistent configuration/command line behavior.
> > If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9",
> > then only "-march=armv7-a" is passed to cc1.
> 
> That kind of configuration is warned today but no one pays attention to that.
> See James's proposal to promote this to an error.
> 
> > If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works
> > fine and passes "-march=armv7-a -mcpu=cortex-a9" to cc1.
> 
> That should be fine.
> 
> >
> > Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
> > -march=armv7-m switch".
> 
> 
> This is OK unless there are objections in the next 24 hours.
> 
> Please watch out for any fallout - Certainly rebase and retest before applying
> and please post the rebased version for archival purposes.
> 

Hi Ramana,
This is the rebased patch, there is no conflict against latest trunk.  I am still doing some tests.  Is it OK if tests are fine?
Also, it depends on patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that patch two.

Thanks,
bin

[-- Attachment #2: j1328-20140704.txt --]
[-- Type: text/plain, Size: 56384 bytes --]

Index: gcc/config/arm/arm-protos.h
===================================================================
--- gcc/config/arm/arm-protos.h	(revision 212295)
+++ gcc/config/arm/arm-protos.h	(working copy)
@@ -277,6 +277,10 @@ struct tune_params
   /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
      would be set.  */
   bool disparage_partial_flag_setting_t16_encodings;
+  /* Prefer to inline string operations like memset by using Neon.  */
+  bool string_ops_prefer_neon;
+  /* Maximum number of instructions to inline calls to memset.  */
+  int max_insns_inline_memset;
 };
 
 extern const struct tune_params *current_tune;
@@ -289,6 +293,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
 extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
 #endif /* RTX_CODE */
 
+extern bool arm_gen_setmem (rtx *);
 extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
 extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
 
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 212295)
+++ gcc/config/arm/arm.c	(working copy)
@@ -1588,34 +1588,38 @@ const struct tune_params arm_slowmul_tune =
 {
   arm_slowmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  3,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  3,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fastmul_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1625,17 +1629,19 @@ const struct tune_params arm_strongarm_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  3,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  3,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1643,50 +1649,56 @@ const struct tune_params arm_xscale_tune =
   arm_xscale_rtx_costs,
   NULL,
   xscale_sched_adjust_cost,
-  2,						/* Constant limit.  */
-  3,						/* Max cond insns.  */
+  2,					/* Constant limit.  */
+  3,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_9e_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_v6t2_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1694,34 +1706,38 @@ const struct tune_params arm_cortex_tune =
 {
   arm_9e_rtx_costs,
   &generic_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
 {
   arm_9e_rtx_costs,
   &cortexa8_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1729,67 +1745,75 @@ const struct tune_params arm_cortex_a7_tune =
   arm_9e_rtx_costs,
   &cortexa7_extra_costs,
   NULL,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
 {
   arm_9e_rtx_costs,
   &cortexa15_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  2,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  true, true                                    /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  true, true,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
 {
   arm_9e_rtx_costs,
   &cortexa53_extra_costs,
-  NULL,						/* Scheduler cost adjustment.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Scheduler cost adjustment.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
 {
   arm_9e_rtx_costs,
   &cortexa57_extra_costs,
-  NULL,                                         /* Scheduler cost adjustment.  */
-  1,                                           /* Constant limit.  */
-  2,                                           /* Max cond insns.  */
+  NULL,					/* Scheduler cost adjustment.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,                                       /* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,                                       /* Prefer LDRD/STRD.  */
-  {true, true},                                /* Prefer non short circuit.  */
-  &arm_default_vec_cost,                       /* Vectorizer costs.  */
-  false,                                       /* Prefer Neon for 64-bits bitops.  */
-  true, true                                   /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  true, true,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1799,17 +1823,19 @@ const struct tune_params arm_cortex_a5_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  1,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  1,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_cortex_a5_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1817,16 +1843,18 @@ const struct tune_params arm_cortex_a9_tune =
   arm_9e_rtx_costs,
   &cortexa9_extra_costs,
   cortex_a9_sched_adjust_cost,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_BENEFICIAL(4,32,32),
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1834,16 +1862,18 @@ const struct tune_params arm_cortex_a12_tune =
   arm_9e_rtx_costs,
   &cortexa12_extra_costs,
   NULL,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_BENEFICIAL(4,32,32),
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -1857,17 +1887,19 @@ const struct tune_params arm_v7m_tune =
 {
   arm_9e_rtx_costs,
   &v7m_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  2,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_cortex_m_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1876,17 +1908,19 @@ const struct tune_params arm_v6m_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -1894,16 +1928,18 @@ const struct tune_params arm_fa726te_tune =
   arm_9e_rtx_costs,
   NULL,
   fa726te_sched_adjust_cost,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 
@@ -16802,6 +16838,14 @@ arm_const_double_inline_cost (rtx val)
 			      NULL_RTX, NULL_RTX, 0, 0));
 }
 
+/* Cost of loading a SImode constant.  */
+static inline int
+arm_const_inline_cost (enum rtx_code code, rtx val)
+{
+  return arm_gen_constant (code, SImode, NULL_RTX, INTVAL (val),
+                           NULL_RTX, NULL_RTX, 1, 0);
+}
+
 /* Return true if it is worthwhile to split a 64-bit constant into two
    32-bit operations.  This is the case if optimizing for size, or
    if we have load delay slots, or if one 32-bit part can be done with
@@ -31417,6 +31468,519 @@ arm_validize_comparison (rtx *comparison, rtx * op
 
 }
 
+/* Maximum number of instructions to set block of memory.  */
+static int
+arm_block_set_max_insns (void)
+{
+  if (optimize_function_for_size_p (cfun))
+    return 4;
+  else
+    return current_tune->max_insns_inline_memset;
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   non-vectorized case.  VAL is the value to set the memory
+   with.  LENGTH is the number of bytes to set.  ALIGN is the
+   alignment of the destination memory in bytes.  UNALIGNED_P
+   is TRUE if we can only set the memory with instructions
+   meeting alignment requirements.  USE_STRD_P is TRUE if we
+   can use strd to set the memory.  */
+static bool
+arm_block_set_non_vect_profit_p (rtx val,
+				 unsigned HOST_WIDE_INT length,
+				 unsigned HOST_WIDE_INT align,
+				 bool unaligned_p, bool use_strd_p)
+{
+  int num = 0;
+  /* For leftovers in bytes of 0-7, we can set the memory block using
+     strb/strh/str with minimum instruction number.  */
+  const int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
+
+  if (unaligned_p)
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += length / align + length % align;
+    }
+  else if (use_strd_p)
+    {
+      num = arm_const_double_inline_cost (val);
+      num += (length >> 3) + leftover[length & 7];
+    }
+  else
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += (length >> 2) + leftover[length & 3];
+    }
+
+  /* We may be able to combine last pair STRH/STRB into a single STR
+     by shifting one byte back.  */
+  if (unaligned_access && length > 3 && (length & 3) == 3)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   vectorized case.  LENGTH is the number of bytes to set.
+   ALIGN is the alignment of destination memory in bytes.
+   MODE is the vector mode used to set the memory.  */
+static bool
+arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
+			     unsigned HOST_WIDE_INT align,
+			     enum machine_mode mode)
+{
+  int num;
+  bool unaligned_p = ((align & 3) != 0);
+  unsigned int nelt = GET_MODE_NUNITS (mode);
+
+  /* Instruction loading constant value.  */
+  num = 1;
+  /* Instructions storing the memory.  */
+  num += (length + nelt - 1) / nelt;
+  /* Instructions adjusting the address expression.  Only need to
+     adjust address expression if it's 4 bytes aligned and bytes
+     leftover can only be stored by mis-aligned store instruction.  */
+  if (!unaligned_p && (length & 3) != 0)
+    num++;
+
+  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
+  if (!unaligned_p && mode == V16QImode)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Set a block of memory using vectorization instructions for the
+   unaligned case.  We fill the first LENGTH bytes of the memory
+   area starting from DSTBASE with byte constant VALUE.  ALIGN is
+   the alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_unaligned_vect (rtx dstbase,
+			      unsigned HOST_WIDE_INT length,
+			      unsigned HOST_WIDE_INT value,
+			      unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v16, nelt_v8, nelt_mode;
+  rtx dst, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  rtx (*gen_func) (rtx, rtx);
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) != 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16)
+    {
+      mode = V16QImode;
+      gen_func = gen_movmisalignv16qi;
+    }
+  else
+    {
+      mode = V8QImode;
+      gen_func = gen_movmisalignv8qi;
+    }
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle nelt_mode bytes in a vector.  */
+  for (i = 0; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      emit_insn ((*gen_func) (mem, reg));
+      if (i + 2 * nelt_mode <= length)
+	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
+    }
+
+  /* If there are not less than nelt_v8 bytes leftover, we must be in
+     V16QI mode.  */
+  gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
+
+  /* Handle (8, 16) bytes leftover.  */
+  if (i + nelt_v8 < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+    }
+  /* Handle (0, 8] bytes leftover.  */
+  else if (i < length && i + nelt_v8 >= length)
+    {
+      if (mode == V16QImode)
+	{
+	  reg = gen_lowpart (V8QImode, reg);
+	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
+	}
+      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
+					      + (nelt_mode - nelt_v8))));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for the
+   aligned case.  We fill the first LENGTH bytes of the memory area
+   starting from DSTBASE with byte constant VALUE.  ALIGN is the
+   alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_aligned_vect (rtx dstbase,
+			    unsigned HOST_WIDE_INT length,
+			    unsigned HOST_WIDE_INT value,
+			    unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v8, nelt_v16, nelt_mode;
+  rtx dst, addr, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) == 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
+    mode = V16QImode;
+  else
+    mode = V8QImode;
+
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  i = 0;
+  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
+  if (mode == V16QImode)
+    {
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+      i += nelt_mode;
+      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
+      if (i + nelt_v8 < length && i + nelt_v16 > length)
+	{
+	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+	  /* We are shifting bytes back, set the alignment accordingly.  */
+	  if ((length & 0x3) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 4);
+	  else if ((length & 0x1) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 2);
+	  else
+	    set_mem_align (mem, BITS_PER_UNIT);
+
+	  emit_insn (gen_movmisalignv16qi (mem, reg));
+	  return true;
+	}
+      /* Fall through for bytes leftover.  */
+      mode = V8QImode;
+      nelt_mode = GET_MODE_NUNITS (mode);
+      reg = gen_lowpart (V8QImode, reg);
+    }
+
+  /* Handle 8 bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single word leftover by shifting 4 bytes back.  We can
+     use aligned access for this case.  */
+  if (i + UNITS_PER_WORD == length)
+    {
+      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
+      mem = adjust_automodify_address (dstbase, mode,
+				       addr, i - UNITS_PER_WORD);
+      /* We are shifting 4 bytes back, set the alignment accordingly.  */
+      if (align > UNITS_PER_WORD)
+	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
+
+      emit_move_insn (mem, reg);
+    }
+  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
+     We have to use unaligned access for this case.  */
+  else if (i < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT * 2);
+      else
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using plain strh/strb instructions, only
+   using instructions allowed by ALIGN on processor.  We fill the
+   first LENGTH bytes of the memory area starting from DSTBASE
+   with byte constant VALUE.  ALIGN is the alignment requirement
+   of memory.  */
+static bool
+arm_block_set_unaligned_non_vect (rtx dstbase,
+				  unsigned HOST_WIDE_INT length,
+				  unsigned HOST_WIDE_INT value,
+				  unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  enum machine_mode mode;
+  HOST_WIDE_INT v = value;
+
+  gcc_assert (align == 1 || align == 2);
+
+  if (align == 2)
+    v |= (value << BITS_PER_UNIT);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, true, false))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mode = (align == 2 ? HImode : QImode);
+  val_reg = force_reg (SImode, val_exp);
+  reg = gen_lowpart (mode, val_reg);
+
+  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+      i++;
+    }
+
+  gcc_assert (i == length);
+  return true;
+}
+
+/* Set a block of memory using plain strd/str/strh/strb instructions,
+   to permit unaligned copies on processors which support unaligned
+   semantics for those instructions.  We fill the first LENGTH bytes
+   of the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_non_vect (rtx dstbase,
+				unsigned HOST_WIDE_INT length,
+				unsigned HOST_WIDE_INT value,
+				unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  unsigned HOST_WIDE_INT v;
+  bool use_strd_p;
+
+  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
+		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
+
+  v = (value | (value << 8) | (value << 16) | (value << 24));
+  if (length < UNITS_PER_WORD)
+    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
+
+  if (use_strd_p)
+    v |= (v << BITS_PER_WORD);
+  else
+    v = sext_hwi (v, BITS_PER_WORD);
+
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, false, use_strd_p))
+    {
+      if (!use_strd_p)
+	return false;
+
+      /* Try without strd.  */
+      v = (v >> BITS_PER_WORD);
+      v = sext_hwi (v, BITS_PER_WORD);
+      val_exp = GEN_INT (v);
+      use_strd_p = false;
+      if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					    align, false, use_strd_p))
+	return false;
+    }
+
+  i = 0;
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  /* Handle double words using strd if possible.  */
+  if (use_strd_p)
+    {
+      val_reg = force_reg (DImode, val_exp);
+      reg = val_reg;
+      for (; (i + 8 <= length); i += 8)
+	{
+	  addr = plus_constant (Pmode, dst, i);
+	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
+	  emit_move_insn (mem, reg);
+	}
+    }
+  else
+    val_reg = force_reg (SImode, val_exp);
+
+  /* Handle words.  */
+  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
+  for (; (i + 4 <= length); i += 4)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i);
+      if ((align & 3) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storesi (mem, reg));
+    }
+
+  /* Merge last pair of STRH and STRB into a STR if possible.  */
+  if (unaligned_access && i > 0 && (i + 3) == length)
+    {
+      addr = plus_constant (Pmode, dst, i - 1);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
+      /* We are shifting one byte back, set the alignment accordingly.  */
+      if ((align & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      /* Most likely this is an unaligned access, and we can't tell at
+	 compilation time.  */
+      emit_insn (gen_unaligned_storesi (mem, reg));
+      return true;
+    }
+
+  /* Handle half word leftover.  */
+  if (i + 2 <= length)
+    {
+      reg = gen_lowpart (HImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, HImode, addr, i);
+      if ((align & 1) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storehi (mem, reg));
+
+      i += 2;
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for both
+   aligned and unaligned cases.  We fill the first LENGTH bytes of
+   the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_vect (rtx dstbase,
+		    unsigned HOST_WIDE_INT length,
+		    unsigned HOST_WIDE_INT value,
+		    unsigned HOST_WIDE_INT align)
+{
+  /* Check whether we need to use unaligned store instruction.  */
+  if (((align & 3) != 0 || (length & 3) != 0)
+      /* Check whether unaligned store instruction is available.  */
+      && (!unaligned_access || BYTES_BIG_ENDIAN))
+    return false;
+
+  if ((align & 3) == 0)
+    return arm_block_set_aligned_vect (dstbase, length, value, align);
+  else
+    return arm_block_set_unaligned_vect (dstbase, length, value, align);
+}
+
+/* Expand string store operation.  Firstly we try to do that by using
+   vectorization instructions, then try with ARM unaligned access and
+   double-word store if profitable.  OPERANDS[0] is the destination,
+   OPERANDS[1] is the number of bytes, operands[2] is the value to
+   initialize the memory, OPERANDS[3] is the known alignment of the
+   destination.  */
+bool
+arm_gen_setmem (rtx *operands)
+{
+  rtx dstbase = operands[0];
+  unsigned HOST_WIDE_INT length;
+  unsigned HOST_WIDE_INT value;
+  unsigned HOST_WIDE_INT align;
+
+  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
+    return false;
+
+  length = UINTVAL (operands[1]);
+  if (length > 64)
+    return false;
+
+  value = (UINTVAL (operands[2]) & 0xFF);
+  align = UINTVAL (operands[3]);
+  if (TARGET_NEON && length >= 8
+      && current_tune->string_ops_prefer_neon
+      && arm_block_set_vect (dstbase, length, value, align))
+    return true;
+
+  if (!unaligned_access && (align & 3) != 0)
+    return arm_block_set_unaligned_non_vect (dstbase, length, value, align);
+
+  return arm_block_set_aligned_non_vect (dstbase, length, value, align);
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 212295)
+++ gcc/config/arm/arm.md	(working copy)
@@ -6726,6 +6726,20 @@
 })
 
 
+(define_expand "setmemsi"
+  [(match_operand:BLK 0 "general_operand" "")
+   (match_operand:SI 1 "const_int_operand" "")
+   (match_operand:SI 2 "const_int_operand" "")
+   (match_operand:SI 3 "const_int_operand" "")]
+  "TARGET_32BIT"
+{
+  if (arm_gen_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
+
 ;; Move a block of memory if it is word aligned and MORE than 2 words long.
 ;; We could let this apply for blocks of less than this, but it clobbers so
 ;; many registers that there is then probably a better way.
Index: gcc/testsuite/gcc.target/arm/memset-inline-7.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
@@ -0,0 +1,171 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+int b[LEN];
+
+void
+init (signed char *arr, int len)
+{
+  int i;
+  for (i = 0; i < len; i++)
+    arr[i] = 0;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+#define TEST(a,l,v)			\
+	init ((signed char*)(a), sizeof (a));		\
+	memset ((a), (v), (l));				\
+	check ((signed char *)(a), (l), sizeof (a), (v));
+int
+main(void)
+{
+  TEST (a, 1, -1);
+  TEST (a, 2, -1);
+  TEST (a, 3, -1);
+  TEST (a, 4, -1);
+  TEST (a, 5, -1);
+  TEST (a, 6, -1);
+  TEST (a, 7, -1);
+  TEST (a, 8, -1);
+  TEST (a, 9, 1);
+  TEST (a, 10, -1);
+  TEST (a, 11, 1);
+  TEST (a, 12, -1);
+  TEST (a, 13, 1);
+  TEST (a, 14, -1);
+  TEST (a, 15, 1);
+  TEST (a, 16, -1);
+  TEST (a, 17, 1);
+  TEST (a, 18, -1);
+  TEST (a, 19, 1);
+  TEST (a, 20, -1);
+  TEST (a, 21, 1);
+  TEST (a, 22, -1);
+  TEST (a, 23, 1);
+  TEST (a, 24, -1);
+  TEST (a, 25, 1);
+  TEST (a, 26, -1);
+  TEST (a, 27, 1);
+  TEST (a, 28, -1);
+  TEST (a, 29, 1);
+  TEST (a, 30, -1);
+  TEST (a, 31, 1);
+  TEST (a, 32, -1);
+  TEST (a, 33, 1);
+  TEST (a, 34, -1);
+  TEST (a, 35, 1);
+  TEST (a, 36, -1);
+  TEST (a, 37, 1);
+  TEST (a, 38, -1);
+  TEST (a, 39, 1);
+  TEST (a, 40, -1);
+  TEST (a, 41, 1);
+  TEST (a, 42, -1);
+  TEST (a, 43, 1);
+  TEST (a, 44, -1);
+  TEST (a, 45, 1);
+  TEST (a, 46, -1);
+  TEST (a, 47, 1);
+  TEST (a, 48, -1);
+  TEST (a, 49, 1);
+  TEST (a, 50, -1);
+  TEST (a, 51, 1);
+  TEST (a, 52, -1);
+  TEST (a, 53, 1);
+  TEST (a, 54, -1);
+  TEST (a, 55, 1);
+  TEST (a, 56, -1);
+  TEST (a, 57, 1);
+  TEST (a, 58, -1);
+  TEST (a, 59, 1);
+  TEST (a, 60, -1);
+  TEST (a, 61, 1);
+  TEST (a, 62, -1);
+  TEST (a, 63, 1);
+  TEST (a, 64, -1);
+
+  TEST (b, 1, -1);
+  TEST (b, 2, -1);
+  TEST (b, 3, -1);
+  TEST (b, 4, -1);
+  TEST (b, 5, -1);
+  TEST (b, 6, -1);
+  TEST (b, 7, -1);
+  TEST (b, 8, -1);
+  TEST (b, 9, 1);
+  TEST (b, 10, -1);
+  TEST (b, 11, 1);
+  TEST (b, 12, -1);
+  TEST (b, 13, 1);
+  TEST (b, 14, -1);
+  TEST (b, 15, 1);
+  TEST (b, 16, -1);
+  TEST (b, 17, 1);
+  TEST (b, 18, -1);
+  TEST (b, 19, 1);
+  TEST (b, 20, -1);
+  TEST (b, 21, 1);
+  TEST (b, 22, -1);
+  TEST (b, 23, 1);
+  TEST (b, 24, -1);
+  TEST (b, 25, 1);
+  TEST (b, 26, -1);
+  TEST (b, 27, 1);
+  TEST (b, 28, -1);
+  TEST (b, 29, 1);
+  TEST (b, 30, -1);
+  TEST (b, 31, 1);
+  TEST (b, 32, -1);
+  TEST (b, 33, 1);
+  TEST (b, 34, -1);
+  TEST (b, 35, 1);
+  TEST (b, 36, -1);
+  TEST (b, 37, 1);
+  TEST (b, 38, -1);
+  TEST (b, 39, 1);
+  TEST (b, 40, -1);
+  TEST (b, 41, 1);
+  TEST (b, 42, -1);
+  TEST (b, 43, 1);
+  TEST (b, 44, -1);
+  TEST (b, 45, 1);
+  TEST (b, 46, -1);
+  TEST (b, 47, 1);
+  TEST (b, 48, -1);
+  TEST (b, 49, 1);
+  TEST (b, 50, -1);
+  TEST (b, 51, 1);
+  TEST (b, 52, -1);
+  TEST (b, 53, 1);
+  TEST (b, 54, -1);
+  TEST (b, 55, 1);
+  TEST (b, 56, -1);
+  TEST (b, 57, 1);
+  TEST (b, 58, -1);
+  TEST (b, 59, 1);
+  TEST (b, 60, -1);
+  TEST (b, 61, 1);
+  TEST (b, 62, -1);
+  TEST (b, 63, 1);
+  TEST (b, 64, -1);
+
+  return 0;
+}
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
@@ -0,0 +1,39 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler "bl?\[ \t\]*memset" { target { ! arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-3.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 7);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 7, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-not "strh" { target { ! arm_thumb1 } } } } */
+/* { dg-final { scan-assembler-not "strb" { target { ! arm_thumb1 } } } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 8);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 12);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, 1, 13);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  int i;
+
+  foo1 ();
+  check ((signed char *)a, 8, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 12, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 13, sizeof (c), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-times "vst1\.8" 1 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
@@ -0,0 +1,78 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+int d[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 16);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 25);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 19);
+  return;
+}
+
+void
+foo4 (void)
+{
+  memset (d, 1, 23);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 16, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 25, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 19, sizeof (c), -1);
+
+  foo4 ();
+  check ((signed char *)d, 23, sizeof (d), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr"  { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 20);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 24);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 32);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 20, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 24, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 32, sizeof (c), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vst1" 3 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vstr" 4 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
+

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-07-04 12:18         ` Bin Cheng
@ 2014-07-08  8:32           ` Bin.Cheng
  2014-07-08  8:56           ` Ramana Radhakrishnan
  1 sibling, 0 replies; 14+ messages in thread
From: Bin.Cheng @ 2014-07-08  8:32 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Richard Earnshaw, gcc-patches

On Fri, Jul 4, 2014 at 1:17 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>
>

>
> Hi Ramana,
> This is the rebased patch, there is no conflict against latest trunk.  I am still doing some tests.  Is it OK if tests are fine?
> Also, it depends on patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that patch two.

Hi Ramana,

Bootstrap and tests for this patch are done.  Is it ok for me to submit?

Thanks,
bin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-07-04 12:18         ` Bin Cheng
  2014-07-08  8:32           ` Bin.Cheng
@ 2014-07-08  8:56           ` Ramana Radhakrishnan
  2014-07-08  9:57             ` Bin.Cheng
  1 sibling, 1 reply; 14+ messages in thread
From: Ramana Radhakrishnan @ 2014-07-08  8:56 UTC (permalink / raw)
  To: Bin Cheng; +Cc: Richard Earnshaw, gcc-patches

>
> Hi Ramana,
> This is the rebased patch, there is no conflict against latest trunk.  I am still doing some tests.  Is it OK if tests are fine?
> Also, it depends on patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that patch two.
>
> Thanks,
> bin

> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c	(revision 212295)
> +++ gcc/config/arm/arm.c	(working copy)
> @@ -1588,34 +1588,38 @@ const struct tune_params arm_slowmul_tune =
>  {
>    arm_slowmul_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  3,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  3,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */

Please make sure alignment is maintained with comments as today. I'm not 
sure why I see the following diffs in your patch since you don't really 
should be touching those lines, that applies to all the cost tables. I 
haven't called out all the places where you appear to have unrelated 
formatting changes in detail, but have done so in one cost table.

Please re-create a patch that doesn't have these hunks.

>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */

Likewise.

>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */

Likewise.

> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_fastmul_tune =
>  {
>    arm_fastmul_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* StrongARM has early execution of branches, so a sequence that is worth
> @@ -1625,17 +1629,19 @@ const struct tune_params arm_strongarm_tune =
>  {
>    arm_fastmul_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  3,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  3,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_xscale_tune =
> @@ -1643,50 +1649,56 @@ const struct tune_params arm_xscale_tune =
>    arm_xscale_rtx_costs,
>    NULL,
>    xscale_sched_adjust_cost,
> -  2,						/* Constant limit.  */
> -  3,						/* Max cond insns.  */
> +  2,					/* Constant limit.  */
> +  3,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_9e_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_v6t2_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
> @@ -1694,34 +1706,38 @@ const struct tune_params arm_cortex_tune =
>  {
>    arm_9e_rtx_costs,
>    &generic_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a8_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa8_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a7_tune =
> @@ -1729,67 +1745,75 @@ const struct tune_params arm_cortex_a7_tune =
>    arm_9e_rtx_costs,
>    &cortexa7_extra_costs,
>    NULL,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,			/* Vectorizer costs.  */
> -  false,					/* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a15_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa15_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  2,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  2,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  true,						/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  true, true                                    /* Prefer 32-bit encodings.  */
> +  true,					/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  true, true,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a53_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa53_extra_costs,
> -  NULL,						/* Scheduler cost adjustment.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Scheduler cost adjustment.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,			/* Vectorizer costs.  */
> -  false,					/* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a57_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa57_extra_costs,
> -  NULL,                                         /* Scheduler cost adjustment.  */
> -  1,                                           /* Constant limit.  */
> -  2,                                           /* Max cond insns.  */
> +  NULL,					/* Scheduler cost adjustment.  */
> +  1,					/* Constant limit.  */
> +  2,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,                                       /* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  true,                                       /* Prefer LDRD/STRD.  */
> -  {true, true},                                /* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                       /* Vectorizer costs.  */
> -  false,                                       /* Prefer Neon for 64-bits bitops.  */
> -  true, true                                   /* Prefer 32-bit encodings.  */
> +  true,					/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  true, true,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* Branches can be dual-issued on Cortex-A5, so conditional execution is
> @@ -1799,17 +1823,19 @@ const struct tune_params arm_cortex_a5_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  1,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  1,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_cortex_a5_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {false, false},				/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {false, false},			/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a9_tune =
> @@ -1817,16 +1843,18 @@ const struct tune_params arm_cortex_a9_tune =
>    arm_9e_rtx_costs,
>    &cortexa9_extra_costs,
>    cortex_a9_sched_adjust_cost,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_BENEFICIAL(4,32,32),
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a12_tune =
> @@ -1834,16 +1862,18 @@ const struct tune_params arm_cortex_a12_tune =
>    arm_9e_rtx_costs,
>    &cortexa12_extra_costs,
>    NULL,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_BENEFICIAL(4,32,32),
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  true,						/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  true,					/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
> @@ -1857,17 +1887,19 @@ const struct tune_params arm_v7m_tune =
>  {
>    arm_9e_rtx_costs,
>    &v7m_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  2,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  2,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_cortex_m_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {false, false},				/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {false, false},			/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
> @@ -1876,17 +1908,19 @@ const struct tune_params arm_v6m_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {false, false},				/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {false, false},			/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_fa726te_tune =
> @@ -1894,16 +1928,18 @@ const struct tune_params arm_fa726te_tune =
>    arm_9e_rtx_costs,
>    NULL,
>    fa726te_sched_adjust_cost,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>




> Thanks,
> bin
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-07-08  8:56           ` Ramana Radhakrishnan
@ 2014-07-08  9:57             ` Bin.Cheng
  2014-07-21 12:45               ` Bin.Cheng
  0 siblings, 1 reply; 14+ messages in thread
From: Bin.Cheng @ 2014-07-08  9:57 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Bin Cheng, Richard Earnshaw, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3356 bytes --]

On Tue, Jul 8, 2014 at 9:56 AM, Ramana Radhakrishnan
<ramana.radhakrishnan@arm.com> wrote:
>>
>> Hi Ramana,
>> This is the rebased patch, there is no conflict against latest trunk.  I
>> am still doing some tests.  Is it OK if tests are fine?
>> Also, it depends on patch at
>> https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that
>> patch two.
>>
>> Thanks,
>> bin
>
>
>> Index: gcc/config/arm/arm.c
>> ===================================================================
>> --- gcc/config/arm/arm.c        (revision 212295)
>> +++ gcc/config/arm/arm.c        (working copy)
>> @@ -1588,34 +1588,38 @@ const struct tune_params arm_slowmul_tune =
>>  {
>>    arm_slowmul_rtx_costs,
>>    NULL,
>> -  NULL,                                                /* Sched adj cost.
>> */
>> -  3,                                           /* Constant limit.  */
>> -  5,                                           /* Max cond insns.  */
>> +  NULL,                                        /* Sched adj cost.  */
>> +  3,                                   /* Constant limit.  */
>> +  5,                                   /* Max cond insns.  */
>
>
> Please make sure alignment is maintained with comments as today. I'm not
> sure why I see the following diffs in your patch since you don't really
> should be touching those lines, that applies to all the cost tables. I
> haven't called out all the places where you appear to have unrelated
> formatting changes in detail, but have done so in one cost table.
>
> Please re-create a patch that doesn't have these hunks.
>
Here is the updated patch, and I followed the format and left the
alignment as it is now.

Thanks,
bin


2014-07-08  Bin Cheng  <bin.cheng@arm.com>

    PR target/55701
    * config/arm/arm.md (setmem): New pattern.
    * config/arm/arm-protos.h (struct tune_params): New fields.
    (arm_gen_setmem): New prototype.
    * config/arm/arm.c (arm_slowmul_tune): Initialize new fields.
    (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
    (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
    (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
    (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
    (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
    (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
    (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
    (arm_const_inline_cost): New function.
    (arm_block_set_max_insns): New function.
    (arm_block_set_non_vect_profit_p): New function.
    (arm_block_set_vect_profit_p): New function.
    (arm_block_set_unaligned_vect): New function.
    (arm_block_set_aligned_vect): New function.
    (arm_block_set_unaligned_non_vect): New function.
    (arm_block_set_aligned_non_vect): New function.
    (arm_block_set_vect, arm_gen_setmem): New functions.

gcc/testsuite/ChangeLog
2014-07-08  Bin Cheng  <bin.cheng@arm.com>

    PR target/55701
    * gcc.target/arm/memset-inline-1.c: New test.
    * gcc.target/arm/memset-inline-2.c: New test.
    * gcc.target/arm/memset-inline-3.c: New test.
    * gcc.target/arm/memset-inline-4.c: New test.
    * gcc.target/arm/memset-inline-5.c: New test.
    * gcc.target/arm/memset-inline-6.c: New test.
    * gcc.target/arm/memset-inline-7.c: New test.
    * gcc.target/arm/memset-inline-8.c: New test.
    * gcc.target/arm/memset-inline-9.c: New test.

[-- Attachment #2: j1328-20140708.txt --]
[-- Type: text/plain, Size: 41927 bytes --]

Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 212351)
+++ gcc/config/arm/arm.c	(working copy)
@@ -1598,6 +1598,8 @@ const struct tune_params arm_slowmul_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1615,6 +1617,8 @@ const struct tune_params arm_fastmul_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1635,6 +1639,8 @@ const struct tune_params arm_strongarm_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1652,6 +1658,8 @@ const struct tune_params arm_xscale_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_9e_tune =
@@ -1669,6 +1677,8 @@ const struct tune_params arm_9e_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_v6t2_tune =
@@ -1686,6 +1696,8 @@ const struct tune_params arm_v6t2_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1704,6 +1716,8 @@ const struct tune_params arm_cortex_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
@@ -1721,6 +1735,8 @@ const struct tune_params arm_cortex_a8_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1738,6 +1754,8 @@ const struct tune_params arm_cortex_a7_tune =
   &arm_default_vec_cost,			/* Vectorizer costs.  */
   false,					/* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
@@ -1755,6 +1773,8 @@ const struct tune_params arm_cortex_a15_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   true, true                                    /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
@@ -1772,6 +1792,8 @@ const struct tune_params arm_cortex_a53_tune =
   &arm_default_vec_cost,			/* Vectorizer costs.  */
   false,					/* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
@@ -1789,6 +1811,8 @@ const struct tune_params arm_cortex_a57_tune =
   &arm_default_vec_cost,                       /* Vectorizer costs.  */
   false,                                       /* Prefer Neon for 64-bits bitops.  */
   true, true                                   /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1809,6 +1833,8 @@ const struct tune_params arm_cortex_a5_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1826,6 +1852,8 @@ const struct tune_params arm_cortex_a9_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1843,6 +1871,8 @@ const struct tune_params arm_cortex_a12_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -1867,6 +1897,8 @@ const struct tune_params arm_v7m_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1886,6 +1918,8 @@ const struct tune_params arm_v6m_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -1903,6 +1937,8 @@ const struct tune_params arm_fa726te_tune =
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
   false, false                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 
@@ -16797,6 +16833,14 @@ arm_const_double_inline_cost (rtx val)
 			      NULL_RTX, NULL_RTX, 0, 0));
 }
 
+/* Cost of loading a SImode constant.  */
+static inline int
+arm_const_inline_cost (enum rtx_code code, rtx val)
+{
+  return arm_gen_constant (code, SImode, NULL_RTX, INTVAL (val),
+                           NULL_RTX, NULL_RTX, 1, 0);
+}
+
 /* Return true if it is worthwhile to split a 64-bit constant into two
    32-bit operations.  This is the case if optimizing for size, or
    if we have load delay slots, or if one 32-bit part can be done with
@@ -31412,6 +31456,519 @@ arm_validize_comparison (rtx *comparison, rtx * op
 
 }
 
+/* Maximum number of instructions to set block of memory.  */
+static int
+arm_block_set_max_insns (void)
+{
+  if (optimize_function_for_size_p (cfun))
+    return 4;
+  else
+    return current_tune->max_insns_inline_memset;
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   non-vectorized case.  VAL is the value to set the memory
+   with.  LENGTH is the number of bytes to set.  ALIGN is the
+   alignment of the destination memory in bytes.  UNALIGNED_P
+   is TRUE if we can only set the memory with instructions
+   meeting alignment requirements.  USE_STRD_P is TRUE if we
+   can use strd to set the memory.  */
+static bool
+arm_block_set_non_vect_profit_p (rtx val,
+				 unsigned HOST_WIDE_INT length,
+				 unsigned HOST_WIDE_INT align,
+				 bool unaligned_p, bool use_strd_p)
+{
+  int num = 0;
+  /* For leftovers in bytes of 0-7, we can set the memory block using
+     strb/strh/str with minimum instruction number.  */
+  const int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
+
+  if (unaligned_p)
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += length / align + length % align;
+    }
+  else if (use_strd_p)
+    {
+      num = arm_const_double_inline_cost (val);
+      num += (length >> 3) + leftover[length & 7];
+    }
+  else
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += (length >> 2) + leftover[length & 3];
+    }
+
+  /* We may be able to combine last pair STRH/STRB into a single STR
+     by shifting one byte back.  */
+  if (unaligned_access && length > 3 && (length & 3) == 3)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   vectorized case.  LENGTH is the number of bytes to set.
+   ALIGN is the alignment of destination memory in bytes.
+   MODE is the vector mode used to set the memory.  */
+static bool
+arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
+			     unsigned HOST_WIDE_INT align,
+			     enum machine_mode mode)
+{
+  int num;
+  bool unaligned_p = ((align & 3) != 0);
+  unsigned int nelt = GET_MODE_NUNITS (mode);
+
+  /* Instruction loading constant value.  */
+  num = 1;
+  /* Instructions storing the memory.  */
+  num += (length + nelt - 1) / nelt;
+  /* Instructions adjusting the address expression.  Only need to
+     adjust address expression if it's 4 bytes aligned and bytes
+     leftover can only be stored by mis-aligned store instruction.  */
+  if (!unaligned_p && (length & 3) != 0)
+    num++;
+
+  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
+  if (!unaligned_p && mode == V16QImode)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Set a block of memory using vectorization instructions for the
+   unaligned case.  We fill the first LENGTH bytes of the memory
+   area starting from DSTBASE with byte constant VALUE.  ALIGN is
+   the alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_unaligned_vect (rtx dstbase,
+			      unsigned HOST_WIDE_INT length,
+			      unsigned HOST_WIDE_INT value,
+			      unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v16, nelt_v8, nelt_mode;
+  rtx dst, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  rtx (*gen_func) (rtx, rtx);
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) != 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16)
+    {
+      mode = V16QImode;
+      gen_func = gen_movmisalignv16qi;
+    }
+  else
+    {
+      mode = V8QImode;
+      gen_func = gen_movmisalignv8qi;
+    }
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle nelt_mode bytes in a vector.  */
+  for (i = 0; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      emit_insn ((*gen_func) (mem, reg));
+      if (i + 2 * nelt_mode <= length)
+	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
+    }
+
+  /* If there are not less than nelt_v8 bytes leftover, we must be in
+     V16QI mode.  */
+  gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
+
+  /* Handle (8, 16) bytes leftover.  */
+  if (i + nelt_v8 < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+    }
+  /* Handle (0, 8] bytes leftover.  */
+  else if (i < length && i + nelt_v8 >= length)
+    {
+      if (mode == V16QImode)
+	{
+	  reg = gen_lowpart (V8QImode, reg);
+	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
+	}
+      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
+					      + (nelt_mode - nelt_v8))));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for the
+   aligned case.  We fill the first LENGTH bytes of the memory area
+   starting from DSTBASE with byte constant VALUE.  ALIGN is the
+   alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_aligned_vect (rtx dstbase,
+			    unsigned HOST_WIDE_INT length,
+			    unsigned HOST_WIDE_INT value,
+			    unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v8, nelt_v16, nelt_mode;
+  rtx dst, addr, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) == 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
+    mode = V16QImode;
+  else
+    mode = V8QImode;
+
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  i = 0;
+  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
+  if (mode == V16QImode)
+    {
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+      i += nelt_mode;
+      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
+      if (i + nelt_v8 < length && i + nelt_v16 > length)
+	{
+	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+	  /* We are shifting bytes back, set the alignment accordingly.  */
+	  if ((length & 0x3) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 4);
+	  else if ((length & 0x1) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 2);
+	  else
+	    set_mem_align (mem, BITS_PER_UNIT);
+
+	  emit_insn (gen_movmisalignv16qi (mem, reg));
+	  return true;
+	}
+      /* Fall through for bytes leftover.  */
+      mode = V8QImode;
+      nelt_mode = GET_MODE_NUNITS (mode);
+      reg = gen_lowpart (V8QImode, reg);
+    }
+
+  /* Handle 8 bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single word leftover by shifting 4 bytes back.  We can
+     use aligned access for this case.  */
+  if (i + UNITS_PER_WORD == length)
+    {
+      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
+      mem = adjust_automodify_address (dstbase, mode,
+				       addr, i - UNITS_PER_WORD);
+      /* We are shifting 4 bytes back, set the alignment accordingly.  */
+      if (align > UNITS_PER_WORD)
+	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
+
+      emit_move_insn (mem, reg);
+    }
+  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
+     We have to use unaligned access for this case.  */
+  else if (i < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT * 2);
+      else
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using plain strh/strb instructions, only
+   using instructions allowed by ALIGN on processor.  We fill the
+   first LENGTH bytes of the memory area starting from DSTBASE
+   with byte constant VALUE.  ALIGN is the alignment requirement
+   of memory.  */
+static bool
+arm_block_set_unaligned_non_vect (rtx dstbase,
+				  unsigned HOST_WIDE_INT length,
+				  unsigned HOST_WIDE_INT value,
+				  unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  enum machine_mode mode;
+  HOST_WIDE_INT v = value;
+
+  gcc_assert (align == 1 || align == 2);
+
+  if (align == 2)
+    v |= (value << BITS_PER_UNIT);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, true, false))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mode = (align == 2 ? HImode : QImode);
+  val_reg = force_reg (SImode, val_exp);
+  reg = gen_lowpart (mode, val_reg);
+
+  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+      i++;
+    }
+
+  gcc_assert (i == length);
+  return true;
+}
+
+/* Set a block of memory using plain strd/str/strh/strb instructions,
+   to permit unaligned copies on processors which support unaligned
+   semantics for those instructions.  We fill the first LENGTH bytes
+   of the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_non_vect (rtx dstbase,
+				unsigned HOST_WIDE_INT length,
+				unsigned HOST_WIDE_INT value,
+				unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  unsigned HOST_WIDE_INT v;
+  bool use_strd_p;
+
+  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
+		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
+
+  v = (value | (value << 8) | (value << 16) | (value << 24));
+  if (length < UNITS_PER_WORD)
+    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
+
+  if (use_strd_p)
+    v |= (v << BITS_PER_WORD);
+  else
+    v = sext_hwi (v, BITS_PER_WORD);
+
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, false, use_strd_p))
+    {
+      if (!use_strd_p)
+	return false;
+
+      /* Try without strd.  */
+      v = (v >> BITS_PER_WORD);
+      v = sext_hwi (v, BITS_PER_WORD);
+      val_exp = GEN_INT (v);
+      use_strd_p = false;
+      if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					    align, false, use_strd_p))
+	return false;
+    }
+
+  i = 0;
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  /* Handle double words using strd if possible.  */
+  if (use_strd_p)
+    {
+      val_reg = force_reg (DImode, val_exp);
+      reg = val_reg;
+      for (; (i + 8 <= length); i += 8)
+	{
+	  addr = plus_constant (Pmode, dst, i);
+	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
+	  emit_move_insn (mem, reg);
+	}
+    }
+  else
+    val_reg = force_reg (SImode, val_exp);
+
+  /* Handle words.  */
+  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
+  for (; (i + 4 <= length); i += 4)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i);
+      if ((align & 3) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storesi (mem, reg));
+    }
+
+  /* Merge last pair of STRH and STRB into a STR if possible.  */
+  if (unaligned_access && i > 0 && (i + 3) == length)
+    {
+      addr = plus_constant (Pmode, dst, i - 1);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
+      /* We are shifting one byte back, set the alignment accordingly.  */
+      if ((align & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      /* Most likely this is an unaligned access, and we can't tell at
+	 compilation time.  */
+      emit_insn (gen_unaligned_storesi (mem, reg));
+      return true;
+    }
+
+  /* Handle half word leftover.  */
+  if (i + 2 <= length)
+    {
+      reg = gen_lowpart (HImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, HImode, addr, i);
+      if ((align & 1) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storehi (mem, reg));
+
+      i += 2;
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for both
+   aligned and unaligned cases.  We fill the first LENGTH bytes of
+   the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_vect (rtx dstbase,
+		    unsigned HOST_WIDE_INT length,
+		    unsigned HOST_WIDE_INT value,
+		    unsigned HOST_WIDE_INT align)
+{
+  /* Check whether we need to use unaligned store instruction.  */
+  if (((align & 3) != 0 || (length & 3) != 0)
+      /* Check whether unaligned store instruction is available.  */
+      && (!unaligned_access || BYTES_BIG_ENDIAN))
+    return false;
+
+  if ((align & 3) == 0)
+    return arm_block_set_aligned_vect (dstbase, length, value, align);
+  else
+    return arm_block_set_unaligned_vect (dstbase, length, value, align);
+}
+
+/* Expand string store operation.  Firstly we try to do that by using
+   vectorization instructions, then try with ARM unaligned access and
+   double-word store if profitable.  OPERANDS[0] is the destination,
+   OPERANDS[1] is the number of bytes, operands[2] is the value to
+   initialize the memory, OPERANDS[3] is the known alignment of the
+   destination.  */
+bool
+arm_gen_setmem (rtx *operands)
+{
+  rtx dstbase = operands[0];
+  unsigned HOST_WIDE_INT length;
+  unsigned HOST_WIDE_INT value;
+  unsigned HOST_WIDE_INT align;
+
+  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
+    return false;
+
+  length = UINTVAL (operands[1]);
+  if (length > 64)
+    return false;
+
+  value = (UINTVAL (operands[2]) & 0xFF);
+  align = UINTVAL (operands[3]);
+  if (TARGET_NEON && length >= 8
+      && current_tune->string_ops_prefer_neon
+      && arm_block_set_vect (dstbase, length, value, align))
+    return true;
+
+  if (!unaligned_access && (align & 3) != 0)
+    return arm_block_set_unaligned_non_vect (dstbase, length, value, align);
+
+  return arm_block_set_aligned_non_vect (dstbase, length, value, align);
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 212351)
+++ gcc/config/arm/arm.md	(working copy)
@@ -6716,6 +6716,20 @@
 })
 
 
+(define_expand "setmemsi"
+  [(match_operand:BLK 0 "general_operand" "")
+   (match_operand:SI 1 "const_int_operand" "")
+   (match_operand:SI 2 "const_int_operand" "")
+   (match_operand:SI 3 "const_int_operand" "")]
+  "TARGET_32BIT"
+{
+  if (arm_gen_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
+
 ;; Move a block of memory if it is word aligned and MORE than 2 words long.
 ;; We could let this apply for blocks of less than this, but it clobbers so
 ;; many registers that there is then probably a better way.
Index: gcc/config/arm/arm-protos.h
===================================================================
--- gcc/config/arm/arm-protos.h	(revision 212351)
+++ gcc/config/arm/arm-protos.h	(working copy)
@@ -278,6 +278,10 @@ struct tune_params
   /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
      would be set.  */
   bool disparage_partial_flag_setting_t16_encodings;
+  /* Prefer to inline string operations like memset by using Neon.  */
+  bool string_ops_prefer_neon;
+  /* Maximum number of instructions to inline calls to memset.  */
+  int max_insns_inline_memset;
 };
 
 extern const struct tune_params *current_tune;
@@ -290,6 +294,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
 extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
 #endif /* RTX_CODE */
 
+extern bool arm_gen_setmem (rtx *);
 extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
 extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
 
Index: gcc/testsuite/gcc.target/arm/memset-inline-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler "bl?\[ \t\]*memset" { target { ! arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-3.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 7);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 7, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-not "strh" { target { ! arm_thumb1 } } } } */
+/* { dg-final { scan-assembler-not "strb" { target { ! arm_thumb1 } } } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 8);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 12);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, 1, 13);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  int i;
+
+  foo1 ();
+  check ((signed char *)a, 8, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 12, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 13, sizeof (c), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-times "vst1\.8" 1 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
@@ -0,0 +1,78 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+int d[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 16);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 25);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 19);
+  return;
+}
+
+void
+foo4 (void)
+{
+  memset (d, 1, 23);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 16, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 25, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 19, sizeof (c), -1);
+
+  foo4 ();
+  check ((signed char *)d, 23, sizeof (d), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr"  { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 20);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 24);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 32);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 20, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 24, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 32, sizeof (c), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vst1" 3 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vstr" 4 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-7.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
@@ -0,0 +1,171 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+int b[LEN];
+
+void
+init (signed char *arr, int len)
+{
+  int i;
+  for (i = 0; i < len; i++)
+    arr[i] = 0;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+#define TEST(a,l,v)			\
+	init ((signed char*)(a), sizeof (a));		\
+	memset ((a), (v), (l));				\
+	check ((signed char *)(a), (l), sizeof (a), (v));
+int
+main(void)
+{
+  TEST (a, 1, -1);
+  TEST (a, 2, -1);
+  TEST (a, 3, -1);
+  TEST (a, 4, -1);
+  TEST (a, 5, -1);
+  TEST (a, 6, -1);
+  TEST (a, 7, -1);
+  TEST (a, 8, -1);
+  TEST (a, 9, 1);
+  TEST (a, 10, -1);
+  TEST (a, 11, 1);
+  TEST (a, 12, -1);
+  TEST (a, 13, 1);
+  TEST (a, 14, -1);
+  TEST (a, 15, 1);
+  TEST (a, 16, -1);
+  TEST (a, 17, 1);
+  TEST (a, 18, -1);
+  TEST (a, 19, 1);
+  TEST (a, 20, -1);
+  TEST (a, 21, 1);
+  TEST (a, 22, -1);
+  TEST (a, 23, 1);
+  TEST (a, 24, -1);
+  TEST (a, 25, 1);
+  TEST (a, 26, -1);
+  TEST (a, 27, 1);
+  TEST (a, 28, -1);
+  TEST (a, 29, 1);
+  TEST (a, 30, -1);
+  TEST (a, 31, 1);
+  TEST (a, 32, -1);
+  TEST (a, 33, 1);
+  TEST (a, 34, -1);
+  TEST (a, 35, 1);
+  TEST (a, 36, -1);
+  TEST (a, 37, 1);
+  TEST (a, 38, -1);
+  TEST (a, 39, 1);
+  TEST (a, 40, -1);
+  TEST (a, 41, 1);
+  TEST (a, 42, -1);
+  TEST (a, 43, 1);
+  TEST (a, 44, -1);
+  TEST (a, 45, 1);
+  TEST (a, 46, -1);
+  TEST (a, 47, 1);
+  TEST (a, 48, -1);
+  TEST (a, 49, 1);
+  TEST (a, 50, -1);
+  TEST (a, 51, 1);
+  TEST (a, 52, -1);
+  TEST (a, 53, 1);
+  TEST (a, 54, -1);
+  TEST (a, 55, 1);
+  TEST (a, 56, -1);
+  TEST (a, 57, 1);
+  TEST (a, 58, -1);
+  TEST (a, 59, 1);
+  TEST (a, 60, -1);
+  TEST (a, 61, 1);
+  TEST (a, 62, -1);
+  TEST (a, 63, 1);
+  TEST (a, 64, -1);
+
+  TEST (b, 1, -1);
+  TEST (b, 2, -1);
+  TEST (b, 3, -1);
+  TEST (b, 4, -1);
+  TEST (b, 5, -1);
+  TEST (b, 6, -1);
+  TEST (b, 7, -1);
+  TEST (b, 8, -1);
+  TEST (b, 9, 1);
+  TEST (b, 10, -1);
+  TEST (b, 11, 1);
+  TEST (b, 12, -1);
+  TEST (b, 13, 1);
+  TEST (b, 14, -1);
+  TEST (b, 15, 1);
+  TEST (b, 16, -1);
+  TEST (b, 17, 1);
+  TEST (b, 18, -1);
+  TEST (b, 19, 1);
+  TEST (b, 20, -1);
+  TEST (b, 21, 1);
+  TEST (b, 22, -1);
+  TEST (b, 23, 1);
+  TEST (b, 24, -1);
+  TEST (b, 25, 1);
+  TEST (b, 26, -1);
+  TEST (b, 27, 1);
+  TEST (b, 28, -1);
+  TEST (b, 29, 1);
+  TEST (b, 30, -1);
+  TEST (b, 31, 1);
+  TEST (b, 32, -1);
+  TEST (b, 33, 1);
+  TEST (b, 34, -1);
+  TEST (b, 35, 1);
+  TEST (b, 36, -1);
+  TEST (b, 37, 1);
+  TEST (b, 38, -1);
+  TEST (b, 39, 1);
+  TEST (b, 40, -1);
+  TEST (b, 41, 1);
+  TEST (b, 42, -1);
+  TEST (b, 43, 1);
+  TEST (b, 44, -1);
+  TEST (b, 45, 1);
+  TEST (b, 46, -1);
+  TEST (b, 47, 1);
+  TEST (b, 48, -1);
+  TEST (b, 49, 1);
+  TEST (b, 50, -1);
+  TEST (b, 51, 1);
+  TEST (b, 52, -1);
+  TEST (b, 53, 1);
+  TEST (b, 54, -1);
+  TEST (b, 55, 1);
+  TEST (b, 56, -1);
+  TEST (b, 57, 1);
+  TEST (b, 58, -1);
+  TEST (b, 59, 1);
+  TEST (b, 60, -1);
+  TEST (b, 61, 1);
+  TEST (b, 62, -1);
+  TEST (b, 63, 1);
+  TEST (b, 64, -1);
+
+  return 0;
+}
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
@@ -0,0 +1,39 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH ARM] Improve ARM memset inlining
  2014-07-08  9:57             ` Bin.Cheng
@ 2014-07-21 12:45               ` Bin.Cheng
  0 siblings, 0 replies; 14+ messages in thread
From: Bin.Cheng @ 2014-07-21 12:45 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Bin Cheng, Richard Earnshaw, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1963 bytes --]

On Tue, Jul 8, 2014 at 10:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Jul 8, 2014 at 9:56 AM, Ramana Radhakrishnan
> <ramana.radhakrishnan@arm.com> wrote:

>
> 2014-07-08  Bin Cheng  <bin.cheng@arm.com>
>
>     PR target/55701
>     * config/arm/arm.md (setmem): New pattern.
>     * config/arm/arm-protos.h (struct tune_params): New fields.
>     (arm_gen_setmem): New prototype.
>     * config/arm/arm.c (arm_slowmul_tune): Initialize new fields.
>     (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
>     (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
>     (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
>     (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
>     (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
>     (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
>     (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
>     (arm_const_inline_cost): New function.
>     (arm_block_set_max_insns): New function.
>     (arm_block_set_non_vect_profit_p): New function.
>     (arm_block_set_vect_profit_p): New function.
>     (arm_block_set_unaligned_vect): New function.
>     (arm_block_set_aligned_vect): New function.
>     (arm_block_set_unaligned_non_vect): New function.
>     (arm_block_set_aligned_non_vect): New function.
>     (arm_block_set_vect, arm_gen_setmem): New functions.
>
> gcc/testsuite/ChangeLog
> 2014-07-08  Bin Cheng  <bin.cheng@arm.com>
>
>     PR target/55701
>     * gcc.target/arm/memset-inline-1.c: New test.
>     * gcc.target/arm/memset-inline-2.c: New test.
>     * gcc.target/arm/memset-inline-3.c: New test.
>     * gcc.target/arm/memset-inline-4.c: New test.
>     * gcc.target/arm/memset-inline-5.c: New test.
>     * gcc.target/arm/memset-inline-6.c: New test.
>     * gcc.target/arm/memset-inline-7.c: New test.
>     * gcc.target/arm/memset-inline-8.c: New test.
>     * gcc.target/arm/memset-inline-9.c: New test.

Committed attached patch as r212893.

Thanks,
bin

[-- Attachment #2: j1328-20140718.txt --]
[-- Type: text/plain, Size: 44386 bytes --]

Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
@@ -0,0 +1,78 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+int d[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 16);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 25);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 19);
+  return;
+}
+
+void
+foo4 (void)
+{
+  memset (d, 1, 23);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 16, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 25, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 19, sizeof (c), -1);
+
+  foo4 ();
+  check ((signed char *)d, 23, sizeof (d), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr"  { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 20);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 24);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 32);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 20, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 24, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 32, sizeof (c), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vst1" 3 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vstr" 4 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-7.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
@@ -0,0 +1,171 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+int b[LEN];
+
+void
+init (signed char *arr, int len)
+{
+  int i;
+  for (i = 0; i < len; i++)
+    arr[i] = 0;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+#define TEST(a,l,v)			\
+	init ((signed char*)(a), sizeof (a));		\
+	memset ((a), (v), (l));				\
+	check ((signed char *)(a), (l), sizeof (a), (v));
+int
+main(void)
+{
+  TEST (a, 1, -1);
+  TEST (a, 2, -1);
+  TEST (a, 3, -1);
+  TEST (a, 4, -1);
+  TEST (a, 5, -1);
+  TEST (a, 6, -1);
+  TEST (a, 7, -1);
+  TEST (a, 8, -1);
+  TEST (a, 9, 1);
+  TEST (a, 10, -1);
+  TEST (a, 11, 1);
+  TEST (a, 12, -1);
+  TEST (a, 13, 1);
+  TEST (a, 14, -1);
+  TEST (a, 15, 1);
+  TEST (a, 16, -1);
+  TEST (a, 17, 1);
+  TEST (a, 18, -1);
+  TEST (a, 19, 1);
+  TEST (a, 20, -1);
+  TEST (a, 21, 1);
+  TEST (a, 22, -1);
+  TEST (a, 23, 1);
+  TEST (a, 24, -1);
+  TEST (a, 25, 1);
+  TEST (a, 26, -1);
+  TEST (a, 27, 1);
+  TEST (a, 28, -1);
+  TEST (a, 29, 1);
+  TEST (a, 30, -1);
+  TEST (a, 31, 1);
+  TEST (a, 32, -1);
+  TEST (a, 33, 1);
+  TEST (a, 34, -1);
+  TEST (a, 35, 1);
+  TEST (a, 36, -1);
+  TEST (a, 37, 1);
+  TEST (a, 38, -1);
+  TEST (a, 39, 1);
+  TEST (a, 40, -1);
+  TEST (a, 41, 1);
+  TEST (a, 42, -1);
+  TEST (a, 43, 1);
+  TEST (a, 44, -1);
+  TEST (a, 45, 1);
+  TEST (a, 46, -1);
+  TEST (a, 47, 1);
+  TEST (a, 48, -1);
+  TEST (a, 49, 1);
+  TEST (a, 50, -1);
+  TEST (a, 51, 1);
+  TEST (a, 52, -1);
+  TEST (a, 53, 1);
+  TEST (a, 54, -1);
+  TEST (a, 55, 1);
+  TEST (a, 56, -1);
+  TEST (a, 57, 1);
+  TEST (a, 58, -1);
+  TEST (a, 59, 1);
+  TEST (a, 60, -1);
+  TEST (a, 61, 1);
+  TEST (a, 62, -1);
+  TEST (a, 63, 1);
+  TEST (a, 64, -1);
+
+  TEST (b, 1, -1);
+  TEST (b, 2, -1);
+  TEST (b, 3, -1);
+  TEST (b, 4, -1);
+  TEST (b, 5, -1);
+  TEST (b, 6, -1);
+  TEST (b, 7, -1);
+  TEST (b, 8, -1);
+  TEST (b, 9, 1);
+  TEST (b, 10, -1);
+  TEST (b, 11, 1);
+  TEST (b, 12, -1);
+  TEST (b, 13, 1);
+  TEST (b, 14, -1);
+  TEST (b, 15, 1);
+  TEST (b, 16, -1);
+  TEST (b, 17, 1);
+  TEST (b, 18, -1);
+  TEST (b, 19, 1);
+  TEST (b, 20, -1);
+  TEST (b, 21, 1);
+  TEST (b, 22, -1);
+  TEST (b, 23, 1);
+  TEST (b, 24, -1);
+  TEST (b, 25, 1);
+  TEST (b, 26, -1);
+  TEST (b, 27, 1);
+  TEST (b, 28, -1);
+  TEST (b, 29, 1);
+  TEST (b, 30, -1);
+  TEST (b, 31, 1);
+  TEST (b, 32, -1);
+  TEST (b, 33, 1);
+  TEST (b, 34, -1);
+  TEST (b, 35, 1);
+  TEST (b, 36, -1);
+  TEST (b, 37, 1);
+  TEST (b, 38, -1);
+  TEST (b, 39, 1);
+  TEST (b, 40, -1);
+  TEST (b, 41, 1);
+  TEST (b, 42, -1);
+  TEST (b, 43, 1);
+  TEST (b, 44, -1);
+  TEST (b, 45, 1);
+  TEST (b, 46, -1);
+  TEST (b, 47, 1);
+  TEST (b, 48, -1);
+  TEST (b, 49, 1);
+  TEST (b, 50, -1);
+  TEST (b, 51, 1);
+  TEST (b, 52, -1);
+  TEST (b, 53, 1);
+  TEST (b, 54, -1);
+  TEST (b, 55, 1);
+  TEST (b, 56, -1);
+  TEST (b, 57, 1);
+  TEST (b, 58, -1);
+  TEST (b, 59, 1);
+  TEST (b, 60, -1);
+  TEST (b, 61, 1);
+  TEST (b, 62, -1);
+  TEST (b, 63, 1);
+  TEST (b, 64, -1);
+
+  return 0;
+}
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
@@ -0,0 +1,39 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler "bl?\[ \t\]*memset" { target { ! arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-3.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 7);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 7, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-not "strh" { target { ! arm_thumb1 } } } } */
+/* { dg-final { scan-assembler-not "strb" { target { ! arm_thumb1 } } } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 8);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 12);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, 1, 13);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  int i;
+
+  foo1 ();
+  check ((signed char *)a, 8, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 12, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 13, sizeof (c), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-times "vst1\.8" 1 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 212750)
+++ gcc/config/arm/arm.c	(working copy)
@@ -1698,7 +1698,9 @@ const struct tune_params arm_slowmul_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1715,7 +1717,9 @@ const struct tune_params arm_fastmul_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1735,7 +1739,9 @@ const struct tune_params arm_strongarm_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1752,7 +1758,9 @@ const struct tune_params arm_xscale_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_9e_tune =
@@ -1769,7 +1777,9 @@ const struct tune_params arm_9e_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_v6t2_tune =
@@ -1786,7 +1796,9 @@ const struct tune_params arm_v6t2_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1804,7 +1816,9 @@ const struct tune_params arm_cortex_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
@@ -1821,7 +1835,9 @@ const struct tune_params arm_cortex_a8_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1838,7 +1854,9 @@ const struct tune_params arm_cortex_a7_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,			/* Vectorizer costs.  */
   false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
@@ -1855,7 +1873,9 @@ const struct tune_params arm_cortex_a15_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  true, true                                    /* Prefer 32-bit encodings.  */
+  true, true,                                   /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
@@ -1872,7 +1892,9 @@ const struct tune_params arm_cortex_a53_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,			/* Vectorizer costs.  */
   false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
@@ -1889,7 +1911,9 @@ const struct tune_params arm_cortex_a57_tune =
   {true, true},                                /* Prefer non short circuit.  */
   &arm_default_vec_cost,                       /* Vectorizer costs.  */
   false,                                       /* Prefer Neon for 64-bits bitops.  */
-  true, true                                   /* Prefer 32-bit encodings.  */
+  true, true,                                  /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1909,7 +1933,9 @@ const struct tune_params arm_cortex_a5_tune =
   {false, false},				/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1926,7 +1952,9 @@ const struct tune_params arm_cortex_a9_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1943,7 +1971,9 @@ const struct tune_params arm_cortex_a12_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  true,						/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -1967,7 +1997,9 @@ const struct tune_params arm_v7m_tune =
   {false, false},				/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1986,7 +2018,9 @@ const struct tune_params arm_v6m_tune =
   {false, false},				/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -2003,7 +2037,9 @@ const struct tune_params arm_fa726te_tune =
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
   false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false, false,                                 /* Prefer 32-bit encodings.  */
+  false,					/* Prefer Neon for stringops.  */
+  8						/* Maximum insns to inline memset.  */
 };
 
 
@@ -16899,6 +16935,14 @@ arm_const_double_inline_cost (rtx val)
 			      NULL_RTX, NULL_RTX, 0, 0));
 }
 
+/* Cost of loading a SImode constant.  */
+static inline int
+arm_const_inline_cost (enum rtx_code code, rtx val)
+{
+  return arm_gen_constant (code, SImode, NULL_RTX, INTVAL (val),
+                           NULL_RTX, NULL_RTX, 1, 0);
+}
+
 /* Return true if it is worthwhile to split a 64-bit constant into two
    32-bit operations.  This is the case if optimizing for size, or
    if we have load delay slots, or if one 32-bit part can be done with
@@ -31514,6 +31565,519 @@ arm_validize_comparison (rtx *comparison, rtx * op
 
 }
 
+/* Maximum number of instructions to set block of memory.  */
+static int
+arm_block_set_max_insns (void)
+{
+  if (optimize_function_for_size_p (cfun))
+    return 4;
+  else
+    return current_tune->max_insns_inline_memset;
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   non-vectorized case.  VAL is the value to set the memory
+   with.  LENGTH is the number of bytes to set.  ALIGN is the
+   alignment of the destination memory in bytes.  UNALIGNED_P
+   is TRUE if we can only set the memory with instructions
+   meeting alignment requirements.  USE_STRD_P is TRUE if we
+   can use strd to set the memory.  */
+static bool
+arm_block_set_non_vect_profit_p (rtx val,
+				 unsigned HOST_WIDE_INT length,
+				 unsigned HOST_WIDE_INT align,
+				 bool unaligned_p, bool use_strd_p)
+{
+  int num = 0;
+  /* For leftovers in bytes of 0-7, we can set the memory block using
+     strb/strh/str with minimum instruction number.  */
+  const int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
+
+  if (unaligned_p)
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += length / align + length % align;
+    }
+  else if (use_strd_p)
+    {
+      num = arm_const_double_inline_cost (val);
+      num += (length >> 3) + leftover[length & 7];
+    }
+  else
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += (length >> 2) + leftover[length & 3];
+    }
+
+  /* We may be able to combine last pair STRH/STRB into a single STR
+     by shifting one byte back.  */
+  if (unaligned_access && length > 3 && (length & 3) == 3)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   vectorized case.  LENGTH is the number of bytes to set.
+   ALIGN is the alignment of destination memory in bytes.
+   MODE is the vector mode used to set the memory.  */
+static bool
+arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
+			     unsigned HOST_WIDE_INT align,
+			     enum machine_mode mode)
+{
+  int num;
+  bool unaligned_p = ((align & 3) != 0);
+  unsigned int nelt = GET_MODE_NUNITS (mode);
+
+  /* Instruction loading constant value.  */
+  num = 1;
+  /* Instructions storing the memory.  */
+  num += (length + nelt - 1) / nelt;
+  /* Instructions adjusting the address expression.  Only need to
+     adjust address expression if it's 4 bytes aligned and bytes
+     leftover can only be stored by mis-aligned store instruction.  */
+  if (!unaligned_p && (length & 3) != 0)
+    num++;
+
+  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
+  if (!unaligned_p && mode == V16QImode)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Set a block of memory using vectorization instructions for the
+   unaligned case.  We fill the first LENGTH bytes of the memory
+   area starting from DSTBASE with byte constant VALUE.  ALIGN is
+   the alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_unaligned_vect (rtx dstbase,
+			      unsigned HOST_WIDE_INT length,
+			      unsigned HOST_WIDE_INT value,
+			      unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v16, nelt_v8, nelt_mode;
+  rtx dst, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  rtx (*gen_func) (rtx, rtx);
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) != 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16)
+    {
+      mode = V16QImode;
+      gen_func = gen_movmisalignv16qi;
+    }
+  else
+    {
+      mode = V8QImode;
+      gen_func = gen_movmisalignv8qi;
+    }
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle nelt_mode bytes in a vector.  */
+  for (i = 0; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      emit_insn ((*gen_func) (mem, reg));
+      if (i + 2 * nelt_mode <= length)
+	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
+    }
+
+  /* If there are not less than nelt_v8 bytes leftover, we must be in
+     V16QI mode.  */
+  gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
+
+  /* Handle (8, 16) bytes leftover.  */
+  if (i + nelt_v8 < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+    }
+  /* Handle (0, 8] bytes leftover.  */
+  else if (i < length && i + nelt_v8 >= length)
+    {
+      if (mode == V16QImode)
+	{
+	  reg = gen_lowpart (V8QImode, reg);
+	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
+	}
+      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
+					      + (nelt_mode - nelt_v8))));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for the
+   aligned case.  We fill the first LENGTH bytes of the memory area
+   starting from DSTBASE with byte constant VALUE.  ALIGN is the
+   alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_aligned_vect (rtx dstbase,
+			    unsigned HOST_WIDE_INT length,
+			    unsigned HOST_WIDE_INT value,
+			    unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v8, nelt_v16, nelt_mode;
+  rtx dst, addr, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) == 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
+    mode = V16QImode;
+  else
+    mode = V8QImode;
+
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  i = 0;
+  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
+  if (mode == V16QImode)
+    {
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+      i += nelt_mode;
+      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
+      if (i + nelt_v8 < length && i + nelt_v16 > length)
+	{
+	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+	  /* We are shifting bytes back, set the alignment accordingly.  */
+	  if ((length & 0x3) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 4);
+	  else if ((length & 0x1) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 2);
+	  else
+	    set_mem_align (mem, BITS_PER_UNIT);
+
+	  emit_insn (gen_movmisalignv16qi (mem, reg));
+	  return true;
+	}
+      /* Fall through for bytes leftover.  */
+      mode = V8QImode;
+      nelt_mode = GET_MODE_NUNITS (mode);
+      reg = gen_lowpart (V8QImode, reg);
+    }
+
+  /* Handle 8 bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single word leftover by shifting 4 bytes back.  We can
+     use aligned access for this case.  */
+  if (i + UNITS_PER_WORD == length)
+    {
+      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
+      mem = adjust_automodify_address (dstbase, mode,
+				       addr, i - UNITS_PER_WORD);
+      /* We are shifting 4 bytes back, set the alignment accordingly.  */
+      if (align > UNITS_PER_WORD)
+	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
+
+      emit_move_insn (mem, reg);
+    }
+  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
+     We have to use unaligned access for this case.  */
+  else if (i < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT * 2);
+      else
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using plain strh/strb instructions, only
+   using instructions allowed by ALIGN on processor.  We fill the
+   first LENGTH bytes of the memory area starting from DSTBASE
+   with byte constant VALUE.  ALIGN is the alignment requirement
+   of memory.  */
+static bool
+arm_block_set_unaligned_non_vect (rtx dstbase,
+				  unsigned HOST_WIDE_INT length,
+				  unsigned HOST_WIDE_INT value,
+				  unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  enum machine_mode mode;
+  HOST_WIDE_INT v = value;
+
+  gcc_assert (align == 1 || align == 2);
+
+  if (align == 2)
+    v |= (value << BITS_PER_UNIT);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, true, false))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mode = (align == 2 ? HImode : QImode);
+  val_reg = force_reg (SImode, val_exp);
+  reg = gen_lowpart (mode, val_reg);
+
+  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+      i++;
+    }
+
+  gcc_assert (i == length);
+  return true;
+}
+
+/* Set a block of memory using plain strd/str/strh/strb instructions,
+   to permit unaligned copies on processors which support unaligned
+   semantics for those instructions.  We fill the first LENGTH bytes
+   of the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_non_vect (rtx dstbase,
+				unsigned HOST_WIDE_INT length,
+				unsigned HOST_WIDE_INT value,
+				unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  unsigned HOST_WIDE_INT v;
+  bool use_strd_p;
+
+  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
+		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
+
+  v = (value | (value << 8) | (value << 16) | (value << 24));
+  if (length < UNITS_PER_WORD)
+    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
+
+  if (use_strd_p)
+    v |= (v << BITS_PER_WORD);
+  else
+    v = sext_hwi (v, BITS_PER_WORD);
+
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, false, use_strd_p))
+    {
+      if (!use_strd_p)
+	return false;
+
+      /* Try without strd.  */
+      v = (v >> BITS_PER_WORD);
+      v = sext_hwi (v, BITS_PER_WORD);
+      val_exp = GEN_INT (v);
+      use_strd_p = false;
+      if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					    align, false, use_strd_p))
+	return false;
+    }
+
+  i = 0;
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  /* Handle double words using strd if possible.  */
+  if (use_strd_p)
+    {
+      val_reg = force_reg (DImode, val_exp);
+      reg = val_reg;
+      for (; (i + 8 <= length); i += 8)
+	{
+	  addr = plus_constant (Pmode, dst, i);
+	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
+	  emit_move_insn (mem, reg);
+	}
+    }
+  else
+    val_reg = force_reg (SImode, val_exp);
+
+  /* Handle words.  */
+  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
+  for (; (i + 4 <= length); i += 4)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i);
+      if ((align & 3) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storesi (mem, reg));
+    }
+
+  /* Merge last pair of STRH and STRB into a STR if possible.  */
+  if (unaligned_access && i > 0 && (i + 3) == length)
+    {
+      addr = plus_constant (Pmode, dst, i - 1);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
+      /* We are shifting one byte back, set the alignment accordingly.  */
+      if ((align & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      /* Most likely this is an unaligned access, and we can't tell at
+	 compilation time.  */
+      emit_insn (gen_unaligned_storesi (mem, reg));
+      return true;
+    }
+
+  /* Handle half word leftover.  */
+  if (i + 2 <= length)
+    {
+      reg = gen_lowpart (HImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, HImode, addr, i);
+      if ((align & 1) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storehi (mem, reg));
+
+      i += 2;
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for both
+   aligned and unaligned cases.  We fill the first LENGTH bytes of
+   the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_vect (rtx dstbase,
+		    unsigned HOST_WIDE_INT length,
+		    unsigned HOST_WIDE_INT value,
+		    unsigned HOST_WIDE_INT align)
+{
+  /* Check whether we need to use unaligned store instruction.  */
+  if (((align & 3) != 0 || (length & 3) != 0)
+      /* Check whether unaligned store instruction is available.  */
+      && (!unaligned_access || BYTES_BIG_ENDIAN))
+    return false;
+
+  if ((align & 3) == 0)
+    return arm_block_set_aligned_vect (dstbase, length, value, align);
+  else
+    return arm_block_set_unaligned_vect (dstbase, length, value, align);
+}
+
+/* Expand string store operation.  Firstly we try to do that by using
+   vectorization instructions, then try with ARM unaligned access and
+   double-word store if profitable.  OPERANDS[0] is the destination,
+   OPERANDS[1] is the number of bytes, operands[2] is the value to
+   initialize the memory, OPERANDS[3] is the known alignment of the
+   destination.  */
+bool
+arm_gen_setmem (rtx *operands)
+{
+  rtx dstbase = operands[0];
+  unsigned HOST_WIDE_INT length;
+  unsigned HOST_WIDE_INT value;
+  unsigned HOST_WIDE_INT align;
+
+  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
+    return false;
+
+  length = UINTVAL (operands[1]);
+  if (length > 64)
+    return false;
+
+  value = (UINTVAL (operands[2]) & 0xFF);
+  align = UINTVAL (operands[3]);
+  if (TARGET_NEON && length >= 8
+      && current_tune->string_ops_prefer_neon
+      && arm_block_set_vect (dstbase, length, value, align))
+    return true;
+
+  if (!unaligned_access && (align & 3) != 0)
+    return arm_block_set_unaligned_non_vect (dstbase, length, value, align);
+
+  return arm_block_set_aligned_non_vect (dstbase, length, value, align);
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 212750)
+++ gcc/config/arm/arm.md	(working copy)
@@ -6716,6 +6716,20 @@
 })
 
 
+(define_expand "setmemsi"
+  [(match_operand:BLK 0 "general_operand" "")
+   (match_operand:SI 1 "const_int_operand" "")
+   (match_operand:SI 2 "const_int_operand" "")
+   (match_operand:SI 3 "const_int_operand" "")]
+  "TARGET_32BIT"
+{
+  if (arm_gen_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
+
 ;; Move a block of memory if it is word aligned and MORE than 2 words long.
 ;; We could let this apply for blocks of less than this, but it clobbers so
 ;; many registers that there is then probably a better way.
Index: gcc/config/arm/arm-protos.h
===================================================================
--- gcc/config/arm/arm-protos.h	(revision 212750)
+++ gcc/config/arm/arm-protos.h	(working copy)
@@ -278,6 +278,10 @@ struct tune_params
   /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
      would be set.  */
   bool disparage_partial_flag_setting_t16_encodings;
+  /* Prefer to inline string operations like memset by using Neon.  */
+  bool string_ops_prefer_neon;
+  /* Maximum number of instructions to inline calls to memset.  */
+  int max_insns_inline_memset;
 };
 
 extern const struct tune_params *current_tune;
@@ -290,6 +294,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
 extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
 #endif /* RTX_CODE */
 
+extern bool arm_gen_setmem (rtx *);
 extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
 extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-07-21 12:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-30  5:56 [PATCH ARM] Improve ARM memset inlining bin.cheng
2014-05-02 13:59 ` Richard Earnshaw
2014-05-05  7:21   ` bin.cheng
2014-05-06  5:00     ` bin.cheng
2014-05-12  3:17       ` Bin.Cheng
2014-05-19  6:40         ` Bin.Cheng
2014-05-28  8:53           ` bin.cheng
2014-06-04  9:11             ` bin.cheng
2014-06-27  8:21       ` Ramana Radhakrishnan
2014-07-04 12:18         ` Bin Cheng
2014-07-08  8:32           ` Bin.Cheng
2014-07-08  8:56           ` Ramana Radhakrishnan
2014-07-08  9:57             ` Bin.Cheng
2014-07-21 12:45               ` Bin.Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).