public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH ARM] Improve ARM memset inlining
@ 2014-04-30  5:56 bin.cheng
  2014-05-02 13:59 ` Richard Earnshaw
  0 siblings, 1 reply; 14+ messages in thread
From: bin.cheng @ 2014-04-30  5:56 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3037 bytes --]

Hi,
This patch expands small memset calls into direct memory set instructions by
introducing "setmemsi" pattern.  For processors without NEON support, it
expands memset using general store instruction.  For example, strd for
4-bytes aligned addresses.  For processors with NEON support, it expands
memset using neon instructions like vstr and miscellaneous vst1.*
instructions for both aligned and unaligned cases.

This patch depends on
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise vst1.64
will be generated for 32-bit aligned memory unit.

There is also one leftover work of this patch:  Since vst1.* instructions
only support post-increment addressing mode, the inlined memset for
unaligned neon cases should be like:
  vmov.i32   q8, #...
  vst1.8     {q8}, [r3]!
  vst1.8     {q8}, [r3]!
  vst1.8     {q8}, [r3]!
  vst1.8     {q8}, [r3]
But for now, gcc can't do this and below code is generated:
  vmov.i32   q8, #...
  vst1.8     {q8}, [r3]
  add        r2,   r3,  #16
  add        r3,   r2,  #16
  vst1.8     {q8}, [r2]
  vst1.8     {q8}, [r3]
  add        r2,   r3,  #16
  vst1.8     {q8}, [r2]

I investigated this issue.  The root cause lies in rtx cost returned by ARM
backend.  Anyway, I think this is another issue and should be fixed in
separated patch.

Bootstrap and reg-test on cortex-a15, with or without neon support.  Is it
OK?

Thanks,
bin


2014-04-29  Bin Cheng  <bin.cheng@arm.com>

	PR target/55701
	* config/arm/arm.md (setmem): New pattern.
	* config/arm/arm-protos.h (struct tune_params): New field.
	(arm_gen_setmem): New prototype.
	* config/arm/arm.c (arm_slowmul_tune): Initialize new field.
	(arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
	(arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
	(arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
	(arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
	(arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
	(arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
	(arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
	(arm_const_inline_cost): New function.
	(arm_block_set_max_insns): New function.
	(arm_block_set_straight_profit_p): New function.
	(arm_block_set_vect_profit_p): New function.
	(arm_block_set_unaligned_vect): New function.
	(arm_block_set_aligned_vect): New function.
	(arm_block_set_unaligned_straight): New function.
	(arm_block_set_aligned_straight): New function.
	(arm_block_set_vect, arm_gen_setmem): New functions.

gcc/testsuite/ChangeLog
2014-04-29  Bin Cheng  <bin.cheng@arm.com>

	PR target/55701
	* gcc.target/arm/memset-inline-1.c: New test.
	* gcc.target/arm/memset-inline-2.c: New test.
	* gcc.target/arm/memset-inline-3.c: New test.
	* gcc.target/arm/memset-inline-4.c: New test.
	* gcc.target/arm/memset-inline-5.c: New test.
	* gcc.target/arm/memset-inline-6.c: New test.
	* gcc.target/arm/memset-inline-7.c: New test.
	* gcc.target/arm/memset-inline-8.c: New test.
	* gcc.target/arm/memset-inline-9.c: New test.

[-- Attachment #2: j1328-20140429.txt --]
[-- Type: text/plain, Size: 50199 bytes --]

Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 209852)
+++ gcc/config/arm/arm.c	(working copy)
@@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1602,10 +1603,11 @@ const struct tune_params arm_fastmul_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1622,10 +1624,11 @@ const struct tune_params arm_strongarm_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1639,10 +1642,11 @@ const struct tune_params arm_xscale_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_9e_tune =
@@ -1656,10 +1660,11 @@ const struct tune_params arm_9e_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_v6t2_tune =
@@ -1673,10 +1678,11 @@ const struct tune_params arm_v6t2_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1691,10 +1697,11 @@ const struct tune_params arm_cortex_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
@@ -1708,10 +1715,11 @@ const struct tune_params arm_cortex_a8_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1725,10 +1733,11 @@ const struct tune_params arm_cortex_a7_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  true                                  /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
@@ -1742,10 +1751,11 @@ const struct tune_params arm_cortex_a15_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  true, true                                    /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  true, true,                           /* Prefer 32-bit encodings.  */
+  true                                  /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
@@ -1759,10 +1769,11 @@ const struct tune_params arm_cortex_a53_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
@@ -1775,11 +1786,12 @@ const struct tune_params arm_cortex_a57_tune =
   ARM_PREFETCH_NOT_BENEFICIAL,
   false,                                       /* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,                                       /* Prefer LDRD/STRD.  */
-  {true, true},                                /* Prefer non short circuit.  */
-  &arm_default_vec_cost,                       /* Vectorizer costs.  */
-  false,                                       /* Prefer Neon for 64-bits bitops.  */
-  true, true                                   /* Prefer 32-bit encodings.  */
+  true,                                        /* Prefer LDRD/STRD.  */
+  {true, true},                         /* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  true, true,                           /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1796,10 +1808,11 @@ const struct tune_params arm_cortex_a5_tune =
   false,					/* Prefer constant pool.  */
   arm_cortex_a5_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1813,10 +1826,11 @@ const struct tune_params arm_cortex_a9_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1830,10 +1844,11 @@ const struct tune_params arm_cortex_a12_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  true                                  /* Prefer Neon for stringops.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -1854,10 +1869,11 @@ const struct tune_params arm_v7m_tune =
   true,						/* Prefer constant pool.  */
   arm_cortex_m_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1873,10 +1889,11 @@ const struct tune_params arm_v6m_tune =
   false,					/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -1890,10 +1907,11 @@ const struct tune_params arm_fa726te_tune =
   true,						/* Prefer constant pool.  */
   arm_default_branch_cost,
   false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,                /* Vectorizer costs.  */
+  false,                                /* Prefer Neon for 64-bits bitops.  */
+  false, false,                         /* Prefer 32-bit encodings.  */
+  false                                 /* Prefer Neon for stringops.  */
 };
 
 
@@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
 			      NULL_RTX, NULL_RTX, 0, 0));
 }
 
+/* Cost of loading a SImode constant.  */
+static inline int
+arm_const_inline_cost (rtx val)
+{
+  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
+                           NULL_RTX, NULL_RTX, 0, 0);
+}
+
 /* Return true if it is worthwhile to split a 64-bit constant into two
    32-bit operations.  This is the case if optimizing for size, or
    if we have load delay slots, or if one 32-bit part can be done with
@@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx *comparison, rtx * op
 
 }
 
+/* Maximum number of instructions to set block of memory.  */
+static int
+arm_block_set_max_insns (void)
+{
+  return (optimize_function_for_size_p (cfun) ? 4 : 8);
+}
+
+/* Return TRUE if it's profitable to set block of memory for straight
+   case.  */
+static bool
+arm_block_set_straight_profit_p (rtx val,
+				 unsigned HOST_WIDE_INT length,
+				 unsigned HOST_WIDE_INT align,
+				 bool unaligned_p, bool use_strd_p)
+{
+  int num = 0;
+  /* For leftovers in bytes of 0-7, we can set the memory block using
+     strb/strh/str with minimum instruction number.  */
+  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
+
+  if (unaligned_p)
+    {
+      num = arm_const_inline_cost (val);
+      num += length / align + length % align;
+    }
+  else if (use_strd_p)
+    {
+      num = arm_const_double_inline_cost (val);
+      num += (length >> 3) + leftover[length & 7];
+    }
+  else
+    {
+      num = arm_const_inline_cost (val);
+      num += (length >> 2) + leftover[length & 3];
+    }
+
+  /* We may be able to combine last pair STRH/STRB into a single STR
+     by shifting one byte back.  */
+  if (unaligned_access && length > 3 && (length & 3) == 3)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Return TRUE if it's profitable to set block of memory for vector case.  */
+static bool
+arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
+			     unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED,
+			     bool unaligned_p, enum machine_mode mode)
+{
+  int num;
+  unsigned int nelt = GET_MODE_NUNITS (mode);
+
+  /* Num of instruction loading constant value.  */
+  num = 1;
+  /* Num of store instructions.  */
+  num += (length + nelt - 1) / nelt;
+  /* Num of address adjusting instructions.  */
+  if (unaligned_p)
+    /* For unaligned case, it's one less than the store instructions.  */
+    num += (length + nelt - 1) / nelt - 1;
+  else if ((length & 3) != 0)
+    /* For aligned case, it's one if bytes leftover can only be stored
+       by mis-aligned store instruction.  */
+    num++;
+
+  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
+  if (!unaligned_p && mode == V16QImode)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Set a block of memory using vectorization instructions for the
+   unaligned case.  We fill the first LENGTH bytes of the memory
+   area starting from DSTBASE with byte constant VALUE.  ALIGN is
+   the alignment requirement of memory.  */
+static bool
+arm_block_set_unaligned_vect (rtx dstbase,
+			      unsigned HOST_WIDE_INT length,
+			      unsigned HOST_WIDE_INT value,
+			      unsigned HOST_WIDE_INT align)
+{
+  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
+  rtx dst, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  rtx (*gen_func) (rtx, rtx);
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) != 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16)
+    {
+      mode = V16QImode;
+      gen_func = gen_movmisalignv16qi;
+    }
+  else
+    {
+      mode = V8QImode;
+      gen_func = gen_movmisalignv8qi;
+    }
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, true, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle nelt_mode bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      emit_insn ((*gen_func) (mem, reg));
+      if (i + 2 * nelt_mode <= length)
+	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
+    }
+
+  if (i + nelt_v8 <= length)
+    gcc_assert (mode == V16QImode);
+
+  /* Handle (8, 16) bytes leftover.  */
+  if (i + nelt_v8 < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+    }
+  /* Handle (0, 8] bytes leftover.  */
+  else if (i < length && i + nelt_v8 >= length)
+    {
+      if (mode == V16QImode)
+	{
+	  reg = gen_lowpart (V8QImode, reg);
+	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
+	}
+      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
+					      + (nelt_mode - nelt_v8))));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for the
+   aligned case.  We fill the first LENGTH bytes of the memory area
+   starting from DSTBASE with byte constant VALUE.  ALIGN is the
+   alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_vect (rtx dstbase,
+			    unsigned HOST_WIDE_INT length,
+			    unsigned HOST_WIDE_INT value,
+			    unsigned HOST_WIDE_INT align)
+{
+  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
+  rtx dst, addr, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) == 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
+    mode = V16QImode;
+  else
+    mode = V8QImode;
+
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, false, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
+  if (mode == V16QImode)
+    {
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+      i += nelt_mode;
+      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
+      if (i + nelt_v8 < length && i + nelt_v16 > length)
+	{
+	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+	  /* We are shifting bytes back, set the alignment accordingly.  */
+	  if ((length & 0x3) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 4);
+	  else if ((length & 0x1) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 2);
+	  else
+	    set_mem_align (mem, BITS_PER_UNIT);
+
+	  emit_insn (gen_movmisalignv16qi (mem, reg));
+	  return true;
+	}
+      /* Fall through for bytes leftover.  */
+      mode = V8QImode;
+      nelt_mode = GET_MODE_NUNITS (mode);
+      reg = gen_lowpart (V8QImode, reg);
+    }
+
+  /* Handle 8 bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single word leftover by shifting 4 bytes back.  We can
+     use aligned access for this case.  */
+  if (i + UNITS_PER_WORD == length)
+    {
+      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
+      mem = adjust_automodify_address (dstbase, mode,
+				       addr, i - UNITS_PER_WORD);
+      /* We are shifting 4 bytes back, set the alignment accordingly.  */
+      if (align > UNITS_PER_WORD)
+	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
+
+      emit_move_insn (mem, reg);
+    }
+  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
+     We have to use unaligned access for this case.  */
+  else if (i < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT * 2);
+      else
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using plain strh/strb instructions, only
+   using instructions allowed by ALIGN on processor.  We fill the
+   first LENGTH bytes of the memory area starting from DSTBASE
+   with byte constant VALUE.  ALIGN is the alignment requirement
+   of memory.  */
+static bool
+arm_block_set_unaligned_straight (rtx dstbase,
+				  unsigned HOST_WIDE_INT length,
+				  unsigned HOST_WIDE_INT value,
+				  unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  enum machine_mode mode;
+  HOST_WIDE_INT v = value;
+
+  gcc_assert (align == 1 || align == 2);
+
+  if (align == 2)
+    v |= (value << BITS_PER_UNIT);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_straight_profit_p (val_exp, length,
+					align, true, false))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mode = (align == 2 ? HImode : QImode);
+  val_reg = force_reg (SImode, val_exp);
+  reg = gen_lowpart (mode, val_reg);
+
+  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+      i++;
+    }
+
+  gcc_assert (i == length);
+  return true;
+}
+
+/* Set a block of memory using plain strd/str/strh/strb instructions,
+   to permit unaligned copies on processors which support unaligned
+   semantics for those instructions.  We fill the first LENGTH bytes
+   of the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_straight (rtx dstbase,
+				unsigned HOST_WIDE_INT length,
+				unsigned HOST_WIDE_INT value,
+				unsigned HOST_WIDE_INT align)
+{
+  unsigned int i = 0;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  unsigned HOST_WIDE_INT v;
+  bool use_strd_p;
+
+  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
+		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
+
+  v = (value | (value << 8) | (value << 16) | (value << 24));
+  if (length < UNITS_PER_WORD)
+    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
+
+  if (use_strd_p)
+    v |= (v << BITS_PER_WORD);
+  else
+    v = sext_hwi (v, BITS_PER_WORD);
+
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_straight_profit_p (val_exp, length,
+					align, false, use_strd_p))
+    {
+      /* Try without strd.  */
+      v = (v >> BITS_PER_WORD);
+      v = sext_hwi (v, BITS_PER_WORD);
+      val_exp = GEN_INT (v);
+      use_strd_p = false;
+      if (!arm_block_set_straight_profit_p (val_exp, length,
+					    align, false, use_strd_p))
+	return false;
+    }
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  /* Handle double words using strd if possible.  */
+  if (use_strd_p)
+    {
+      val_reg = force_reg (DImode, val_exp);
+      reg = val_reg;
+      for (; (i + 8 <= length); i += 8)
+	{
+	  addr = plus_constant (Pmode, dst, i);
+	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
+	  emit_move_insn (mem, reg);
+	}
+    }
+  else
+    val_reg = force_reg (SImode, val_exp);
+
+  /* Handle words.  */
+  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
+  for (; (i + 4 <= length); i += 4)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i);
+      if ((align & 3) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storesi (mem, reg));
+    }
+
+  /* Merge last pair of STRH and STRB into a STR if possible.  */
+  if (unaligned_access && i > 0 && (i + 3) == length)
+    {
+      addr = plus_constant (Pmode, dst, i - 1);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
+      /* We are shifting one byte back, set the alignment accordingly.  */
+      if ((align & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      /* Most likely this is an unaligned access, and we can't tell at
+	 compilation time.  */
+      emit_insn (gen_unaligned_storesi (mem, reg));
+      return true;
+    }
+
+  /* Handle half word leftover.  */
+  if (i + 2 <= length)
+    {
+      reg = gen_lowpart (HImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, HImode, addr, i);
+      if ((align & 1) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storehi (mem, reg));
+
+      i += 2;
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for both
+   aligned and unaligned cases.  We fill the first LENGTH bytes of
+   the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_vect (rtx dstbase,
+		    unsigned HOST_WIDE_INT length,
+		    unsigned HOST_WIDE_INT value,
+		    unsigned HOST_WIDE_INT align)
+{
+  /* Check whether we need to use unaligned store instruction.  */
+  if (((align & 3) != 0 || (length & 3) != 0)
+      /* Check whether unaligned store instruction is available.  */
+      && (!unaligned_access || BYTES_BIG_ENDIAN))
+    return false;
+
+  if ((align & 3) == 0)
+    return arm_block_set_aligned_vect (dstbase, length, value, align);
+  else
+    return arm_block_set_unaligned_vect (dstbase, length, value, align);
+}
+
+/* Expand string store operation.  Firstly we try to do that by using
+   vectorization instructions, then try with ARM unaligned access and
+   double-word store if profitable.  OPERANDS[0] is the destination,
+   OPERANDS[1] is the number of bytes, operands[2] is the value to
+   initialize the memory, OPERANDS[3] is the known alignment of the
+   destination.  */
+bool
+arm_gen_setmem (rtx *operands)
+{
+  rtx dstbase = operands[0];
+  unsigned HOST_WIDE_INT length;
+  unsigned HOST_WIDE_INT value;
+  unsigned HOST_WIDE_INT align;
+
+  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
+    return false;
+
+  length = UINTVAL (operands[1]);
+  if (length > 64)
+    return false;
+
+  value = (UINTVAL (operands[2]) & 0xFF);
+  align = UINTVAL (operands[3]);
+  if (TARGET_NEON && length >= 8
+      && current_tune->string_ops_prefer_neon
+      && arm_block_set_vect (dstbase, length, value, align))
+    return true;
+
+  if (!unaligned_access && (align & 3) != 0)
+    return arm_block_set_unaligned_straight (dstbase, length, value, align);
+
+  return arm_block_set_aligned_straight (dstbase, length, value, align);
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
Index: gcc/config/arm/arm-protos.h
===================================================================
--- gcc/config/arm/arm-protos.h	(revision 209852)
+++ gcc/config/arm/arm-protos.h	(working copy)
@@ -277,6 +277,8 @@ struct tune_params
   /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
      would be set.  */
   bool disparage_partial_flag_setting_t16_encodings;
+  /* Prefer to inline string operations like memset by using Neon.  */
+  bool string_ops_prefer_neon;
 };
 
 extern const struct tune_params *current_tune;
@@ -289,6 +291,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
 extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
 #endif /* RTX_CODE */
 
+extern bool arm_gen_setmem (rtx *);
 extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
 extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
 
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 209852)
+++ gcc/config/arm/arm.md	(working copy)
@@ -7555,6 +7555,20 @@
 })
 
 
+(define_expand "setmemsi"
+  [(match_operand:BLK 0 "general_operand" "")
+   (match_operand:SI 1 "const_int_operand" "")
+   (match_operand:SI 2 "const_int_operand" "")
+   (match_operand:SI 3 "const_int_operand" "")]
+  "TARGET_32BIT"
+{
+  if (arm_gen_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
+
 ;; Move a block of memory if it is word aligned and MORE than 2 words long.
 ;; We could let this apply for blocks of less than this, but it clobbers so
 ;; many registers that there is then probably a better way.
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 20);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 24);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 32);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 20, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 24, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 32, sizeof (c), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vst1" 3 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vstr" 4 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-7.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
@@ -0,0 +1,171 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+int b[LEN];
+
+void
+init (signed char *arr, int len)
+{
+  int i;
+  for (i = 0; i < len; i++)
+    arr[i] = 0;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+#define TEST(a,l,v)			\
+	init ((signed char*)(a), sizeof (a));		\
+	memset ((a), (v), (l));				\
+	check ((signed char *)(a), (l), sizeof (a), (v));
+int
+main(void)
+{
+  TEST (a, 1, -1);
+  TEST (a, 2, -1);
+  TEST (a, 3, -1);
+  TEST (a, 4, -1);
+  TEST (a, 5, -1);
+  TEST (a, 6, -1);
+  TEST (a, 7, -1);
+  TEST (a, 8, -1);
+  TEST (a, 9, 1);
+  TEST (a, 10, -1);
+  TEST (a, 11, 1);
+  TEST (a, 12, -1);
+  TEST (a, 13, 1);
+  TEST (a, 14, -1);
+  TEST (a, 15, 1);
+  TEST (a, 16, -1);
+  TEST (a, 17, 1);
+  TEST (a, 18, -1);
+  TEST (a, 19, 1);
+  TEST (a, 20, -1);
+  TEST (a, 21, 1);
+  TEST (a, 22, -1);
+  TEST (a, 23, 1);
+  TEST (a, 24, -1);
+  TEST (a, 25, 1);
+  TEST (a, 26, -1);
+  TEST (a, 27, 1);
+  TEST (a, 28, -1);
+  TEST (a, 29, 1);
+  TEST (a, 30, -1);
+  TEST (a, 31, 1);
+  TEST (a, 32, -1);
+  TEST (a, 33, 1);
+  TEST (a, 34, -1);
+  TEST (a, 35, 1);
+  TEST (a, 36, -1);
+  TEST (a, 37, 1);
+  TEST (a, 38, -1);
+  TEST (a, 39, 1);
+  TEST (a, 40, -1);
+  TEST (a, 41, 1);
+  TEST (a, 42, -1);
+  TEST (a, 43, 1);
+  TEST (a, 44, -1);
+  TEST (a, 45, 1);
+  TEST (a, 46, -1);
+  TEST (a, 47, 1);
+  TEST (a, 48, -1);
+  TEST (a, 49, 1);
+  TEST (a, 50, -1);
+  TEST (a, 51, 1);
+  TEST (a, 52, -1);
+  TEST (a, 53, 1);
+  TEST (a, 54, -1);
+  TEST (a, 55, 1);
+  TEST (a, 56, -1);
+  TEST (a, 57, 1);
+  TEST (a, 58, -1);
+  TEST (a, 59, 1);
+  TEST (a, 60, -1);
+  TEST (a, 61, 1);
+  TEST (a, 62, -1);
+  TEST (a, 63, 1);
+  TEST (a, 64, -1);
+
+  TEST (b, 1, -1);
+  TEST (b, 2, -1);
+  TEST (b, 3, -1);
+  TEST (b, 4, -1);
+  TEST (b, 5, -1);
+  TEST (b, 6, -1);
+  TEST (b, 7, -1);
+  TEST (b, 8, -1);
+  TEST (b, 9, 1);
+  TEST (b, 10, -1);
+  TEST (b, 11, 1);
+  TEST (b, 12, -1);
+  TEST (b, 13, 1);
+  TEST (b, 14, -1);
+  TEST (b, 15, 1);
+  TEST (b, 16, -1);
+  TEST (b, 17, 1);
+  TEST (b, 18, -1);
+  TEST (b, 19, 1);
+  TEST (b, 20, -1);
+  TEST (b, 21, 1);
+  TEST (b, 22, -1);
+  TEST (b, 23, 1);
+  TEST (b, 24, -1);
+  TEST (b, 25, 1);
+  TEST (b, 26, -1);
+  TEST (b, 27, 1);
+  TEST (b, 28, -1);
+  TEST (b, 29, 1);
+  TEST (b, 30, -1);
+  TEST (b, 31, 1);
+  TEST (b, 32, -1);
+  TEST (b, 33, 1);
+  TEST (b, 34, -1);
+  TEST (b, 35, 1);
+  TEST (b, 36, -1);
+  TEST (b, 37, 1);
+  TEST (b, 38, -1);
+  TEST (b, 39, 1);
+  TEST (b, 40, -1);
+  TEST (b, 41, 1);
+  TEST (b, 42, -1);
+  TEST (b, 43, 1);
+  TEST (b, 44, -1);
+  TEST (b, 45, 1);
+  TEST (b, 46, -1);
+  TEST (b, 47, 1);
+  TEST (b, 48, -1);
+  TEST (b, 49, 1);
+  TEST (b, 50, -1);
+  TEST (b, 51, 1);
+  TEST (b, 52, -1);
+  TEST (b, 53, 1);
+  TEST (b, 54, -1);
+  TEST (b, 55, 1);
+  TEST (b, 56, -1);
+  TEST (b, 57, 1);
+  TEST (b, 58, -1);
+  TEST (b, 59, 1);
+  TEST (b, 60, -1);
+  TEST (b, 61, 1);
+  TEST (b, 62, -1);
+  TEST (b, 63, 1);
+  TEST (b, 64, -1);
+
+  return 0;
+}
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
@@ -0,0 +1,39 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler "bl?\[ \t\]*memset" { target { ! arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-3.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 7);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 7, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-not "strh" { target { ! arm_thumb1 } } } } */
+/* { dg-final { scan-assembler-not "strb" { target { ! arm_thumb1 } } } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 8);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 12);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, 1, 13);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  int i;
+
+  foo1 ();
+  check ((signed char *)a, 8, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 12, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 13, sizeof (c), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-times "vst1\.8" 1 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
@@ -0,0 +1,78 @@
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+int d[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 16);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 25);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 19);
+  return;
+}
+
+void
+foo4 (void)
+{
+  memset (d, 1, 23);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 16, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 25, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 19, sizeof (c), -1);
+
+  foo4 ();
+  check ((signed char *)d, 23, sizeof (d), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr"  { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-07-21 12:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-30  5:56 [PATCH ARM] Improve ARM memset inlining bin.cheng
2014-05-02 13:59 ` Richard Earnshaw
2014-05-05  7:21   ` bin.cheng
2014-05-06  5:00     ` bin.cheng
2014-05-12  3:17       ` Bin.Cheng
2014-05-19  6:40         ` Bin.Cheng
2014-05-28  8:53           ` bin.cheng
2014-06-04  9:11             ` bin.cheng
2014-06-27  8:21       ` Ramana Radhakrishnan
2014-07-04 12:18         ` Bin Cheng
2014-07-08  8:32           ` Bin.Cheng
2014-07-08  8:56           ` Ramana Radhakrishnan
2014-07-08  9:57             ` Bin.Cheng
2014-07-21 12:45               ` Bin.Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).