From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-366655-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 19967 invoked by alias); 6 May 2014 05:00:06 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 19935 invoked by uid 89); 6 May 2014 05:00:04 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: service87.mimecast.com
Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 06 May 2014 04:59:59 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Tue, 06 May 2014 05:59:54 +0100
Received: from SHAWIN162 ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Tue, 6 May 2014 06:00:02 +0100
From: "bin.cheng" <bin.cheng@arm.com>
To: "Bin Cheng" <Bin.Cheng@arm.com>,	"Richard Earnshaw" <Richard.Earnshaw@arm.com>
Cc: <gcc-patches@gcc.gnu.org>
References: <003301cf641f$36733f40$a359bdc0$@arm.com> <5363A4CF.9050904@arm.com> <003f01cf6832$9cf921a0$d6eb64e0$@arm.com>
In-Reply-To: <003f01cf6832$9cf921a0$d6eb64e0$@arm.com>
Subject: RE: [PATCH ARM] Improve ARM memset inlining
Date: Tue, 06 May 2014 05:00:00 -0000
Message-ID: <004401cf68e7$fff1bb40$ffd531c0$@arm.com>
MIME-Version: 1.0
X-MC-Unique: 114050605595400201
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2014-05/txt/msg00270.txt.bz2



> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Monday, May 05, 2014 3:21 PM
> To: Richard Earnshaw
> Cc: gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH ARM] Improve ARM memset inlining
>=20
> Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
> also updated the patch.
>=20
> > -----Original Message-----
> > From: Richard Earnshaw
> > Sent: Friday, May 02, 2014 10:00 PM
> > To: Bin Cheng
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
> >
> > On 30/04/14 03:52, bin.cheng wrote:
> > > Hi,
> > > This patch expands small memset calls into direct memory set
> > > instructions by introducing "setmemsi" pattern.  For processors
> > > without NEON support, it expands memset using general store
> > > instruction.  For example, strd for 4-bytes aligned addresses.  For
> > > processors with NEON support, it expands memset using neon
> > > instructions like vstr and miscellaneous vst1.* instructions for
> > > both
> aligned
> > and unaligned cases.
> > >
> > > This patch depends on
> > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
> > > vst1.64 will be generated for 32-bit aligned memory unit.
> > >
> > > There is also one leftover work of this patch:  Since vst1.*
> > > instructions only support post-increment addressing mode, the
> > > inlined memset for unaligned neon cases should be like:
> > >   vmov.i32   q8, #...
> > >   vst1.8     {q8}, [r3]!
> > >   vst1.8     {q8}, [r3]!
> > >   vst1.8     {q8}, [r3]!
> > >   vst1.8     {q8}, [r3]
> >
> > Other than for zero, I'd expect the vmov to be vmov.i8 to move an
> arbitrary
> I just used vmov.i32 as an example.  The element size is actually
calculated by
> function neon_valid_immediate which works as expected I think.
>=20
> > byte value into all lanes in a vector.  After that, if the alignment
> > is
> known to
> > be more than 8-bit, I'd expect the vst1 instructions (with the
> > exception
> of the
> > last store if the length is not a multiple of the alignment) to use
> >
> > 	vst1.<align> {reg}, [addr-reg :<align>]!
> >
> > Hence, for 16-bit aligned data, we want
> >
> > 	vst1.16	{q8}, [r3:16]!
> Did I miss something important?  It seems to me the explicit alignment
notes
> supported are 64/128/256.  So what do you mean by 16 bits alignment here?
>=20
> >
> > > But for now, gcc can't do this and below code is generated:
> > >   vmov.i32   q8, #...
> > >   vst1.8     {q8}, [r3]
> > >   add        r2,   r3,  #16
> > >   add        r3,   r2,  #16
> > >   vst1.8     {q8}, [r2]
> > >   vst1.8     {q8}, [r3]
> > >   add        r2,   r3,  #16
> > >   vst1.8     {q8}, [r2]
> > >
> > > I investigated this issue.  The root cause lies in rtx cost returned
> > > by ARM backend.  Anyway, I think this is another issue and should be
> > > fixed in separated patch.
> > >
> > > Bootstrap and reg-test on cortex-a15, with or without neon support.
> > > Is it OK?
> > >
> >
> > Some more comments inline.
> >
> > > Thanks,
> > > bin
> > >
> > >
> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> > >
> > > 	PR target/55701
> > > 	* config/arm/arm.md (setmem): New pattern.
> > > 	* config/arm/arm-protos.h (struct tune_params): New field.
> > > 	(arm_gen_setmem): New prototype.
> > > 	* config/arm/arm.c (arm_slowmul_tune): Initialize new field.
> > > 	(arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
> > > 	(arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
> > > 	(arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
> > > 	(arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
> > > 	(arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
> > > 	(arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
> > > 	(arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
> > > 	(arm_const_inline_cost): New function.
> > > 	(arm_block_set_max_insns): New function.
> > > 	(arm_block_set_straight_profit_p): New function.
> > > 	(arm_block_set_vect_profit_p): New function.
> > > 	(arm_block_set_unaligned_vect): New function.
> > > 	(arm_block_set_aligned_vect): New function.
> > > 	(arm_block_set_unaligned_straight): New function.
> > > 	(arm_block_set_aligned_straight): New function.
> > > 	(arm_block_set_vect, arm_gen_setmem): New functions.
> > >
> > > gcc/testsuite/ChangeLog
> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> > >
> > > 	PR target/55701
> > > 	* gcc.target/arm/memset-inline-1.c: New test.
> > > 	* gcc.target/arm/memset-inline-2.c: New test.
> > > 	* gcc.target/arm/memset-inline-3.c: New test.
> > > 	* gcc.target/arm/memset-inline-4.c: New test.
> > > 	* gcc.target/arm/memset-inline-5.c: New test.
> > > 	* gcc.target/arm/memset-inline-6.c: New test.
> > > 	* gcc.target/arm/memset-inline-7.c: New test.
> > > 	* gcc.target/arm/memset-inline-8.c: New test.
> > > 	* gcc.target/arm/memset-inline-9.c: New test.
> > >
> > >
> > > j1328-20140429.txt
> > >
> > >
> > > Index: gcc/config/arm/arm.c
> > >
> >
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > --- gcc/config/arm/arm.c	(revision 209852)
> > > +++ gcc/config/arm/arm.c	(working copy)
> > > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune
> =3D
> > >    true,						/* Prefer constant
> > pool.  */
> > >    arm_default_branch_cost,
> > >    false,					/* Prefer LDRD/STRD.  */
> > > -  {true, true},					/* Prefer non short
> > circuit.  */
> > > -  &arm_default_vec_cost,                        /* Vectorizer costs.
> */
> > > -  false,                                        /* Prefer Neon for
> 64-bits bitops.  */
> > > -  false, false                                  /* Prefer 32-bit
> encodings.  */
> > > +  {true, true},				/* Prefer non short circuit.
> */
> > > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
> > > +  false,                                /* Prefer Neon for 64-bits
> bitops.  */
> > > +  false, false,                         /* Prefer 32-bit encodings.
*/
> > > +  false                                 /* Prefer Neon for stringops.
> */
> > >  };
> > >
> >
> > Please make sure that all the white space before the comments is using
> TAB,
> > not spaces.  Similarly for the other tables.
> Fixed.
>=20
> >
> > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
> > >  			      NULL_RTX, NULL_RTX, 0, 0));  }
> > >
> > > +/* Cost of loading a SImode constant.  */ static inline int
> > > +arm_const_inline_cost (rtx val) {
> > > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
> > > +                           NULL_RTX, NULL_RTX, 0, 0); }
> > > +
> >
> > This could be used more widely if you passed the SET in as a parameter
> > (there are cases in arm_new_rtx_cost that could use it, for example).
> > Also, you want to enable sub-targets (only once you can't create new
> > pseudos is that not safe), so the penultimate argument in the call to
> > arm_gen_constant should be 1.
> Fixed.
>=20
> >
> > >  /* Return true if it is worthwhile to split a 64-bit constant into
two
> > >     32-bit operations.  This is the case if optimizing for size, or
> > >     if we have load delay slots, or if one 32-bit part can be done
> > > with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
> > > *comparison, rtx * op
> > >
> > >  }
> > >
> > > +/* Maximum number of instructions to set block of memory.  */
> > > +static int arm_block_set_max_insns (void) {
> > > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
> >
> > I think the non-size_p alternative should really be a parameter in the
> per-cpu
> > costs table.
> Fixed.
>=20
> >
> > > +
> > > +/* Return TRUE if it's profitable to set block of memory for straight
> > > +   case.  */
> >
> > "Straight" is confusing here.  Do you mean non-vectorized?  If so,
> > then non_vect might be clearer.
> Fixed.
>=20
> >
> > The arguments should really be documented (see comment below about
> > align, for example).
> Fixed.
>=20
> >
> > > +static bool
> > > +arm_block_set_straight_profit_p (rtx val,
> > > +				 unsigned HOST_WIDE_INT length,
> > > +				 unsigned HOST_WIDE_INT align,
> > > +				 bool unaligned_p, bool use_strd_p) {
> > > +  int num =3D 0;
> > > +  /* For leftovers in bytes of 0-7, we can set the memory block using
> > > +     strb/strh/str with minimum instruction number.  */
> > > +  int leftover[8] =3D {0, 1, 1, 2, 1, 2, 2, 3};
> >
> > This should be marked const.
> Fixed.
>=20
> >
> > > +
> > > +  if (unaligned_p)
> > > +    {
> > > +      num =3D arm_const_inline_cost (val);
> > > +      num +=3D length / align + length % align;
> >
> > Isn't align in bits here, when you really want it in bytes?
> All alignments are in bytes starting from pattern "setmem".
>=20
> >
> > What if align > 4 bytes?
> Then it's the "!unaligned_p" case and handled by other arms of this if
> statement.
>=20
> >
> > > +    }
> > > +  else if (use_strd_p)
> > > +    {
> > > +      num =3D arm_const_double_inline_cost (val);
> > > +      num +=3D (length >> 3) + leftover[length & 7];
> > > +    }
> > > +  else
> > > +    {
> > > +      num =3D arm_const_inline_cost (val);
> > > +      num +=3D (length >> 2) + leftover[length & 3];
> > > +    }
> > > +
> > > +  /* We may be able to combine last pair STRH/STRB into a single STR
> > > +     by shifting one byte back.  */  if (unaligned_access && length
> > > + > 3 && (length & 3) =3D=3D 3)
> > > +    num--;
> > > +
> > > +  return (num <=3D arm_block_set_max_insns ()); }
> > > +
> > > +/* Return TRUE if it's profitable to set block of memory for vector
> > > +case.  */ static bool arm_block_set_vect_profit_p (unsigned
> > > +HOST_WIDE_INT length,
> > > +			     unsigned HOST_WIDE_INT align
> > ATTRIBUTE_UNUSED,
> > > +			     bool unaligned_p, enum machine_mode mode)
> >
> > I'm not sure what you mean by unaligned here.  Again, documenting the
> > arguments might help.
> Fixed.
>=20
> >
> > > +{
> > > +  int num;
> > > +  unsigned int nelt =3D GET_MODE_NUNITS (mode);
> > > +
> > > +  /* Num of instruction loading constant value.  */
> >
> > Use either "Number" or, in this case, simply drop that bit and write:
> >   /* Instruction loading constant value.  */
> Fixed.
>=20
> >
> > > +  num =3D 1;
> > > +  /* Num of store instructions.  */
> >
> > Likewise.
> >
> > > +  num +=3D (length + nelt - 1) / nelt;
> > > +  /* Num of address adjusting instructions.  */
> >
> > Can't we work on the premise that the address adjusting instructions
> > will
> be
> > merged into the stores?  I know you said that they currently do not,
> > but that's not a problem that this bit of code should have to worry
about.
> Fixed.
>=20
> >
> > > +  if (unaligned_p)
> > > +    /* For unaligned case, it's one less than the store instructions.
> */
> > > +    num +=3D (length + nelt - 1) / nelt - 1;  else if ((length & 3)
> > > + !=3D
> > > + 0)
> > > +    /* For aligned case, it's one if bytes leftover can only be
stored
> > > +       by mis-aligned store instruction.  */
> > > +    num++;
> > > +
> > > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
> > > + */  if (!unaligned_p && mode =3D=3D V16QImode)
> > > +    num--;
> > > +
> > > +  return (num <=3D arm_block_set_max_insns ()); }
> > > +
> > > +/* Set a block of memory using vectorization instructions for the
> > > +   unaligned case.  We fill the first LENGTH bytes of the memory
> > > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
> > > +   the alignment requirement of memory.  */
> >
> > What's the return value mean?
> Documented.
>=20
> >
> > > +static bool
> > > +arm_block_set_unaligned_vect (rtx dstbase,
> > > +			      unsigned HOST_WIDE_INT length,
> > > +			      unsigned HOST_WIDE_INT value,
> > > +			      unsigned HOST_WIDE_INT align) {
> > > +  unsigned int i =3D 0, j =3D 0, nelt_v16, nelt_v8, nelt_mode;
> >
> > Don't mix initialized declarations with unitialized ones on the same
line.
> You
> > don't appear to use either I or J until their first use in the loop
> control below,
> > so why initialize them here?
> Fixed.
>=20
> >
> > > +  rtx dst, mem;
> > > +  rtx val_elt, val_vec, reg;
> > > +  rtx rval[MAX_VECT_LEN];
> > > +  rtx (*gen_func) (rtx, rtx);
> > > +  enum machine_mode mode;
> > > +  unsigned HOST_WIDE_INT v =3D value;
> > > +
> > > +  gcc_assert ((align & 0x3) !=3D 0);
> > > +  nelt_v8 =3D GET_MODE_NUNITS (V8QImode);
> > > +  nelt_v16 =3D GET_MODE_NUNITS (V16QImode);  if (length >=3D nelt_v1=
6)
> > > +    {
> > > +      mode =3D V16QImode;
> > > +      gen_func =3D gen_movmisalignv16qi;
> > > +    }
> > > +  else
> > > +    {
> > > +      mode =3D V8QImode;
> > > +      gen_func =3D gen_movmisalignv8qi;
> > > +    }
> > > +  nelt_mode =3D GET_MODE_NUNITS (mode);  gcc_assert (length >=3D
> > > + nelt_mode);
> > > +  /* Skip if it isn't profitable.  */  if
> > > + (!arm_block_set_vect_profit_p (length, align, true, mode))
> > > +    return false;
> > > +
> > > +  dst =3D copy_addr_to_reg (XEXP (dstbase, 0));  mem =3D
> > > + adjust_automodify_address (dstbase, mode, dst, 0);
> > > +
> > > +  v =3D sext_hwi (v, BITS_PER_WORD);
> > > +  val_elt =3D GEN_INT (v);
> > > +  for (; j < nelt_mode; j++)
> > > +    rval[j] =3D val_elt;
> >
> > Is this the first use of J?  If so, initialize it here.
> >
> > > +
> > > +  reg =3D gen_reg_rtx (mode);
> > > +  val_vec =3D gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
> > > + rval));
> > > +  /* Emit instruction loading the constant value.  */
> > > + emit_move_insn (reg, val_vec);
> > > +
> > > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i + nelt_mode
> > > + <=3D length); i +=3D nelt_mode)
> >
> > Similarly for I.
> >
> > > +    {
> > > +      emit_insn ((*gen_func) (mem, reg));
> > > +      if (i + 2 * nelt_mode <=3D length)
> > > +	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
> > > +    }
> > > +
> > > +  if (i + nelt_v8 <=3D length)
> > > +    gcc_assert (mode =3D=3D V16QImode);
> >
> > Why not drop the if and write:
> >
> >      gcc_assert ((i + nelt_v8) > length || mode =3D=3D V16QImode);
> Fixed.
>=20
> >
> > > +
> > > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 < length)
> >
> > Your assertion above checked <=3D, but here you use <.  Is that correct?
> Yes, it is.  For case "=3D=3D", it means we have nelt_v8 bytes leftover, =
which
will
> be handled by the last branch of if statement.
>=20
> >
> > > +    {
> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
> > > +      /* We are shifting bytes back, set the alignment accordingly.
*/
> > > +      if ((length & 1) !=3D 0 && align >=3D 2)
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> > > +    }
> > > +  /* Handle (0, 8] bytes leftover.  */
> > > +  else if (i < length && i + nelt_v8 >=3D length)
> > > +    {
> > > +      if (mode =3D=3D V16QImode)
> > > +	{
> > > +	  reg =3D gen_lowpart (V8QImode, reg);
> > > +	  mem =3D adjust_automodify_address (dstbase, V8QImode, dst, 0);
> > > +	}
> > > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
> > > +					      + (nelt_mode - nelt_v8))));
> > > +      /* We are shifting bytes back, set the alignment accordingly.
*/
> > > +      if ((length & 1) !=3D 0 && align >=3D 2)
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> > > +    }
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using vectorization instructions for the
> > > +   aligned case.  We fill the first LENGTH bytes of the memory area
> > > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
> > > +   alignment requirement of memory.  */
> >
> > See all the comments above for the unaligend case.
> Fixed accordingly.
>=20
> >
> > > +static bool
> > > +arm_block_set_aligned_vect (rtx dstbase,
> > > +			    unsigned HOST_WIDE_INT length,
> > > +			    unsigned HOST_WIDE_INT value,
> > > +			    unsigned HOST_WIDE_INT align) {
> > > +  unsigned int i =3D 0, j =3D 0, nelt_v8, nelt_v16, nelt_mode;
> > > +  rtx dst, addr, mem;
> > > +  rtx val_elt, val_vec, reg;
> > > +  rtx rval[MAX_VECT_LEN];
> > > +  enum machine_mode mode;
> > > +  unsigned HOST_WIDE_INT v =3D value;
> > > +
> > > +  gcc_assert ((align & 0x3) =3D=3D 0);
> > > +  nelt_v8 =3D GET_MODE_NUNITS (V8QImode);
> > > +  nelt_v16 =3D GET_MODE_NUNITS (V16QImode);  if (length >=3D nelt_v16
> > > + && unaligned_access && !BYTES_BIG_ENDIAN)
> > > +    mode =3D V16QImode;
> > > +  else
> > > +    mode =3D V8QImode;
> > > +
> > > +  nelt_mode =3D GET_MODE_NUNITS (mode);  gcc_assert (length >=3D
> > > + nelt_mode);
> > > +  /* Skip if it isn't profitable.  */  if
> > > + (!arm_block_set_vect_profit_p (length, align, false, mode))
> > > +    return false;
> > > +
> > > +  dst =3D copy_addr_to_reg (XEXP (dstbase, 0));
> > > +
> > > +  v =3D sext_hwi (v, BITS_PER_WORD);
> > > +  val_elt =3D GEN_INT (v);
> > > +  for (; j < nelt_mode; j++)
> > > +    rval[j] =3D val_elt;
> > > +
> > > +  reg =3D gen_reg_rtx (mode);
> > > +  val_vec =3D gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
> > > + rval));
> > > +  /* Emit instruction loading the constant value.  */
> > > + emit_move_insn (reg, val_vec);
> > > +
> > > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
> > > +*/
> > > +  if (mode =3D=3D V16QImode)
> > > +    {
> > > +      mem =3D adjust_automodify_address (dstbase, mode, dst, 0);
> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> > > +      i +=3D nelt_mode;
> > > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
> > > +      if (i + nelt_v8 < length && i + nelt_v16 > length)
> > > +	{
> > > +	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> > > +	  mem =3D adjust_automodify_address (dstbase, mode, dst, 0);
> > > +	  /* We are shifting bytes back, set the alignment accordingly.  */
> > > +	  if ((length & 0x3) =3D=3D 0)
> > > +	    set_mem_align (mem, BITS_PER_UNIT * 4);
> > > +	  else if ((length & 0x1) =3D=3D 0)
> > > +	    set_mem_align (mem, BITS_PER_UNIT * 2);
> > > +	  else
> > > +	    set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +	  emit_insn (gen_movmisalignv16qi (mem, reg));
> > > +	  return true;
> > > +	}
> > > +      /* Fall through for bytes leftover.  */
> > > +      mode =3D V8QImode;
> > > +      nelt_mode =3D GET_MODE_NUNITS (mode);
> > > +      reg =3D gen_lowpart (V8QImode, reg);
> > > +    }
> > > +
> > > +  /* Handle 8 bytes in a vector.  */  for (; (i + nelt_mode <=3D
> > > + length); i +=3D nelt_mode)
> > > +    {
> > > +      addr =3D plus_constant (Pmode, dst, i);
> > > +      mem =3D adjust_automodify_address (dstbase, mode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +
> > > +  /* Handle single word leftover by shifting 4 bytes back.  We can
> > > +     use aligned access for this case.  */
> > > +  if (i + UNITS_PER_WORD =3D=3D length)
> > > +    {
> > > +      addr =3D plus_constant (Pmode, dst, i - UNITS_PER_WORD);
> > > +      mem =3D adjust_automodify_address (dstbase, mode,
> > > +				       addr, i - UNITS_PER_WORD);
> > > +      /* We are shifting 4 bytes back, set the alignment accordingly.
> */
> > > +      if (align > UNITS_PER_WORD)
> > > +	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
> > > +
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
> > > +     We have to use unaligned access for this case.  */
> > > +  else if (i < length)
> > > +    {
> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> > > +      mem =3D adjust_automodify_address (dstbase, mode, dst, 0);
> > > +      /* We are shifting bytes back, set the alignment accordingly.
*/
> > > +      if ((length & 1) =3D=3D 0)
> > > +	set_mem_align (mem, BITS_PER_UNIT * 2);
> > > +      else
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> > > +    }
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using plain strh/strb instructions, only
> > > +   using instructions allowed by ALIGN on processor.  We fill the
> > > +   first LENGTH bytes of the memory area starting from DSTBASE
> > > +   with byte constant VALUE.  ALIGN is the alignment requirement
> > > +   of memory.  */
> > > +static bool
> > > +arm_block_set_unaligned_straight (rtx dstbase,
> > > +				  unsigned HOST_WIDE_INT length,
> > > +				  unsigned HOST_WIDE_INT value,
> > > +				  unsigned HOST_WIDE_INT align) {
> > > +  unsigned int i;
> > > +  rtx dst, addr, mem;
> > > +  rtx val_exp, val_reg, reg;
> > > +  enum machine_mode mode;
> > > +  HOST_WIDE_INT v =3D value;
> > > +
> > > +  gcc_assert (align =3D=3D 1 || align =3D=3D 2);
> > > +
> > > +  if (align =3D=3D 2)
> > > +    v |=3D (value << BITS_PER_UNIT);
> > > +
> > > +  v =3D sext_hwi (v, BITS_PER_WORD);
> > > +  val_exp =3D GEN_INT (v);
> > > +  /* Skip if it isn't profitable.  */
> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
> > > +					align, true, false))
> > > +    return false;
> > > +
> > > +  dst =3D copy_addr_to_reg (XEXP (dstbase, 0));  mode =3D (align =3D=
=3D 2 ?
> > > + HImode : QImode);  val_reg =3D force_reg (SImode, val_exp);  reg =3D
> > > + gen_lowpart (mode, val_reg);
> > > +
> > > +  for (i =3D 0; (i + GET_MODE_SIZE (mode) <=3D length); i +=3D
> > > + GET_MODE_SIZE
> > (mode))
> > > +    {
> > > +      addr =3D plus_constant (Pmode, dst, i);
> > > +      mem =3D adjust_automodify_address (dstbase, mode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +
> > > +  /* Handle single byte leftover.  */  if (i + 1 =3D=3D length)
> > > +    {
> > > +      reg =3D gen_lowpart (QImode, val_reg);
> > > +      addr =3D plus_constant (Pmode, dst, i);
> > > +      mem =3D adjust_automodify_address (dstbase, QImode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +      i++;
> > > +    }
> > > +
> > > +  gcc_assert (i =3D=3D length);
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using plain strd/str/strh/strb instructions,
> > > +   to permit unaligned copies on processors which support unaligned
> > > +   semantics for those instructions.  We fill the first LENGTH bytes
> > > +   of the memory area starting from DSTBASE with byte constant VALUE.
> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> > > +arm_block_set_aligned_straight (rtx dstbase,
> > > +				unsigned HOST_WIDE_INT length,
> > > +				unsigned HOST_WIDE_INT value,
> > > +				unsigned HOST_WIDE_INT align)
> > > +{
> > > +  unsigned int i =3D 0;
> > > +  rtx dst, addr, mem;
> > > +  rtx val_exp, val_reg, reg;
> > > +  unsigned HOST_WIDE_INT v;
> > > +  bool use_strd_p;
> > > +
> > > +  use_strd_p =3D (length >=3D 2 * UNITS_PER_WORD && (align & 3) =3D=
=3D 0
> > > +		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
> > > +
> > > +  v =3D (value | (value << 8) | (value << 16) | (value << 24));  if
> > > + (length < UNITS_PER_WORD)
> > > +    v &=3D (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
> > > +
> > > +  if (use_strd_p)
> > > +    v |=3D (v << BITS_PER_WORD);
> > > +  else
> > > +    v =3D sext_hwi (v, BITS_PER_WORD);
> > > +
> > > +  val_exp =3D GEN_INT (v);
> > > +  /* Skip if it isn't profitable.  */
> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
> > > +					align, false, use_strd_p))
> > > +    {
> > > +      /* Try without strd.  */
> > > +      v =3D (v >> BITS_PER_WORD);
> > > +      v =3D sext_hwi (v, BITS_PER_WORD);
> > > +      val_exp =3D GEN_INT (v);
> > > +      use_strd_p =3D false;
> > > +      if (!arm_block_set_straight_profit_p (val_exp, length,
> > > +					    align, false, use_strd_p))
> > > +	return false;
> > > +    }
> > > +
> > > +  dst =3D copy_addr_to_reg (XEXP (dstbase, 0));
> > > +  /* Handle double words using strd if possible.  */
> > > +  if (use_strd_p)
> > > +    {
> > > +      val_reg =3D force_reg (DImode, val_exp);
> > > +      reg =3D val_reg;
> > > +      for (; (i + 8 <=3D length); i +=3D 8)
> > > +	{
> > > +	  addr =3D plus_constant (Pmode, dst, i);
> > > +	  mem =3D adjust_automodify_address (dstbase, DImode, addr, i);
> > > +	  emit_move_insn (mem, reg);
> > > +	}
> > > +    }
> > > +  else
> > > +    val_reg =3D force_reg (SImode, val_exp);
> > > +
> > > +  /* Handle words.  */
> > > +  reg =3D (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
> > > +  for (; (i + 4 <=3D length); i +=3D 4)
> > > +    {
> > > +      addr =3D plus_constant (Pmode, dst, i);
> > > +      mem =3D adjust_automodify_address (dstbase, SImode, addr, i);
> > > +      if ((align & 3) =3D=3D 0)
> > > +	emit_move_insn (mem, reg);
> > > +      else
> > > +	emit_insn (gen_unaligned_storesi (mem, reg));
> > > +    }
> > > +
> > > +  /* Merge last pair of STRH and STRB into a STR if possible.  */
> > > + if (unaligned_access && i > 0 && (i + 3) =3D=3D length)
> > > +    {
> > > +      addr =3D plus_constant (Pmode, dst, i - 1);
> > > +      mem =3D adjust_automodify_address (dstbase, SImode, addr, i - =
1);
> > > +      /* We are shifting one byte back, set the alignment
accordingly.
> */
> > > +      if ((align & 1) =3D=3D 0)
> > > +	set_mem_align (mem, BITS_PER_UNIT);
> > > +
> > > +      /* Most likely this is an unaligned access, and we can't tell
at
> > > +	 compilation time.  */
> > > +      emit_insn (gen_unaligned_storesi (mem, reg));
> > > +      return true;
> > > +    }
> > > +
> > > +  /* Handle half word leftover.  */
> > > +  if (i + 2 <=3D length)
> > > +    {
> > > +      reg =3D gen_lowpart (HImode, val_reg);
> > > +      addr =3D plus_constant (Pmode, dst, i);
> > > +      mem =3D adjust_automodify_address (dstbase, HImode, addr, i);
> > > +      if ((align & 1) =3D=3D 0)
> > > +	emit_move_insn (mem, reg);
> > > +      else
> > > +	emit_insn (gen_unaligned_storehi (mem, reg));
> > > +
> > > +      i +=3D 2;
> > > +    }
> > > +
> > > +  /* Handle single byte leftover.  */  if (i + 1 =3D=3D length)
> > > +    {
> > > +      reg =3D gen_lowpart (QImode, val_reg);
> > > +      addr =3D plus_constant (Pmode, dst, i);
> > > +      mem =3D adjust_automodify_address (dstbase, QImode, addr, i);
> > > +      emit_move_insn (mem, reg);
> > > +    }
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Set a block of memory using vectorization instructions for both
> > > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
> > > +   the memory area starting from DSTBASE with byte constant VALUE.
> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> > > +arm_block_set_vect (rtx dstbase,
> > > +		    unsigned HOST_WIDE_INT length,
> > > +		    unsigned HOST_WIDE_INT value,
> > > +		    unsigned HOST_WIDE_INT align) {
> > > +  /* Check whether we need to use unaligned store instruction.  */
> > > +  if (((align & 3) !=3D 0 || (length & 3) !=3D 0)
> > > +      /* Check whether unaligned store instruction is available.  */
> > > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
> > > +    return false;
> >
> > Huh!  vst1.8 can work for unaligned accesses even when hw alignment
> > checking is strict.
> Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
> unaligned_access", vst1.8 instructions  can't be recognized now in this
way.
> I agree that it's too strict, but that's another problem I think.
>=20
> >
> > > +
> > > +  if ((align & 3) =3D=3D 0)
> > > +    return arm_block_set_aligned_vect (dstbase, length, value,
> > > +align);
> > > +  else
> > > +    return arm_block_set_unaligned_vect (dstbase, length, value,
> > > +align); }
> > > +
> > > +/* Expand string store operation.  Firstly we try to do that by using
> > > +   vectorization instructions, then try with ARM unaligned access and
> > > +   double-word store if profitable.  OPERANDS[0] is the destination,
> > > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
> > > +   initialize the memory, OPERANDS[3] is the known alignment of the
> > > +   destination.  */
> > > +bool
> > > +arm_gen_setmem (rtx *operands)
> > > +{
> > > +  rtx dstbase =3D operands[0];
> > > +  unsigned HOST_WIDE_INT length;
> > > +  unsigned HOST_WIDE_INT value;
> > > +  unsigned HOST_WIDE_INT align;
> > > +
> > > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
> > > +    return false;
> > > +
> > > +  length =3D UINTVAL (operands[1]);
> > > +  if (length > 64)
> > > +    return false;
> > > +
> > > +  value =3D (UINTVAL (operands[2]) & 0xFF);  align =3D UINTVAL
> > > + (operands[3]);  if (TARGET_NEON && length >=3D 8
> > > +      && current_tune->string_ops_prefer_neon
> > > +      && arm_block_set_vect (dstbase, length, value, align))
> > > +    return true;
> > > +
> > > +  if (!unaligned_access && (align & 3) !=3D 0)
> > > +    return arm_block_set_unaligned_straight (dstbase, length,
> > > + value, align);
> > > +
> > > +  return arm_block_set_aligned_straight (dstbase, length, value,
> > > +align); }
> > > +
> > >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
> > >
> > >  static unsigned HOST_WIDE_INT
> > > Index: gcc/config/arm/arm-protos.h
> > >
> >
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > --- gcc/config/arm/arm-protos.h	(revision 209852)
> > > +++ gcc/config/arm/arm-protos.h	(working copy)
> > > @@ -277,6 +277,8 @@ struct tune_params
> > >    /* Prefer 32-bit encoding instead of 16-bit encoding where subset
> > > of
> flags
> > >       would be set.  */
> > >    bool disparage_partial_flag_setting_t16_encodings;
> > > +  /* Prefer to inline string operations like memset by using Neon.
> > > + */  bool string_ops_prefer_neon;
> > >  };
> > >
> > >  extern const struct tune_params *current_tune; @@ -289,6 +291,7 @@
> > > extern void arm_emit_coreregs_64bit_shift (enum rt  extern bool
> > > arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /* RTX_CODE
> > > */
> > >
> > > +extern bool arm_gen_setmem (rtx *);
> > >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx
> > > sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx op0,
> > > rtx op1, rtx sel);
> > >
> > > Index: gcc/config/arm/arm.md
> > >
> >
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > --- gcc/config/arm/arm.md	(revision 209852)
> > > +++ gcc/config/arm/arm.md	(working copy)
> > > @@ -7555,6 +7555,20 @@
> > >  })
> > >
> > >
> > > +(define_expand "setmemsi"
> > > +  [(match_operand:BLK 0 "general_operand" "")
> > > +   (match_operand:SI 1 "const_int_operand" "")
> > > +   (match_operand:SI 2 "const_int_operand" "")
> > > +   (match_operand:SI 3 "const_int_operand" "")]
> > > +  "TARGET_32BIT"
> > > +{
> > > +  if (arm_gen_setmem (operands))
> > > +    DONE;
> > > +
> > > +  FAIL;
> > > +})
> > > +
> > > +
> > >  ;; Move a block of memory if it is word aligned and MORE than 2
> > > words
> > long.
> > >  ;; We could let this apply for blocks of less than this, but it
> > > clobbers so  ;; many registers that there is then probably a better
way.
> > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
> > >
> >
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
> > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
> >
> > Have you tested these when the compiler was configured with "--with-
> > cpu=3Dcortex-a9"?
> Here is the tricky part.
> For compiler configured with "--with-tune=3Dcortex-a9", the neon related
> cases
> (4/5/6/8/9) would fail because we have no way to determine that we are
> compiling with cortex-a9 tune here.
> For compiler configured with "--with-cpu=3Dcortex-a9", the test cases wou=
ld
> pass but I think this is a mistake.  It reveals an issue that GCC won't
pass "-
> mcpu=3Dcortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It ju=
st
makes
> no sense.
> With these issues, I didn't change the tests for now.
Precisely, I configured gcc with options "--with-arch=3Darmv7-a
--with-cpu|--with-tune=3Dcortex-a9".
I read gcc documents and realized that "-mcpu" is ignored when "-march" is
specified.  I don't know why gcc acts in this manner, but it leads to
inconsistent configuration/command line behavior.
If we configure GCC with "--with-arch=3Darmv7-a --with-cpu=3Dcortex-a9", th=
en
only "-march=3Darmv7-a" is passed to cc1.
If we compile with "-march=3Darmv7-a -mcpu=3Dcortex-a9", then gcc works fin=
e and
passes "-march=3Darmv7-a -mcpu=3Dcortex-a9" to cc1.

Even more weird cc1 warns that "switch -mcpu=3Dcortex-m4 conflicts with
-march=3Darmv7-m switch".=20

Thanks,
bin