From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19967 invoked by alias); 6 May 2014 05:00:06 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 19935 invoked by uid 89); 6 May 2014 05:00:04 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: service87.mimecast.com Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 06 May 2014 04:59:59 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Tue, 06 May 2014 05:59:54 +0100 Received: from SHAWIN162 ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 6 May 2014 06:00:02 +0100 From: "bin.cheng" To: "Bin Cheng" , "Richard Earnshaw" Cc: References: <003301cf641f$36733f40$a359bdc0$@arm.com> <5363A4CF.9050904@arm.com> <003f01cf6832$9cf921a0$d6eb64e0$@arm.com> In-Reply-To: <003f01cf6832$9cf921a0$d6eb64e0$@arm.com> Subject: RE: [PATCH ARM] Improve ARM memset inlining Date: Tue, 06 May 2014 05:00:00 -0000 Message-ID: <004401cf68e7$fff1bb40$ffd531c0$@arm.com> MIME-Version: 1.0 X-MC-Unique: 114050605595400201 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2014-05/txt/msg00270.txt.bz2 > -----Original Message----- > From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches- > owner@gcc.gnu.org] On Behalf Of bin.cheng > Sent: Monday, May 05, 2014 3:21 PM > To: Richard Earnshaw > Cc: gcc-patches@gcc.gnu.org > Subject: RE: [PATCH ARM] Improve ARM memset inlining >=20 > Hi Richard, Thanks for reviewing. I embedded answers to your comments, > also updated the patch. >=20 > > -----Original Message----- > > From: Richard Earnshaw > > Sent: Friday, May 02, 2014 10:00 PM > > To: Bin Cheng > > Cc: gcc-patches@gcc.gnu.org > > Subject: Re: [PATCH ARM] Improve ARM memset inlining > > > > On 30/04/14 03:52, bin.cheng wrote: > > > Hi, > > > This patch expands small memset calls into direct memory set > > > instructions by introducing "setmemsi" pattern. For processors > > > without NEON support, it expands memset using general store > > > instruction. For example, strd for 4-bytes aligned addresses. For > > > processors with NEON support, it expands memset using neon > > > instructions like vstr and miscellaneous vst1.* instructions for > > > both > aligned > > and unaligned cases. > > > > > > This patch depends on > > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise > > > vst1.64 will be generated for 32-bit aligned memory unit. > > > > > > There is also one leftover work of this patch: Since vst1.* > > > instructions only support post-increment addressing mode, the > > > inlined memset for unaligned neon cases should be like: > > > vmov.i32 q8, #... > > > vst1.8 {q8}, [r3]! > > > vst1.8 {q8}, [r3]! > > > vst1.8 {q8}, [r3]! > > > vst1.8 {q8}, [r3] > > > > Other than for zero, I'd expect the vmov to be vmov.i8 to move an > arbitrary > I just used vmov.i32 as an example. The element size is actually calculated by > function neon_valid_immediate which works as expected I think. >=20 > > byte value into all lanes in a vector. After that, if the alignment > > is > known to > > be more than 8-bit, I'd expect the vst1 instructions (with the > > exception > of the > > last store if the length is not a multiple of the alignment) to use > > > > vst1. {reg}, [addr-reg :]! > > > > Hence, for 16-bit aligned data, we want > > > > vst1.16 {q8}, [r3:16]! > Did I miss something important? It seems to me the explicit alignment notes > supported are 64/128/256. So what do you mean by 16 bits alignment here? >=20 > > > > > But for now, gcc can't do this and below code is generated: > > > vmov.i32 q8, #... > > > vst1.8 {q8}, [r3] > > > add r2, r3, #16 > > > add r3, r2, #16 > > > vst1.8 {q8}, [r2] > > > vst1.8 {q8}, [r3] > > > add r2, r3, #16 > > > vst1.8 {q8}, [r2] > > > > > > I investigated this issue. The root cause lies in rtx cost returned > > > by ARM backend. Anyway, I think this is another issue and should be > > > fixed in separated patch. > > > > > > Bootstrap and reg-test on cortex-a15, with or without neon support. > > > Is it OK? > > > > > > > Some more comments inline. > > > > > Thanks, > > > bin > > > > > > > > > 2014-04-29 Bin Cheng > > > > > > PR target/55701 > > > * config/arm/arm.md (setmem): New pattern. > > > * config/arm/arm-protos.h (struct tune_params): New field. > > > (arm_gen_setmem): New prototype. > > > * config/arm/arm.c (arm_slowmul_tune): Initialize new field. > > > (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto. > > > (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto. > > > (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto. > > > (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto. > > > (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto. > > > (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto. > > > (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto. > > > (arm_const_inline_cost): New function. > > > (arm_block_set_max_insns): New function. > > > (arm_block_set_straight_profit_p): New function. > > > (arm_block_set_vect_profit_p): New function. > > > (arm_block_set_unaligned_vect): New function. > > > (arm_block_set_aligned_vect): New function. > > > (arm_block_set_unaligned_straight): New function. > > > (arm_block_set_aligned_straight): New function. > > > (arm_block_set_vect, arm_gen_setmem): New functions. > > > > > > gcc/testsuite/ChangeLog > > > 2014-04-29 Bin Cheng > > > > > > PR target/55701 > > > * gcc.target/arm/memset-inline-1.c: New test. > > > * gcc.target/arm/memset-inline-2.c: New test. > > > * gcc.target/arm/memset-inline-3.c: New test. > > > * gcc.target/arm/memset-inline-4.c: New test. > > > * gcc.target/arm/memset-inline-5.c: New test. > > > * gcc.target/arm/memset-inline-6.c: New test. > > > * gcc.target/arm/memset-inline-7.c: New test. > > > * gcc.target/arm/memset-inline-8.c: New test. > > > * gcc.target/arm/memset-inline-9.c: New test. > > > > > > > > > j1328-20140429.txt > > > > > > > > > Index: gcc/config/arm/arm.c > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > --- gcc/config/arm/arm.c (revision 209852) > > > +++ gcc/config/arm/arm.c (working copy) > > > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune > =3D > > > true, /* Prefer constant > > pool. */ > > > arm_default_branch_cost, > > > false, /* Prefer LDRD/STRD. */ > > > - {true, true}, /* Prefer non short > > circuit. */ > > > - &arm_default_vec_cost, /* Vectorizer costs. > */ > > > - false, /* Prefer Neon for > 64-bits bitops. */ > > > - false, false /* Prefer 32-bit > encodings. */ > > > + {true, true}, /* Prefer non short circuit. > */ > > > + &arm_default_vec_cost, /* Vectorizer costs. */ > > > + false, /* Prefer Neon for 64-bits > bitops. */ > > > + false, false, /* Prefer 32-bit encodings. */ > > > + false /* Prefer Neon for stringops. > */ > > > }; > > > > > > > Please make sure that all the white space before the comments is using > TAB, > > not spaces. Similarly for the other tables. > Fixed. >=20 > > > > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val) > > > NULL_RTX, NULL_RTX, 0, 0)); } > > > > > > +/* Cost of loading a SImode constant. */ static inline int > > > +arm_const_inline_cost (rtx val) { > > > + return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val), > > > + NULL_RTX, NULL_RTX, 0, 0); } > > > + > > > > This could be used more widely if you passed the SET in as a parameter > > (there are cases in arm_new_rtx_cost that could use it, for example). > > Also, you want to enable sub-targets (only once you can't create new > > pseudos is that not safe), so the penultimate argument in the call to > > arm_gen_constant should be 1. > Fixed. >=20 > > > > > /* Return true if it is worthwhile to split a 64-bit constant into two > > > 32-bit operations. This is the case if optimizing for size, or > > > if we have load delay slots, or if one 32-bit part can be done > > > with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx > > > *comparison, rtx * op > > > > > > } > > > > > > +/* Maximum number of instructions to set block of memory. */ > > > +static int arm_block_set_max_insns (void) { > > > + return (optimize_function_for_size_p (cfun) ? 4 : 8); } > > > > I think the non-size_p alternative should really be a parameter in the > per-cpu > > costs table. > Fixed. >=20 > > > > > + > > > +/* Return TRUE if it's profitable to set block of memory for straight > > > + case. */ > > > > "Straight" is confusing here. Do you mean non-vectorized? If so, > > then non_vect might be clearer. > Fixed. >=20 > > > > The arguments should really be documented (see comment below about > > align, for example). > Fixed. >=20 > > > > > +static bool > > > +arm_block_set_straight_profit_p (rtx val, > > > + unsigned HOST_WIDE_INT length, > > > + unsigned HOST_WIDE_INT align, > > > + bool unaligned_p, bool use_strd_p) { > > > + int num =3D 0; > > > + /* For leftovers in bytes of 0-7, we can set the memory block using > > > + strb/strh/str with minimum instruction number. */ > > > + int leftover[8] =3D {0, 1, 1, 2, 1, 2, 2, 3}; > > > > This should be marked const. > Fixed. >=20 > > > > > + > > > + if (unaligned_p) > > > + { > > > + num =3D arm_const_inline_cost (val); > > > + num +=3D length / align + length % align; > > > > Isn't align in bits here, when you really want it in bytes? > All alignments are in bytes starting from pattern "setmem". >=20 > > > > What if align > 4 bytes? > Then it's the "!unaligned_p" case and handled by other arms of this if > statement. >=20 > > > > > + } > > > + else if (use_strd_p) > > > + { > > > + num =3D arm_const_double_inline_cost (val); > > > + num +=3D (length >> 3) + leftover[length & 7]; > > > + } > > > + else > > > + { > > > + num =3D arm_const_inline_cost (val); > > > + num +=3D (length >> 2) + leftover[length & 3]; > > > + } > > > + > > > + /* We may be able to combine last pair STRH/STRB into a single STR > > > + by shifting one byte back. */ if (unaligned_access && length > > > + > 3 && (length & 3) =3D=3D 3) > > > + num--; > > > + > > > + return (num <=3D arm_block_set_max_insns ()); } > > > + > > > +/* Return TRUE if it's profitable to set block of memory for vector > > > +case. */ static bool arm_block_set_vect_profit_p (unsigned > > > +HOST_WIDE_INT length, > > > + unsigned HOST_WIDE_INT align > > ATTRIBUTE_UNUSED, > > > + bool unaligned_p, enum machine_mode mode) > > > > I'm not sure what you mean by unaligned here. Again, documenting the > > arguments might help. > Fixed. >=20 > > > > > +{ > > > + int num; > > > + unsigned int nelt =3D GET_MODE_NUNITS (mode); > > > + > > > + /* Num of instruction loading constant value. */ > > > > Use either "Number" or, in this case, simply drop that bit and write: > > /* Instruction loading constant value. */ > Fixed. >=20 > > > > > + num =3D 1; > > > + /* Num of store instructions. */ > > > > Likewise. > > > > > + num +=3D (length + nelt - 1) / nelt; > > > + /* Num of address adjusting instructions. */ > > > > Can't we work on the premise that the address adjusting instructions > > will > be > > merged into the stores? I know you said that they currently do not, > > but that's not a problem that this bit of code should have to worry about. > Fixed. >=20 > > > > > + if (unaligned_p) > > > + /* For unaligned case, it's one less than the store instructions. > */ > > > + num +=3D (length + nelt - 1) / nelt - 1; else if ((length & 3) > > > + !=3D > > > + 0) > > > + /* For aligned case, it's one if bytes leftover can only be stored > > > + by mis-aligned store instruction. */ > > > + num++; > > > + > > > + /* Store the first 16 bytes using vst1:v16qi for the aligned case. > > > + */ if (!unaligned_p && mode =3D=3D V16QImode) > > > + num--; > > > + > > > + return (num <=3D arm_block_set_max_insns ()); } > > > + > > > +/* Set a block of memory using vectorization instructions for the > > > + unaligned case. We fill the first LENGTH bytes of the memory > > > + area starting from DSTBASE with byte constant VALUE. ALIGN is > > > + the alignment requirement of memory. */ > > > > What's the return value mean? > Documented. >=20 > > > > > +static bool > > > +arm_block_set_unaligned_vect (rtx dstbase, > > > + unsigned HOST_WIDE_INT length, > > > + unsigned HOST_WIDE_INT value, > > > + unsigned HOST_WIDE_INT align) { > > > + unsigned int i =3D 0, j =3D 0, nelt_v16, nelt_v8, nelt_mode; > > > > Don't mix initialized declarations with unitialized ones on the same line. > You > > don't appear to use either I or J until their first use in the loop > control below, > > so why initialize them here? > Fixed. >=20 > > > > > + rtx dst, mem; > > > + rtx val_elt, val_vec, reg; > > > + rtx rval[MAX_VECT_LEN]; > > > + rtx (*gen_func) (rtx, rtx); > > > + enum machine_mode mode; > > > + unsigned HOST_WIDE_INT v =3D value; > > > + > > > + gcc_assert ((align & 0x3) !=3D 0); > > > + nelt_v8 =3D GET_MODE_NUNITS (V8QImode); > > > + nelt_v16 =3D GET_MODE_NUNITS (V16QImode); if (length >=3D nelt_v1= 6) > > > + { > > > + mode =3D V16QImode; > > > + gen_func =3D gen_movmisalignv16qi; > > > + } > > > + else > > > + { > > > + mode =3D V8QImode; > > > + gen_func =3D gen_movmisalignv8qi; > > > + } > > > + nelt_mode =3D GET_MODE_NUNITS (mode); gcc_assert (length >=3D > > > + nelt_mode); > > > + /* Skip if it isn't profitable. */ if > > > + (!arm_block_set_vect_profit_p (length, align, true, mode)) > > > + return false; > > > + > > > + dst =3D copy_addr_to_reg (XEXP (dstbase, 0)); mem =3D > > > + adjust_automodify_address (dstbase, mode, dst, 0); > > > + > > > + v =3D sext_hwi (v, BITS_PER_WORD); > > > + val_elt =3D GEN_INT (v); > > > + for (; j < nelt_mode; j++) > > > + rval[j] =3D val_elt; > > > > Is this the first use of J? If so, initialize it here. > > > > > + > > > + reg =3D gen_reg_rtx (mode); > > > + val_vec =3D gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, > > > + rval)); > > > + /* Emit instruction loading the constant value. */ > > > + emit_move_insn (reg, val_vec); > > > + > > > + /* Handle nelt_mode bytes in a vector. */ for (; (i + nelt_mode > > > + <=3D length); i +=3D nelt_mode) > > > > Similarly for I. > > > > > + { > > > + emit_insn ((*gen_func) (mem, reg)); > > > + if (i + 2 * nelt_mode <=3D length) > > > + emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode))); > > > + } > > > + > > > + if (i + nelt_v8 <=3D length) > > > + gcc_assert (mode =3D=3D V16QImode); > > > > Why not drop the if and write: > > > > gcc_assert ((i + nelt_v8) > length || mode =3D=3D V16QImode); > Fixed. >=20 > > > > > + > > > + /* Handle (8, 16) bytes leftover. */ if (i + nelt_v8 < length) > > > > Your assertion above checked <=3D, but here you use <. Is that correct? > Yes, it is. For case "=3D=3D", it means we have nelt_v8 bytes leftover, = which will > be handled by the last branch of if statement. >=20 > > > > > + { > > > + emit_insn (gen_add2_insn (dst, GEN_INT (length - i))); > > > + /* We are shifting bytes back, set the alignment accordingly. */ > > > + if ((length & 1) !=3D 0 && align >=3D 2) > > > + set_mem_align (mem, BITS_PER_UNIT); > > > + > > > + emit_insn (gen_movmisalignv16qi (mem, reg)); > > > + } > > > + /* Handle (0, 8] bytes leftover. */ > > > + else if (i < length && i + nelt_v8 >=3D length) > > > + { > > > + if (mode =3D=3D V16QImode) > > > + { > > > + reg =3D gen_lowpart (V8QImode, reg); > > > + mem =3D adjust_automodify_address (dstbase, V8QImode, dst, 0); > > > + } > > > + emit_insn (gen_add2_insn (dst, GEN_INT ((length - i) > > > + + (nelt_mode - nelt_v8)))); > > > + /* We are shifting bytes back, set the alignment accordingly. */ > > > + if ((length & 1) !=3D 0 && align >=3D 2) > > > + set_mem_align (mem, BITS_PER_UNIT); > > > + > > > + emit_insn (gen_movmisalignv8qi (mem, reg)); > > > + } > > > + > > > + return true; > > > +} > > > + > > > +/* Set a block of memory using vectorization instructions for the > > > + aligned case. We fill the first LENGTH bytes of the memory area > > > + starting from DSTBASE with byte constant VALUE. ALIGN is the > > > + alignment requirement of memory. */ > > > > See all the comments above for the unaligend case. > Fixed accordingly. >=20 > > > > > +static bool > > > +arm_block_set_aligned_vect (rtx dstbase, > > > + unsigned HOST_WIDE_INT length, > > > + unsigned HOST_WIDE_INT value, > > > + unsigned HOST_WIDE_INT align) { > > > + unsigned int i =3D 0, j =3D 0, nelt_v8, nelt_v16, nelt_mode; > > > + rtx dst, addr, mem; > > > + rtx val_elt, val_vec, reg; > > > + rtx rval[MAX_VECT_LEN]; > > > + enum machine_mode mode; > > > + unsigned HOST_WIDE_INT v =3D value; > > > + > > > + gcc_assert ((align & 0x3) =3D=3D 0); > > > + nelt_v8 =3D GET_MODE_NUNITS (V8QImode); > > > + nelt_v16 =3D GET_MODE_NUNITS (V16QImode); if (length >=3D nelt_v16 > > > + && unaligned_access && !BYTES_BIG_ENDIAN) > > > + mode =3D V16QImode; > > > + else > > > + mode =3D V8QImode; > > > + > > > + nelt_mode =3D GET_MODE_NUNITS (mode); gcc_assert (length >=3D > > > + nelt_mode); > > > + /* Skip if it isn't profitable. */ if > > > + (!arm_block_set_vect_profit_p (length, align, false, mode)) > > > + return false; > > > + > > > + dst =3D copy_addr_to_reg (XEXP (dstbase, 0)); > > > + > > > + v =3D sext_hwi (v, BITS_PER_WORD); > > > + val_elt =3D GEN_INT (v); > > > + for (; j < nelt_mode; j++) > > > + rval[j] =3D val_elt; > > > + > > > + reg =3D gen_reg_rtx (mode); > > > + val_vec =3D gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, > > > + rval)); > > > + /* Emit instruction loading the constant value. */ > > > + emit_move_insn (reg, val_vec); > > > + > > > + /* Handle first 16 bytes specially using vst1:v16qi instruction. > > > +*/ > > > + if (mode =3D=3D V16QImode) > > > + { > > > + mem =3D adjust_automodify_address (dstbase, mode, dst, 0); > > > + emit_insn (gen_movmisalignv16qi (mem, reg)); > > > + i +=3D nelt_mode; > > > + /* Handle (8, 16) bytes leftover using vst1:v16qi again. */ > > > + if (i + nelt_v8 < length && i + nelt_v16 > length) > > > + { > > > + emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode))); > > > + mem =3D adjust_automodify_address (dstbase, mode, dst, 0); > > > + /* We are shifting bytes back, set the alignment accordingly. */ > > > + if ((length & 0x3) =3D=3D 0) > > > + set_mem_align (mem, BITS_PER_UNIT * 4); > > > + else if ((length & 0x1) =3D=3D 0) > > > + set_mem_align (mem, BITS_PER_UNIT * 2); > > > + else > > > + set_mem_align (mem, BITS_PER_UNIT); > > > + > > > + emit_insn (gen_movmisalignv16qi (mem, reg)); > > > + return true; > > > + } > > > + /* Fall through for bytes leftover. */ > > > + mode =3D V8QImode; > > > + nelt_mode =3D GET_MODE_NUNITS (mode); > > > + reg =3D gen_lowpart (V8QImode, reg); > > > + } > > > + > > > + /* Handle 8 bytes in a vector. */ for (; (i + nelt_mode <=3D > > > + length); i +=3D nelt_mode) > > > + { > > > + addr =3D plus_constant (Pmode, dst, i); > > > + mem =3D adjust_automodify_address (dstbase, mode, addr, i); > > > + emit_move_insn (mem, reg); > > > + } > > > + > > > + /* Handle single word leftover by shifting 4 bytes back. We can > > > + use aligned access for this case. */ > > > + if (i + UNITS_PER_WORD =3D=3D length) > > > + { > > > + addr =3D plus_constant (Pmode, dst, i - UNITS_PER_WORD); > > > + mem =3D adjust_automodify_address (dstbase, mode, > > > + addr, i - UNITS_PER_WORD); > > > + /* We are shifting 4 bytes back, set the alignment accordingly. > */ > > > + if (align > UNITS_PER_WORD) > > > + set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD); > > > + > > > + emit_move_insn (mem, reg); > > > + } > > > + /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back. > > > + We have to use unaligned access for this case. */ > > > + else if (i < length) > > > + { > > > + emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode))); > > > + mem =3D adjust_automodify_address (dstbase, mode, dst, 0); > > > + /* We are shifting bytes back, set the alignment accordingly. */ > > > + if ((length & 1) =3D=3D 0) > > > + set_mem_align (mem, BITS_PER_UNIT * 2); > > > + else > > > + set_mem_align (mem, BITS_PER_UNIT); > > > + > > > + emit_insn (gen_movmisalignv8qi (mem, reg)); > > > + } > > > + > > > + return true; > > > +} > > > + > > > +/* Set a block of memory using plain strh/strb instructions, only > > > + using instructions allowed by ALIGN on processor. We fill the > > > + first LENGTH bytes of the memory area starting from DSTBASE > > > + with byte constant VALUE. ALIGN is the alignment requirement > > > + of memory. */ > > > +static bool > > > +arm_block_set_unaligned_straight (rtx dstbase, > > > + unsigned HOST_WIDE_INT length, > > > + unsigned HOST_WIDE_INT value, > > > + unsigned HOST_WIDE_INT align) { > > > + unsigned int i; > > > + rtx dst, addr, mem; > > > + rtx val_exp, val_reg, reg; > > > + enum machine_mode mode; > > > + HOST_WIDE_INT v =3D value; > > > + > > > + gcc_assert (align =3D=3D 1 || align =3D=3D 2); > > > + > > > + if (align =3D=3D 2) > > > + v |=3D (value << BITS_PER_UNIT); > > > + > > > + v =3D sext_hwi (v, BITS_PER_WORD); > > > + val_exp =3D GEN_INT (v); > > > + /* Skip if it isn't profitable. */ > > > + if (!arm_block_set_straight_profit_p (val_exp, length, > > > + align, true, false)) > > > + return false; > > > + > > > + dst =3D copy_addr_to_reg (XEXP (dstbase, 0)); mode =3D (align =3D= =3D 2 ? > > > + HImode : QImode); val_reg =3D force_reg (SImode, val_exp); reg =3D > > > + gen_lowpart (mode, val_reg); > > > + > > > + for (i =3D 0; (i + GET_MODE_SIZE (mode) <=3D length); i +=3D > > > + GET_MODE_SIZE > > (mode)) > > > + { > > > + addr =3D plus_constant (Pmode, dst, i); > > > + mem =3D adjust_automodify_address (dstbase, mode, addr, i); > > > + emit_move_insn (mem, reg); > > > + } > > > + > > > + /* Handle single byte leftover. */ if (i + 1 =3D=3D length) > > > + { > > > + reg =3D gen_lowpart (QImode, val_reg); > > > + addr =3D plus_constant (Pmode, dst, i); > > > + mem =3D adjust_automodify_address (dstbase, QImode, addr, i); > > > + emit_move_insn (mem, reg); > > > + i++; > > > + } > > > + > > > + gcc_assert (i =3D=3D length); > > > + return true; > > > +} > > > + > > > +/* Set a block of memory using plain strd/str/strh/strb instructions, > > > + to permit unaligned copies on processors which support unaligned > > > + semantics for those instructions. We fill the first LENGTH bytes > > > + of the memory area starting from DSTBASE with byte constant VALUE. > > > + ALIGN is the alignment requirement of memory. */ static bool > > > +arm_block_set_aligned_straight (rtx dstbase, > > > + unsigned HOST_WIDE_INT length, > > > + unsigned HOST_WIDE_INT value, > > > + unsigned HOST_WIDE_INT align) > > > +{ > > > + unsigned int i =3D 0; > > > + rtx dst, addr, mem; > > > + rtx val_exp, val_reg, reg; > > > + unsigned HOST_WIDE_INT v; > > > + bool use_strd_p; > > > + > > > + use_strd_p =3D (length >=3D 2 * UNITS_PER_WORD && (align & 3) =3D= =3D 0 > > > + && TARGET_LDRD && current_tune->prefer_ldrd_strd); > > > + > > > + v =3D (value | (value << 8) | (value << 16) | (value << 24)); if > > > + (length < UNITS_PER_WORD) > > > + v &=3D (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT); > > > + > > > + if (use_strd_p) > > > + v |=3D (v << BITS_PER_WORD); > > > + else > > > + v =3D sext_hwi (v, BITS_PER_WORD); > > > + > > > + val_exp =3D GEN_INT (v); > > > + /* Skip if it isn't profitable. */ > > > + if (!arm_block_set_straight_profit_p (val_exp, length, > > > + align, false, use_strd_p)) > > > + { > > > + /* Try without strd. */ > > > + v =3D (v >> BITS_PER_WORD); > > > + v =3D sext_hwi (v, BITS_PER_WORD); > > > + val_exp =3D GEN_INT (v); > > > + use_strd_p =3D false; > > > + if (!arm_block_set_straight_profit_p (val_exp, length, > > > + align, false, use_strd_p)) > > > + return false; > > > + } > > > + > > > + dst =3D copy_addr_to_reg (XEXP (dstbase, 0)); > > > + /* Handle double words using strd if possible. */ > > > + if (use_strd_p) > > > + { > > > + val_reg =3D force_reg (DImode, val_exp); > > > + reg =3D val_reg; > > > + for (; (i + 8 <=3D length); i +=3D 8) > > > + { > > > + addr =3D plus_constant (Pmode, dst, i); > > > + mem =3D adjust_automodify_address (dstbase, DImode, addr, i); > > > + emit_move_insn (mem, reg); > > > + } > > > + } > > > + else > > > + val_reg =3D force_reg (SImode, val_exp); > > > + > > > + /* Handle words. */ > > > + reg =3D (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg); > > > + for (; (i + 4 <=3D length); i +=3D 4) > > > + { > > > + addr =3D plus_constant (Pmode, dst, i); > > > + mem =3D adjust_automodify_address (dstbase, SImode, addr, i); > > > + if ((align & 3) =3D=3D 0) > > > + emit_move_insn (mem, reg); > > > + else > > > + emit_insn (gen_unaligned_storesi (mem, reg)); > > > + } > > > + > > > + /* Merge last pair of STRH and STRB into a STR if possible. */ > > > + if (unaligned_access && i > 0 && (i + 3) =3D=3D length) > > > + { > > > + addr =3D plus_constant (Pmode, dst, i - 1); > > > + mem =3D adjust_automodify_address (dstbase, SImode, addr, i - = 1); > > > + /* We are shifting one byte back, set the alignment accordingly. > */ > > > + if ((align & 1) =3D=3D 0) > > > + set_mem_align (mem, BITS_PER_UNIT); > > > + > > > + /* Most likely this is an unaligned access, and we can't tell at > > > + compilation time. */ > > > + emit_insn (gen_unaligned_storesi (mem, reg)); > > > + return true; > > > + } > > > + > > > + /* Handle half word leftover. */ > > > + if (i + 2 <=3D length) > > > + { > > > + reg =3D gen_lowpart (HImode, val_reg); > > > + addr =3D plus_constant (Pmode, dst, i); > > > + mem =3D adjust_automodify_address (dstbase, HImode, addr, i); > > > + if ((align & 1) =3D=3D 0) > > > + emit_move_insn (mem, reg); > > > + else > > > + emit_insn (gen_unaligned_storehi (mem, reg)); > > > + > > > + i +=3D 2; > > > + } > > > + > > > + /* Handle single byte leftover. */ if (i + 1 =3D=3D length) > > > + { > > > + reg =3D gen_lowpart (QImode, val_reg); > > > + addr =3D plus_constant (Pmode, dst, i); > > > + mem =3D adjust_automodify_address (dstbase, QImode, addr, i); > > > + emit_move_insn (mem, reg); > > > + } > > > + > > > + return true; > > > +} > > > + > > > +/* Set a block of memory using vectorization instructions for both > > > + aligned and unaligned cases. We fill the first LENGTH bytes of > > > + the memory area starting from DSTBASE with byte constant VALUE. > > > + ALIGN is the alignment requirement of memory. */ static bool > > > +arm_block_set_vect (rtx dstbase, > > > + unsigned HOST_WIDE_INT length, > > > + unsigned HOST_WIDE_INT value, > > > + unsigned HOST_WIDE_INT align) { > > > + /* Check whether we need to use unaligned store instruction. */ > > > + if (((align & 3) !=3D 0 || (length & 3) !=3D 0) > > > + /* Check whether unaligned store instruction is available. */ > > > + && (!unaligned_access || BYTES_BIG_ENDIAN)) > > > + return false; > > > > Huh! vst1.8 can work for unaligned accesses even when hw alignment > > checking is strict. > Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN && > unaligned_access", vst1.8 instructions can't be recognized now in this way. > I agree that it's too strict, but that's another problem I think. >=20 > > > > > + > > > + if ((align & 3) =3D=3D 0) > > > + return arm_block_set_aligned_vect (dstbase, length, value, > > > +align); > > > + else > > > + return arm_block_set_unaligned_vect (dstbase, length, value, > > > +align); } > > > + > > > +/* Expand string store operation. Firstly we try to do that by using > > > + vectorization instructions, then try with ARM unaligned access and > > > + double-word store if profitable. OPERANDS[0] is the destination, > > > + OPERANDS[1] is the number of bytes, operands[2] is the value to > > > + initialize the memory, OPERANDS[3] is the known alignment of the > > > + destination. */ > > > +bool > > > +arm_gen_setmem (rtx *operands) > > > +{ > > > + rtx dstbase =3D operands[0]; > > > + unsigned HOST_WIDE_INT length; > > > + unsigned HOST_WIDE_INT value; > > > + unsigned HOST_WIDE_INT align; > > > + > > > + if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1])) > > > + return false; > > > + > > > + length =3D UINTVAL (operands[1]); > > > + if (length > 64) > > > + return false; > > > + > > > + value =3D (UINTVAL (operands[2]) & 0xFF); align =3D UINTVAL > > > + (operands[3]); if (TARGET_NEON && length >=3D 8 > > > + && current_tune->string_ops_prefer_neon > > > + && arm_block_set_vect (dstbase, length, value, align)) > > > + return true; > > > + > > > + if (!unaligned_access && (align & 3) !=3D 0) > > > + return arm_block_set_unaligned_straight (dstbase, length, > > > + value, align); > > > + > > > + return arm_block_set_aligned_straight (dstbase, length, value, > > > +align); } > > > + > > > /* Implement the TARGET_ASAN_SHADOW_OFFSET hook. */ > > > > > > static unsigned HOST_WIDE_INT > > > Index: gcc/config/arm/arm-protos.h > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > --- gcc/config/arm/arm-protos.h (revision 209852) > > > +++ gcc/config/arm/arm-protos.h (working copy) > > > @@ -277,6 +277,8 @@ struct tune_params > > > /* Prefer 32-bit encoding instead of 16-bit encoding where subset > > > of > flags > > > would be set. */ > > > bool disparage_partial_flag_setting_t16_encodings; > > > + /* Prefer to inline string operations like memset by using Neon. > > > + */ bool string_ops_prefer_neon; > > > }; > > > > > > extern const struct tune_params *current_tune; @@ -289,6 +291,7 @@ > > > extern void arm_emit_coreregs_64bit_shift (enum rt extern bool > > > arm_validize_comparison (rtx *, rtx *, rtx *); #endif /* RTX_CODE > > > */ > > > > > > +extern bool arm_gen_setmem (rtx *); > > > extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx > > > sel); extern bool arm_expand_vec_perm_const (rtx target, rtx op0, > > > rtx op1, rtx sel); > > > > > > Index: gcc/config/arm/arm.md > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > --- gcc/config/arm/arm.md (revision 209852) > > > +++ gcc/config/arm/arm.md (working copy) > > > @@ -7555,6 +7555,20 @@ > > > }) > > > > > > > > > +(define_expand "setmemsi" > > > + [(match_operand:BLK 0 "general_operand" "") > > > + (match_operand:SI 1 "const_int_operand" "") > > > + (match_operand:SI 2 "const_int_operand" "") > > > + (match_operand:SI 3 "const_int_operand" "")] > > > + "TARGET_32BIT" > > > +{ > > > + if (arm_gen_setmem (operands)) > > > + DONE; > > > + > > > + FAIL; > > > +}) > > > + > > > + > > > ;; Move a block of memory if it is word aligned and MORE than 2 > > > words > > long. > > > ;; We could let this apply for blocks of less than this, but it > > > clobbers so ;; many registers that there is then probably a better way. > > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c (revision 0) > > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c (revision 0) > > > > Have you tested these when the compiler was configured with "--with- > > cpu=3Dcortex-a9"? > Here is the tricky part. > For compiler configured with "--with-tune=3Dcortex-a9", the neon related > cases > (4/5/6/8/9) would fail because we have no way to determine that we are > compiling with cortex-a9 tune here. > For compiler configured with "--with-cpu=3Dcortex-a9", the test cases wou= ld > pass but I think this is a mistake. It reveals an issue that GCC won't pass "- > mcpu=3Dcortex-a9" to cc1, resulting in cortex-a8 tune is selected. It ju= st makes > no sense. > With these issues, I didn't change the tests for now. Precisely, I configured gcc with options "--with-arch=3Darmv7-a --with-cpu|--with-tune=3Dcortex-a9". I read gcc documents and realized that "-mcpu" is ignored when "-march" is specified. I don't know why gcc acts in this manner, but it leads to inconsistent configuration/command line behavior. If we configure GCC with "--with-arch=3Darmv7-a --with-cpu=3Dcortex-a9", th= en only "-march=3Darmv7-a" is passed to cc1. If we compile with "-march=3Darmv7-a -mcpu=3Dcortex-a9", then gcc works fin= e and passes "-march=3Darmv7-a -mcpu=3Dcortex-a9" to cc1. Even more weird cc1 warns that "switch -mcpu=3Dcortex-m4 conflicts with -march=3Darmv7-m switch".=20 Thanks, bin