From: Richard Earnshaw <rearnsha@arm.com>
To: "bin.cheng" <bin.cheng@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH ARM] Improve ARM memset inlining
Date: Fri, 02 May 2014 13:59:00 -0000 [thread overview]
Message-ID: <5363A4CF.9050904@arm.com> (raw)
In-Reply-To: <003301cf641f$36733f40$a359bdc0$@arm.com>
On 30/04/14 03:52, bin.cheng wrote:
> Hi,
> This patch expands small memset calls into direct memory set instructions by
> introducing "setmemsi" pattern. For processors without NEON support, it
> expands memset using general store instruction. For example, strd for
> 4-bytes aligned addresses. For processors with NEON support, it expands
> memset using neon instructions like vstr and miscellaneous vst1.*
> instructions for both aligned and unaligned cases.
>
> This patch depends on
> http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise vst1.64
> will be generated for 32-bit aligned memory unit.
>
> There is also one leftover work of this patch: Since vst1.* instructions
> only support post-increment addressing mode, the inlined memset for
> unaligned neon cases should be like:
> vmov.i32 q8, #...
> vst1.8 {q8}, [r3]!
> vst1.8 {q8}, [r3]!
> vst1.8 {q8}, [r3]!
> vst1.8 {q8}, [r3]
Other than for zero, I'd expect the vmov to be vmov.i8 to move an
arbitrary byte value into all lanes in a vector. After that, if the
alignment is known to be more than 8-bit, I'd expect the vst1
instructions (with the exception of the last store if the length is not
a multiple of the alignment) to use
vst1.<align> {reg}, [addr-reg :<align>]!
Hence, for 16-bit aligned data, we want
vst1.16 {q8}, [r3:16]!
> But for now, gcc can't do this and below code is generated:
> vmov.i32 q8, #...
> vst1.8 {q8}, [r3]
> add r2, r3, #16
> add r3, r2, #16
> vst1.8 {q8}, [r2]
> vst1.8 {q8}, [r3]
> add r2, r3, #16
> vst1.8 {q8}, [r2]
>
> I investigated this issue. The root cause lies in rtx cost returned by ARM
> backend. Anyway, I think this is another issue and should be fixed in
> separated patch.
>
> Bootstrap and reg-test on cortex-a15, with or without neon support. Is it
> OK?
>
Some more comments inline.
> Thanks,
> bin
>
>
> 2014-04-29 Bin Cheng <bin.cheng@arm.com>
>
> PR target/55701
> * config/arm/arm.md (setmem): New pattern.
> * config/arm/arm-protos.h (struct tune_params): New field.
> (arm_gen_setmem): New prototype.
> * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
> (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
> (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
> (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
> (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
> (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
> (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
> (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
> (arm_const_inline_cost): New function.
> (arm_block_set_max_insns): New function.
> (arm_block_set_straight_profit_p): New function.
> (arm_block_set_vect_profit_p): New function.
> (arm_block_set_unaligned_vect): New function.
> (arm_block_set_aligned_vect): New function.
> (arm_block_set_unaligned_straight): New function.
> (arm_block_set_aligned_straight): New function.
> (arm_block_set_vect, arm_gen_setmem): New functions.
>
> gcc/testsuite/ChangeLog
> 2014-04-29 Bin Cheng <bin.cheng@arm.com>
>
> PR target/55701
> * gcc.target/arm/memset-inline-1.c: New test.
> * gcc.target/arm/memset-inline-2.c: New test.
> * gcc.target/arm/memset-inline-3.c: New test.
> * gcc.target/arm/memset-inline-4.c: New test.
> * gcc.target/arm/memset-inline-5.c: New test.
> * gcc.target/arm/memset-inline-6.c: New test.
> * gcc.target/arm/memset-inline-7.c: New test.
> * gcc.target/arm/memset-inline-8.c: New test.
> * gcc.target/arm/memset-inline-9.c: New test.
>
>
> j1328-20140429.txt
>
>
> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c (revision 209852)
> +++ gcc/config/arm/arm.c (working copy)
> @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune =
> true, /* Prefer constant pool. */
> arm_default_branch_cost,
> false, /* Prefer LDRD/STRD. */
> - {true, true}, /* Prefer non short circuit. */
> - &arm_default_vec_cost, /* Vectorizer costs. */
> - false, /* Prefer Neon for 64-bits bitops. */
> - false, false /* Prefer 32-bit encodings. */
> + {true, true}, /* Prefer non short circuit. */
> + &arm_default_vec_cost, /* Vectorizer costs. */
> + false, /* Prefer Neon for 64-bits bitops. */
> + false, false, /* Prefer 32-bit encodings. */
> + false /* Prefer Neon for stringops. */
> };
>
Please make sure that all the white space before the comments is using
TAB, not spaces. Similarly for the other tables.
> @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
> NULL_RTX, NULL_RTX, 0, 0));
> }
>
> +/* Cost of loading a SImode constant. */
> +static inline int
> +arm_const_inline_cost (rtx val)
> +{
> + return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
> + NULL_RTX, NULL_RTX, 0, 0);
> +}
> +
This could be used more widely if you passed the SET in as a parameter
(there are cases in arm_new_rtx_cost that could use it, for example).
Also, you want to enable sub-targets (only once you can't create new
pseudos is that not safe), so the penultimate argument in the call to
arm_gen_constant should be 1.
> /* Return true if it is worthwhile to split a 64-bit constant into two
> 32-bit operations. This is the case if optimizing for size, or
> if we have load delay slots, or if one 32-bit part can be done with
> @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx *comparison, rtx * op
>
> }
>
> +/* Maximum number of instructions to set block of memory. */
> +static int
> +arm_block_set_max_insns (void)
> +{
> + return (optimize_function_for_size_p (cfun) ? 4 : 8);
> +}
I think the non-size_p alternative should really be a parameter in the
per-cpu costs table.
> +
> +/* Return TRUE if it's profitable to set block of memory for straight
> + case. */
"Straight" is confusing here. Do you mean non-vectorized? If so, then
non_vect might be clearer.
The arguments should really be documented (see comment below about
align, for example).
> +static bool
> +arm_block_set_straight_profit_p (rtx val,
> + unsigned HOST_WIDE_INT length,
> + unsigned HOST_WIDE_INT align,
> + bool unaligned_p, bool use_strd_p)
> +{
> + int num = 0;
> + /* For leftovers in bytes of 0-7, we can set the memory block using
> + strb/strh/str with minimum instruction number. */
> + int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
This should be marked const.
> +
> + if (unaligned_p)
> + {
> + num = arm_const_inline_cost (val);
> + num += length / align + length % align;
Isn't align in bits here, when you really want it in bytes?
What if align > 4 bytes?
> + }
> + else if (use_strd_p)
> + {
> + num = arm_const_double_inline_cost (val);
> + num += (length >> 3) + leftover[length & 7];
> + }
> + else
> + {
> + num = arm_const_inline_cost (val);
> + num += (length >> 2) + leftover[length & 3];
> + }
> +
> + /* We may be able to combine last pair STRH/STRB into a single STR
> + by shifting one byte back. */
> + if (unaligned_access && length > 3 && (length & 3) == 3)
> + num--;
> +
> + return (num <= arm_block_set_max_insns ());
> +}
> +
> +/* Return TRUE if it's profitable to set block of memory for vector case. */
> +static bool
> +arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
> + unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED,
> + bool unaligned_p, enum machine_mode mode)
I'm not sure what you mean by unaligned here. Again, documenting the
arguments might help.
> +{
> + int num;
> + unsigned int nelt = GET_MODE_NUNITS (mode);
> +
> + /* Num of instruction loading constant value. */
Use either "Number" or, in this case, simply drop that bit and write:
/* Instruction loading constant value. */
> + num = 1;
> + /* Num of store instructions. */
Likewise.
> + num += (length + nelt - 1) / nelt;
> + /* Num of address adjusting instructions. */
Can't we work on the premise that the address adjusting instructions
will be merged into the stores? I know you said that they currently do
not, but that's not a problem that this bit of code should have to worry
about.
> + if (unaligned_p)
> + /* For unaligned case, it's one less than the store instructions. */
> + num += (length + nelt - 1) / nelt - 1;
> + else if ((length & 3) != 0)
> + /* For aligned case, it's one if bytes leftover can only be stored
> + by mis-aligned store instruction. */
> + num++;
> +
> + /* Store the first 16 bytes using vst1:v16qi for the aligned case. */
> + if (!unaligned_p && mode == V16QImode)
> + num--;
> +
> + return (num <= arm_block_set_max_insns ());
> +}
> +
> +/* Set a block of memory using vectorization instructions for the
> + unaligned case. We fill the first LENGTH bytes of the memory
> + area starting from DSTBASE with byte constant VALUE. ALIGN is
> + the alignment requirement of memory. */
What's the return value mean?
> +static bool
> +arm_block_set_unaligned_vect (rtx dstbase,
> + unsigned HOST_WIDE_INT length,
> + unsigned HOST_WIDE_INT value,
> + unsigned HOST_WIDE_INT align)
> +{
> + unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
Don't mix initialized declarations with unitialized ones on the same
line. You don't appear to use either I or J until their first use in
the loop control below, so why initialize them here?
> + rtx dst, mem;
> + rtx val_elt, val_vec, reg;
> + rtx rval[MAX_VECT_LEN];
> + rtx (*gen_func) (rtx, rtx);
> + enum machine_mode mode;
> + unsigned HOST_WIDE_INT v = value;
> +
> + gcc_assert ((align & 0x3) != 0);
> + nelt_v8 = GET_MODE_NUNITS (V8QImode);
> + nelt_v16 = GET_MODE_NUNITS (V16QImode);
> + if (length >= nelt_v16)
> + {
> + mode = V16QImode;
> + gen_func = gen_movmisalignv16qi;
> + }
> + else
> + {
> + mode = V8QImode;
> + gen_func = gen_movmisalignv8qi;
> + }
> + nelt_mode = GET_MODE_NUNITS (mode);
> + gcc_assert (length >= nelt_mode);
> + /* Skip if it isn't profitable. */
> + if (!arm_block_set_vect_profit_p (length, align, true, mode))
> + return false;
> +
> + dst = copy_addr_to_reg (XEXP (dstbase, 0));
> + mem = adjust_automodify_address (dstbase, mode, dst, 0);
> +
> + v = sext_hwi (v, BITS_PER_WORD);
> + val_elt = GEN_INT (v);
> + for (; j < nelt_mode; j++)
> + rval[j] = val_elt;
Is this the first use of J? If so, initialize it here.
> +
> + reg = gen_reg_rtx (mode);
> + val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
> + /* Emit instruction loading the constant value. */
> + emit_move_insn (reg, val_vec);
> +
> + /* Handle nelt_mode bytes in a vector. */
> + for (; (i + nelt_mode <= length); i += nelt_mode)
Similarly for I.
> + {
> + emit_insn ((*gen_func) (mem, reg));
> + if (i + 2 * nelt_mode <= length)
> + emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
> + }
> +
> + if (i + nelt_v8 <= length)
> + gcc_assert (mode == V16QImode);
Why not drop the if and write:
gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
> +
> + /* Handle (8, 16) bytes leftover. */
> + if (i + nelt_v8 < length)
Your assertion above checked <=, but here you use <. Is that correct?
> + {
> + emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
> + /* We are shifting bytes back, set the alignment accordingly. */
> + if ((length & 1) != 0 && align >= 2)
> + set_mem_align (mem, BITS_PER_UNIT);
> +
> + emit_insn (gen_movmisalignv16qi (mem, reg));
> + }
> + /* Handle (0, 8] bytes leftover. */
> + else if (i < length && i + nelt_v8 >= length)
> + {
> + if (mode == V16QImode)
> + {
> + reg = gen_lowpart (V8QImode, reg);
> + mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
> + }
> + emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
> + + (nelt_mode - nelt_v8))));
> + /* We are shifting bytes back, set the alignment accordingly. */
> + if ((length & 1) != 0 && align >= 2)
> + set_mem_align (mem, BITS_PER_UNIT);
> +
> + emit_insn (gen_movmisalignv8qi (mem, reg));
> + }
> +
> + return true;
> +}
> +
> +/* Set a block of memory using vectorization instructions for the
> + aligned case. We fill the first LENGTH bytes of the memory area
> + starting from DSTBASE with byte constant VALUE. ALIGN is the
> + alignment requirement of memory. */
See all the comments above for the unaligend case.
> +static bool
> +arm_block_set_aligned_vect (rtx dstbase,
> + unsigned HOST_WIDE_INT length,
> + unsigned HOST_WIDE_INT value,
> + unsigned HOST_WIDE_INT align)
> +{
> + unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
> + rtx dst, addr, mem;
> + rtx val_elt, val_vec, reg;
> + rtx rval[MAX_VECT_LEN];
> + enum machine_mode mode;
> + unsigned HOST_WIDE_INT v = value;
> +
> + gcc_assert ((align & 0x3) == 0);
> + nelt_v8 = GET_MODE_NUNITS (V8QImode);
> + nelt_v16 = GET_MODE_NUNITS (V16QImode);
> + if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
> + mode = V16QImode;
> + else
> + mode = V8QImode;
> +
> + nelt_mode = GET_MODE_NUNITS (mode);
> + gcc_assert (length >= nelt_mode);
> + /* Skip if it isn't profitable. */
> + if (!arm_block_set_vect_profit_p (length, align, false, mode))
> + return false;
> +
> + dst = copy_addr_to_reg (XEXP (dstbase, 0));
> +
> + v = sext_hwi (v, BITS_PER_WORD);
> + val_elt = GEN_INT (v);
> + for (; j < nelt_mode; j++)
> + rval[j] = val_elt;
> +
> + reg = gen_reg_rtx (mode);
> + val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
> + /* Emit instruction loading the constant value. */
> + emit_move_insn (reg, val_vec);
> +
> + /* Handle first 16 bytes specially using vst1:v16qi instruction. */
> + if (mode == V16QImode)
> + {
> + mem = adjust_automodify_address (dstbase, mode, dst, 0);
> + emit_insn (gen_movmisalignv16qi (mem, reg));
> + i += nelt_mode;
> + /* Handle (8, 16) bytes leftover using vst1:v16qi again. */
> + if (i + nelt_v8 < length && i + nelt_v16 > length)
> + {
> + emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> + mem = adjust_automodify_address (dstbase, mode, dst, 0);
> + /* We are shifting bytes back, set the alignment accordingly. */
> + if ((length & 0x3) == 0)
> + set_mem_align (mem, BITS_PER_UNIT * 4);
> + else if ((length & 0x1) == 0)
> + set_mem_align (mem, BITS_PER_UNIT * 2);
> + else
> + set_mem_align (mem, BITS_PER_UNIT);
> +
> + emit_insn (gen_movmisalignv16qi (mem, reg));
> + return true;
> + }
> + /* Fall through for bytes leftover. */
> + mode = V8QImode;
> + nelt_mode = GET_MODE_NUNITS (mode);
> + reg = gen_lowpart (V8QImode, reg);
> + }
> +
> + /* Handle 8 bytes in a vector. */
> + for (; (i + nelt_mode <= length); i += nelt_mode)
> + {
> + addr = plus_constant (Pmode, dst, i);
> + mem = adjust_automodify_address (dstbase, mode, addr, i);
> + emit_move_insn (mem, reg);
> + }
> +
> + /* Handle single word leftover by shifting 4 bytes back. We can
> + use aligned access for this case. */
> + if (i + UNITS_PER_WORD == length)
> + {
> + addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
> + mem = adjust_automodify_address (dstbase, mode,
> + addr, i - UNITS_PER_WORD);
> + /* We are shifting 4 bytes back, set the alignment accordingly. */
> + if (align > UNITS_PER_WORD)
> + set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
> +
> + emit_move_insn (mem, reg);
> + }
> + /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
> + We have to use unaligned access for this case. */
> + else if (i < length)
> + {
> + emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> + mem = adjust_automodify_address (dstbase, mode, dst, 0);
> + /* We are shifting bytes back, set the alignment accordingly. */
> + if ((length & 1) == 0)
> + set_mem_align (mem, BITS_PER_UNIT * 2);
> + else
> + set_mem_align (mem, BITS_PER_UNIT);
> +
> + emit_insn (gen_movmisalignv8qi (mem, reg));
> + }
> +
> + return true;
> +}
> +
> +/* Set a block of memory using plain strh/strb instructions, only
> + using instructions allowed by ALIGN on processor. We fill the
> + first LENGTH bytes of the memory area starting from DSTBASE
> + with byte constant VALUE. ALIGN is the alignment requirement
> + of memory. */
> +static bool
> +arm_block_set_unaligned_straight (rtx dstbase,
> + unsigned HOST_WIDE_INT length,
> + unsigned HOST_WIDE_INT value,
> + unsigned HOST_WIDE_INT align)
> +{
> + unsigned int i;
> + rtx dst, addr, mem;
> + rtx val_exp, val_reg, reg;
> + enum machine_mode mode;
> + HOST_WIDE_INT v = value;
> +
> + gcc_assert (align == 1 || align == 2);
> +
> + if (align == 2)
> + v |= (value << BITS_PER_UNIT);
> +
> + v = sext_hwi (v, BITS_PER_WORD);
> + val_exp = GEN_INT (v);
> + /* Skip if it isn't profitable. */
> + if (!arm_block_set_straight_profit_p (val_exp, length,
> + align, true, false))
> + return false;
> +
> + dst = copy_addr_to_reg (XEXP (dstbase, 0));
> + mode = (align == 2 ? HImode : QImode);
> + val_reg = force_reg (SImode, val_exp);
> + reg = gen_lowpart (mode, val_reg);
> +
> + for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
> + {
> + addr = plus_constant (Pmode, dst, i);
> + mem = adjust_automodify_address (dstbase, mode, addr, i);
> + emit_move_insn (mem, reg);
> + }
> +
> + /* Handle single byte leftover. */
> + if (i + 1 == length)
> + {
> + reg = gen_lowpart (QImode, val_reg);
> + addr = plus_constant (Pmode, dst, i);
> + mem = adjust_automodify_address (dstbase, QImode, addr, i);
> + emit_move_insn (mem, reg);
> + i++;
> + }
> +
> + gcc_assert (i == length);
> + return true;
> +}
> +
> +/* Set a block of memory using plain strd/str/strh/strb instructions,
> + to permit unaligned copies on processors which support unaligned
> + semantics for those instructions. We fill the first LENGTH bytes
> + of the memory area starting from DSTBASE with byte constant VALUE.
> + ALIGN is the alignment requirement of memory. */
> +static bool
> +arm_block_set_aligned_straight (rtx dstbase,
> + unsigned HOST_WIDE_INT length,
> + unsigned HOST_WIDE_INT value,
> + unsigned HOST_WIDE_INT align)
> +{
> + unsigned int i = 0;
> + rtx dst, addr, mem;
> + rtx val_exp, val_reg, reg;
> + unsigned HOST_WIDE_INT v;
> + bool use_strd_p;
> +
> + use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
> + && TARGET_LDRD && current_tune->prefer_ldrd_strd);
> +
> + v = (value | (value << 8) | (value << 16) | (value << 24));
> + if (length < UNITS_PER_WORD)
> + v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
> +
> + if (use_strd_p)
> + v |= (v << BITS_PER_WORD);
> + else
> + v = sext_hwi (v, BITS_PER_WORD);
> +
> + val_exp = GEN_INT (v);
> + /* Skip if it isn't profitable. */
> + if (!arm_block_set_straight_profit_p (val_exp, length,
> + align, false, use_strd_p))
> + {
> + /* Try without strd. */
> + v = (v >> BITS_PER_WORD);
> + v = sext_hwi (v, BITS_PER_WORD);
> + val_exp = GEN_INT (v);
> + use_strd_p = false;
> + if (!arm_block_set_straight_profit_p (val_exp, length,
> + align, false, use_strd_p))
> + return false;
> + }
> +
> + dst = copy_addr_to_reg (XEXP (dstbase, 0));
> + /* Handle double words using strd if possible. */
> + if (use_strd_p)
> + {
> + val_reg = force_reg (DImode, val_exp);
> + reg = val_reg;
> + for (; (i + 8 <= length); i += 8)
> + {
> + addr = plus_constant (Pmode, dst, i);
> + mem = adjust_automodify_address (dstbase, DImode, addr, i);
> + emit_move_insn (mem, reg);
> + }
> + }
> + else
> + val_reg = force_reg (SImode, val_exp);
> +
> + /* Handle words. */
> + reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
> + for (; (i + 4 <= length); i += 4)
> + {
> + addr = plus_constant (Pmode, dst, i);
> + mem = adjust_automodify_address (dstbase, SImode, addr, i);
> + if ((align & 3) == 0)
> + emit_move_insn (mem, reg);
> + else
> + emit_insn (gen_unaligned_storesi (mem, reg));
> + }
> +
> + /* Merge last pair of STRH and STRB into a STR if possible. */
> + if (unaligned_access && i > 0 && (i + 3) == length)
> + {
> + addr = plus_constant (Pmode, dst, i - 1);
> + mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
> + /* We are shifting one byte back, set the alignment accordingly. */
> + if ((align & 1) == 0)
> + set_mem_align (mem, BITS_PER_UNIT);
> +
> + /* Most likely this is an unaligned access, and we can't tell at
> + compilation time. */
> + emit_insn (gen_unaligned_storesi (mem, reg));
> + return true;
> + }
> +
> + /* Handle half word leftover. */
> + if (i + 2 <= length)
> + {
> + reg = gen_lowpart (HImode, val_reg);
> + addr = plus_constant (Pmode, dst, i);
> + mem = adjust_automodify_address (dstbase, HImode, addr, i);
> + if ((align & 1) == 0)
> + emit_move_insn (mem, reg);
> + else
> + emit_insn (gen_unaligned_storehi (mem, reg));
> +
> + i += 2;
> + }
> +
> + /* Handle single byte leftover. */
> + if (i + 1 == length)
> + {
> + reg = gen_lowpart (QImode, val_reg);
> + addr = plus_constant (Pmode, dst, i);
> + mem = adjust_automodify_address (dstbase, QImode, addr, i);
> + emit_move_insn (mem, reg);
> + }
> +
> + return true;
> +}
> +
> +/* Set a block of memory using vectorization instructions for both
> + aligned and unaligned cases. We fill the first LENGTH bytes of
> + the memory area starting from DSTBASE with byte constant VALUE.
> + ALIGN is the alignment requirement of memory. */
> +static bool
> +arm_block_set_vect (rtx dstbase,
> + unsigned HOST_WIDE_INT length,
> + unsigned HOST_WIDE_INT value,
> + unsigned HOST_WIDE_INT align)
> +{
> + /* Check whether we need to use unaligned store instruction. */
> + if (((align & 3) != 0 || (length & 3) != 0)
> + /* Check whether unaligned store instruction is available. */
> + && (!unaligned_access || BYTES_BIG_ENDIAN))
> + return false;
Huh! vst1.8 can work for unaligned accesses even when hw alignment
checking is strict.
> +
> + if ((align & 3) == 0)
> + return arm_block_set_aligned_vect (dstbase, length, value, align);
> + else
> + return arm_block_set_unaligned_vect (dstbase, length, value, align);
> +}
> +
> +/* Expand string store operation. Firstly we try to do that by using
> + vectorization instructions, then try with ARM unaligned access and
> + double-word store if profitable. OPERANDS[0] is the destination,
> + OPERANDS[1] is the number of bytes, operands[2] is the value to
> + initialize the memory, OPERANDS[3] is the known alignment of the
> + destination. */
> +bool
> +arm_gen_setmem (rtx *operands)
> +{
> + rtx dstbase = operands[0];
> + unsigned HOST_WIDE_INT length;
> + unsigned HOST_WIDE_INT value;
> + unsigned HOST_WIDE_INT align;
> +
> + if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
> + return false;
> +
> + length = UINTVAL (operands[1]);
> + if (length > 64)
> + return false;
> +
> + value = (UINTVAL (operands[2]) & 0xFF);
> + align = UINTVAL (operands[3]);
> + if (TARGET_NEON && length >= 8
> + && current_tune->string_ops_prefer_neon
> + && arm_block_set_vect (dstbase, length, value, align))
> + return true;
> +
> + if (!unaligned_access && (align & 3) != 0)
> + return arm_block_set_unaligned_straight (dstbase, length, value, align);
> +
> + return arm_block_set_aligned_straight (dstbase, length, value, align);
> +}
> +
> /* Implement the TARGET_ASAN_SHADOW_OFFSET hook. */
>
> static unsigned HOST_WIDE_INT
> Index: gcc/config/arm/arm-protos.h
> ===================================================================
> --- gcc/config/arm/arm-protos.h (revision 209852)
> +++ gcc/config/arm/arm-protos.h (working copy)
> @@ -277,6 +277,8 @@ struct tune_params
> /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
> would be set. */
> bool disparage_partial_flag_setting_t16_encodings;
> + /* Prefer to inline string operations like memset by using Neon. */
> + bool string_ops_prefer_neon;
> };
>
> extern const struct tune_params *current_tune;
> @@ -289,6 +291,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt
> extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
> #endif /* RTX_CODE */
>
> +extern bool arm_gen_setmem (rtx *);
> extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
> extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
>
> Index: gcc/config/arm/arm.md
> ===================================================================
> --- gcc/config/arm/arm.md (revision 209852)
> +++ gcc/config/arm/arm.md (working copy)
> @@ -7555,6 +7555,20 @@
> })
>
>
> +(define_expand "setmemsi"
> + [(match_operand:BLK 0 "general_operand" "")
> + (match_operand:SI 1 "const_int_operand" "")
> + (match_operand:SI 2 "const_int_operand" "")
> + (match_operand:SI 3 "const_int_operand" "")]
> + "TARGET_32BIT"
> +{
> + if (arm_gen_setmem (operands))
> + DONE;
> +
> + FAIL;
> +})
> +
> +
> ;; Move a block of memory if it is word aligned and MORE than 2 words long.
> ;; We could let this apply for blocks of less than this, but it clobbers so
> ;; many registers that there is then probably a better way.
> Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
> ===================================================================
> --- gcc/testsuite/gcc.target/arm/memset-inline-6.c (revision 0)
> +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c (revision 0)
Have you tested these when the compiler was configured with
"--with-cpu=cortex-a9"?
next prev parent reply other threads:[~2014-05-02 13:59 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-30 5:56 bin.cheng
2014-05-02 13:59 ` Richard Earnshaw [this message]
2014-05-05 7:21 ` bin.cheng
2014-05-06 5:00 ` bin.cheng
2014-05-12 3:17 ` Bin.Cheng
2014-05-19 6:40 ` Bin.Cheng
2014-05-28 8:53 ` bin.cheng
2014-06-04 9:11 ` bin.cheng
2014-06-27 8:21 ` Ramana Radhakrishnan
2014-07-04 12:18 ` Bin Cheng
2014-07-08 8:32 ` Bin.Cheng
2014-07-08 8:56 ` Ramana Radhakrishnan
2014-07-08 9:57 ` Bin.Cheng
2014-07-21 12:45 ` Bin.Cheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5363A4CF.9050904@arm.com \
--to=rearnsha@arm.com \
--cc=bin.cheng@arm.com \
--cc=gcc-patches@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).