public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: Use of vector instructions in memmov/memset expanding
       [not found] <CANtU07-DAOMe9Nk4oYj3FJnkZqgkHvSnobsugeSfcRUzDChrrg@mail.gmail.com>
@ 2011-07-11 21:03 ` Michael Zolotukhin
  2011-07-11 21:09 ` Michael Zolotukhin
  1 sibling, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-07-11 21:03 UTC (permalink / raw)
  To: gcc-patches, Richard Guenther, H.J. Lu

Resending in plain text:

On 11 July 2011 23:50, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
>
> The attached patch enables use of vector instructions in memmov/memset expanding.
>
> New algorithm for move-mode selection is implemented for move_by_pieces, store_by_pieces.
> x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in similar way, x86 cost-models parameters are slightly changed to support this. This implementation checks if array's alignment is known at compile time and chooses expanding algorithm and move-mode according to it.
>
> Bootstrapped, two new fails due to incorrect tests (see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49503). New implementation gives quite big performance gain on memset/memcpy in some cases.
>
> A bunch of new tests are added to verify the implementation.
>
> Is it ok for trunk?
>
> Changelog:
>
> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>
>     * config/i386/i386.h (processor_costs): Add second dimension to
>     stringop_algs array.
>     (clear_ratio): Tune value to improve performance.
>     * config/i386/i386.c (cost models): Initialize second dimension of
>     stringop_algs arrays.  Tune cost model in atom_cost, generic32_cost
>     and generic64_cost.
>     (ix86_expand_move): Add support for vector moves, that use half of
>     vector register.
>     (expand_set_or_movmem_via_loop_with_iter): New function.
>     (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
>     different loops, produced by this function.
>     (emit_strset): New function.
>     (promote_duplicated_reg): Add support for vector modes, add
>     declaration.
>     (promote_duplicated_reg_to_size): Likewise.
>     (expand_movmem_epilogue): Add epilogue generation for bigger sizes.
>     (expand_setmem_epilogue): Likewise.
>     (expand_movmem_prologue): Likewise for prologue.
>     (expand_setmem_prologue): Likewise.
>     (expand_constant_movmem_prologue): Likewise.
>     (expand_constant_setmem_prologue): Likewise.
>     (decide_alg): Add new argument align_unknown.  Fix algorithm of
>     strategy selection if TARGET_INLINE_ALL_STRINGOPS is set.
>     (decide_alignment): Update desired alignment according to chosen move
>     mode.
>     (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
>     (ix86_expand_setmem): Likewise.
>     (ix86_slow_unaligned_access): Implementation of new hook
>     slow_unaligned_access.
>     (ix86_promote_rtx_for_memset): Implementation of new hook
>     promote_rtx_for_memset.
>     * config/i386/sse.md (sse2_loadq): Add expand for sse2_loadq.
>     (vec_dupv4si): Add expand for vec_dupv4si.
>     (vec_dupv2di): Add expand for vec_dupv2di.
>     * emit-rtl.c (adjust_address_1): Improve algorithm for determining
>     alignment of address+offset.
>     (get_mem_align_offset): Add handling of MEM_REFs.
>     * expr.c (compute_align_by_offset): New function.
>     (move_by_pieces_insn): New function.
>     (widest_mode_for_unaligned_mov): New function.
>     (widest_mode_for_aligned_mov): New function.
>     (widest_int_mode_for_size): Change type of size from int to
>     HOST_WIDE_INT.
>     (set_by_pieces_1): New function (new algorithm of memset expanding).
>     (set_by_pieces_2): New function.
>     (generate_move_with_mode): New function for set_by_pieces.
>     (alignment_for_piecewise_move): Use hook slow_unaligned_access instead
>     of macros SLOW_UNALIGNED_ACCESS.
>     (emit_group_load_1): Likewise.
>     (emit_group_store): Likewise.
>     (emit_push_insn): Likewise.
>     (store_field): Likewise.
>     (expand_expr_real_1): Likewise.
>     (compute_aligned_cost): New function.
>     (compute_unaligned_cost): New function.
>     (vector_mode_for_mode): New function.
>     (vector_extensions_used_for_mode): New function.
>     (move_by_pieces): New algorithm of memmove expanding.
>     (move_by_pieces_ninsns): Update according to changes in
>     move_by_pieces.
>     (move_by_pieces_1): Remove as unused.
>     (store_by_pieces): New algorithm for memset expanding.
>     (clear_by_pieces): Likewise.
>     (store_by_pieces_1): Remove incorrect parameters' attributes.
>     * expr.h (compute_align_by_offset): Add declaration.
>     * rtl.h (vector_extensions_used_for_mode): Add declaration.
>     * builtins.c (expand_builtin_memset_args): Update according to changes
>     in set_by_pieces.
>     * target.def (DEFHOOK): Add hook slow_unaligned_access and
>     promote_rtx_for_memset.
>     * targhooks.c (default_slow_unaligned_access): Add default hook
>     implementation.
>     (default_promote_rtx_for_memset): Likewise.
>     * targhooks.h (default_slow_unaligned_access): Add prototype.
>     (default_promote_rtx_for_memset): Likewise.
>     * cse.c (cse_insn): Stop forward propagation of vector constants.
>     * fwprop.c (forward_propagate_and_simplify): Likewise.
>     * doc/tm.texi (SLOW_UNALIGNED_ACCESS): Remove documentation for deleted
>     macro SLOW_UNALIGNED_ACCESS.
>     (TARGET_SLOW_UNALIGNED_ACCESS): Add documentation on new hook.
>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>     * doc/tm.texi.in (SLOW_UNALIGNED_ACCESS): Likewise.
>     (TARGET_SLOW_UNALIGNED_ACCESS): Likewise.
>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>
> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>
>     * testsuite/gcc.target/i386/memset-s64-a0-1.c: New testcase.
>     * testsuite/gcc.target/i386/memset-s64-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-5.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-6.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-7.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-8.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-5.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-6.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-9.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-10.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-11.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-7.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-8.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-12.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-4.c: Ditto.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
       [not found] <CANtU07-DAOMe9Nk4oYj3FJnkZqgkHvSnobsugeSfcRUzDChrrg@mail.gmail.com>
  2011-07-11 21:03 ` Use of vector instructions in memmov/memset expanding Michael Zolotukhin
@ 2011-07-11 21:09 ` Michael Zolotukhin
  2011-07-12  5:11   ` H.J. Lu
  2011-07-16  2:51   ` Jan Hubicka
  1 sibling, 2 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-07-11 21:09 UTC (permalink / raw)
  To: gcc-patches, Richard Guenther, H.J. Lu

[-- Attachment #1: Type: text/plain, Size: 10407 bytes --]

Sorry, for sending once again - forgot to attach the patch.

On 11 July 2011 23:50, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> The attached patch enables use of vector instructions in memmov/memset
> expanding.
>
> New algorithm for move-mode selection is implemented for move_by_pieces,
> store_by_pieces.
> x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
> similar way, x86 cost-models parameters are slightly changed to support
> this. This implementation checks if array's alignment is known at compile
> time and chooses expanding algorithm and move-mode according to it.
>
> Bootstrapped, two new fails due to incorrect tests (see
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49503). New implementation gives
> quite big performance gain on memset/memcpy in some cases.
>
> A bunch of new tests are added to verify the implementation.
>
> Is it ok for trunk?
>
> Changelog:
>
> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>
>     * config/i386/i386.h (processor_costs): Add second dimension to
>     stringop_algs array.
>     (clear_ratio): Tune value to improve performance.
>     * config/i386/i386.c (cost models): Initialize second dimension of
>     stringop_algs arrays.  Tune cost model in atom_cost, generic32_cost
>     and generic64_cost.
>     (ix86_expand_move): Add support for vector moves, that use half of
>     vector register.
>     (expand_set_or_movmem_via_loop_with_iter): New function.
>     (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
>     different loops, produced by this function.
>     (emit_strset): New function.
>     (promote_duplicated_reg): Add support for vector modes, add
>     declaration.
>     (promote_duplicated_reg_to_size): Likewise.
>     (expand_movmem_epilogue): Add epilogue generation for bigger sizes.
>     (expand_setmem_epilogue): Likewise.
>     (expand_movmem_prologue): Likewise for prologue.
>     (expand_setmem_prologue): Likewise.
>     (expand_constant_movmem_prologue): Likewise.
>     (expand_constant_setmem_prologue): Likewise.
>     (decide_alg): Add new argument align_unknown.  Fix algorithm of
>     strategy selection if TARGET_INLINE_ALL_STRINGOPS is set.
>     (decide_alignment): Update desired alignment according to chosen move
>     mode.
>     (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
>     (ix86_expand_setmem): Likewise.
>     (ix86_slow_unaligned_access): Implementation of new hook
>     slow_unaligned_access.
>     (ix86_promote_rtx_for_memset): Implementation of new hook
>     promote_rtx_for_memset.
>     * config/i386/sse.md (sse2_loadq): Add expand for sse2_loadq.
>     (vec_dupv4si): Add expand for vec_dupv4si.
>     (vec_dupv2di): Add expand for vec_dupv2di.
>     * emit-rtl.c (adjust_address_1): Improve algorithm for determining
>     alignment of address+offset.
>     (get_mem_align_offset): Add handling of MEM_REFs.
>     * expr.c (compute_align_by_offset): New function.
>     (move_by_pieces_insn): New function.
>     (widest_mode_for_unaligned_mov): New function.
>     (widest_mode_for_aligned_mov): New function.
>     (widest_int_mode_for_size): Change type of size from int to
>     HOST_WIDE_INT.
>     (set_by_pieces_1): New function (new algorithm of memset expanding).
>     (set_by_pieces_2): New function.
>     (generate_move_with_mode): New function for set_by_pieces.
>     (alignment_for_piecewise_move): Use hook slow_unaligned_access instead
>     of macros SLOW_UNALIGNED_ACCESS.
>     (emit_group_load_1): Likewise.
>     (emit_group_store): Likewise.
>     (emit_push_insn): Likewise.
>     (store_field): Likewise.
>     (expand_expr_real_1): Likewise.
>     (compute_aligned_cost): New function.
>     (compute_unaligned_cost): New function.
>     (vector_mode_for_mode): New function.
>     (vector_extensions_used_for_mode): New function.
>     (move_by_pieces): New algorithm of memmove expanding.
>     (move_by_pieces_ninsns): Update according to changes in
>     move_by_pieces.
>     (move_by_pieces_1): Remove as unused.
>     (store_by_pieces): New algorithm for memset expanding.
>     (clear_by_pieces): Likewise.
>     (store_by_pieces_1): Remove incorrect parameters' attributes.
>     * expr.h (compute_align_by_offset): Add declaration.
>     * rtl.h (vector_extensions_used_for_mode): Add declaration.
>     * builtins.c (expand_builtin_memset_args): Update according to changes
>     in set_by_pieces.
>     * target.def (DEFHOOK): Add hook slow_unaligned_access and
>     promote_rtx_for_memset.
>     * targhooks.c (default_slow_unaligned_access): Add default hook
>     implementation.
>     (default_promote_rtx_for_memset): Likewise.
>     * targhooks.h (default_slow_unaligned_access): Add prototype.
>     (default_promote_rtx_for_memset): Likewise.
>     * cse.c (cse_insn): Stop forward propagation of vector constants.
>     * fwprop.c (forward_propagate_and_simplify): Likewise.
>     * doc/tm.texi (SLOW_UNALIGNED_ACCESS): Remove documentation for deleted
>     macro SLOW_UNALIGNED_ACCESS.
>     (TARGET_SLOW_UNALIGNED_ACCESS): Add documentation on new hook.
>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>     * doc/tm.texi.in (SLOW_UNALIGNED_ACCESS): Likewise.
>     (TARGET_SLOW_UNALIGNED_ACCESS): Likewise.
>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>
> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>
>     * testsuite/gcc.target/i386/memset-s64-a0-1.c: New testcase.
>     * testsuite/gcc.target/i386/memset-s64-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-a1-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-au-1.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-5.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-6.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-a1-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s3072-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s3072-au-2.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-7.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-8.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-5.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-6.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-9.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-au-3.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-10.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-11.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-7.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s768-a0-8.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s16-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s16-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a0-12.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s64-au-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s64-au-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a0-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memcpy-s512-a1-4.c: Ditto.
>     * testsuite/gcc.target/i386/memset-s512-au-4.c: Ditto.
>

[-- Attachment #2: memfunc.patch --]
[-- Type: application/octet-stream, Size: 154941 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 1ee8cf8..40d6baa 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3564,7 +3564,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a46101b..a4043f5 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1402,11 +1451,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1473,11 +1527,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1544,13 +1605,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1617,13 +1687,20 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1697,10 +1774,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1769,10 +1852,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2451,6 +2540,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -14952,6 +15042,28 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -19470,22 +19582,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -19493,10 +19600,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -19507,7 +19616,8 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
@@ -19590,19 +19700,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -19704,7 +19838,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -19715,43 +19869,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -19857,87 +20023,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -19947,14 +20148,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -19962,24 +20166,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -20022,7 +20226,27 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = change_address (srcmem, DImode, srcptr);
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = change_address (srcmem, SImode, srcptr);
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -20078,6 +20302,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -20137,7 +20392,17 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -20173,6 +20438,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -20184,7 +20462,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -20193,7 +20471,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -20206,7 +20484,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -20215,9 +20493,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -20281,29 +20559,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -20327,9 +20609,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -20417,6 +20701,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -20440,9 +20729,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -20461,11 +20758,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -20634,11 +20936,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -20689,9 +20994,50 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -20709,7 +21055,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -20775,11 +21151,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+        promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+        promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -20805,12 +21191,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -20830,8 +21221,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -20849,11 +21243,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -20870,6 +21274,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -20898,8 +21303,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -20908,12 +21315,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -20954,8 +21355,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -20978,6 +21381,8 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -21019,7 +21424,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -21027,8 +21432,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -21072,15 +21483,36 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -34598,6 +35030,87 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -34899,6 +35412,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8cef4e7..cf6d092 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -156,8 +156,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 7abee33..c2c8ef6 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6345,6 +6345,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -6456,6 +6463,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -6496,6 +6513,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=Y2,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/cse.c b/gcc/cse.c
index a078329..9cf70ce 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4614,7 +4614,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c0648a5..44e9947 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5747,6 +5747,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6219,23 +6245,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3660d36..0e41fb4 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5690,6 +5690,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6162,23 +6188,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index c641b7e..18e1a8c 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1504,6 +1504,11 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (MEM_ALIGN (mem) < (unsigned int) align)
+	return -1;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2059,9 +2064,14 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      lowest-order set bit in OFFSET, but don't change the alignment if OFFSET
      if zero.  */
   if (offset != 0)
-    memalign
-      = MIN (memalign,
-	     (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
+    {
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	memalign = compute_align_by_offset (old_offset + offset);
+      else
+	memalign = MIN (memalign,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
+    }
 
   /* We can compute the size in a number of ways.  */
   if (GET_MODE (new_rtx) != BLKmode)
diff --git a/gcc/expr.c b/gcc/expr.c
index fb4379f..410779a 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -125,15 +125,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -160,6 +163,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -808,7 +817,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -817,11 +826,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    MOVE_MAX * BITS_PER_UNIT :
+	    MIN (MOVE_MAX, (offset & -offset)) * BITS_PER_UNIT;
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -833,6 +897,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -873,6 +1101,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -957,23 +1186,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1011,35 +1254,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1047,60 +1302,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1677,7 +1929,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2067,7 +2319,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2356,7 +2608,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2400,10 +2655,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2417,13 +2672,121 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+
+      rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    {
+      generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+      return Pmode;
+    }
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2498,6 +2861,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -3714,7 +4205,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -5839,7 +6330,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9195,7 +9686,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index cb4050d..b9ec9c2 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -693,4 +693,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5db9ed8..f15f957 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1270,6 +1270,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index e3ceecd..4584206 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2428,6 +2428,10 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
+
 
 /* In cfgrtl.c */
 extern void print_rtl_with_bb (FILE *, const_rtx);
diff --git a/gcc/target.def b/gcc/target.def
index c67f0ba..378e6e5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1479,6 +1479,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index bcb8a12..4c4e4bd 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1441,4 +1441,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index ce89d32..27f2f4d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -174,3 +174,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..c4d9fa3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..d25f297
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..0846e7c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..38140a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..132b1e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..4cfdc23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..01c1324
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..fad066e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..1d1c9a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..538fa73
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..7918557
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..8cdf50c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..ddebd95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..b775354
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..5666b62
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..ed5d937
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..b2f3e41
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..4bc9412
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..b6f1479
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..15e0b12
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..a99c4ba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..caa6199
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..40d7691
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..f543626
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..b858610
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..617471c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..eb4bf9b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..36223c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..c05e509
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..08b7591
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..45bf2e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..6416e97
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..481eb2e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..55934fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..681d994
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..aca1224
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..dccdef3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..0a718ca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..2e52789
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..e182d93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..18c9b37
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..137a658
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..878acca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..5c73cbd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..72bdd06e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..dc4c5aa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..d14bce8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..b1ccc53
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..39eba30
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..472a12c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..bf6f9a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..1c0c3d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..1a73d2a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..4744f6d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..145ea52
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..93ff487
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..da01948
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..af707c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..9e880da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..02c5356
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..9230120
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..57a98fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..eee218f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..93649e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..5078782
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..cdadae8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..25a9d20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..c506844
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..f7cf5bf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..0b1930e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..ef013b0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..d1331b1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..4f3e7b7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..ccbe129
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..3a45c4f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..1737703
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..6098a60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..bfa44c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..2f2cd5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-07-11 21:09 ` Michael Zolotukhin
@ 2011-07-12  5:11   ` H.J. Lu
  2011-07-12 20:35     ` Michael Zolotukhin
  2011-07-16  2:51   ` Jan Hubicka
  1 sibling, 1 reply; 52+ messages in thread
From: H.J. Lu @ 2011-07-12  5:11 UTC (permalink / raw)
  To: Michael Zolotukhin; +Cc: gcc-patches, Richard Guenther

On Mon, Jul 11, 2011 at 1:57 PM, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Sorry, for sending once again - forgot to attach the patch.
>
> On 11 July 2011 23:50, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> The attached patch enables use of vector instructions in memmov/memset
>> expanding.
>>
>> New algorithm for move-mode selection is implemented for move_by_pieces,
>> store_by_pieces.
>> x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>> similar way, x86 cost-models parameters are slightly changed to support
>> this. This implementation checks if array's alignment is known at compile
>> time and chooses expanding algorithm and move-mode according to it.
>>
>> Bootstrapped, two new fails due to incorrect tests (see
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49503). New implementation gives
>> quite big performance gain on memset/memcpy in some cases.
>>
>> A bunch of new tests are added to verify the implementation.
>>
>> Is it ok for trunk?
>>
>> Changelog:
>>
>> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>>
>>     * config/i386/i386.h (processor_costs): Add second dimension to
>>     stringop_algs array.
>>     (clear_ratio): Tune value to improve performance.
>>     * config/i386/i386.c (cost models): Initialize second dimension of
>>     stringop_algs arrays.  Tune cost model in atom_cost, generic32_cost
>>     and generic64_cost.
>>     (ix86_expand_move): Add support for vector moves, that use half of
>>     vector register.
>>     (expand_set_or_movmem_via_loop_with_iter): New function.
>>     (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
>>     different loops, produced by this function.
>>     (emit_strset): New function.
>>     (promote_duplicated_reg): Add support for vector modes, add
>>     declaration.
>>     (promote_duplicated_reg_to_size): Likewise.
>>     (expand_movmem_epilogue): Add epilogue generation for bigger sizes.
>>     (expand_setmem_epilogue): Likewise.
>>     (expand_movmem_prologue): Likewise for prologue.
>>     (expand_setmem_prologue): Likewise.
>>     (expand_constant_movmem_prologue): Likewise.
>>     (expand_constant_setmem_prologue): Likewise.
>>     (decide_alg): Add new argument align_unknown.  Fix algorithm of
>>     strategy selection if TARGET_INLINE_ALL_STRINGOPS is set.
>>     (decide_alignment): Update desired alignment according to chosen move
>>     mode.
>>     (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
>>     (ix86_expand_setmem): Likewise.
>>     (ix86_slow_unaligned_access): Implementation of new hook
>>     slow_unaligned_access.
>>     (ix86_promote_rtx_for_memset): Implementation of new hook
>>     promote_rtx_for_memset.
>>     * config/i386/sse.md (sse2_loadq): Add expand for sse2_loadq.
>>     (vec_dupv4si): Add expand for vec_dupv4si.
>>     (vec_dupv2di): Add expand for vec_dupv2di.
>>     * emit-rtl.c (adjust_address_1): Improve algorithm for determining
>>     alignment of address+offset.
>>     (get_mem_align_offset): Add handling of MEM_REFs.
>>     * expr.c (compute_align_by_offset): New function.
>>     (move_by_pieces_insn): New function.
>>     (widest_mode_for_unaligned_mov): New function.
>>     (widest_mode_for_aligned_mov): New function.
>>     (widest_int_mode_for_size): Change type of size from int to
>>     HOST_WIDE_INT.
>>     (set_by_pieces_1): New function (new algorithm of memset expanding).
>>     (set_by_pieces_2): New function.
>>     (generate_move_with_mode): New function for set_by_pieces.
>>     (alignment_for_piecewise_move): Use hook slow_unaligned_access instead
>>     of macros SLOW_UNALIGNED_ACCESS.
>>     (emit_group_load_1): Likewise.
>>     (emit_group_store): Likewise.
>>     (emit_push_insn): Likewise.
>>     (store_field): Likewise.
>>     (expand_expr_real_1): Likewise.
>>     (compute_aligned_cost): New function.
>>     (compute_unaligned_cost): New function.
>>     (vector_mode_for_mode): New function.
>>     (vector_extensions_used_for_mode): New function.
>>     (move_by_pieces): New algorithm of memmove expanding.
>>     (move_by_pieces_ninsns): Update according to changes in
>>     move_by_pieces.
>>     (move_by_pieces_1): Remove as unused.
>>     (store_by_pieces): New algorithm for memset expanding.
>>     (clear_by_pieces): Likewise.
>>     (store_by_pieces_1): Remove incorrect parameters' attributes.
>>     * expr.h (compute_align_by_offset): Add declaration.
>>     * rtl.h (vector_extensions_used_for_mode): Add declaration.
>>     * builtins.c (expand_builtin_memset_args): Update according to changes
>>     in set_by_pieces.
>>     * target.def (DEFHOOK): Add hook slow_unaligned_access and
>>     promote_rtx_for_memset.
>>     * targhooks.c (default_slow_unaligned_access): Add default hook
>>     implementation.
>>     (default_promote_rtx_for_memset): Likewise.
>>     * targhooks.h (default_slow_unaligned_access): Add prototype.
>>     (default_promote_rtx_for_memset): Likewise.
>>     * cse.c (cse_insn): Stop forward propagation of vector constants.
>>     * fwprop.c (forward_propagate_and_simplify): Likewise.
>>     * doc/tm.texi (SLOW_UNALIGNED_ACCESS): Remove documentation for deleted
>>     macro SLOW_UNALIGNED_ACCESS.
>>     (TARGET_SLOW_UNALIGNED_ACCESS): Add documentation on new hook.
>>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>>     * doc/tm.texi.in (SLOW_UNALIGNED_ACCESS): Likewise.
>>     (TARGET_SLOW_UNALIGNED_ACCESS): Likewise.
>>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>>
>> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>>
>>     * testsuite/gcc.target/i386/memset-s64-a0-1.c: New testcase.
>>     * testsuite/gcc.target/i386/memset-s64-a0-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s16-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s16-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a0-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-au-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-au-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a0-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a0-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-au-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-au-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s3072-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s3072-a1-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s3072-au-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s3072-au-1.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-5.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s16-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s16-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-6.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a0-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-au-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-au-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a0-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a0-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-au-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-au-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s3072-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s3072-a1-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s3072-au-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s3072-au-2.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-7.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-8.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-5.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-6.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s16-a1-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s16-a1-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-9.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a0-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a1-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a1-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-au-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-au-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a0-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a0-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a1-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a1-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-au-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-au-3.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-10.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-11.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-7.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s768-a0-8.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s16-a1-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s16-a1-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a0-12.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a0-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-a1-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-a1-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s64-au-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s64-au-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a0-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a0-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-a1-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memcpy-s512-a1-4.c: Ditto.
>>     * testsuite/gcc.target/i386/memset-s512-au-4.c: Ditto.
>>
>

Please don't use -m32/-m64 in testcases directly.
You should use

/* { dg-do compile { target { ! ia32 } } } */

for 32bit insns and

/* { dg-do compile { target { ia32 } } } */

for 64bit insns.


-- 
H.J.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-07-12  5:11   ` H.J. Lu
@ 2011-07-12 20:35     ` Michael Zolotukhin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-07-12 20:35 UTC (permalink / raw)
  To: H.J. Lu; +Cc: gcc-patches, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 11292 bytes --]

> Please don't use -m32/-m64 in testcases directly.
Fixed in the attached patch.

On 12 July 2011 07:49, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Jul 11, 2011 at 1:57 PM, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> Sorry, for sending once again - forgot to attach the patch.
>>
>> On 11 July 2011 23:50, Michael Zolotukhin
>> <michael.v.zolotukhin@gmail.com> wrote:
>>> The attached patch enables use of vector instructions in memmov/memset
>>> expanding.
>>>
>>> New algorithm for move-mode selection is implemented for move_by_pieces,
>>> store_by_pieces.
>>> x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>>> similar way, x86 cost-models parameters are slightly changed to support
>>> this. This implementation checks if array's alignment is known at compile
>>> time and chooses expanding algorithm and move-mode according to it.
>>>
>>> Bootstrapped, two new fails due to incorrect tests (see
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49503). New implementation gives
>>> quite big performance gain on memset/memcpy in some cases.
>>>
>>> A bunch of new tests are added to verify the implementation.
>>>
>>> Is it ok for trunk?
>>>
>>> Changelog:
>>>
>>> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>>>
>>>     * config/i386/i386.h (processor_costs): Add second dimension to
>>>     stringop_algs array.
>>>     (clear_ratio): Tune value to improve performance.
>>>     * config/i386/i386.c (cost models): Initialize second dimension of
>>>     stringop_algs arrays.  Tune cost model in atom_cost, generic32_cost
>>>     and generic64_cost.
>>>     (ix86_expand_move): Add support for vector moves, that use half of
>>>     vector register.
>>>     (expand_set_or_movmem_via_loop_with_iter): New function.
>>>     (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
>>>     different loops, produced by this function.
>>>     (emit_strset): New function.
>>>     (promote_duplicated_reg): Add support for vector modes, add
>>>     declaration.
>>>     (promote_duplicated_reg_to_size): Likewise.
>>>     (expand_movmem_epilogue): Add epilogue generation for bigger sizes.
>>>     (expand_setmem_epilogue): Likewise.
>>>     (expand_movmem_prologue): Likewise for prologue.
>>>     (expand_setmem_prologue): Likewise.
>>>     (expand_constant_movmem_prologue): Likewise.
>>>     (expand_constant_setmem_prologue): Likewise.
>>>     (decide_alg): Add new argument align_unknown.  Fix algorithm of
>>>     strategy selection if TARGET_INLINE_ALL_STRINGOPS is set.
>>>     (decide_alignment): Update desired alignment according to chosen move
>>>     mode.
>>>     (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
>>>     (ix86_expand_setmem): Likewise.
>>>     (ix86_slow_unaligned_access): Implementation of new hook
>>>     slow_unaligned_access.
>>>     (ix86_promote_rtx_for_memset): Implementation of new hook
>>>     promote_rtx_for_memset.
>>>     * config/i386/sse.md (sse2_loadq): Add expand for sse2_loadq.
>>>     (vec_dupv4si): Add expand for vec_dupv4si.
>>>     (vec_dupv2di): Add expand for vec_dupv2di.
>>>     * emit-rtl.c (adjust_address_1): Improve algorithm for determining
>>>     alignment of address+offset.
>>>     (get_mem_align_offset): Add handling of MEM_REFs.
>>>     * expr.c (compute_align_by_offset): New function.
>>>     (move_by_pieces_insn): New function.
>>>     (widest_mode_for_unaligned_mov): New function.
>>>     (widest_mode_for_aligned_mov): New function.
>>>     (widest_int_mode_for_size): Change type of size from int to
>>>     HOST_WIDE_INT.
>>>     (set_by_pieces_1): New function (new algorithm of memset expanding).
>>>     (set_by_pieces_2): New function.
>>>     (generate_move_with_mode): New function for set_by_pieces.
>>>     (alignment_for_piecewise_move): Use hook slow_unaligned_access instead
>>>     of macros SLOW_UNALIGNED_ACCESS.
>>>     (emit_group_load_1): Likewise.
>>>     (emit_group_store): Likewise.
>>>     (emit_push_insn): Likewise.
>>>     (store_field): Likewise.
>>>     (expand_expr_real_1): Likewise.
>>>     (compute_aligned_cost): New function.
>>>     (compute_unaligned_cost): New function.
>>>     (vector_mode_for_mode): New function.
>>>     (vector_extensions_used_for_mode): New function.
>>>     (move_by_pieces): New algorithm of memmove expanding.
>>>     (move_by_pieces_ninsns): Update according to changes in
>>>     move_by_pieces.
>>>     (move_by_pieces_1): Remove as unused.
>>>     (store_by_pieces): New algorithm for memset expanding.
>>>     (clear_by_pieces): Likewise.
>>>     (store_by_pieces_1): Remove incorrect parameters' attributes.
>>>     * expr.h (compute_align_by_offset): Add declaration.
>>>     * rtl.h (vector_extensions_used_for_mode): Add declaration.
>>>     * builtins.c (expand_builtin_memset_args): Update according to changes
>>>     in set_by_pieces.
>>>     * target.def (DEFHOOK): Add hook slow_unaligned_access and
>>>     promote_rtx_for_memset.
>>>     * targhooks.c (default_slow_unaligned_access): Add default hook
>>>     implementation.
>>>     (default_promote_rtx_for_memset): Likewise.
>>>     * targhooks.h (default_slow_unaligned_access): Add prototype.
>>>     (default_promote_rtx_for_memset): Likewise.
>>>     * cse.c (cse_insn): Stop forward propagation of vector constants.
>>>     * fwprop.c (forward_propagate_and_simplify): Likewise.
>>>     * doc/tm.texi (SLOW_UNALIGNED_ACCESS): Remove documentation for deleted
>>>     macro SLOW_UNALIGNED_ACCESS.
>>>     (TARGET_SLOW_UNALIGNED_ACCESS): Add documentation on new hook.
>>>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>>>     * doc/tm.texi.in (SLOW_UNALIGNED_ACCESS): Likewise.
>>>     (TARGET_SLOW_UNALIGNED_ACCESS): Likewise.
>>>     (TARGET_PROMOTE_RTX_FOR_MEMSET): Likewise.
>>>
>>> 2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
>>>
>>>     * testsuite/gcc.target/i386/memset-s64-a0-1.c: New testcase.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s16-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s16-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a0-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-au-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-au-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a0-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a0-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-au-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-au-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s3072-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s3072-a1-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s3072-au-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s3072-au-1.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-5.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s16-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s16-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-6.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a0-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-au-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-au-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a0-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a0-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-au-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-au-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s3072-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s3072-a1-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s3072-au-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s3072-au-2.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-7.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-8.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-5.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-6.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s16-a1-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s16-a1-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-9.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a0-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a1-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a1-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-au-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-au-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a0-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a0-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a1-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a1-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-au-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-au-3.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-10.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-11.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-7.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s768-a0-8.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s16-a1-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s16-a1-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a0-12.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a0-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-a1-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-a1-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s64-au-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s64-au-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a0-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a0-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-a1-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memcpy-s512-a1-4.c: Ditto.
>>>     * testsuite/gcc.target/i386/memset-s512-au-4.c: Ditto.
>>>
>>
>
> Please don't use -m32/-m64 in testcases directly.
> You should use
>
> /* { dg-do compile { target { ! ia32 } } } */
>
> for 32bit insns and
>
> /* { dg-do compile { target { ia32 } } } */
>
> for 64bit insns.
>
>
> --
> H.J.
>

[-- Attachment #2: memfunc.patch --]
[-- Type: application/octet-stream, Size: 156203 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 1ee8cf8..40d6baa 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3564,7 +3564,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a46101b..a4043f5 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1402,11 +1451,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1473,11 +1527,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1544,13 +1605,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1617,13 +1687,20 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1697,10 +1774,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1769,10 +1852,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2451,6 +2540,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -14952,6 +15042,28 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -19470,22 +19582,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -19493,10 +19600,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -19507,7 +19616,8 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
@@ -19590,19 +19700,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -19704,7 +19838,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -19715,43 +19869,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -19857,87 +20023,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -19947,14 +20148,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -19962,24 +20166,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -20022,7 +20226,27 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = change_address (srcmem, DImode, srcptr);
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = change_address (srcmem, SImode, srcptr);
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -20078,6 +20302,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -20137,7 +20392,17 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -20173,6 +20438,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -20184,7 +20462,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -20193,7 +20471,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -20206,7 +20484,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -20215,9 +20493,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -20281,29 +20559,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -20327,9 +20609,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -20417,6 +20701,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -20440,9 +20729,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -20461,11 +20758,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -20634,11 +20936,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -20689,9 +20994,50 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -20709,7 +21055,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -20775,11 +21151,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+        promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+        promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -20805,12 +21191,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -20830,8 +21221,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -20849,11 +21243,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -20870,6 +21274,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -20898,8 +21303,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -20908,12 +21315,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -20954,8 +21355,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -20978,6 +21381,8 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -21019,7 +21424,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -21027,8 +21432,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -21072,15 +21483,36 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -34598,6 +35030,87 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -34899,6 +35412,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8cef4e7..cf6d092 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -156,8 +156,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 7abee33..c2c8ef6 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6345,6 +6345,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -6456,6 +6463,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -6496,6 +6513,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=Y2,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/cse.c b/gcc/cse.c
index a078329..9cf70ce 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4614,7 +4614,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c0648a5..44e9947 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5747,6 +5747,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6219,23 +6245,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3660d36..0e41fb4 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5690,6 +5690,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6162,23 +6188,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index c641b7e..18e1a8c 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1504,6 +1504,11 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (MEM_ALIGN (mem) < (unsigned int) align)
+	return -1;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2059,9 +2064,14 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      lowest-order set bit in OFFSET, but don't change the alignment if OFFSET
      if zero.  */
   if (offset != 0)
-    memalign
-      = MIN (memalign,
-	     (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
+    {
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	memalign = compute_align_by_offset (old_offset + offset);
+      else
+	memalign = MIN (memalign,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
+    }
 
   /* We can compute the size in a number of ways.  */
   if (GET_MODE (new_rtx) != BLKmode)
diff --git a/gcc/expr.c b/gcc/expr.c
index fb4379f..410779a 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -125,15 +125,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -160,6 +163,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -808,7 +817,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -817,11 +826,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    MOVE_MAX * BITS_PER_UNIT :
+	    MIN (MOVE_MAX, (offset & -offset)) * BITS_PER_UNIT;
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -833,6 +897,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -873,6 +1101,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -957,23 +1186,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1011,35 +1254,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1047,60 +1302,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1677,7 +1929,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2067,7 +2319,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2356,7 +2608,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2400,10 +2655,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2417,13 +2672,121 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+
+      rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    {
+      generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+      return Pmode;
+    }
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2498,6 +2861,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -3714,7 +4205,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -5839,7 +6330,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9195,7 +9686,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index cb4050d..b9ec9c2 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -693,4 +693,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5db9ed8..f15f957 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1270,6 +1270,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index e3ceecd..4584206 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2428,6 +2428,10 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
+
 
 /* In cfgrtl.c */
 extern void print_rtl_with_bb (FILE *, const_rtx);
diff --git a/gcc/target.def b/gcc/target.def
index c67f0ba..378e6e5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1479,6 +1479,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index bcb8a12..4c4e4bd 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1441,4 +1441,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index ce89d32..27f2f4d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -174,3 +174,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..205f0b4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..f703cd3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..5a1e9c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..e3139de
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..57f3946
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..4fc85a7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..90a7654
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..e95d696
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..eae12c3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..fe359fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..5b11dda
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..0c94a13
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..d7e9ce4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..4592e77
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..37b8840
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..5f60b60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..6cbb09e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..e5d6bf6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..c5019f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..d8de2d5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..9e8d76c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..0da49fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..9f1008d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..f1c737c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..57eeebc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..f6c5447
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..96420ab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..461ca11
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..edb857a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..7d4cf5c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..93b2eac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..60f8b44
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..f7d509c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..dadba3c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..7263ce8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..60c2947
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..6340dfb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..349b4ab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..e6898d5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..7aad20f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..b554662
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..0e7b408
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..3e177f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..7b82652
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..bf1e527
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..25fd0a4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..9807471
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..63ec791
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..68f335b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..ae8fcdf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..0a551a9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..c35cfa3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..f6d1f00
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..12d85fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..322dc94
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..12c0f88
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..919920e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..772256f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..11b7176
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..a3ee5a6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..e9dffb1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..50cd32c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..bdd118e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..62642ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..200e018
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..05dd067
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..b818d63
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..6978faa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..9a58ab2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..1c67b91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..b4f0e1a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..2f6faf2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..bb89f1c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..1ecca60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..836b56a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom -mtune=atom -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..98de8d8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..b2051be
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..4c9a267
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..2387509
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-07-11 21:09 ` Michael Zolotukhin
  2011-07-12  5:11   ` H.J. Lu
@ 2011-07-16  2:51   ` Jan Hubicka
  2011-07-18 11:25     ` Michael Zolotukhin
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Hubicka @ 2011-07-16  2:51 UTC (permalink / raw)
  To: Michael Zolotukhin; +Cc: gcc-patches, Richard Guenther, H.J. Lu

> > New algorithm for move-mode selection is implemented for move_by_pieces,
> > store_by_pieces.
> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
> > similar way, x86 cost-models parameters are slightly changed to support
> > this. This implementation checks if array's alignment is known at compile
> > time and chooses expanding algorithm and move-mode according to it.

Can you give some sumary of changes you made?  It would make it a lot easier to
review if it was broken up int the generic changes (with rationaly why they are
needed) and i386 backend changes that I could review then.

From first pass through the patch I don't quite see the need for i.e. adding
new move patterns when we can output all kinds of SSE moves already.  Will look
more into the patch to see if I can come up with useful comments.

Honza

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-07-16  2:51   ` Jan Hubicka
@ 2011-07-18 11:25     ` Michael Zolotukhin
  2011-07-26 15:46       ` Michael Zolotukhin
                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-07-18 11:25 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, Richard Guenther, H.J. Lu, izamyatin

Here is a summary - probably, it doesn't cover every single piece in
the patch, but I tried to describe the major changes. I hope this will
help you a bit - and of course I'll answer your further questions if
they appear.

The changes could be logically divided into two parts (though, these
parts have something in common).
The first part is changes in target-independent part, in functions
move_by_pieces() and store_by_pieces() - mostly located in expr.c.
The second part touches ix86_expand_movmem() and ix86_expand_setmem()
- mostly located in config/i386/i386.c.

Changes in i386.c (target-dependent part):
1) Strategies for cases with known and unknown alignment are separated
from each other.
When alignment is known at compile time, we could generate optimized
code without libcalls.
When it's unknown, we sometimes could create runtime-checks to reach
desired alignment, but not always.
Strategies for atom and generic_32, generic_64 were chosen according
to set of experiments, strategies in other
cost models are unchanged (strategies for unknown alignment are copied
from existing strategies).
2) unrolled_loop algorithm was modified - now it uses SSE move-modes,
if they're available.
3) As size of data, moved in one iteration, greatly increased, and
epilogues became bigger - so some changes were needed in epilogue
generation. In some cases a special loop (not unrolled) is generated
in epilogue to avoid slow copying by bytes (changes in
expand_set_or_movmem_via_loop() and introducing of
expand_set_or_movmem_via_loop_with_iter() is made for these cases).
4) As bigger alignment might be needed than previously, prologue
generation was also modified.

Changes in expr.c (target-independent part):
There are two possible strategies now: use of aligned and unaligned
moves. For each of them a cost model was implemented and the choice is
made according to the cost of each option. Move-mode choice is made by
functions widest_mode_for_unaligned_mov() and
widest_mode_for_aligned_mov().
Cost estimation is implemented in functions compute_aligned_cost() and
compute_unaligned_cost().
Choice between these two strategies and the generation of moves
themselves are in function move_by_pieces().

Function store_by_pieces() calls set_by_pieces_1() instead of
store_by_pieces_1(), if this is memset-case (I needed to introduce
set_by_pieces_1 to separate memset-case from others -
store_by_pieces_1 is sometimes called for strcpy and some other
functions, not only for memset).

Set_by_pieces_1() estimates costs of aligned and unaligned strategies
(as in move_by_pieces() ) and generates moves for memset. Single move
is generated via
generate_move_with_mode(). If it's called first time, a promoted value
(register, filled with one-byte value of memset argument) is generated
- later calls reuse this value.

Changes in MD-files:
For generation of promoted values, I made some changes in
promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands
for vec_dup4si and vec_dupv2di were introduced for this too (these
expands differ from corresponding define_insns - existing define_insn
work only with registers, while new expands could process memory
operand as well).

Some code were added to allow generation of MOVQ (with SSE-registers)
- such moves aren't usual ones, because they use only half of
xmm-register.
There was a need to generate such moves explicitly, so I added a
simple expand to sse.md.


On 16 July 2011 03:24, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > New algorithm for move-mode selection is implemented for move_by_pieces,
>> > store_by_pieces.
>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>> > similar way, x86 cost-models parameters are slightly changed to support
>> > this. This implementation checks if array's alignment is known at compile
>> > time and chooses expanding algorithm and move-mode according to it.
>
> Can you give some sumary of changes you made?  It would make it a lot easier to
> review if it was broken up int the generic changes (with rationaly why they are
> needed) and i386 backend changes that I could review then.
>
> From first pass through the patch I don't quite see the need for i.e. adding
> new move patterns when we can output all kinds of SSE moves already.  Will look
> more into the patch to see if I can come up with useful comments.
>
> Honza
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-07-18 11:25     ` Michael Zolotukhin
@ 2011-07-26 15:46       ` Michael Zolotukhin
  2011-08-22  9:52       ` Michael Zolotukhin
       [not found]       ` <CANtU07-eCpAZ=VgvkdBCORq8bR0UZCgryofBXU_4FcRDJ7hWoQ@mail.gmail.com>
  2 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-07-26 15:46 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, Richard Guenther, H.J. Lu, izamyatin

Any updates/questions on this?

On 18 July 2011 15:00, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Here is a summary - probably, it doesn't cover every single piece in
> the patch, but I tried to describe the major changes. I hope this will
> help you a bit - and of course I'll answer your further questions if
> they appear.
>
> The changes could be logically divided into two parts (though, these
> parts have something in common).
> The first part is changes in target-independent part, in functions
> move_by_pieces() and store_by_pieces() - mostly located in expr.c.
> The second part touches ix86_expand_movmem() and ix86_expand_setmem()
> - mostly located in config/i386/i386.c.
>
> Changes in i386.c (target-dependent part):
> 1) Strategies for cases with known and unknown alignment are separated
> from each other.
> When alignment is known at compile time, we could generate optimized
> code without libcalls.
> When it's unknown, we sometimes could create runtime-checks to reach
> desired alignment, but not always.
> Strategies for atom and generic_32, generic_64 were chosen according
> to set of experiments, strategies in other
> cost models are unchanged (strategies for unknown alignment are copied
> from existing strategies).
> 2) unrolled_loop algorithm was modified - now it uses SSE move-modes,
> if they're available.
> 3) As size of data, moved in one iteration, greatly increased, and
> epilogues became bigger - so some changes were needed in epilogue
> generation. In some cases a special loop (not unrolled) is generated
> in epilogue to avoid slow copying by bytes (changes in
> expand_set_or_movmem_via_loop() and introducing of
> expand_set_or_movmem_via_loop_with_iter() is made for these cases).
> 4) As bigger alignment might be needed than previously, prologue
> generation was also modified.
>
> Changes in expr.c (target-independent part):
> There are two possible strategies now: use of aligned and unaligned
> moves. For each of them a cost model was implemented and the choice is
> made according to the cost of each option. Move-mode choice is made by
> functions widest_mode_for_unaligned_mov() and
> widest_mode_for_aligned_mov().
> Cost estimation is implemented in functions compute_aligned_cost() and
> compute_unaligned_cost().
> Choice between these two strategies and the generation of moves
> themselves are in function move_by_pieces().
>
> Function store_by_pieces() calls set_by_pieces_1() instead of
> store_by_pieces_1(), if this is memset-case (I needed to introduce
> set_by_pieces_1 to separate memset-case from others -
> store_by_pieces_1 is sometimes called for strcpy and some other
> functions, not only for memset).
>
> Set_by_pieces_1() estimates costs of aligned and unaligned strategies
> (as in move_by_pieces() ) and generates moves for memset. Single move
> is generated via
> generate_move_with_mode(). If it's called first time, a promoted value
> (register, filled with one-byte value of memset argument) is generated
> - later calls reuse this value.
>
> Changes in MD-files:
> For generation of promoted values, I made some changes in
> promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands
> for vec_dup4si and vec_dupv2di were introduced for this too (these
> expands differ from corresponding define_insns - existing define_insn
> work only with registers, while new expands could process memory
> operand as well).
>
> Some code were added to allow generation of MOVQ (with SSE-registers)
> - such moves aren't usual ones, because they use only half of
> xmm-register.
> There was a need to generate such moves explicitly, so I added a
> simple expand to sse.md.
>
>
> On 16 July 2011 03:24, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > New algorithm for move-mode selection is implemented for move_by_pieces,
>>> > store_by_pieces.
>>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>>> > similar way, x86 cost-models parameters are slightly changed to support
>>> > this. This implementation checks if array's alignment is known at compile
>>> > time and chooses expanding algorithm and move-mode according to it.
>>
>> Can you give some sumary of changes you made?  It would make it a lot easier to
>> review if it was broken up int the generic changes (with rationaly why they are
>> needed) and i386 backend changes that I could review then.
>>
>> From first pass through the patch I don't quite see the need for i.e. adding
>> new move patterns when we can output all kinds of SSE moves already.  Will look
>> more into the patch to see if I can come up with useful comments.
>>
>> Honza
>>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-07-18 11:25     ` Michael Zolotukhin
  2011-07-26 15:46       ` Michael Zolotukhin
@ 2011-08-22  9:52       ` Michael Zolotukhin
       [not found]       ` <CANtU07-eCpAZ=VgvkdBCORq8bR0UZCgryofBXU_4FcRDJ7hWoQ@mail.gmail.com>
  2 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-08-22  9:52 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, Richard Guenther, H.J. Lu, izamyatin

Ping.

On 18 July 2011 15:00, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Here is a summary - probably, it doesn't cover every single piece in
> the patch, but I tried to describe the major changes. I hope this will
> help you a bit - and of course I'll answer your further questions if
> they appear.
>
> The changes could be logically divided into two parts (though, these
> parts have something in common).
> The first part is changes in target-independent part, in functions
> move_by_pieces() and store_by_pieces() - mostly located in expr.c.
> The second part touches ix86_expand_movmem() and ix86_expand_setmem()
> - mostly located in config/i386/i386.c.
>
> Changes in i386.c (target-dependent part):
> 1) Strategies for cases with known and unknown alignment are separated
> from each other.
> When alignment is known at compile time, we could generate optimized
> code without libcalls.
> When it's unknown, we sometimes could create runtime-checks to reach
> desired alignment, but not always.
> Strategies for atom and generic_32, generic_64 were chosen according
> to set of experiments, strategies in other
> cost models are unchanged (strategies for unknown alignment are copied
> from existing strategies).
> 2) unrolled_loop algorithm was modified - now it uses SSE move-modes,
> if they're available.
> 3) As size of data, moved in one iteration, greatly increased, and
> epilogues became bigger - so some changes were needed in epilogue
> generation. In some cases a special loop (not unrolled) is generated
> in epilogue to avoid slow copying by bytes (changes in
> expand_set_or_movmem_via_loop() and introducing of
> expand_set_or_movmem_via_loop_with_iter() is made for these cases).
> 4) As bigger alignment might be needed than previously, prologue
> generation was also modified.
>
> Changes in expr.c (target-independent part):
> There are two possible strategies now: use of aligned and unaligned
> moves. For each of them a cost model was implemented and the choice is
> made according to the cost of each option. Move-mode choice is made by
> functions widest_mode_for_unaligned_mov() and
> widest_mode_for_aligned_mov().
> Cost estimation is implemented in functions compute_aligned_cost() and
> compute_unaligned_cost().
> Choice between these two strategies and the generation of moves
> themselves are in function move_by_pieces().
>
> Function store_by_pieces() calls set_by_pieces_1() instead of
> store_by_pieces_1(), if this is memset-case (I needed to introduce
> set_by_pieces_1 to separate memset-case from others -
> store_by_pieces_1 is sometimes called for strcpy and some other
> functions, not only for memset).
>
> Set_by_pieces_1() estimates costs of aligned and unaligned strategies
> (as in move_by_pieces() ) and generates moves for memset. Single move
> is generated via
> generate_move_with_mode(). If it's called first time, a promoted value
> (register, filled with one-byte value of memset argument) is generated
> - later calls reuse this value.
>
> Changes in MD-files:
> For generation of promoted values, I made some changes in
> promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands
> for vec_dup4si and vec_dupv2di were introduced for this too (these
> expands differ from corresponding define_insns - existing define_insn
> work only with registers, while new expands could process memory
> operand as well).
>
> Some code were added to allow generation of MOVQ (with SSE-registers)
> - such moves aren't usual ones, because they use only half of
> xmm-register.
> There was a need to generate such moves explicitly, so I added a
> simple expand to sse.md.
>
>
> On 16 July 2011 03:24, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > New algorithm for move-mode selection is implemented for move_by_pieces,
>>> > store_by_pieces.
>>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>>> > similar way, x86 cost-models parameters are slightly changed to support
>>> > this. This implementation checks if array's alignment is known at compile
>>> > time and chooses expanding algorithm and move-mode according to it.
>>
>> Can you give some sumary of changes you made?  It would make it a lot easier to
>> review if it was broken up int the generic changes (with rationaly why they are
>> needed) and i386 backend changes that I could review then.
>>
>> From first pass through the patch I don't quite see the need for i.e. adding
>> new move patterns when we can output all kinds of SSE moves already.  Will look
>> more into the patch to see if I can come up with useful comments.
>>
>> Honza
>>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
       [not found]       ` <CANtU07-eCpAZ=VgvkdBCORq8bR0UZCgryofBXU_4FcRDJ7hWoQ@mail.gmail.com>
@ 2011-09-28 12:29         ` Michael Zolotukhin
  2011-09-28 12:36           ` Michael Zolotukhin
  2011-09-28 14:15           ` Jack Howarth
  0 siblings, 2 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 12:29 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

Attached is a part 1 of patch that enables use of vector-instructions
in memset and memcopy (middle-end part).
The main part of the changes is in functions
move_by_pieces/set_by_pieces. In new version algorithm of move-mode
selection was changed – now it checks if alignment is known at compile
time and uses cost-models to choose between aligned and unaligned
vector or not-vector move-modes.

Build and 'make check' was tested - in 'make check' there is a fail,
that would be cured when complete patch is applied.

On 27 September 2011 18:44, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> I divided the patch into three smaller ones:
>
> 1) Patch with target-independent changes (see attached file memfunc-mid.patch).
> The main part of the changes is in functions
> move_by_pieces/set_by_pieces. In new version algorithm of move-mode
> selection was changed – now it checks if alignment is known at compile
> time and uses cost-models to choose between aligned and unaligned
> vector or not-vector move-modes.
>
> 2) Patch with target-dependent changes (memfunc-be.patch).
> The main part of the changes is in functions
> ix86_expand_setmem/ix86_expand_movmem. The other changes are only
> needed to support it.
> The changes mostly touched unrolled_loop strategy – now vector move
> modes could be used here. That resulted in large epilogues and
> prologues, so their generation also was modified.
> This patch contains some changes in middle-end (to make build
> possible) - but all these changes are present in the first patch, so
> there is no need to review them here.
>
> 3) Patch with all new tests (memfunc-tests.patch).
> This patch contains a lot of small tests for different memset and memcopy cases.
>
> Separately from each other, these patches won't give performance gain.
> The positive effect will be noticeable only if they are applied
> together (I attach the complete patch also - see file
> memfunc-complete.patch).
>
>
> If you have any questions regarding these changes, please don't
> hesitate to ask them.
>
>
> On 18 July 2011 15:00, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> Here is a summary - probably, it doesn't cover every single piece in
>> the patch, but I tried to describe the major changes. I hope this will
>> help you a bit - and of course I'll answer your further questions if
>> they appear.
>>
>> The changes could be logically divided into two parts (though, these
>> parts have something in common).
>> The first part is changes in target-independent part, in functions
>> move_by_pieces() and store_by_pieces() - mostly located in expr.c.
>> The second part touches ix86_expand_movmem() and ix86_expand_setmem()
>> - mostly located in config/i386/i386.c.
>>
>> Changes in i386.c (target-dependent part):
>> 1) Strategies for cases with known and unknown alignment are separated
>> from each other.
>> When alignment is known at compile time, we could generate optimized
>> code without libcalls.
>> When it's unknown, we sometimes could create runtime-checks to reach
>> desired alignment, but not always.
>> Strategies for atom and generic_32, generic_64 were chosen according
>> to set of experiments, strategies in other
>> cost models are unchanged (strategies for unknown alignment are copied
>> from existing strategies).
>> 2) unrolled_loop algorithm was modified - now it uses SSE move-modes,
>> if they're available.
>> 3) As size of data, moved in one iteration, greatly increased, and
>> epilogues became bigger - so some changes were needed in epilogue
>> generation. In some cases a special loop (not unrolled) is generated
>> in epilogue to avoid slow copying by bytes (changes in
>> expand_set_or_movmem_via_loop() and introducing of
>> expand_set_or_movmem_via_loop_with_iter() is made for these cases).
>> 4) As bigger alignment might be needed than previously, prologue
>> generation was also modified.
>>
>> Changes in expr.c (target-independent part):
>> There are two possible strategies now: use of aligned and unaligned
>> moves. For each of them a cost model was implemented and the choice is
>> made according to the cost of each option. Move-mode choice is made by
>> functions widest_mode_for_unaligned_mov() and
>> widest_mode_for_aligned_mov().
>> Cost estimation is implemented in functions compute_aligned_cost() and
>> compute_unaligned_cost().
>> Choice between these two strategies and the generation of moves
>> themselves are in function move_by_pieces().
>>
>> Function store_by_pieces() calls set_by_pieces_1() instead of
>> store_by_pieces_1(), if this is memset-case (I needed to introduce
>> set_by_pieces_1 to separate memset-case from others -
>> store_by_pieces_1 is sometimes called for strcpy and some other
>> functions, not only for memset).
>>
>> Set_by_pieces_1() estimates costs of aligned and unaligned strategies
>> (as in move_by_pieces() ) and generates moves for memset. Single move
>> is generated via
>> generate_move_with_mode(). If it's called first time, a promoted value
>> (register, filled with one-byte value of memset argument) is generated
>> - later calls reuse this value.
>>
>> Changes in MD-files:
>> For generation of promoted values, I made some changes in
>> promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands
>> for vec_dup4si and vec_dupv2di were introduced for this too (these
>> expands differ from corresponding define_insns - existing define_insn
>> work only with registers, while new expands could process memory
>> operand as well).
>>
>> Some code were added to allow generation of MOVQ (with SSE-registers)
>> - such moves aren't usual ones, because they use only half of
>> xmm-register.
>> There was a need to generate such moves explicitly, so I added a
>> simple expand to sse.md.
>>
>>
>> On 16 July 2011 03:24, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> > New algorithm for move-mode selection is implemented for move_by_pieces,
>>>> > store_by_pieces.
>>>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>>>> > similar way, x86 cost-models parameters are slightly changed to support
>>>> > this. This implementation checks if array's alignment is known at compile
>>>> > time and chooses expanding algorithm and move-mode according to it.
>>>
>>> Can you give some sumary of changes you made?  It would make it a lot easier to
>>> review if it was broken up int the generic changes (with rationaly why they are
>>> needed) and i386 backend changes that I could review then.
>>>
>>> From first pass through the patch I don't quite see the need for i.e. adding
>>> new move patterns when we can output all kinds of SSE moves already.  Will look
>>> more into the patch to see if I can come up with useful comments.
>>>
>>> Honza
>>>
>>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:29         ` Michael Zolotukhin
@ 2011-09-28 12:36           ` Michael Zolotukhin
  2011-09-28 12:38             ` Michael Zolotukhin
  2011-09-28 12:49             ` Andi Kleen
  2011-09-28 14:15           ` Jack Howarth
  1 sibling, 2 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 12:36 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 8111 bytes --]

Attached is a part 2 of patch that enables use of vector-instructions
in memset and memcopy (back-end part).

The main part of the changes is in functions
ix86_expand_setmem/ix86_expand_movmem. The other changes are only
needed to support it.
The changes mostly touched unrolled_loop strategy – now vector move
modes could be used here. That resulted in large epilogues and
prologues, so their generation also was modified.
This patch contains some changes in middle-end (to make build
possible) - but all these changes are present in the first part of
patch, so there is no need to review them here.

Build and 'make check' was tested.


On 28 September 2011 14:56, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Attached is a part 1 of patch that enables use of vector-instructions
> in memset and memcopy (middle-end part).
> The main part of the changes is in functions
> move_by_pieces/set_by_pieces. In new version algorithm of move-mode
> selection was changed – now it checks if alignment is known at compile
> time and uses cost-models to choose between aligned and unaligned
> vector or not-vector move-modes.
>
> Build and 'make check' was tested - in 'make check' there is a fail,
> that would be cured when complete patch is applied.
>
> On 27 September 2011 18:44, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> I divided the patch into three smaller ones:
>>
>> 1) Patch with target-independent changes (see attached file memfunc-mid.patch).
>> The main part of the changes is in functions
>> move_by_pieces/set_by_pieces. In new version algorithm of move-mode
>> selection was changed – now it checks if alignment is known at compile
>> time and uses cost-models to choose between aligned and unaligned
>> vector or not-vector move-modes.
>>
>> 2) Patch with target-dependent changes (memfunc-be.patch).
>> The main part of the changes is in functions
>> ix86_expand_setmem/ix86_expand_movmem. The other changes are only
>> needed to support it.
>> The changes mostly touched unrolled_loop strategy – now vector move
>> modes could be used here. That resulted in large epilogues and
>> prologues, so their generation also was modified.
>> This patch contains some changes in middle-end (to make build
>> possible) - but all these changes are present in the first patch, so
>> there is no need to review them here.
>>
>> 3) Patch with all new tests (memfunc-tests.patch).
>> This patch contains a lot of small tests for different memset and memcopy cases.
>>
>> Separately from each other, these patches won't give performance gain.
>> The positive effect will be noticeable only if they are applied
>> together (I attach the complete patch also - see file
>> memfunc-complete.patch).
>>
>>
>> If you have any questions regarding these changes, please don't
>> hesitate to ask them.
>>
>>
>> On 18 July 2011 15:00, Michael Zolotukhin
>> <michael.v.zolotukhin@gmail.com> wrote:
>>> Here is a summary - probably, it doesn't cover every single piece in
>>> the patch, but I tried to describe the major changes. I hope this will
>>> help you a bit - and of course I'll answer your further questions if
>>> they appear.
>>>
>>> The changes could be logically divided into two parts (though, these
>>> parts have something in common).
>>> The first part is changes in target-independent part, in functions
>>> move_by_pieces() and store_by_pieces() - mostly located in expr.c.
>>> The second part touches ix86_expand_movmem() and ix86_expand_setmem()
>>> - mostly located in config/i386/i386.c.
>>>
>>> Changes in i386.c (target-dependent part):
>>> 1) Strategies for cases with known and unknown alignment are separated
>>> from each other.
>>> When alignment is known at compile time, we could generate optimized
>>> code without libcalls.
>>> When it's unknown, we sometimes could create runtime-checks to reach
>>> desired alignment, but not always.
>>> Strategies for atom and generic_32, generic_64 were chosen according
>>> to set of experiments, strategies in other
>>> cost models are unchanged (strategies for unknown alignment are copied
>>> from existing strategies).
>>> 2) unrolled_loop algorithm was modified - now it uses SSE move-modes,
>>> if they're available.
>>> 3) As size of data, moved in one iteration, greatly increased, and
>>> epilogues became bigger - so some changes were needed in epilogue
>>> generation. In some cases a special loop (not unrolled) is generated
>>> in epilogue to avoid slow copying by bytes (changes in
>>> expand_set_or_movmem_via_loop() and introducing of
>>> expand_set_or_movmem_via_loop_with_iter() is made for these cases).
>>> 4) As bigger alignment might be needed than previously, prologue
>>> generation was also modified.
>>>
>>> Changes in expr.c (target-independent part):
>>> There are two possible strategies now: use of aligned and unaligned
>>> moves. For each of them a cost model was implemented and the choice is
>>> made according to the cost of each option. Move-mode choice is made by
>>> functions widest_mode_for_unaligned_mov() and
>>> widest_mode_for_aligned_mov().
>>> Cost estimation is implemented in functions compute_aligned_cost() and
>>> compute_unaligned_cost().
>>> Choice between these two strategies and the generation of moves
>>> themselves are in function move_by_pieces().
>>>
>>> Function store_by_pieces() calls set_by_pieces_1() instead of
>>> store_by_pieces_1(), if this is memset-case (I needed to introduce
>>> set_by_pieces_1 to separate memset-case from others -
>>> store_by_pieces_1 is sometimes called for strcpy and some other
>>> functions, not only for memset).
>>>
>>> Set_by_pieces_1() estimates costs of aligned and unaligned strategies
>>> (as in move_by_pieces() ) and generates moves for memset. Single move
>>> is generated via
>>> generate_move_with_mode(). If it's called first time, a promoted value
>>> (register, filled with one-byte value of memset argument) is generated
>>> - later calls reuse this value.
>>>
>>> Changes in MD-files:
>>> For generation of promoted values, I made some changes in
>>> promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands
>>> for vec_dup4si and vec_dupv2di were introduced for this too (these
>>> expands differ from corresponding define_insns - existing define_insn
>>> work only with registers, while new expands could process memory
>>> operand as well).
>>>
>>> Some code were added to allow generation of MOVQ (with SSE-registers)
>>> - such moves aren't usual ones, because they use only half of
>>> xmm-register.
>>> There was a need to generate such moves explicitly, so I added a
>>> simple expand to sse.md.
>>>
>>>
>>> On 16 July 2011 03:24, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>> > New algorithm for move-mode selection is implemented for move_by_pieces,
>>>>> > store_by_pieces.
>>>>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>>>>> > similar way, x86 cost-models parameters are slightly changed to support
>>>>> > this. This implementation checks if array's alignment is known at compile
>>>>> > time and chooses expanding algorithm and move-mode according to it.
>>>>
>>>> Can you give some sumary of changes you made?  It would make it a lot easier to
>>>> review if it was broken up int the generic changes (with rationaly why they are
>>>> needed) and i386 backend changes that I could review then.
>>>>
>>>> From first pass through the patch I don't quite see the need for i.e. adding
>>>> new move patterns when we can output all kinds of SSE moves already.  Will look
>>>> more into the patch to see if I can come up with useful comments.
>>>>
>>>> Honza
>>>>
>>>
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-be.patch --]
[-- Type: application/octet-stream, Size: 74731 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index f952d2e..416d74d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1538,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1614,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1692,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1774,20 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1861,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1939,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2626,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15190,6 +15280,28 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20201,22 +20313,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20224,10 +20331,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20238,7 +20347,8 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
@@ -20321,19 +20431,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20437,7 +20571,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20448,43 +20602,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (remainder_size >= 4)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -20590,87 +20756,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -20680,14 +20881,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -20695,24 +20899,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -20755,7 +20959,27 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = change_address (srcmem, DImode, srcptr);
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = change_address (srcmem, SImode, srcptr);
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -20810,6 +21034,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -20869,7 +21124,17 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -20905,6 +21170,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -20916,7 +21194,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -20925,7 +21203,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -20938,7 +21216,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -20947,9 +21225,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21013,29 +21291,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21059,9 +21341,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21149,6 +21433,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21172,9 +21461,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21193,11 +21490,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21318,6 +21620,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -21366,11 +21670,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21421,9 +21728,50 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21441,7 +21789,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21507,11 +21885,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -21537,12 +21925,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21562,8 +21955,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21581,11 +21977,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21602,6 +22008,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -21630,8 +22037,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -21640,12 +22049,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -21686,8 +22089,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -21751,7 +22156,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -21759,8 +22164,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -21804,15 +22215,36 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -36374,6 +36806,87 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -36681,6 +37194,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 7d6e058..1336f9f 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 6c20ddb..3e363f4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7244,6 +7244,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7355,6 +7362,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7396,6 +7413,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..479d534 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..9073d9e 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/expr.c b/gcc/expr.c
index 29bf68b..bf9ed3f 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -811,7 +811,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -836,6 +836,48 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -1680,7 +1722,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2112,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -3928,7 +3970,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6214,7 +6256,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9617,7 +9659,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/target.def b/gcc/target.def
index 1e09ba7..082ed99 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 8ad517f..617d8a3 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1457,4 +1457,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 552407b..08511ab 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -177,3 +177,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:36           ` Michael Zolotukhin
@ 2011-09-28 12:38             ` Michael Zolotukhin
  2011-09-28 12:49             ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 12:38 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 8665 bytes --]

Attached is a part 3 of patch that enables use of vector-instructions
in memset and memcopy. This part contains only tests for different
memset/memcopy cases.

On 28 September 2011 14:57, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Attached is a part 2 of patch that enables use of vector-instructions
> in memset and memcopy (back-end part).
>
> The main part of the changes is in functions
> ix86_expand_setmem/ix86_expand_movmem. The other changes are only
> needed to support it.
> The changes mostly touched unrolled_loop strategy – now vector move
> modes could be used here. That resulted in large epilogues and
> prologues, so their generation also was modified.
> This patch contains some changes in middle-end (to make build
> possible) - but all these changes are present in the first part of
> patch, so there is no need to review them here.
>
> Build and 'make check' was tested.
>
>
> On 28 September 2011 14:56, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> Attached is a part 1 of patch that enables use of vector-instructions
>> in memset and memcopy (middle-end part).
>> The main part of the changes is in functions
>> move_by_pieces/set_by_pieces. In new version algorithm of move-mode
>> selection was changed – now it checks if alignment is known at compile
>> time and uses cost-models to choose between aligned and unaligned
>> vector or not-vector move-modes.
>>
>> Build and 'make check' was tested - in 'make check' there is a fail,
>> that would be cured when complete patch is applied.
>>
>> On 27 September 2011 18:44, Michael Zolotukhin
>> <michael.v.zolotukhin@gmail.com> wrote:
>>> I divided the patch into three smaller ones:
>>>
>>> 1) Patch with target-independent changes (see attached file memfunc-mid.patch).
>>> The main part of the changes is in functions
>>> move_by_pieces/set_by_pieces. In new version algorithm of move-mode
>>> selection was changed – now it checks if alignment is known at compile
>>> time and uses cost-models to choose between aligned and unaligned
>>> vector or not-vector move-modes.
>>>
>>> 2) Patch with target-dependent changes (memfunc-be.patch).
>>> The main part of the changes is in functions
>>> ix86_expand_setmem/ix86_expand_movmem. The other changes are only
>>> needed to support it.
>>> The changes mostly touched unrolled_loop strategy – now vector move
>>> modes could be used here. That resulted in large epilogues and
>>> prologues, so their generation also was modified.
>>> This patch contains some changes in middle-end (to make build
>>> possible) - but all these changes are present in the first patch, so
>>> there is no need to review them here.
>>>
>>> 3) Patch with all new tests (memfunc-tests.patch).
>>> This patch contains a lot of small tests for different memset and memcopy cases.
>>>
>>> Separately from each other, these patches won't give performance gain.
>>> The positive effect will be noticeable only if they are applied
>>> together (I attach the complete patch also - see file
>>> memfunc-complete.patch).
>>>
>>>
>>> If you have any questions regarding these changes, please don't
>>> hesitate to ask them.
>>>
>>>
>>> On 18 July 2011 15:00, Michael Zolotukhin
>>> <michael.v.zolotukhin@gmail.com> wrote:
>>>> Here is a summary - probably, it doesn't cover every single piece in
>>>> the patch, but I tried to describe the major changes. I hope this will
>>>> help you a bit - and of course I'll answer your further questions if
>>>> they appear.
>>>>
>>>> The changes could be logically divided into two parts (though, these
>>>> parts have something in common).
>>>> The first part is changes in target-independent part, in functions
>>>> move_by_pieces() and store_by_pieces() - mostly located in expr.c.
>>>> The second part touches ix86_expand_movmem() and ix86_expand_setmem()
>>>> - mostly located in config/i386/i386.c.
>>>>
>>>> Changes in i386.c (target-dependent part):
>>>> 1) Strategies for cases with known and unknown alignment are separated
>>>> from each other.
>>>> When alignment is known at compile time, we could generate optimized
>>>> code without libcalls.
>>>> When it's unknown, we sometimes could create runtime-checks to reach
>>>> desired alignment, but not always.
>>>> Strategies for atom and generic_32, generic_64 were chosen according
>>>> to set of experiments, strategies in other
>>>> cost models are unchanged (strategies for unknown alignment are copied
>>>> from existing strategies).
>>>> 2) unrolled_loop algorithm was modified - now it uses SSE move-modes,
>>>> if they're available.
>>>> 3) As size of data, moved in one iteration, greatly increased, and
>>>> epilogues became bigger - so some changes were needed in epilogue
>>>> generation. In some cases a special loop (not unrolled) is generated
>>>> in epilogue to avoid slow copying by bytes (changes in
>>>> expand_set_or_movmem_via_loop() and introducing of
>>>> expand_set_or_movmem_via_loop_with_iter() is made for these cases).
>>>> 4) As bigger alignment might be needed than previously, prologue
>>>> generation was also modified.
>>>>
>>>> Changes in expr.c (target-independent part):
>>>> There are two possible strategies now: use of aligned and unaligned
>>>> moves. For each of them a cost model was implemented and the choice is
>>>> made according to the cost of each option. Move-mode choice is made by
>>>> functions widest_mode_for_unaligned_mov() and
>>>> widest_mode_for_aligned_mov().
>>>> Cost estimation is implemented in functions compute_aligned_cost() and
>>>> compute_unaligned_cost().
>>>> Choice between these two strategies and the generation of moves
>>>> themselves are in function move_by_pieces().
>>>>
>>>> Function store_by_pieces() calls set_by_pieces_1() instead of
>>>> store_by_pieces_1(), if this is memset-case (I needed to introduce
>>>> set_by_pieces_1 to separate memset-case from others -
>>>> store_by_pieces_1 is sometimes called for strcpy and some other
>>>> functions, not only for memset).
>>>>
>>>> Set_by_pieces_1() estimates costs of aligned and unaligned strategies
>>>> (as in move_by_pieces() ) and generates moves for memset. Single move
>>>> is generated via
>>>> generate_move_with_mode(). If it's called first time, a promoted value
>>>> (register, filled with one-byte value of memset argument) is generated
>>>> - later calls reuse this value.
>>>>
>>>> Changes in MD-files:
>>>> For generation of promoted values, I made some changes in
>>>> promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands
>>>> for vec_dup4si and vec_dupv2di were introduced for this too (these
>>>> expands differ from corresponding define_insns - existing define_insn
>>>> work only with registers, while new expands could process memory
>>>> operand as well).
>>>>
>>>> Some code were added to allow generation of MOVQ (with SSE-registers)
>>>> - such moves aren't usual ones, because they use only half of
>>>> xmm-register.
>>>> There was a need to generate such moves explicitly, so I added a
>>>> simple expand to sse.md.
>>>>
>>>>
>>>> On 16 July 2011 03:24, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>>> > New algorithm for move-mode selection is implemented for move_by_pieces,
>>>>>> > store_by_pieces.
>>>>>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>>>>>> > similar way, x86 cost-models parameters are slightly changed to support
>>>>>> > this. This implementation checks if array's alignment is known at compile
>>>>>> > time and chooses expanding algorithm and move-mode according to it.
>>>>>
>>>>> Can you give some sumary of changes you made?  It would make it a lot easier to
>>>>> review if it was broken up int the generic changes (with rationaly why they are
>>>>> needed) and i386 backend changes that I could review then.
>>>>>
>>>>> From first pass through the patch I don't quite see the need for i.e. adding
>>>>> new move patterns when we can output all kinds of SSE moves already.  Will look
>>>>> more into the patch to see if I can come up with useful comments.
>>>>>
>>>>> Honza
>>>>>
>>>>
>>>
>>> --
>>> ---
>>> Best regards,
>>> Michael V. Zolotukhin,
>>> Software Engineer
>>> Intel Corporation.
>>>
>>
>>
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-tests.patch --]
[-- Type: application/octet-stream, Size: 56377 bytes --]

diff --git a/gcc/ChangeLog.memfunc.tests b/gcc/ChangeLog.memfunc.tests
new file mode 100644
index 0000000..2e58eab
--- /dev/null
+++ b/gcc/ChangeLog.memfunc.tests
@@ -0,0 +1,81 @@
+2011-07-11  Zolotukhin Michael  <michael.v.zolotukhin@intel.com>
+
+	* testsuite/gcc.target/i386/memset-s64-a0-1.c: New testcase.
+	* testsuite/gcc.target/i386/memset-s64-a0-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s16-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s16-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-3.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a0-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-au-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-au-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a0-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a0-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-au-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-au-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s3072-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s3072-a1-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s3072-au-1.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s3072-au-1.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-5.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s16-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s16-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-6.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a0-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-au-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-au-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a0-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a0-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-au-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-au-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s3072-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s3072-a1-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s3072-au-2.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s3072-au-2.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-7.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-8.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-5.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-6.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s16-a1-3.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s16-a1-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-9.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a0-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a1-3.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a1-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-au-3.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-au-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a0-3.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a0-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a1-3.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a1-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-au-3.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-au-3.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-10.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-11.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-7.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s768-a0-8.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s16-a1-4.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s16-a1-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a0-12.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a0-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-a1-4.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-a1-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s64-au-4.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s64-au-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a0-4.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a0-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-a1-4.c: Ditto.
+	* testsuite/gcc.target/i386/memcpy-s512-a1-4.c: Ditto.
+	* testsuite/gcc.target/i386/memset-s512-au-4.c: Ditto.
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..c4d9fa3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..d25f297
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..0846e7c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..38140a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..132b1e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..4cfdc23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..01c1324
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..fad066e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..1d1c9a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..538fa73
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..7918557
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..8cdf50c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..ddebd95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..b775354
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..5666b62
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..ed5d937
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..b2f3e41
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..4bc9412
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..b6f1479
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..15e0b12
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..a99c4ba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..caa6199
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..40d7691
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..f543626
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..b858610
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..617471c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..eb4bf9b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..36223c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..c05e509
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..08b7591
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..45bf2e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..6416e97
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..481eb2e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..55934fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..681d994
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..aca1224
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..dccdef3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..0a718ca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..2e52789
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..e182d93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..18c9b37
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..137a658
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..878acca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..5c73cbd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..72bdd06e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..dc4c5aa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..d14bce8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..b1ccc53
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..39eba30
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..472a12c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..bf6f9a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..1c0c3d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..1a73d2a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..4744f6d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..145ea52
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..93ff487
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..da01948
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..af707c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..9e880da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..02c5356
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..9230120
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..57a98fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..eee218f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..93649e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..5078782
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..cdadae8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..25a9d20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..c506844
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..f7cf5bf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..0b1930e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..ef013b0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..d1331b1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..4f3e7b7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..ccbe129
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..3a45c4f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..1737703
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..6098a60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..bfa44c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..2f2cd5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:36           ` Michael Zolotukhin
  2011-09-28 12:38             ` Michael Zolotukhin
@ 2011-09-28 12:49             ` Andi Kleen
  2011-09-28 12:49               ` Jakub Jelinek
  1 sibling, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2011-09-28 12:49 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

Michael Zolotukhin <michael.v.zolotukhin@gmail.com> writes:
>
> Build and 'make check' was tested.

Could you expand a bit on the performance benefits?  Where does it help?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:49             ` Andi Kleen
@ 2011-09-28 12:49               ` Jakub Jelinek
  2011-09-28 12:51                 ` Jan Hubicka
  2011-09-28 12:54                 ` Michael Zolotukhin
  0 siblings, 2 replies; 52+ messages in thread
From: Jakub Jelinek @ 2011-09-28 12:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Michael Zolotukhin, gcc-patches, Jan Hubicka, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

On Wed, Sep 28, 2011 at 04:41:47AM -0700, Andi Kleen wrote:
> Michael Zolotukhin <michael.v.zolotukhin@gmail.com> writes:
> >
> > Build and 'make check' was tested.
> 
> Could you expand a bit on the performance benefits?  Where does it help?

Especially when glibc these days has very well optimized implementations
tuned for various CPUs and it is very unlikely beneficial to inline
memcpy/memset if they aren't really short or have unknown number of
iterations.

	Jakub

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:49               ` Jakub Jelinek
@ 2011-09-28 12:51                 ` Jan Hubicka
  2011-09-28 13:31                   ` Michael Zolotukhin
  2011-09-28 12:54                 ` Michael Zolotukhin
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Hubicka @ 2011-09-28 12:51 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andi Kleen, Michael Zolotukhin, gcc-patches, Jan Hubicka,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

> On Wed, Sep 28, 2011 at 04:41:47AM -0700, Andi Kleen wrote:
> > Michael Zolotukhin <michael.v.zolotukhin@gmail.com> writes:
> > >
> > > Build and 'make check' was tested.
> > 
> > Could you expand a bit on the performance benefits?  Where does it help?
> 
> Especially when glibc these days has very well optimized implementations
> tuned for various CPUs and it is very unlikely beneficial to inline
> memcpy/memset if they aren't really short or have unknown number of
> iterations.

I guess we should update the expansion tables so we produce function calls more often.
I will look how things behave on my setup.  Do you know glibc version numbers when
the optimized string functions was introduced?

Concerning inline SSE, I think it makes a lot of sense when we know size &
alignment so we can output just few SSE moves instead of more integer moves.
We definitely need some numbers for the loop variants.

Honza

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:49               ` Jakub Jelinek
  2011-09-28 12:51                 ` Jan Hubicka
@ 2011-09-28 12:54                 ` Michael Zolotukhin
  1 sibling, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 12:54 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andi Kleen, gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu,
	izamyatin, areg.melikadamyan

This expanding only works on relatively small sizes (up to 4k), where
overhead of library call could be quite significant. In some cases new
implementation gives 5x acceleration (especially on small sizes - less
than ~256 bytes). Almost on all sizes from 16 to 4096 bytes there is a
some gain, in average it's 20-30% on 64-bits and 40-50% on 32-bits (on
Atom).
This inlining implementation isn't intended to replace glibc, it's
intended to replace old implementation which sometimes is quite slow.

If glibc-calls turn out to be faster than this expanding, libcall is
generated (special experiments were carried out to find threshold
values in cost models).

If the size is unknown at all, this inlining doesn't work (i.e glibc is called).

On 28 September 2011 15:55, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Sep 28, 2011 at 04:41:47AM -0700, Andi Kleen wrote:
>> Michael Zolotukhin <michael.v.zolotukhin@gmail.com> writes:
>> >
>> > Build and 'make check' was tested.
>>
>> Could you expand a bit on the performance benefits?  Where does it help?
>
> Especially when glibc these days has very well optimized implementations
> tuned for various CPUs and it is very unlikely beneficial to inline
> memcpy/memset if they aren't really short or have unknown number of
> iterations.
>
>        Jakub
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:51                 ` Jan Hubicka
@ 2011-09-28 13:31                   ` Michael Zolotukhin
  2011-09-28 13:33                     ` Jan Hubicka
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 13:31 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Andi Kleen, gcc-patches, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

> Do you know glibc version numbers when
> the optimized string functions was introduced?

Afaik, it's 2.13.
I also compared my implementation to 2.13.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 13:31                   ` Michael Zolotukhin
@ 2011-09-28 13:33                     ` Jan Hubicka
  2011-09-28 13:51                       ` Michael Zolotukhin
  2011-09-28 16:21                       ` Andi Kleen
  0 siblings, 2 replies; 52+ messages in thread
From: Jan Hubicka @ 2011-09-28 13:33 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jan Hubicka, Jakub Jelinek, Andi Kleen, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

> > Do you know glibc version numbers when
> > the optimized string functions was introduced?
> 
> Afaik, it's 2.13.
> I also compared my implementation to 2.13.

I wonder if we can assume that most of GCC 4.7 based systems will be glibc 2.13
based, too.  I would tend to say that yes and thus would suggest to tamn down
inlining that is no longer profitable on newer glibcs with a note in
changes.html...

(I worry about the tables in i386.c deciding what strategy to use for block of
given size. This is more or less unrelated to the actual patch)

Honza

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 13:33                     ` Jan Hubicka
@ 2011-09-28 13:51                       ` Michael Zolotukhin
  2011-09-28 16:21                       ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 13:51 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Andi Kleen, gcc-patches, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

> (I worry about the tables in i386.c deciding what strategy to use for block of
> given size. This is more or less unrelated to the actual patch)
Yep, the threshold values I mentioned above are the values in these
tables. Even with fast glibs there are some cases when inlining is
profitable (e.g. if alignment is known at compile time).

On 28 September 2011 16:54, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Do you know glibc version numbers when
>> > the optimized string functions was introduced?
>>
>> Afaik, it's 2.13.
>> I also compared my implementation to 2.13.
>
> I wonder if we can assume that most of GCC 4.7 based systems will be glibc 2.13
> based, too.  I would tend to say that yes and thus would suggest to tamn down
> inlining that is no longer profitable on newer glibcs with a note in
> changes.html...
>
> (I worry about the tables in i386.c deciding what strategy to use for block of
> given size. This is more or less unrelated to the actual patch)
>
> Honza
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 12:29         ` Michael Zolotukhin
  2011-09-28 12:36           ` Michael Zolotukhin
@ 2011-09-28 14:15           ` Jack Howarth
  2011-09-28 14:28             ` Michael Zolotukhin
  1 sibling, 1 reply; 52+ messages in thread
From: Jack Howarth @ 2011-09-28 14:15 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

On Wed, Sep 28, 2011 at 02:56:30PM +0400, Michael Zolotukhin wrote:
> Attached is a part 1 of patch that enables use of vector-instructions
> in memset and memcopy (middle-end part).
> The main part of the changes is in functions
> move_by_pieces/set_by_pieces. In new version algorithm of move-mode
> selection was changed – now it checks if alignment is known at compile
> time and uses cost-models to choose between aligned and unaligned
> vector or not-vector move-modes.
> 

Michael,
   It appears that part 1 of the patch wasn't really attached.
                   Jack

> 
> 
> 
> -- 
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 14:15           ` Jack Howarth
@ 2011-09-28 14:28             ` Michael Zolotukhin
  2011-09-28 22:52               ` Jack Howarth
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 14:28 UTC (permalink / raw)
  To: Jack Howarth
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 85 bytes --]

>   It appears that part 1 of the patch wasn't really attached.
Thanks, resending.

[-- Attachment #2: memfunc-mid.patch --]
[-- Type: application/octet-stream, Size: 40780 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index b79ce6f..5c95577 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3568,7 +3568,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..479d534 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..9073d9e 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index ee38d3c..e2e4fda 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1494,6 +1494,11 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (MEM_ALIGN (mem) < (unsigned int) align)
+	return -1;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2108,8 +2113,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index 29bf68b..8f87944 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -126,15 +126,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -163,6 +166,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -811,7 +820,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,11 +829,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    MOVE_MAX * BITS_PER_UNIT :
+	    MIN (MOVE_MAX, (offset & -offset)) * BITS_PER_UNIT;
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -836,6 +900,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -876,6 +1104,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -960,23 +1189,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1014,35 +1257,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1050,60 +1305,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1680,7 +1932,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2322,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2358,7 +2610,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2402,10 +2657,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2419,13 +2674,121 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+      if (*promoted_to_vector_value_ptr)
+	rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    {
+      generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+      return Pmode;
+    }
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2500,6 +2863,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -3928,7 +4419,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6214,7 +6705,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9617,7 +10108,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1652186..67541ab 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -704,4 +704,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 236dda2..c6a8c3d 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1270,6 +1270,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 860f6c4..c2e6920 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2511,6 +2511,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index 1e09ba7..082ed99 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 8ad517f..617d8a3 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1457,4 +1457,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 552407b..08511ab 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -177,3 +177,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 13:33                     ` Jan Hubicka
  2011-09-28 13:51                       ` Michael Zolotukhin
@ 2011-09-28 16:21                       ` Andi Kleen
  2011-09-28 16:33                         ` Michael Zolotukhin
  1 sibling, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2011-09-28 16:21 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Michael Zolotukhin, Jakub Jelinek, Andi Kleen, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

On Wed, Sep 28, 2011 at 02:54:34PM +0200, Jan Hubicka wrote:
> > > Do you know glibc version numbers when
> > > the optimized string functions was introduced?
> > 
> > Afaik, it's 2.13.
> > I also compared my implementation to 2.13.
> 
> I wonder if we can assume that most of GCC 4.7 based systems will be glibc 2.13
> based, too.  I would tend to say that yes and thus would suggest to tamn down
> inlining that is no longer profitable on newer glibcs with a note in
> changes.html...

You could add a check to configure and generate based on that?

BTW I know that the tables need tuning for Nehalem and Sandy Bridge too.
Michael are you planning to do that?

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 16:21                       ` Andi Kleen
@ 2011-09-28 16:33                         ` Michael Zolotukhin
  2011-09-28 18:29                           ` Andi Kleen
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-28 16:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jan Hubicka, Jakub Jelinek, gcc-patches, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

> You could add a check to configure and generate based on that?
Do you mean check if glibc is newer than 2.13?
I think that when new glibc version is released, the tables should be
re-examined anyway - we shouldn't just stop inlining, or stop
generating libcalls.

> BTW I know that the tables need tuning for Nehalem and Sandy Bridge too.
> Michael are you planning to do that?
There is no separate cost-table for Nehalem or SandyBridge - however,
I tuned generic32 and generic64 tables, that should improve
performance on modern processors. In old version REP-MOV was used - it
turns out to be slower than SSE-moves or libcalls (in my version the
fastest from these options is used).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 16:33                         ` Michael Zolotukhin
@ 2011-09-28 18:29                           ` Andi Kleen
  2011-09-28 18:36                             ` Andi Kleen
  0 siblings, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2011-09-28 18:29 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Andi Kleen, Jan Hubicka, Jakub Jelinek, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

> There is no separate cost-table for Nehalem or SandyBridge - however,
> I tuned generic32 and generic64 tables, that should improve
> performance on modern processors. In old version REP-MOV was used - it

The recommended heuristics have changed in Nehalem and Sandy-Bridge
over earlier Intel CPUs.

-nAid
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 18:29                           ` Andi Kleen
@ 2011-09-28 18:36                             ` Andi Kleen
  2011-09-29  8:25                               ` Michael Zolotukhin
  0 siblings, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2011-09-28 18:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Michael Zolotukhin, Jan Hubicka, Jakub Jelinek, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

On Wed, Sep 28, 2011 at 06:27:11PM +0200, Andi Kleen wrote:
> > There is no separate cost-table for Nehalem or SandyBridge - however,
> > I tuned generic32 and generic64 tables, that should improve
> > performance on modern processors. In old version REP-MOV was used - it
> 
> The recommended heuristics have changed in Nehalem and Sandy-Bridge
> over earlier Intel CPUs.

Sorry what I meant is that it would be bad if -mtune=corei7(-avx)? was
slower than generic.

Adding new tables shouldn't be very difficult, even if they are the same
as generic.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 14:28             ` Michael Zolotukhin
@ 2011-09-28 22:52               ` Jack Howarth
  2011-09-29  8:13                 ` Michael Zolotukhin
  0 siblings, 1 reply; 52+ messages in thread
From: Jack Howarth @ 2011-09-28 22:52 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

On Wed, Sep 28, 2011 at 05:33:23PM +0400, Michael Zolotukhin wrote:
> >   It appears that part 1 of the patch wasn't really attached.
> Thanks, resending.

Michael,
    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
failure...

/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/g++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/ -B/sw/lib/gcc4.7/x86_64-apple-darwin11.2.0/bin/ -nostdinc++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include/x86_64-apple-darwin11.2.0 -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include -I/sw/src/fink.build/gcc47-4.7.0-1/gcc-4.7-20110927/libstdc++-v3/libsupc++ -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H -I. -I. -I../../gcc-4.7-20110927/gcc -I../../gcc-4.7-20110927/gcc/. -I../../gcc-4.7-20110927/gcc/../include -I../../gcc-4.7-20110927/gcc/../libcpp/include -I/sw/include -I/sw/include  -I../../gcc-4.7-20110927/gcc/../libdecnumber -I../../gcc-4.7-20110927/gcc/../libdecnumber/dpd -I../libdecnumber -I/sw/include  -I/sw/include -DCLOOG_INT_GMP -DCLOOG_ORG -I/sw/include ../../gcc-4.7-20110927/gcc/emit-rtl.c -o emit-rtl.o
../../gcc-4.7-20110927/gcc/emit-rtl.c: In function ‘rtx_def* adjust_address_1(rtx, machine_mode, long int, int, int)’:
../../gcc-4.7-20110927/gcc/emit-rtl.c:2060:26: error: unused variable ‘max_align’ [-Werror=unused-variable]
cc1plus: all warnings being treated as errors

on x86_64-apple-darwin11 with your patches.
          Jack
ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 22:52               ` Jack Howarth
@ 2011-09-29  8:13                 ` Michael Zolotukhin
  2011-09-29 12:09                   ` Michael Zolotukhin
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-29  8:13 UTC (permalink / raw)
  To: Jack Howarth
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

> Michael,
>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
> failure...
I checked bootstrap, specs and 'make check' with the complete patch.
Separate patches for ME and BE were only tested for build (no
bootstrap) and 'make check'. I think it's better to apply the complete
patch, but review the separate patches (to make it easier).

> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
That's true, some new routines from middle-end are used in back-end
changes - I couldn't separate the patches in other way without
significant changes in them.


On 29 September 2011 01:51, Jack Howarth <howarth@bromo.med.uc.edu> wrote:
> On Wed, Sep 28, 2011 at 05:33:23PM +0400, Michael Zolotukhin wrote:
>> >   It appears that part 1 of the patch wasn't really attached.
>> Thanks, resending.
>
> Michael,
>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
> failure...
>
> /sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/g++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/ -B/sw/lib/gcc4.7/x86_64-apple-darwin11.2.0/bin/ -nostdinc++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include/x86_64-apple-darwin11.2.0 -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include -I/sw/src/fink.build/gcc47-4.7.0-1/gcc-4.7-20110927/libstdc++-v3/libsupc++ -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H -I. -I. -I../../gcc-4.7-20110927/gcc -I../../gcc-4.7-20110927/gcc/. -I../../gcc-4.7-20110927/gcc/../include -I../../gcc-4.7-20110927/gcc/../libcpp/include -I/sw/include -I/sw/include  -I../../gcc-4.7-20110927/gcc/../libdecnumber -I../../gcc-4.7-20110927/gcc/../libdecnumber/dpd -I../libdecnumber -I/sw/include  -I/sw/include -DCLOOG_INT_GMP -DCLOOG_ORG -I/sw/include ../../gcc-4.7-20110927/gcc/emit-rtl.c -o emit-rtl.o
> ../../gcc-4.7-20110927/gcc/emit-rtl.c: In function ‘rtx_def* adjust_address_1(rtx, machine_mode, long int, int, int)’:
> ../../gcc-4.7-20110927/gcc/emit-rtl.c:2060:26: error: unused variable ‘max_align’ [-Werror=unused-variable]
> cc1plus: all warnings being treated as errors
>
> on x86_64-apple-darwin11 with your patches.
>          Jack
> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-28 18:36                             ` Andi Kleen
@ 2011-09-29  8:25                               ` Michael Zolotukhin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-29  8:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jan Hubicka, Jakub Jelinek, gcc-patches, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

> Sorry what I meant is that it would be bad if -mtune=corei7(-avx)? was
> slower than generic.
For now, -mtune=corei7 is triggering use of generic cost-table (I'm
not sure about corei7-avx, but assume the same) - so it won't be
slower.

> Adding new tables shouldn't be very difficult, even if they are the same
> as generic.
Yes, that should be quite easy, but this isn't 'must-have' for the
current patch (and it's already very huge). This could be done
independently from the work on memfunctions inlining.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-29  8:13                 ` Michael Zolotukhin
@ 2011-09-29 12:09                   ` Michael Zolotukhin
  2011-09-29 12:12                     ` Michael Zolotukhin
  2011-09-29 13:02                     ` Jakub Jelinek
  0 siblings, 2 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-29 12:09 UTC (permalink / raw)
  To: Jack Howarth
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 3882 bytes --]

>> Michael,
>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>> failure...
> I checked bootstrap, specs and 'make check' with the complete patch.
> Separate patches for ME and BE were only tested for build (no
> bootstrap) and 'make check'. I think it's better to apply the complete
> patch, but review the separate patches (to make it easier).

I rechecked bootstrap, and it failed.. Seemingly, something went wrong
when I updated my branches, but I've already fixed it.

Here is fixed version of complete patch.

On 29 September 2011 09:39, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
>> Michael,
>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>> failure...
> I checked bootstrap, specs and 'make check' with the complete patch.
> Separate patches for ME and BE were only tested for build (no
> bootstrap) and 'make check'. I think it's better to apply the complete
> patch, but review the separate patches (to make it easier).
>
>> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
> That's true, some new routines from middle-end are used in back-end
> changes - I couldn't separate the patches in other way without
> significant changes in them.
>
>
> On 29 September 2011 01:51, Jack Howarth <howarth@bromo.med.uc.edu> wrote:
>> On Wed, Sep 28, 2011 at 05:33:23PM +0400, Michael Zolotukhin wrote:
>>> >   It appears that part 1 of the patch wasn't really attached.
>>> Thanks, resending.
>>
>> Michael,
>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>> failure...
>>
>> /sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/g++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/ -B/sw/lib/gcc4.7/x86_64-apple-darwin11.2.0/bin/ -nostdinc++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include/x86_64-apple-darwin11.2.0 -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include -I/sw/src/fink.build/gcc47-4.7.0-1/gcc-4.7-20110927/libstdc++-v3/libsupc++ -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H -I. -I. -I../../gcc-4.7-20110927/gcc -I../../gcc-4.7-20110927/gcc/. -I../../gcc-4.7-20110927/gcc/../include -I../../gcc-4.7-20110927/gcc/../libcpp/include -I/sw/include -I/sw/include  -I../../gcc-4.7-20110927/gcc/../libdecnumber -I../../gcc-4.7-20110927/gcc/../libdecnumber/dpd -I../libdecnumber -I/sw/include  -I/sw/include -DCLOOG_INT_GMP -DCLOOG_ORG -I/sw/include ../../gcc-4.7-20110927/gcc/emit-rtl.c -o emit-rtl.o
>> ../../gcc-4.7-20110927/gcc/emit-rtl.c: In function ‘rtx_def* adjust_address_1(rtx, machine_mode, long int, int, int)’:
>> ../../gcc-4.7-20110927/gcc/emit-rtl.c:2060:26: error: unused variable ‘max_align’ [-Werror=unused-variable]
>> cc1plus: all warnings being treated as errors
>>
>> on x86_64-apple-darwin11 with your patches.
>>          Jack
>> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-complete-2.patch --]
[-- Type: application/octet-stream, Size: 155649 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index b79ce6f..5c95577 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3568,7 +3568,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index f952d2e..b912040 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,21 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1867,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1945,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2632,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15190,6 +15286,28 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20201,22 +20319,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20224,10 +20337,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20238,7 +20353,8 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
@@ -20321,19 +20437,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20437,7 +20577,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20448,43 +20608,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -20590,87 +20762,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -20680,14 +20887,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -20695,24 +20905,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -20755,7 +20965,27 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = change_address (srcmem, DImode, srcptr);
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = change_address (srcmem, SImode, srcptr);
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -20810,6 +21040,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -20869,7 +21130,17 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -20905,6 +21176,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -20916,7 +21200,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -20925,7 +21209,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -20938,7 +21222,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -20947,9 +21231,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21013,29 +21297,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21059,9 +21347,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21149,6 +21439,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21172,9 +21467,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21193,11 +21496,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21366,11 +21674,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21421,9 +21732,43 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21441,7 +21786,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21507,11 +21882,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -21537,12 +21922,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21562,8 +21952,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21581,11 +21974,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21602,6 +22005,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -21630,8 +22034,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -21640,12 +22046,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -21686,8 +22086,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -21710,6 +22112,8 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -21751,7 +22155,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -21759,8 +22163,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -21804,15 +22214,29 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -36374,6 +36798,87 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -36681,6 +37186,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 7d6e058..1336f9f 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 6c20ddb..3e363f4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7244,6 +7244,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7355,6 +7362,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7396,6 +7413,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..479d534 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..9073d9e 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index ee38d3c..dc7d052 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1494,6 +1494,11 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (MEM_ALIGN (mem) < (unsigned int) align)
+	return -1;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2051,7 +2056,6 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
   enum machine_mode address_mode;
   int pbits;
   struct mem_attrs attrs, *defattrs;
-  unsigned HOST_WIDE_INT max_align;
 
   attrs = *get_mem_attrs (memref);
 
@@ -2108,8 +2112,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index 29bf68b..6f9667b 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -126,15 +126,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -163,6 +166,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -811,7 +820,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,11 +829,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    MOVE_MAX * BITS_PER_UNIT :
+	    MIN (MOVE_MAX, (offset & -offset)) * BITS_PER_UNIT;
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -836,6 +900,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -876,6 +1104,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -960,23 +1189,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1014,35 +1257,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1050,60 +1305,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1680,7 +1932,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2322,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2358,7 +2610,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2402,10 +2657,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2419,13 +2674,122 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+
+      if (*promoted_to_vector_value_ptr)
+	rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    {
+      generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+      return Pmode;
+    }
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2500,6 +2864,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -3928,7 +4420,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6214,7 +6706,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9617,7 +10109,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1652186..67541ab 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -704,4 +704,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 236dda2..c6a8c3d 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1270,6 +1270,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 860f6c4..c2e6920 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2511,6 +2511,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index 1e09ba7..082ed99 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 8ad517f..617d8a3 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1457,4 +1457,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 552407b..08511ab 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -177,3 +177,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..c4d9fa3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..d25f297
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..0846e7c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..38140a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..132b1e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..4cfdc23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..01c1324
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..fad066e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..1d1c9a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..538fa73
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..7918557
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..8cdf50c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..ddebd95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..b775354
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..5666b62
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..ed5d937
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..b2f3e41
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..4bc9412
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..b6f1479
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..15e0b12
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..a99c4ba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..caa6199
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..40d7691
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..f543626
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..b858610
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..617471c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..eb4bf9b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..36223c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..c05e509
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..08b7591
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..45bf2e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..6416e97
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..481eb2e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..55934fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..681d994
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..aca1224
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..dccdef3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..0a718ca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..2e52789
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..e182d93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..18c9b37
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..137a658
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..878acca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..5c73cbd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..72bdd06e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..dc4c5aa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..d14bce8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..b1ccc53
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..39eba30
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..472a12c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..bf6f9a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..1c0c3d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..1a73d2a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..4744f6d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..145ea52
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..93ff487
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..da01948
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..af707c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..9e880da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..02c5356
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..9230120
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..57a98fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..eee218f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..93649e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..5078782
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..cdadae8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..25a9d20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..c506844
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..f7cf5bf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..0b1930e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..ef013b0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..d1331b1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..4f3e7b7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m32 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..ccbe129
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..3a45c4f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..1737703
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..6098a60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m32 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..bfa44c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..2f2cd5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=corei7 -mtune=corei7 -m64 -mstringop-strategy=unrolled_loop -dp" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-29 12:09                   ` Michael Zolotukhin
@ 2011-09-29 12:12                     ` Michael Zolotukhin
  2011-09-29 12:23                       ` Michael Zolotukhin
  2011-09-29 13:02                     ` Jakub Jelinek
  1 sibling, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-29 12:12 UTC (permalink / raw)
  To: Jack Howarth
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 4192 bytes --]

Here is a fixed version of middle-end patch.

On 29 September 2011 15:14, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
>>> Michael,
>>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>>> failure...
>> I checked bootstrap, specs and 'make check' with the complete patch.
>> Separate patches for ME and BE were only tested for build (no
>> bootstrap) and 'make check'. I think it's better to apply the complete
>> patch, but review the separate patches (to make it easier).
>
> I rechecked bootstrap, and it failed.. Seemingly, something went wrong
> when I updated my branches, but I've already fixed it.
>
> Here is fixed version of complete patch.
>
> On 29 September 2011 09:39, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>>> Michael,
>>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>>> failure...
>> I checked bootstrap, specs and 'make check' with the complete patch.
>> Separate patches for ME and BE were only tested for build (no
>> bootstrap) and 'make check'. I think it's better to apply the complete
>> patch, but review the separate patches (to make it easier).
>>
>>> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
>> That's true, some new routines from middle-end are used in back-end
>> changes - I couldn't separate the patches in other way without
>> significant changes in them.
>>
>>
>> On 29 September 2011 01:51, Jack Howarth <howarth@bromo.med.uc.edu> wrote:
>>> On Wed, Sep 28, 2011 at 05:33:23PM +0400, Michael Zolotukhin wrote:
>>>> >   It appears that part 1 of the patch wasn't really attached.
>>>> Thanks, resending.
>>>
>>> Michael,
>>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>>> failure...
>>>
>>> /sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/g++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/ -B/sw/lib/gcc4.7/x86_64-apple-darwin11.2.0/bin/ -nostdinc++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include/x86_64-apple-darwin11.2.0 -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include -I/sw/src/fink.build/gcc47-4.7.0-1/gcc-4.7-20110927/libstdc++-v3/libsupc++ -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H -I. -I. -I../../gcc-4.7-20110927/gcc -I../../gcc-4.7-20110927/gcc/. -I../../gcc-4.7-20110927/gcc/../include -I../../gcc-4.7-20110927/gcc/../libcpp/include -I/sw/include -I/sw/include  -I../../gcc-4.7-20110927/gcc/../libdecnumber -I../../gcc-4.7-20110927/gcc/../libdecnumber/dpd -I../libdecnumber -I/sw/include  -I/sw/include -DCLOOG_INT_GMP -DCLOOG_ORG -I/sw/include ../../gcc-4.7-20110927/gcc/emit-rtl.c -o emit-rtl.o
>>> ../../gcc-4.7-20110927/gcc/emit-rtl.c: In function ‘rtx_def* adjust_address_1(rtx, machine_mode, long int, int, int)’:
>>> ../../gcc-4.7-20110927/gcc/emit-rtl.c:2060:26: error: unused variable ‘max_align’ [-Werror=unused-variable]
>>> cc1plus: all warnings being treated as errors
>>>
>>> on x86_64-apple-darwin11 with your patches.
>>>          Jack
>>> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
>>>
>>
>>
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-mid-2.patch --]
[-- Type: application/octet-stream, Size: 41042 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index b79ce6f..5c95577 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3568,7 +3568,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..479d534 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..9073d9e 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index ee38d3c..dc7d052 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1494,6 +1494,11 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (MEM_ALIGN (mem) < (unsigned int) align)
+	return -1;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2051,7 +2056,6 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
   enum machine_mode address_mode;
   int pbits;
   struct mem_attrs attrs, *defattrs;
-  unsigned HOST_WIDE_INT max_align;
 
   attrs = *get_mem_attrs (memref);
 
@@ -2108,8 +2112,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index 29bf68b..8f87944 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -126,15 +126,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -163,6 +166,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -811,7 +820,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,11 +829,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    MOVE_MAX * BITS_PER_UNIT :
+	    MIN (MOVE_MAX, (offset & -offset)) * BITS_PER_UNIT;
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -836,6 +900,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -876,6 +1104,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -960,23 +1189,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1014,35 +1257,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1050,60 +1305,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1680,7 +1932,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2322,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2358,7 +2610,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2402,10 +2657,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2419,13 +2674,121 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+      if (*promoted_to_vector_value_ptr)
+	rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    {
+      generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+      return Pmode;
+    }
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2500,6 +2863,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -3928,7 +4419,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6214,7 +6705,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9617,7 +10108,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1652186..67541ab 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -704,4 +704,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 236dda2..c6a8c3d 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1270,6 +1270,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 860f6c4..c2e6920 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2511,6 +2511,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index 1e09ba7..082ed99 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 8ad517f..617d8a3 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1457,4 +1457,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 552407b..08511ab 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -177,3 +177,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-29 12:12                     ` Michael Zolotukhin
@ 2011-09-29 12:23                       ` Michael Zolotukhin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-09-29 12:23 UTC (permalink / raw)
  To: Jack Howarth
  Cc: gcc-patches, Jan Hubicka, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 4516 bytes --]

And here is a fixed version of back-end patch.

On 29 September 2011 15:15, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Here is a fixed version of middle-end patch.
>
> On 29 September 2011 15:14, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>>>> Michael,
>>>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>>>> failure...
>>> I checked bootstrap, specs and 'make check' with the complete patch.
>>> Separate patches for ME and BE were only tested for build (no
>>> bootstrap) and 'make check'. I think it's better to apply the complete
>>> patch, but review the separate patches (to make it easier).
>>
>> I rechecked bootstrap, and it failed.. Seemingly, something went wrong
>> when I updated my branches, but I've already fixed it.
>>
>> Here is fixed version of complete patch.
>>
>> On 29 September 2011 09:39, Michael Zolotukhin
>> <michael.v.zolotukhin@gmail.com> wrote:
>>>> Michael,
>>>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>>>> failure...
>>> I checked bootstrap, specs and 'make check' with the complete patch.
>>> Separate patches for ME and BE were only tested for build (no
>>> bootstrap) and 'make check'. I think it's better to apply the complete
>>> patch, but review the separate patches (to make it easier).
>>>
>>>> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
>>> That's true, some new routines from middle-end are used in back-end
>>> changes - I couldn't separate the patches in other way without
>>> significant changes in them.
>>>
>>>
>>> On 29 September 2011 01:51, Jack Howarth <howarth@bromo.med.uc.edu> wrote:
>>>> On Wed, Sep 28, 2011 at 05:33:23PM +0400, Michael Zolotukhin wrote:
>>>>> >   It appears that part 1 of the patch wasn't really attached.
>>>>> Thanks, resending.
>>>>
>>>> Michael,
>>>>    Did you bootstrap with --enable-checking=yes? I am seeing the bootstrap
>>>> failure...
>>>>
>>>> /sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/g++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/./prev-gcc/ -B/sw/lib/gcc4.7/x86_64-apple-darwin11.2.0/bin/ -nostdinc++ -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -B/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include/x86_64-apple-darwin11.2.0 -I/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/include -I/sw/src/fink.build/gcc47-4.7.0-1/gcc-4.7-20110927/libstdc++-v3/libsupc++ -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/src/.libs -L/sw/src/fink.build/gcc47-4.7.0-1/darwin_objdir/prev-x86_64-apple-darwin11.2.0/libstdc++-v3/libsupc++/.libs -c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H -I. -I. -I../../gcc-4.7-20110927/gcc -I../../gcc-4.7-20110927/gcc/. -I../../gcc-4.7-20110927/gcc/../include -I../../gcc-4.7-20110927/gcc/../libcpp/include -I/sw/include -I/sw/include  -I../../gcc-4.7-20110927/gcc/../libdecnumber -I../../gcc-4.7-20110927/gcc/../libdecnumber/dpd -I../libdecnumber -I/sw/include  -I/sw/include -DCLOOG_INT_GMP -DCLOOG_ORG -I/sw/include ../../gcc-4.7-20110927/gcc/emit-rtl.c -o emit-rtl.o
>>>> ../../gcc-4.7-20110927/gcc/emit-rtl.c: In function ‘rtx_def* adjust_address_1(rtx, machine_mode, long int, int, int)’:
>>>> ../../gcc-4.7-20110927/gcc/emit-rtl.c:2060:26: error: unused variable ‘max_align’ [-Werror=unused-variable]
>>>> cc1plus: all warnings being treated as errors
>>>>
>>>> on x86_64-apple-darwin11 with your patches.
>>>>          Jack
>>>> ps There also seems to be common sections in the memfunc-mid.patch and memfunc-be.patch patches.
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Best regards,
>>> Michael V. Zolotukhin,
>>> Software Engineer
>>> Intel Corporation.
>>>
>>
>>
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-be-2.patch --]
[-- Type: application/octet-stream, Size: 76421 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index f952d2e..36ea2af 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,20 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1866,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1944,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2631,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15190,6 +15285,28 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20201,22 +20318,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20224,10 +20336,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20238,7 +20352,8 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
@@ -20321,19 +20436,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20437,7 +20576,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20448,43 +20607,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -20590,87 +20761,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -20680,14 +20886,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -20695,24 +20904,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -20755,7 +20964,27 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = change_address (srcmem, DImode, srcptr);
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = change_address (srcmem, SImode, srcptr);
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -20810,6 +21039,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -20869,7 +21129,17 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -20905,6 +21175,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -20916,7 +21199,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -20925,7 +21208,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -20938,7 +21221,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -20947,9 +21230,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21013,29 +21296,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21059,9 +21346,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21149,6 +21438,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21172,9 +21466,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21193,11 +21495,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21318,6 +21625,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -21366,11 +21675,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21421,9 +21733,50 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21441,7 +21794,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21507,11 +21890,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+        promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+        promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -21537,12 +21930,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21562,8 +21960,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21581,11 +21982,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21602,6 +22013,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -21630,8 +22042,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -21640,12 +22054,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -21686,8 +22094,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -21751,7 +22161,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -21759,8 +22169,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -21804,15 +22220,36 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -36374,6 +36811,87 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -36681,6 +37199,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 7d6e058..1336f9f 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 6c20ddb..3e363f4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7244,6 +7244,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7355,6 +7362,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7396,6 +7413,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..479d534 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..9073d9e 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/expr.c b/gcc/expr.c
index 29bf68b..bf9ed3f 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -811,7 +811,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -836,6 +836,48 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -1680,7 +1722,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2112,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -3928,7 +3970,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6214,7 +6256,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9617,7 +9659,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 860f6c4..c2e6920 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2511,6 +2511,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index 1e09ba7..082ed99 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 8ad517f..617d8a3 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1457,4 +1457,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 552407b..08511ab 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -177,3 +177,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-29 12:09                   ` Michael Zolotukhin
  2011-09-29 12:12                     ` Michael Zolotukhin
@ 2011-09-29 13:02                     ` Jakub Jelinek
  2011-10-20  8:39                       ` Michael Zolotukhin
  1 sibling, 1 reply; 52+ messages in thread
From: Jakub Jelinek @ 2011-09-29 13:02 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jack Howarth, gcc-patches, Jan Hubicka, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

Hi!

On Thu, Sep 29, 2011 at 03:14:40PM +0400, Michael Zolotukhin wrote:
+/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */                                                                                       

The testcases are wrong, -m64 or -m32 should never appear in dg-options,
instead if the testcase is specific to -m64, it should be guarded with
/* { dg-do compile { target lp64 } } */
resp. ia32 (or ilp32, depending on what exactly should be done for -mx32),
if you have the same testcase for -m32 and -m64, but just want different
scan-assembler for the two cases, then just guard the scan-assembler
with lp64 resp. ia32/ilp32 target and add second one for the other target.

	Jakub

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-09-29 13:02                     ` Jakub Jelinek
@ 2011-10-20  8:39                       ` Michael Zolotukhin
  2011-10-20  8:46                         ` Michael Zolotukhin
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-10-20  8:39 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jack Howarth, gcc-patches, Jan Hubicka, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]

I fixed the tests as well as updated my branch and fixed introduced
during this process bugs.
Here is fixed complete patch (other parts will be sent in consequent letters).

The changes passed bootstrap and make check.

On 29 September 2011 15:21, Jakub Jelinek <jakub@redhat.com> wrote:
> Hi!
>
> On Thu, Sep 29, 2011 at 03:14:40PM +0400, Michael Zolotukhin wrote:
> +/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
>
> The testcases are wrong, -m64 or -m32 should never appear in dg-options,
> instead if the testcase is specific to -m64, it should be guarded with
> /* { dg-do compile { target lp64 } } */
> resp. ia32 (or ilp32, depending on what exactly should be done for -mx32),
> if you have the same testcase for -m32 and -m64, but just want different
> scan-assembler for the two cases, then just guard the scan-assembler
> with lp64 resp. ia32/ilp32 target and add second one for the other target.
>
>        Jakub

-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-complete-3.patch --]
[-- Type: application/octet-stream, Size: 161580 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 296c5b7..3e41695 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3567,7 +3567,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..d7c4330 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,21 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1867,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1945,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2632,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20677,22 +20805,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20700,10 +20823,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20714,18 +20839,21 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
   tmp = convert_modes (Pmode, iter_mode, iter, true);
   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
-  destmem = change_address (destmem, mode, x_addr);
+  destmem =
+    adjust_automodify_address_1 (copy_rtx (destmem), mode, x_addr, 0, 1);
 
   if (srcmem)
     {
       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
-      srcmem = change_address (srcmem, mode, y_addr);
+      srcmem =
+	adjust_automodify_address_1 (copy_rtx (srcmem), mode, y_addr, 0, 1);
 
       /* When unrolling for chips that reorder memory reads and writes,
 	 we can save registers by using single temporary.
@@ -20797,19 +20925,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20924,43 +21096,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -21066,87 +21250,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21156,14 +21375,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21171,24 +21393,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21214,8 +21436,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      srcmem = change_address (srcmem, HImode, srcptr);
-      destmem = change_address (destmem, HImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, HImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, HImode, destptr, 0, 1);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21224,14 +21446,34 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      srcmem = change_address (srcmem, SImode, srcptr);
-      destmem = change_address (destmem, SImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, SImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = adjust_automodify_address_1 (srcmem, DImode, srcptr, 0, 1);
+	  destmem = adjust_automodify_address_1 (destmem, DImode, destptr, 0, 1);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = adjust_automodify_address_1 (srcmem, SImode, srcptr, 0, 1);
+	  destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -21293,7 +21566,9 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
   if (src_align_bytes >= 0)
     {
       unsigned int src_align = 0;
-      if ((src_align_bytes & 7) == (align_bytes & 7))
+      if ((src_align_bytes & 15) == (align_bytes & 15))
+	src_align = 16;
+      else if ((src_align_bytes & 7) == (align_bytes & 7))
 	src_align = 8;
       else if ((src_align_bytes & 3) == (align_bytes & 3))
 	src_align = 4;
@@ -21321,7 +21596,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      destmem = change_address (destmem, QImode, destptr);
+      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21330,7 +21605,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      destmem = change_address (destmem, HImode, destptr);
+      destmem = adjust_automodify_address_1 (destmem, HImode, destptr, 0, 1);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21339,13 +21614,23 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      destmem = change_address (destmem, SImode, destptr);
+      destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -21381,6 +21666,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      off = 4;
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 4;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -21392,7 +21690,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -21401,7 +21699,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -21414,7 +21712,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -21423,9 +21721,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21489,29 +21787,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21535,9 +21837,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21625,6 +21929,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21648,9 +21957,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21669,11 +21986,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21785,6 +22107,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
 				  desired_align);
+	  set_mem_align (src, desired_align*BITS_PER_UNIT);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -21842,11 +22166,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21897,9 +22224,43 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21917,7 +22278,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21983,11 +22374,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -22013,12 +22414,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22038,8 +22444,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -22057,11 +22466,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -22078,6 +22497,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -22106,8 +22526,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -22116,12 +22538,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -22162,8 +22578,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -22177,6 +22595,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
 				  desired_align);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -22186,6 +22605,8 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -22227,7 +22648,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -22235,8 +22656,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -22280,15 +22707,29 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -37436,6 +37877,92 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = VOIDmode;
+      if (TARGET_AVX2)
+	vec_mode = TARGET_64BIT ? V4DImode : V8SImode;
+      else
+	vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      gcc_assert (vec_mode != VOIDmode);
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -37743,6 +38270,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bd69ec2..550b2ab 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ff77003..b8ecc59 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7426,6 +7426,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7537,6 +7544,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7578,6 +7595,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..4b7d67b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 187122e..c7e2457 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 8465237..ff568b1 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1495,6 +1495,12 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (get_object_alignment_1 (expr, &offset) < align)
+	return -1;
+      offset /= BITS_PER_UNIT;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2058,7 +2064,6 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
   enum machine_mode address_mode;
   int pbits;
   struct mem_attrs attrs, *defattrs;
-  unsigned HOST_WIDE_INT max_align;
 
   attrs = *get_mem_attrs (memref);
 
@@ -2115,8 +2120,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + attrs.offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index b020978..83bc789 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -126,15 +126,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -163,6 +166,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -811,7 +820,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,11 +829,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    BIGGEST_ALIGNMENT :
+	    MIN (BIGGEST_ALIGNMENT, (offset & -offset) * BITS_PER_UNIT);
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -836,6 +900,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -876,6 +1104,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -960,23 +1189,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1014,35 +1257,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1050,60 +1305,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1680,7 +1932,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2322,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2464,7 +2716,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2508,10 +2763,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2525,13 +2780,126 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+
+      if (*promoted_to_vector_value_ptr)
+	{
+	  enum machine_mode promoted_mode = GET_MODE (*promoted_to_vector_value_ptr);
+	  if (GET_MODE_SIZE (promoted_mode) < GET_MODE_SIZE (mode))
+	    return generate_move_with_mode (data, promoted_mode,
+				    promoted_to_vector_value_ptr,
+				    promoted_value_ptr);
+	  rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+	}
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    return generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2606,6 +2974,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -4034,7 +4530,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6325,7 +6821,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9738,7 +10234,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1bf1369..6f697d7 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -706,4 +706,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5368d18..cbbb75a 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1273,6 +1273,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f13485e..4ec67c7 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2513,6 +2513,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index c3bec0e..a74bb7b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 81fd12f..f02a9e8 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1442,4 +1442,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f19fb50..8d23747 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -175,3 +175,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..39c8ef0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..439694b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..51f4c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..bca8680
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..5bc8e74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..b7dff27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..bee85fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..1160beb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..b1c78ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..a15a0f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..2789660
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..17e0342
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..e437378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..ba716df
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..1845e95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..2b23751
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..e751192
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..7defe7e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..ea27378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..de2a557
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..1f82258
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..7f60806
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..94f0864
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..20545c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..52dab8e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..c662480
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..9e8e152
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..662fc20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..c90e852
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..5a41f82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..ec2dfff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..d6b2cd5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..9cd89e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..ddf25fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..fde4f5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..4fe2d36
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..2209563
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..8d99dde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..e0ad04a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..404d04e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..1df9db0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..beb005c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..29f5ea3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..2504333
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..b0aaada
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..3e250d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..c13edd7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..17d9525
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..8125e9d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..ff74811
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..d7e0c3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..ea7b439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..5ef250d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..846a807
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..a8f7c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..ae05e93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..96462bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..6aee01e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..bbad9b9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..8e90d72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..26d0b42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..84ec749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..ef15265
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..444a8de
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..9154fb9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..9b7dac1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..713c8a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..8c700c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..c344fd0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..125de2f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..b50de1b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..c6fd271
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..32972e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..ac615e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..8458cfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..210946d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..e63feae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..72b2ba0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..cb5dc85
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index 483d117..e3d3b91 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue -mstringop-strategy=rep_byte" } */
 
 #include <string.h>
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-20  8:46                         ` Michael Zolotukhin
@ 2011-10-20  8:46                           ` Michael Zolotukhin
  2011-10-20  8:51                             ` Michael Zolotukhin
  2011-10-27 16:10                             ` Jan Hubicka
  0 siblings, 2 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-10-20  8:46 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jack Howarth, gcc-patches, Jan Hubicka, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]

Back-end part of the patch is attached here.

On 20 October 2011 12:35, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Middle-end part of the patch is attached.
>
> On 20 October 2011 12:34, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> I fixed the tests as well as updated my branch and fixed introduced
>> during this process bugs.
>> Here is fixed complete patch (other parts will be sent in consequent letters).
>>
>> The changes passed bootstrap and make check.
>>
>> On 29 September 2011 15:21, Jakub Jelinek <jakub@redhat.com> wrote:
>>> Hi!
>>>
>>> On Thu, Sep 29, 2011 at 03:14:40PM +0400, Michael Zolotukhin wrote:
>>> +/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
>>>
>>> The testcases are wrong, -m64 or -m32 should never appear in dg-options,
>>> instead if the testcase is specific to -m64, it should be guarded with
>>> /* { dg-do compile { target lp64 } } */
>>> resp. ia32 (or ilp32, depending on what exactly should be done for -mx32),
>>> if you have the same testcase for -m32 and -m64, but just want different
>>> scan-assembler for the two cases, then just guard the scan-assembler
>>> with lp64 resp. ia32/ilp32 target and add second one for the other target.
>>>
>>>        Jakub
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-be-3.patch --]
[-- Type: application/octet-stream, Size: 81450 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..d7c4330 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,21 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1867,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1945,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2632,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20677,22 +20805,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20700,10 +20823,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20714,18 +20839,21 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
   tmp = convert_modes (Pmode, iter_mode, iter, true);
   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
-  destmem = change_address (destmem, mode, x_addr);
+  destmem =
+    adjust_automodify_address_1 (copy_rtx (destmem), mode, x_addr, 0, 1);
 
   if (srcmem)
     {
       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
-      srcmem = change_address (srcmem, mode, y_addr);
+      srcmem =
+	adjust_automodify_address_1 (copy_rtx (srcmem), mode, y_addr, 0, 1);
 
       /* When unrolling for chips that reorder memory reads and writes,
 	 we can save registers by using single temporary.
@@ -20797,19 +20925,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20924,43 +21096,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -21066,87 +21250,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21156,14 +21375,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21171,24 +21393,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21214,8 +21436,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      srcmem = change_address (srcmem, HImode, srcptr);
-      destmem = change_address (destmem, HImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, HImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, HImode, destptr, 0, 1);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21224,14 +21446,34 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      srcmem = change_address (srcmem, SImode, srcptr);
-      destmem = change_address (destmem, SImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, SImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = adjust_automodify_address_1 (srcmem, DImode, srcptr, 0, 1);
+	  destmem = adjust_automodify_address_1 (destmem, DImode, destptr, 0, 1);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = adjust_automodify_address_1 (srcmem, SImode, srcptr, 0, 1);
+	  destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -21293,7 +21566,9 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
   if (src_align_bytes >= 0)
     {
       unsigned int src_align = 0;
-      if ((src_align_bytes & 7) == (align_bytes & 7))
+      if ((src_align_bytes & 15) == (align_bytes & 15))
+	src_align = 16;
+      else if ((src_align_bytes & 7) == (align_bytes & 7))
 	src_align = 8;
       else if ((src_align_bytes & 3) == (align_bytes & 3))
 	src_align = 4;
@@ -21321,7 +21596,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      destmem = change_address (destmem, QImode, destptr);
+      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21330,7 +21605,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      destmem = change_address (destmem, HImode, destptr);
+      destmem = adjust_automodify_address_1 (destmem, HImode, destptr, 0, 1);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21339,13 +21614,23 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      destmem = change_address (destmem, SImode, destptr);
+      destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_1 (destmem, SImode, destptr, 0, 1);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -21381,6 +21666,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      off = 4;
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 4;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -21392,7 +21690,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -21401,7 +21699,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -21414,7 +21712,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -21423,9 +21721,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21489,29 +21787,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21535,9 +21837,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21625,6 +21929,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21648,9 +21957,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21669,11 +21986,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21785,6 +22107,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
 				  desired_align);
+	  set_mem_align (src, desired_align*BITS_PER_UNIT);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -21842,11 +22166,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21897,9 +22224,43 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21917,7 +22278,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21983,11 +22374,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -22013,12 +22414,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22038,8 +22444,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -22057,11 +22466,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -22078,6 +22497,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -22106,8 +22526,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -22116,12 +22538,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -22162,8 +22578,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -22177,6 +22595,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
 				  desired_align);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -22186,6 +22605,8 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -22227,7 +22648,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -22235,8 +22656,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -22280,15 +22707,29 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -37436,6 +37877,92 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = VOIDmode;
+      if (TARGET_AVX2)
+	vec_mode = TARGET_64BIT ? V4DImode : V8SImode;
+      else
+	vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      gcc_assert (vec_mode != VOIDmode);
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -37743,6 +38270,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bd69ec2..550b2ab 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ff77003..b8ecc59 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7426,6 +7426,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7537,6 +7544,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7578,6 +7595,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..4b7d67b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 187122e..c7e2457 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/expr.c b/gcc/expr.c
index b020978..5c5002c 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -811,7 +811,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -836,6 +836,48 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -1680,7 +1722,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2112,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -4034,7 +4076,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6325,7 +6367,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9738,7 +9780,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f13485e..4ec67c7 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2513,6 +2513,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index c3bec0e..a74bb7b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 81fd12f..f02a9e8 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1442,4 +1442,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f19fb50..8d23747 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -175,3 +175,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index 483d117..e3d3b91 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue -mstringop-strategy=rep_byte" } */
 
 #include <string.h>
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-20  8:39                       ` Michael Zolotukhin
@ 2011-10-20  8:46                         ` Michael Zolotukhin
  2011-10-20  8:46                           ` Michael Zolotukhin
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-10-20  8:46 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jack Howarth, gcc-patches, Jan Hubicka, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 1316 bytes --]

Middle-end part of the patch is attached.

On 20 October 2011 12:34, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> I fixed the tests as well as updated my branch and fixed introduced
> during this process bugs.
> Here is fixed complete patch (other parts will be sent in consequent letters).
>
> The changes passed bootstrap and make check.
>
> On 29 September 2011 15:21, Jakub Jelinek <jakub@redhat.com> wrote:
>> Hi!
>>
>> On Thu, Sep 29, 2011 at 03:14:40PM +0400, Michael Zolotukhin wrote:
>> +/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
>>
>> The testcases are wrong, -m64 or -m32 should never appear in dg-options,
>> instead if the testcase is specific to -m64, it should be guarded with
>> /* { dg-do compile { target lp64 } } */
>> resp. ia32 (or ilp32, depending on what exactly should be done for -mx32),
>> if you have the same testcase for -m32 and -m64, but just want different
>> scan-assembler for the two cases, then just guard the scan-assembler
>> with lp64 resp. ia32/ilp32 target and add second one for the other target.
>>
>>        Jakub
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-mid-3.patch --]
[-- Type: application/octet-stream, Size: 41342 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 296c5b7..3e41695 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3567,7 +3567,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..4b7d67b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6278,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 187122e..c7e2457 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6216,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 8465237..ff568b1 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1495,6 +1495,12 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (get_object_alignment_1 (expr, &offset) < align)
+	return -1;
+      offset /= BITS_PER_UNIT;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2058,7 +2064,6 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
   enum machine_mode address_mode;
   int pbits;
   struct mem_attrs attrs, *defattrs;
-  unsigned HOST_WIDE_INT max_align;
 
   attrs = *get_mem_attrs (memref);
 
@@ -2115,8 +2120,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + attrs.offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index b020978..83bc789 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -126,15 +126,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -163,6 +166,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -811,7 +820,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,11 +829,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    BIGGEST_ALIGNMENT :
+	    MIN (BIGGEST_ALIGNMENT, (offset & -offset) * BITS_PER_UNIT);
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -836,6 +900,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -876,6 +1104,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -960,23 +1189,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1014,35 +1257,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1050,60 +1305,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1680,7 +1932,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2322,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2464,7 +2716,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2508,10 +2763,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2525,13 +2780,126 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+
+      if (*promoted_to_vector_value_ptr)
+	{
+	  enum machine_mode promoted_mode = GET_MODE (*promoted_to_vector_value_ptr);
+	  if (GET_MODE_SIZE (promoted_mode) < GET_MODE_SIZE (mode))
+	    return generate_move_with_mode (data, promoted_mode,
+				    promoted_to_vector_value_ptr,
+				    promoted_value_ptr);
+	  rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+	}
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    return generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2606,6 +2974,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -4034,7 +4530,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6325,7 +6821,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9738,7 +10234,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1bf1369..6f697d7 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -706,4 +706,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5368d18..cbbb75a 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1273,6 +1273,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f13485e..4ec67c7 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2513,6 +2513,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index c3bec0e..a74bb7b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 81fd12f..f02a9e8 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1442,4 +1442,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f19fb50..8d23747 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -175,3 +175,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-20  8:46                           ` Michael Zolotukhin
@ 2011-10-20  8:51                             ` Michael Zolotukhin
  2011-10-26 20:36                               ` Michael Zolotukhin
  2011-10-27 16:10                             ` Jan Hubicka
  1 sibling, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-10-20  8:51 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jack Howarth, gcc-patches, Jan Hubicka, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 1884 bytes --]

And, finally, part with the tests.

On 20 October 2011 12:36, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Back-end part of the patch is attached here.
>
> On 20 October 2011 12:35, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> Middle-end part of the patch is attached.
>>
>> On 20 October 2011 12:34, Michael Zolotukhin
>> <michael.v.zolotukhin@gmail.com> wrote:
>>> I fixed the tests as well as updated my branch and fixed introduced
>>> during this process bugs.
>>> Here is fixed complete patch (other parts will be sent in consequent letters).
>>>
>>> The changes passed bootstrap and make check.
>>>
>>> On 29 September 2011 15:21, Jakub Jelinek <jakub@redhat.com> wrote:
>>>> Hi!
>>>>
>>>> On Thu, Sep 29, 2011 at 03:14:40PM +0400, Michael Zolotukhin wrote:
>>>> +/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
>>>>
>>>> The testcases are wrong, -m64 or -m32 should never appear in dg-options,
>>>> instead if the testcase is specific to -m64, it should be guarded with
>>>> /* { dg-do compile { target lp64 } } */
>>>> resp. ia32 (or ilp32, depending on what exactly should be done for -mx32),
>>>> if you have the same testcase for -m32 and -m64, but just want different
>>>> scan-assembler for the two cases, then just guard the scan-assembler
>>>> with lp64 resp. ia32/ilp32 target and add second one for the other target.
>>>>
>>>>        Jakub
>>>
>>> --
>>> ---
>>> Best regards,
>>> Michael V. Zolotukhin,
>>> Software Engineer
>>> Intel Corporation.
>>>
>>
>>
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-tests-3.patch --]
[-- Type: application/octet-stream, Size: 51576 bytes --]

diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..39c8ef0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..439694b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..51f4c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..bca8680
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..5bc8e74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..b7dff27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..bee85fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..1160beb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..b1c78ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..a15a0f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..2789660
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..17e0342
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..e437378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..ba716df
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..1845e95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..2b23751
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..e751192
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..7defe7e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..ea27378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..de2a557
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..1f82258
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..7f60806
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..94f0864
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..20545c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..52dab8e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..c662480
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..9e8e152
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..662fc20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..c90e852
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..5a41f82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..ec2dfff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..d6b2cd5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..9cd89e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..ddf25fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..fde4f5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..4fe2d36
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..2209563
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..8d99dde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..e0ad04a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..404d04e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..1df9db0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..beb005c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..29f5ea3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..2504333
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..b0aaada
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..3e250d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..c13edd7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..17d9525
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..8125e9d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..ff74811
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..d7e0c3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..ea7b439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..5ef250d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..846a807
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..a8f7c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..ae05e93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..96462bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..6aee01e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..bbad9b9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..8e90d72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..26d0b42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..84ec749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..ef15265
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..444a8de
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..9154fb9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..9b7dac1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..713c8a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..8c700c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..c344fd0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..125de2f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..b50de1b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..c6fd271
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..32972e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..ac615e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..8458cfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..210946d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..e63feae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..72b2ba0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..cb5dc85
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-20  8:51                             ` Michael Zolotukhin
@ 2011-10-26 20:36                               ` Michael Zolotukhin
  2011-11-07  2:48                                 ` Jan Hubicka
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-10-26 20:36 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jack Howarth, gcc-patches, Jan Hubicka, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

Any questions on these patches? Are they ok for the trunk?

On 20 October 2011 12:37, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> And, finally, part with the tests.
>
> On 20 October 2011 12:36, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> Back-end part of the patch is attached here.
>>
>> On 20 October 2011 12:35, Michael Zolotukhin
>> <michael.v.zolotukhin@gmail.com> wrote:
>>> Middle-end part of the patch is attached.
>>>
>>> On 20 October 2011 12:34, Michael Zolotukhin
>>> <michael.v.zolotukhin@gmail.com> wrote:
>>>> I fixed the tests as well as updated my branch and fixed introduced
>>>> during this process bugs.
>>>> Here is fixed complete patch (other parts will be sent in consequent letters).
>>>>
>>>> The changes passed bootstrap and make check.
>>>>
>>>> On 29 September 2011 15:21, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>> Hi!
>>>>>
>>>>> On Thu, Sep 29, 2011 at 03:14:40PM +0400, Michael Zolotukhin wrote:
>>>>> +/* { dg-options "-O2 -march=atom -mtune=atom -m64 -dp" } */
>>>>>
>>>>> The testcases are wrong, -m64 or -m32 should never appear in dg-options,
>>>>> instead if the testcase is specific to -m64, it should be guarded with
>>>>> /* { dg-do compile { target lp64 } } */
>>>>> resp. ia32 (or ilp32, depending on what exactly should be done for -mx32),
>>>>> if you have the same testcase for -m32 and -m64, but just want different
>>>>> scan-assembler for the two cases, then just guard the scan-assembler
>>>>> with lp64 resp. ia32/ilp32 target and add second one for the other target.
>>>>>
>>>>>        Jakub
>>>>
>>>> --
>>>> ---
>>>> Best regards,
>>>> Michael V. Zolotukhin,
>>>> Software Engineer
>>>> Intel Corporation.
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Best regards,
>>> Michael V. Zolotukhin,
>>> Software Engineer
>>> Intel Corporation.
>>>
>>
>>
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-20  8:46                           ` Michael Zolotukhin
  2011-10-20  8:51                             ` Michael Zolotukhin
@ 2011-10-27 16:10                             ` Jan Hubicka
  2011-10-28 13:14                               ` Michael Zolotukhin
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Hubicka @ 2011-10-27 16:10 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jakub Jelinek, Jack Howarth, gcc-patches, Jan Hubicka,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

Hi,
sorry for delay with the review. This is my first pass through the backend part, hopefully
someone else will do the middle end bits.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..d7c4330 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},

I am bit concerned about explossion of variants, but adding aligned variants probably makes
sense.  I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
or shall the alignment be matching the size of loads/stores produced?

@@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}

This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
prevent use of integer registers for 64bit/128bit arithmetic.

I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
we already know pretty well that the moves will stay as they are.  That might be enough for you?

If you go this way, please make separate patch so it can be benchmarked. While the moves are faster
there are always problem with size mismatches in load/store buffers.

I think for string operations we should output the SSE/AVX instruction variants
by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
feasible?  This has been traditionally problem with old RA because it did not
see that because pseudo is eventually used for arithmetics, it can not go into
SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
variants for 64bit/128bit manipulations w/o hurting performance of code that
does arithmetic on long long and __int128.  Perhaps IRA can solve this now.
Vladimir, do you have any ida?

+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)

I wrote the original function, but it is not really clear for me what the function
does now. I.e. what is code for updating addresses and what means reusing iter.
I guess reusing iter means that we won't start the loop from 0.  Could you
expand comments a bit more?

I know I did not documented them originally, but all the parameters ought to be
explicitely documented in a function comment.
@@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)

This seems to more naturally belong into gen_strset expander?
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
it copies memory aliasing attributes and it is needed to zap them here since stringops
behaves funily WRT aliaseing.
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));

No use for 128bit moves here?
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));

And here?
@@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);

You want to always use adjust_automodify_address or adjust_automodify_address_nv,
adjust_automodify_address_1 is not intended for general use.
@@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));

again, no use for vector moves?
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).

This can be handled in machine independent way, right?

+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */

This would probably go better as named expansion pattern, like we do for other
machine description interfaces.
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..4b7d67b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.

New hooks should go to the machine indpendent part of the patch.

Honza

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-27 16:10                             ` Jan Hubicka
@ 2011-10-28 13:14                               ` Michael Zolotukhin
  2011-10-28 16:38                                 ` Richard Henderson
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-10-28 13:14 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Jack Howarth, gcc-patches, Richard Guenther,
	H.J. Lu, izamyatin, areg.melikadamyan

Hi Jan!
Thanks for the review, you could find my answers to some of your
remarks below. I'll send a corrected patch soon with answers to the
rest of your remarks.

> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants probably makes
> sense.  I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?

Yes, alignment should match the size of loads/stores as well as offset
from alignment boundary should be known. In other case, strategies for
unknown alignment would be chosen.


> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
> we already know pretty well that the moves will stay as they are.  That might be enough for you?

Yes, theoretically it could harm 64/128-bit arithmetic, but actually
what could we do if we have DImode, mem-to-mem move and our mode is
32-bit? Ideally, RA should be able to make desicions on how to perform
such moves, but currently it doesn't generate SSE-moves - when it'll
be able to do so, I think we could remove this part and rely on RA.
And, one more point. This is quite a special case - here we want to
perform move via half of vector register. This is the main reason why
these particular cases are handled in special, not common, way.


> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to be
> explicitely documented in a function comment.

Yep, you're right - we just don't start the loop from 0. I'll send a
version with the comments soon.


> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?

I don't think it matters here, but to make emit_strset look similar to
emit_strmov, most of emit_strset body realy could be moved to
gen_strset.


>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?

For memset prologues/epilogues I avoid using vector moves as it could
require expensive initialization (we need to create a vector filled
with the given value). For other cases, I'll re-check if use of vector
moves is possible.


> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
>       off = 4;
>       emit_insn (gen_strmov (destreg, dst, srcreg, src));
>     }
> +  if (align_bytes & 8)
> +    {
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      else
> +       {
> +         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?

Actually, here vector-moves are used if they are available - if 32bit
mode is used (so we can't do the move via GPR), but SSE is available,
then SSE-move would be generated.


> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> +   2) if VAL isn't const, then the result will be the result of MUL-instruction
> +   of VAL and const_int (0x01010101) (for SImode).  */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.

I don't think it could be done in machine-independent way - e.g. if
AVX is available, we could use broadcast-instructions, if not - we
need to use multiply-instructions, on other architectures there
probably some other more efficient ways to duplicate byte value across
the entire vector register. So IMO it's a good place to have a hook.


> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>

These changes present in middle-end part too (of course, in the full
patch it's not duplicated). I left it in this patch too to avoid
possible problems with build.


-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


On 27 October 2011 19:09, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> sorry for delay with the review. This is my first pass through the backend part, hopefully
> someone else will do the middle end bits.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 2c53423..d7c4330 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
>   COSTS_N_BYTES (2),                   /* cost of FABS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FCHS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FSQRT instruction.  */
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants probably makes
> sense.  I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?
>
> @@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
>     }
>   else
>     {
> +      if (mode == DImode
> +         && !TARGET_64BIT
> +         && TARGET_SSE2
> +         && MEM_P (op0)
> +         && MEM_P (op1)
> +         && !push_operand (op0, mode)
> +         && can_create_pseudo_p ())
> +       {
> +         rtx temp = gen_reg_rtx (V2DImode);
> +         emit_insn (gen_sse2_loadq (temp, op1));
> +         emit_insn (gen_sse_storeq (op0, temp));
> +         return;
> +       }
> +      if (mode == DImode
> +         && !TARGET_64BIT
> +         && TARGET_SSE
> +         && !MEM_P (op1)
> +         && GET_MODE (op1) == V2DImode)
> +       {
> +         emit_insn (gen_sse_storeq (op0, op1));
> +         return;
> +       }
> +      if (mode == TImode
> +         && TARGET_AVX2
> +         && MEM_P (op0)
> +         && !MEM_P (op1)
> +         && GET_MODE (op1) == V4DImode)
> +       {
> +         op0 = convert_to_mode (V2DImode, op0, 1);
> +         emit_insn (gen_vec_extract_lo_v4di (op0, op1));
> +         return;
> +       }
>
> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
> we already know pretty well that the moves will stay as they are.  That might be enough for you?
>
> If you go this way, please make separate patch so it can be benchmarked. While the moves are faster
> there are always problem with size mismatches in load/store buffers.
>
> I think for string operations we should output the SSE/AVX instruction variants
> by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
> feasible?  This has been traditionally problem with old RA because it did not
> see that because pseudo is eventually used for arithmetics, it can not go into
> SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
> variants for 64bit/128bit manipulations w/o hurting performance of code that
> does arithmetic on long long and __int128.  Perhaps IRA can solve this now.
> Vladimir, do you have any ida?
>
> +/* Helper function for expand_set_or_movmem_via_loop.
> +   This function can reuse iter rtx from another loop and don't generate
> +   code for updating the addresses.  */
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> +                                        rtx destptr, rtx srcptr, rtx value,
> +                                        rtx count, rtx iter,
> +                                        enum machine_mode mode, int unroll,
> +                                        int expected_size, bool change_ptrs)
>
> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to be
> explicitely documented in a function comment.
> @@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
>   emit_insn (gen_strmov (destptr, dest, srcptr, src));
>  }
>
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
>        {
> -         if (TARGET_64BIT)
> -           {
> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         else
> -           {
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         offset += 8;
> +         if (GET_MODE (destmem) != move_mode)
> +           destmem = change_address (destmem, move_mode, destptr);
> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
> it copies memory aliasing attributes and it is needed to zap them here since stringops
> behaves funily WRT aliaseing.
>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?
> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
>   if (align <= 1 && desired_alignment > 1)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 1, false);
> -      srcmem = change_address (srcmem, QImode, srcptr);
> -      destmem = change_address (destmem, QImode, destptr);
> +      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
> +      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>
> You want to always use adjust_automodify_address or adjust_automodify_address_nv,
> adjust_automodify_address_1 is not intended for general use.
> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
>       off = 4;
>       emit_insn (gen_strmov (destreg, dst, srcreg, src));
>     }
> +  if (align_bytes & 8)
> +    {
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      else
> +       {
> +         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?
> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> +   2) if VAL isn't const, then the result will be the result of MUL-instruction
> +   of VAL and const_int (0x01010101) (for SImode).  */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>
> Honza
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-28 13:14                               ` Michael Zolotukhin
@ 2011-10-28 16:38                                 ` Richard Henderson
  2011-11-01 17:36                                   ` Michael Zolotukhin
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Henderson @ 2011-10-28 16:38 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jan Hubicka, Jakub Jelinek, Jack Howarth, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

On 10/28/2011 05:41 AM, Michael Zolotukhin wrote:
>> > +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>> > +   supposed to represent one byte.  MODE could be a vector mode.
>> > +   Example:
>> > +   1) VAL = const_int (0xAB), mode = SImode,
>> > +   the result is const_int (0xABABABAB).
>> >
>> > This can be handled in machine independent way, right?
>> >
>> > +   2) if VAL isn't const, then the result will be the result of MUL-instruction
>> > +   of VAL and const_int (0x01010101) (for SImode).  */
>> >
>> > This would probably go better as named expansion pattern, like we do for other
>> > machine description interfaces.
> I don't think it could be done in machine-independent way - e.g. if
> AVX is available, we could use broadcast-instructions, if not - we
> need to use multiply-instructions, on other architectures there
> probably some other more efficient ways to duplicate byte value across
> the entire vector register. So IMO it's a good place to have a hook.
> 
> 

Certainly it can be done machine-independently.
See expand_vector_broadcast in optabs.c for a start.


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-28 16:38                                 ` Richard Henderson
@ 2011-11-01 17:36                                   ` Michael Zolotukhin
  2011-11-01 17:48                                     ` Michael Zolotukhin
                                                       ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-11-01 17:36 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Henderson, Jakub Jelinek, Jack Howarth, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 7006 bytes --]

Thanks for the answers!

I tried to take into account all the remarks and updated the patch in
accordance with them. Its full version is attached to this letter,
separate middle-end and back-end parts will be in consequent letters.

What about the rest part of the patch? Jan, could you please review it too?


Below there are responses to remarks you made.
> +/* Helper function for expand_set_or_movmem_via_loop.
> +   This function can reuse iter rtx from another loop and don't generate
> +   code for updating the addresses.  */
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> +                                        rtx destptr, rtx srcptr, rtx value,
> +                                        rtx count, rtx iter,
> +                                        enum machine_mode mode, int unroll,
> +                                        int expected_size, bool change_ptrs)
>
> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
I added some comments - please see updated version of the patch.


> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
Corrected.


>        {
> -         if (TARGET_64BIT)
> -           {
> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         else
> -           {
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         offset += 8;
> +         if (GET_MODE (destmem) != move_mode)
> +           destmem = change_address (destmem, move_mode, destptr);
> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
> it copies memory aliasing attributes and it is needed to zap them here since stringops
> behaves funily WRT aliaseing.
Fixed.


>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?
Fixed.


> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
>   if (align <= 1 && desired_alignment > 1)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 1, false);
> -      srcmem = change_address (srcmem, QImode, srcptr);
> -      destmem = change_address (destmem, QImode, destptr);
> +      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
> +      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>
> You want to always use adjust_automodify_address or adjust_automodify_address_nv,
> adjust_automodify_address_1 is not intended for general use.
Fixed.


> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?

> Certainly it can be done machine-independently.
> See expand_vector_broadcast in optabs.c for a start.
Thanks, there is no need in new hook, indeed. Fixed.

Responses to other questions/remarks were in previous letter.

On 28 October 2011 19:59, Richard Henderson <rth@redhat.com> wrote:
> On 10/28/2011 05:41 AM, Michael Zolotukhin wrote:
>>> > +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>>> > +   supposed to represent one byte.  MODE could be a vector mode.
>>> > +   Example:
>>> > +   1) VAL = const_int (0xAB), mode = SImode,
>>> > +   the result is const_int (0xABABABAB).
>>> >
>>> > This can be handled in machine independent way, right?
>>> >
>>> > +   2) if VAL isn't const, then the result will be the result of MUL-instruction
>>> > +   of VAL and const_int (0x01010101) (for SImode).  */
>>> >
>>> > This would probably go better as named expansion pattern, like we do for other
>>> > machine description interfaces.
>> I don't think it could be done in machine-independent way - e.g. if
>> AVX is available, we could use broadcast-instructions, if not - we
>> need to use multiply-instructions, on other architectures there
>> probably some other more efficient ways to duplicate byte value across
>> the entire vector register. So IMO it's a good place to have a hook.
>>
>>
>
> Certainly it can be done machine-independently.
> See expand_vector_broadcast in optabs.c for a start.
>
>
> r~
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-complete-4.patch --]
[-- Type: application/octet-stream, Size: 164401 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 296c5b7..3e41695 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3567,7 +3567,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..6ce240a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,21 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1867,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1945,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2632,8 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
+static rtx promote_duplicated_reg_to_size (rtx, int, int, int);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15266,6 +15363,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20677,22 +20806,37 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
+/* Helper function for expand_set_or_movmem_via_loop.
+
+   When SRCPTR is non-NULL, output simple loop to move memory
    pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
    overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
    equivalent loop to set memory by VALUE (supposed to be in MODE).
 
    The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.
 
+   If ITER isn't NULL, than it'll be used in the generated loop without
+   initialization (that allows to generate several consequent loops using the
+   same iterator).
+   If CHANGE_PTRS is specified, DESTPTR and SRCPTR would be increased by
+   iterator value at the end of the function (as if they iterate in the loop).
+   Otherwise, their vaules'll stay unchanged.
 
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+   If EXPECTED_SIZE isn't -1, than it's used to compute branch-probabilities on
+   the loop backedge.  When expected size is unknown (it's -1), the probability
+   is set to 80%.
+
+   Return value is rtx of iterator, used in the loop - it could be reused in
+   consequent calls of this function.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20700,10 +20844,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20714,18 +20860,21 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
   tmp = convert_modes (Pmode, iter_mode, iter, true);
   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
-  destmem = change_address (destmem, mode, x_addr);
+  destmem =
+    adjust_automodify_address_nv (copy_rtx (destmem), mode, x_addr, 0);
 
   if (srcmem)
     {
       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
-      srcmem = change_address (srcmem, mode, y_addr);
+      srcmem =
+	adjust_automodify_address_nv (copy_rtx (srcmem), mode, y_addr, 0);
 
       /* When unrolling for chips that reorder memory reads and writes,
 	 we can save registers by using single temporary.
@@ -20797,19 +20946,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20913,7 +21086,18 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this constant to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (COUNT % MAX_SIZE) bytes from SRCPTR to DESTPTR.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20924,43 +21108,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -21066,87 +21262,134 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set with VALUE at most (COUNT % MAX_SIZE) bytes starting from
+   DESTPTR.
+   DESTMEM provides MEMrtx to feed proper aliasing info.
+   PROMOTED_TO_GPR_VALUE is rtx representing a GPR containing broadcasted VALUE.
+   PROMOTED_TO_VECTOR_VALUE is rtx representing a vector register containing
+   broadcasted VALUE.
+   PROMOTED_TO_GPR_VALUE and PROMOTED_TO_VECTOR_VALUE could be NULL if the
+   promotion hasn't been generated before.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx promoted_to_gpr_value, rtx value, rtx count,
+			int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!promoted_to_vector_value
+	      || !VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = adjust_automodify_address_nv (destmem, move_mode,
+						    destptr, offset);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      expand_vector_broadcast_of_byte_value (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+
+  if (!promoted_to_gpr_value)
+    promoted_to_gpr_value = promote_duplicated_reg_to_size (value,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode));
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem,
+				    GET_MODE (promoted_to_vector_value),
+				    destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_vector_value));
+	}
+      else if (TARGET_64BIT)
+	{
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21154,16 +21397,22 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 8)
     {
       rtx label = ix86_expand_aligntest (count, 8, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem,
+				 gen_lowpart (DImode, promoted_to_vector_value)));
+	}
+      else if (TARGET_64BIT)
+	{
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21171,24 +21420,27 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (SImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (HImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (QImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -21204,8 +21456,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, QImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21214,8 +21466,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      srcmem = change_address (srcmem, HImode, srcptr);
-      destmem = change_address (destmem, HImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, HImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21224,14 +21476,34 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      srcmem = change_address (srcmem, SImode, srcptr);
-      destmem = change_address (destmem, SImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, DImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, DImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -21286,6 +21558,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -21293,7 +21596,9 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
   if (src_align_bytes >= 0)
     {
       unsigned int src_align = 0;
-      if ((src_align_bytes & 7) == (align_bytes & 7))
+      if ((src_align_bytes & 15) == (align_bytes & 15))
+	src_align = 16;
+      else if ((src_align_bytes & 7) == (align_bytes & 7))
 	src_align = 8;
       else if ((src_align_bytes & 3) == (align_bytes & 3))
 	src_align = 4;
@@ -21321,7 +21626,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      destmem = change_address (destmem, QImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21330,7 +21635,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      destmem = change_address (destmem, HImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21339,13 +21644,23 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      destmem = change_address (destmem, SImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -21381,6 +21696,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      off = 4;
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 4;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -21392,7 +21720,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -21401,7 +21729,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -21414,7 +21742,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -21423,9 +21751,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21489,29 +21817,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21535,9 +21867,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21625,6 +21959,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21648,9 +21987,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21669,11 +22016,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21785,6 +22137,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
 				  desired_align);
+	  set_mem_align (src, desired_align*BITS_PER_UNIT);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -21842,11 +22196,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
     expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
 			    epilogue_size_needed);
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21917,7 +22306,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -22015,10 +22444,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   int size_needed = 0, epilogue_size_needed;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
-  rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx gpr_promoted_val = NULL;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22038,8 +22471,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -22057,11 +22493,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -22106,8 +22552,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -22116,12 +22564,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -22161,9 +22603,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 2: Alignment prologue.  */
 
   /* Do the expensive promotion once we branched off the small blocks.  */
-  if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+  if (!gpr_promoted_val)
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -22175,17 +22619,20 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	     the pain to maintain it for the first move, so throw away
 	     the info early.  */
 	  dst = change_address (dst, BLKmode, destreg);
-	  expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
+	  expand_setmem_prologue (dst, destreg, gpr_promoted_val, count_exp, align,
 				  desired_align);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
 	  /* If we know how many bytes need to be stored before dst is
 	     sufficiently aligned, maintain aliasing info accurately.  */
-	  dst = expand_constant_setmem_prologue (dst, destreg, promoted_val,
+	  dst = expand_constant_setmem_prologue (dst, destreg, gpr_promoted_val,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -22213,7 +22660,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       emit_label (label);
       LABEL_NUSES (label) = 1;
       label = NULL;
-      promoted_val = val_exp;
+      gpr_promoted_val = val_exp;
       epilogue_size_needed = 1;
     }
   else if (label == NULL_RTX)
@@ -22227,27 +22674,34 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, gpr_promoted_val,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (gpr_promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      gcc_assert (TARGET_64BIT);
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  DImode, val_exp);
       break;
     case rep_prefix_4_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  SImode, val_exp);
       break;
     case rep_prefix_1_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  QImode, val_exp);
       break;
     }
@@ -22280,15 +22734,29 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
+			    val_exp, count_exp, epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -37436,6 +37904,33 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -37743,6 +38238,9 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bd69ec2..550b2ab 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 9c9508d..bd38e48 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -15905,6 +15905,17 @@
 	      (clobber (reg:CC FLAGS_REG))])]
   ""
 {
+  rtx vec_reg;
+  enum machine_mode mode = GET_MODE (operands[2]);
+  if (vector_extensions_used_for_mode (mode)
+      && CONSTANT_P (operands[2]))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, operands[2]);
+      operands[2] = vec_reg;
+    }
   if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
     operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
 
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ff77003..b8ecc59 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7426,6 +7426,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7537,6 +7544,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7578,6 +7595,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..844ed17 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6271,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 187122e..c32e745 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6209,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 8465237..ff568b1 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1495,6 +1495,12 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (get_object_alignment_1 (expr, &offset) < align)
+	return -1;
+      offset /= BITS_PER_UNIT;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2058,7 +2064,6 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
   enum machine_mode address_mode;
   int pbits;
   struct mem_attrs attrs, *defattrs;
-  unsigned HOST_WIDE_INT max_align;
 
   attrs = *get_mem_attrs (memref);
 
@@ -2115,8 +2120,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + attrs.offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index b020978..4ffd0b7 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -126,15 +126,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -163,6 +166,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -811,7 +820,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,11 +829,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    BIGGEST_ALIGNMENT :
+	    MIN (BIGGEST_ALIGNMENT, (offset & -offset) * BITS_PER_UNIT);
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -836,6 +900,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -876,6 +1104,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -960,23 +1189,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1014,35 +1257,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1050,60 +1305,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1680,7 +1932,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2322,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2464,7 +2716,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2508,10 +2763,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2525,13 +2780,148 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set_by_pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+  enum machine_mode promoted_value_mode;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      promoted_value_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = expand_vector_broadcast_of_byte_value (promoted_value_mode,
+						   (rtx)data->constfundata);
+
+      /* *PROMOTED_TO_VECTOR_VALUE_PTR could be NULL if we failed to promote it.
+	 It's also not guaranteed, that it will have mode PROMOTED_VALUE_MODE.  */
+      if (*promoted_to_vector_value_ptr)
+	{
+	  promoted_value_mode = GET_MODE (*promoted_to_vector_value_ptr);
+
+	  if (GET_MODE_SIZE (promoted_value_mode) < GET_MODE_SIZE (mode))
+	    return generate_move_with_mode (data, promoted_value_mode,
+				    promoted_to_vector_value_ptr,
+				    promoted_value_ptr);
+
+	  /* If promoted value mode is wider than the requested mode, we need to
+	     extract a part from the vector register.  */
+	  if (GET_MODE_SIZE (promoted_value_mode) > GET_MODE_SIZE (mode))
+	    {
+	      enum machine_mode part_mode;
+	      enum machine_mode inner_mode = GET_MODE_INNER (promoted_value_mode);
+	      int n_elem = GET_MODE_SIZE (mode) / GET_MODE_SIZE (inner_mode);
+	      gcc_assert (n_elem > 0);
+	      /* PART_MODE should have the same size as MODE, and the same inner
+		 mode as PROMOTED_VALUE_MODE.  */
+	      part_mode = mode_for_vector (inner_mode, n_elem);
+	      rhs = convert_to_mode (mode,
+				     gen_lowpart (part_mode,
+						  *promoted_to_vector_value_ptr),
+				     1);
+	    }
+	  else
+	    rhs = convert_to_mode (mode, *promoted_to_vector_value_ptr, 1);
+	}
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    return generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2606,6 +2996,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -4034,7 +4552,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6325,7 +6843,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9738,7 +10256,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1bf1369..9fc83c9 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -706,4 +706,7 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5368d18..cbbb75a 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1273,6 +1273,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/optabs.c b/gcc/optabs.c
index a373d7a..f42bd9e 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -770,6 +770,47 @@ expand_vector_broadcast (enum machine_mode vmode, rtx op)
   return ret;
 }
 
+/* Create a new vector value in VMODE with all bytes set to VAL.  The
+   mode of VAL must be QImode or it should be a constant.
+   If VAL is a constant, then the return value will be a constant.  */
+
+extern rtx
+expand_vector_broadcast_of_byte_value (enum machine_mode vmode, rtx val)
+{
+  enum insn_code icode;
+  rtvec vec;
+  rtx ret;
+  int i, n;
+
+  enum machine_mode byte_vmode;
+
+  gcc_checking_assert (VECTOR_MODE_P (vmode));
+  gcc_assert (CONSTANT_P (val) || GET_MODE (val) == QImode);
+  byte_vmode = mode_for_vector (QImode, GET_MODE_SIZE (vmode));
+
+  n = GET_MODE_NUNITS (byte_vmode);
+  vec = rtvec_alloc (n);
+  for (i = 0; i < n; ++i)
+    RTVEC_ELT (vec, i) = val;
+
+  if (CONSTANT_P (val))
+    ret = gen_rtx_CONST_VECTOR (byte_vmode, vec);
+  else
+    {
+      icode = optab_handler (vec_init_optab, byte_vmode);
+      if (icode == CODE_FOR_nothing)
+	return NULL;
+
+      ret = gen_reg_rtx (byte_vmode);
+      emit_insn (GEN_FCN (icode) (ret, gen_rtx_PARALLEL (byte_vmode, vec)));
+    }
+
+  if (vmode != byte_vmode)
+    ret = convert_to_mode (vmode, ret, 1);
+
+  return ret;
+}
+
 /* This subroutine of expand_doubleword_shift handles the cases in which
    the effective shift value is >= BITS_PER_WORD.  The arguments and return
    value are the same as for the parent routine, except that SUPERWORD_OP1
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 926d21f..dca1742 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -1147,4 +1147,5 @@ extern void expand_jump_insn (enum insn_code icode, unsigned int nops,
 extern rtx prepare_operand (enum insn_code, rtx, int, enum machine_mode,
 			    enum machine_mode, int);
 
+extern rtx expand_vector_broadcast_of_byte_value (enum machine_mode, rtx);
 #endif /* GCC_OPTABS_H */
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f13485e..4ec67c7 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2513,6 +2513,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index c3bec0e..76cf291 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,14 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 81fd12f..e70ecba 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1442,4 +1442,15 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f19fb50..ace8686 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -175,3 +175,5 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..39c8ef0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..439694b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..51f4c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..bca8680
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..5bc8e74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..b7dff27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..bee85fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..1160beb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..b1c78ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..a15a0f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..2789660
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..17e0342
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..e437378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..ba716df
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..1845e95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..2b23751
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..e751192
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..7defe7e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..ea27378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..de2a557
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..1f82258
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..7f60806
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..94f0864
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..20545c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..52dab8e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..c662480
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..9e8e152
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..662fc20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..c90e852
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..5a41f82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..ec2dfff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..d6b2cd5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..9cd89e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..ddf25fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..fde4f5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..4fe2d36
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..2209563
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..8d99dde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..e0ad04a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..404d04e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..1df9db0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..beb005c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..29f5ea3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..2504333
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..b0aaada
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..3e250d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..c13edd7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..17d9525
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..8125e9d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..ff74811
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..d7e0c3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..ea7b439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..5ef250d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..846a807
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..a8f7c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..ae05e93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..96462bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..6aee01e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..bbad9b9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..8e90d72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..26d0b42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..84ec749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..ef15265
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..444a8de
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..9154fb9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..9b7dac1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..713c8a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..8c700c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..c344fd0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..125de2f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..b50de1b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..c6fd271
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..32972e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..ac615e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..8458cfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..210946d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..e63feae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..72b2ba0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..cb5dc85
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index 483d117..e3d3b91 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue -mstringop-strategy=rep_byte" } */
 
 #include <string.h>
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-01 17:36                                   ` Michael Zolotukhin
@ 2011-11-01 17:48                                     ` Michael Zolotukhin
  2011-11-02 19:12                                     ` Jan Hubicka
  2011-11-02 19:55                                     ` Jan Hubicka
  2 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-11-01 17:48 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Henderson, Jakub Jelinek, Jack Howarth, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 7448 bytes --]

Here is the separate middle-end and back-end parts of the updated patch.

On 1 November 2011 21:34, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Thanks for the answers!
>
> I tried to take into account all the remarks and updated the patch in
> accordance with them. Its full version is attached to this letter,
> separate middle-end and back-end parts will be in consequent letters.
>
> What about the rest part of the patch? Jan, could you please review it too?
>
>
> Below there are responses to remarks you made.
>> +/* Helper function for expand_set_or_movmem_via_loop.
>> +   This function can reuse iter rtx from another loop and don't generate
>> +   code for updating the addresses.  */
>> +static rtx
>> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
>> +                                        rtx destptr, rtx srcptr, rtx value,
>> +                                        rtx count, rtx iter,
>> +                                        enum machine_mode mode, int unroll,
>> +                                        int expected_size, bool change_ptrs)
>>
>> I wrote the original function, but it is not really clear for me what the function
>> does now. I.e. what is code for updating addresses and what means reusing iter.
>> I guess reusing iter means that we won't start the loop from 0.  Could you
>> expand comments a bit more?
> I added some comments - please see updated version of the patch.
>
>
>> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
>> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
>> +   then move this consatnt to a vector register before emitting strset.  */
>> +static void
>> +emit_strset (rtx destmem, rtx value,
>> +            rtx destptr, enum machine_mode mode, int offset)
>>
>> This seems to more naturally belong into gen_strset expander?
> Corrected.
>
>
>>        {
>> -         if (TARGET_64BIT)
>> -           {
>> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
>> -             emit_insn (gen_strset (destptr, dest, value));
>> -           }
>> -         else
>> -           {
>> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
>> -             emit_insn (gen_strset (destptr, dest, value));
>> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
>> -             emit_insn (gen_strset (destptr, dest, value));
>> -           }
>> -         offset += 8;
>> +         if (GET_MODE (destmem) != move_mode)
>> +           destmem = change_address (destmem, move_mode, destptr);
>> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
>> it copies memory aliasing attributes and it is needed to zap them here since stringops
>> behaves funily WRT aliaseing.
> Fixed.
>
>
>>   if (max_size > 16)
>>     {
>>       rtx label = ix86_expand_aligntest (count, 16, true);
>>       if (TARGET_64BIT)
>>        {
>> -         dest = change_address (destmem, DImode, destptr);
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> +         destmem = change_address (destmem, DImode, destptr);
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
>> +                                                               value)));
>>
>> No use for 128bit moves here?
>>        }
>>       else
>>        {
>> -         dest = change_address (destmem, SImode, destptr);
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> +         destmem = change_address (destmem, SImode, destptr);
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>>
>> And here?
> Fixed.
>
>
>> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
>>   if (align <= 1 && desired_alignment > 1)
>>     {
>>       rtx label = ix86_expand_aligntest (destptr, 1, false);
>> -      srcmem = change_address (srcmem, QImode, srcptr);
>> -      destmem = change_address (destmem, QImode, destptr);
>> +      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
>> +      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>>
>> You want to always use adjust_automodify_address or adjust_automodify_address_nv,
>> adjust_automodify_address_1 is not intended for general use.
> Fixed.
>
>
>> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>> +   supposed to represent one byte.  MODE could be a vector mode.
>> +   Example:
>> +   1) VAL = const_int (0xAB), mode = SImode,
>> +   the result is const_int (0xABABABAB).
>>
>> This can be handled in machine independent way, right?
>
>> Certainly it can be done machine-independently.
>> See expand_vector_broadcast in optabs.c for a start.
> Thanks, there is no need in new hook, indeed. Fixed.
>
> Responses to other questions/remarks were in previous letter.
>
> On 28 October 2011 19:59, Richard Henderson <rth@redhat.com> wrote:
>> On 10/28/2011 05:41 AM, Michael Zolotukhin wrote:
>>>> > +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>>>> > +   supposed to represent one byte.  MODE could be a vector mode.
>>>> > +   Example:
>>>> > +   1) VAL = const_int (0xAB), mode = SImode,
>>>> > +   the result is const_int (0xABABABAB).
>>>> >
>>>> > This can be handled in machine independent way, right?
>>>> >
>>>> > +   2) if VAL isn't const, then the result will be the result of MUL-instruction
>>>> > +   of VAL and const_int (0x01010101) (for SImode).  */
>>>> >
>>>> > This would probably go better as named expansion pattern, like we do for other
>>>> > machine description interfaces.
>>> I don't think it could be done in machine-independent way - e.g. if
>>> AVX is available, we could use broadcast-instructions, if not - we
>>> need to use multiply-instructions, on other architectures there
>>> probably some other more efficient ways to duplicate byte value across
>>> the entire vector register. So IMO it's a good place to have a hook.
>>>
>>>
>>
>> Certainly it can be done machine-independently.
>> See expand_vector_broadcast in optabs.c for a start.
>>
>>
>> r~
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-mid-4.patch --]
[-- Type: application/octet-stream, Size: 42953 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 296c5b7..3e41695 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3567,7 +3567,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..844ed17 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6271,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 187122e..c32e745 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6209,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 8465237..ff568b1 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1495,6 +1495,12 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (get_object_alignment_1 (expr, &offset) < align)
+	return -1;
+      offset /= BITS_PER_UNIT;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2058,7 +2064,6 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
   enum machine_mode address_mode;
   int pbits;
   struct mem_attrs attrs, *defattrs;
-  unsigned HOST_WIDE_INT max_align;
 
   attrs = *get_mem_attrs (memref);
 
@@ -2115,8 +2120,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + attrs.offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index b020978..4ffd0b7 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -126,15 +126,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -163,6 +166,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -811,7 +820,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,11 +829,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    BIGGEST_ALIGNMENT :
+	    MIN (BIGGEST_ALIGNMENT, (offset & -offset) * BITS_PER_UNIT);
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -836,6 +900,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -876,6 +1104,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -960,23 +1189,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1014,35 +1257,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1050,60 +1305,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1680,7 +1932,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2322,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2464,7 +2716,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2508,10 +2763,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2525,13 +2780,148 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set_by_pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+  enum machine_mode promoted_value_mode;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      promoted_value_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = expand_vector_broadcast_of_byte_value (promoted_value_mode,
+						   (rtx)data->constfundata);
+
+      /* *PROMOTED_TO_VECTOR_VALUE_PTR could be NULL if we failed to promote it.
+	 It's also not guaranteed, that it will have mode PROMOTED_VALUE_MODE.  */
+      if (*promoted_to_vector_value_ptr)
+	{
+	  promoted_value_mode = GET_MODE (*promoted_to_vector_value_ptr);
+
+	  if (GET_MODE_SIZE (promoted_value_mode) < GET_MODE_SIZE (mode))
+	    return generate_move_with_mode (data, promoted_value_mode,
+				    promoted_to_vector_value_ptr,
+				    promoted_value_ptr);
+
+	  /* If promoted value mode is wider than the requested mode, we need to
+	     extract a part from the vector register.  */
+	  if (GET_MODE_SIZE (promoted_value_mode) > GET_MODE_SIZE (mode))
+	    {
+	      enum machine_mode part_mode;
+	      enum machine_mode inner_mode = GET_MODE_INNER (promoted_value_mode);
+	      int n_elem = GET_MODE_SIZE (mode) / GET_MODE_SIZE (inner_mode);
+	      gcc_assert (n_elem > 0);
+	      /* PART_MODE should have the same size as MODE, and the same inner
+		 mode as PROMOTED_VALUE_MODE.  */
+	      part_mode = mode_for_vector (inner_mode, n_elem);
+	      rhs = convert_to_mode (mode,
+				     gen_lowpart (part_mode,
+						  *promoted_to_vector_value_ptr),
+				     1);
+	    }
+	  else
+	    rhs = convert_to_mode (mode, *promoted_to_vector_value_ptr, 1);
+	}
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    return generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2606,6 +2996,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -4034,7 +4552,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6325,7 +6843,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9738,7 +10256,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1bf1369..9fc83c9 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -706,4 +706,7 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5368d18..cbbb75a 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1273,6 +1273,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/optabs.c b/gcc/optabs.c
index a373d7a..f42bd9e 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -770,6 +770,47 @@ expand_vector_broadcast (enum machine_mode vmode, rtx op)
   return ret;
 }
 
+/* Create a new vector value in VMODE with all bytes set to VAL.  The
+   mode of VAL must be QImode or it should be a constant.
+   If VAL is a constant, then the return value will be a constant.  */
+
+extern rtx
+expand_vector_broadcast_of_byte_value (enum machine_mode vmode, rtx val)
+{
+  enum insn_code icode;
+  rtvec vec;
+  rtx ret;
+  int i, n;
+
+  enum machine_mode byte_vmode;
+
+  gcc_checking_assert (VECTOR_MODE_P (vmode));
+  gcc_assert (CONSTANT_P (val) || GET_MODE (val) == QImode);
+  byte_vmode = mode_for_vector (QImode, GET_MODE_SIZE (vmode));
+
+  n = GET_MODE_NUNITS (byte_vmode);
+  vec = rtvec_alloc (n);
+  for (i = 0; i < n; ++i)
+    RTVEC_ELT (vec, i) = val;
+
+  if (CONSTANT_P (val))
+    ret = gen_rtx_CONST_VECTOR (byte_vmode, vec);
+  else
+    {
+      icode = optab_handler (vec_init_optab, byte_vmode);
+      if (icode == CODE_FOR_nothing)
+	return NULL;
+
+      ret = gen_reg_rtx (byte_vmode);
+      emit_insn (GEN_FCN (icode) (ret, gen_rtx_PARALLEL (byte_vmode, vec)));
+    }
+
+  if (vmode != byte_vmode)
+    ret = convert_to_mode (vmode, ret, 1);
+
+  return ret;
+}
+
 /* This subroutine of expand_doubleword_shift handles the cases in which
    the effective shift value is >= BITS_PER_WORD.  The arguments and return
    value are the same as for the parent routine, except that SUPERWORD_OP1
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 926d21f..dca1742 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -1147,4 +1147,5 @@ extern void expand_jump_insn (enum insn_code icode, unsigned int nops,
 extern rtx prepare_operand (enum insn_code, rtx, int, enum machine_mode,
 			    enum machine_mode, int);
 
+extern rtx expand_vector_broadcast_of_byte_value (enum machine_mode, rtx);
 #endif /* GCC_OPTABS_H */
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f13485e..4ec67c7 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2513,6 +2513,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index c3bec0e..76cf291 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,14 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 81fd12f..e70ecba 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1442,4 +1442,15 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f19fb50..ace8686 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -175,3 +175,5 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);

[-- Attachment #3: memfunc-be-4.patch --]
[-- Type: application/octet-stream, Size: 83361 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..6ce240a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,21 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1867,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1945,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2632,8 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
+static rtx promote_duplicated_reg_to_size (rtx, int, int, int);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15266,6 +15363,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20677,22 +20806,37 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
+/* Helper function for expand_set_or_movmem_via_loop.
+
+   When SRCPTR is non-NULL, output simple loop to move memory
    pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
    overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
    equivalent loop to set memory by VALUE (supposed to be in MODE).
 
    The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.
 
+   If ITER isn't NULL, than it'll be used in the generated loop without
+   initialization (that allows to generate several consequent loops using the
+   same iterator).
+   If CHANGE_PTRS is specified, DESTPTR and SRCPTR would be increased by
+   iterator value at the end of the function (as if they iterate in the loop).
+   Otherwise, their vaules'll stay unchanged.
 
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+   If EXPECTED_SIZE isn't -1, than it's used to compute branch-probabilities on
+   the loop backedge.  When expected size is unknown (it's -1), the probability
+   is set to 80%.
+
+   Return value is rtx of iterator, used in the loop - it could be reused in
+   consequent calls of this function.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20700,10 +20844,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20714,18 +20860,21 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
   tmp = convert_modes (Pmode, iter_mode, iter, true);
   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
-  destmem = change_address (destmem, mode, x_addr);
+  destmem =
+    adjust_automodify_address_nv (copy_rtx (destmem), mode, x_addr, 0);
 
   if (srcmem)
     {
       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
-      srcmem = change_address (srcmem, mode, y_addr);
+      srcmem =
+	adjust_automodify_address_nv (copy_rtx (srcmem), mode, y_addr, 0);
 
       /* When unrolling for chips that reorder memory reads and writes,
 	 we can save registers by using single temporary.
@@ -20797,19 +20946,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20913,7 +21086,18 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this constant to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (COUNT % MAX_SIZE) bytes from SRCPTR to DESTPTR.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20924,43 +21108,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -21066,87 +21262,134 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set with VALUE at most (COUNT % MAX_SIZE) bytes starting from
+   DESTPTR.
+   DESTMEM provides MEMrtx to feed proper aliasing info.
+   PROMOTED_TO_GPR_VALUE is rtx representing a GPR containing broadcasted VALUE.
+   PROMOTED_TO_VECTOR_VALUE is rtx representing a vector register containing
+   broadcasted VALUE.
+   PROMOTED_TO_GPR_VALUE and PROMOTED_TO_VECTOR_VALUE could be NULL if the
+   promotion hasn't been generated before.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx promoted_to_gpr_value, rtx value, rtx count,
+			int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!promoted_to_vector_value
+	      || !VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = adjust_automodify_address_nv (destmem, move_mode,
+						    destptr, offset);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      expand_vector_broadcast_of_byte_value (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+
+  if (!promoted_to_gpr_value)
+    promoted_to_gpr_value = promote_duplicated_reg_to_size (value,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode));
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem,
+				    GET_MODE (promoted_to_vector_value),
+				    destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_vector_value));
+	}
+      else if (TARGET_64BIT)
+	{
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21154,16 +21397,22 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 8)
     {
       rtx label = ix86_expand_aligntest (count, 8, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem,
+				 gen_lowpart (DImode, promoted_to_vector_value)));
+	}
+      else if (TARGET_64BIT)
+	{
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21171,24 +21420,27 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (SImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (HImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (QImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -21204,8 +21456,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, QImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21214,8 +21466,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      srcmem = change_address (srcmem, HImode, srcptr);
-      destmem = change_address (destmem, HImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, HImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21224,14 +21476,34 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      srcmem = change_address (srcmem, SImode, srcptr);
-      destmem = change_address (destmem, SImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, DImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, DImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -21286,6 +21558,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -21293,7 +21596,9 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
   if (src_align_bytes >= 0)
     {
       unsigned int src_align = 0;
-      if ((src_align_bytes & 7) == (align_bytes & 7))
+      if ((src_align_bytes & 15) == (align_bytes & 15))
+	src_align = 16;
+      else if ((src_align_bytes & 7) == (align_bytes & 7))
 	src_align = 8;
       else if ((src_align_bytes & 3) == (align_bytes & 3))
 	src_align = 4;
@@ -21321,7 +21626,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      destmem = change_address (destmem, QImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21330,7 +21635,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      destmem = change_address (destmem, HImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21339,13 +21644,23 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      destmem = change_address (destmem, SImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -21381,6 +21696,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      off = 4;
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 4;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -21392,7 +21720,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -21401,7 +21729,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -21414,7 +21742,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -21423,9 +21751,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21489,29 +21817,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21535,9 +21867,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21625,6 +21959,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21648,9 +21987,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21669,11 +22016,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21785,6 +22137,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
 				  desired_align);
+	  set_mem_align (src, desired_align*BITS_PER_UNIT);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -21842,11 +22196,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
     expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
 			    epilogue_size_needed);
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21917,7 +22306,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -22015,10 +22444,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   int size_needed = 0, epilogue_size_needed;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
-  rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx gpr_promoted_val = NULL;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22038,8 +22471,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -22057,11 +22493,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -22106,8 +22552,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -22116,12 +22564,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -22161,9 +22603,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 2: Alignment prologue.  */
 
   /* Do the expensive promotion once we branched off the small blocks.  */
-  if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+  if (!gpr_promoted_val)
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -22175,17 +22619,20 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	     the pain to maintain it for the first move, so throw away
 	     the info early.  */
 	  dst = change_address (dst, BLKmode, destreg);
-	  expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
+	  expand_setmem_prologue (dst, destreg, gpr_promoted_val, count_exp, align,
 				  desired_align);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
 	  /* If we know how many bytes need to be stored before dst is
 	     sufficiently aligned, maintain aliasing info accurately.  */
-	  dst = expand_constant_setmem_prologue (dst, destreg, promoted_val,
+	  dst = expand_constant_setmem_prologue (dst, destreg, gpr_promoted_val,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -22213,7 +22660,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       emit_label (label);
       LABEL_NUSES (label) = 1;
       label = NULL;
-      promoted_val = val_exp;
+      gpr_promoted_val = val_exp;
       epilogue_size_needed = 1;
     }
   else if (label == NULL_RTX)
@@ -22227,27 +22674,34 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, gpr_promoted_val,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (gpr_promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      gcc_assert (TARGET_64BIT);
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  DImode, val_exp);
       break;
     case rep_prefix_4_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  SImode, val_exp);
       break;
     case rep_prefix_1_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  QImode, val_exp);
       break;
     }
@@ -22280,15 +22734,29 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
+			    val_exp, count_exp, epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -37436,6 +37904,33 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -37743,6 +38238,9 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bd69ec2..550b2ab 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 9c9508d..bd38e48 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -15905,6 +15905,17 @@
 	      (clobber (reg:CC FLAGS_REG))])]
   ""
 {
+  rtx vec_reg;
+  enum machine_mode mode = GET_MODE (operands[2]);
+  if (vector_extensions_used_for_mode (mode)
+      && CONSTANT_P (operands[2]))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, operands[2]);
+      operands[2] = vec_reg;
+    }
   if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
     operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
 
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ff77003..b8ecc59 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7426,6 +7426,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7537,6 +7544,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7578,6 +7595,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..844ed17 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6271,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 187122e..c32e745 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6209,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/expr.c b/gcc/expr.c
index b020978..5c5002c 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -811,7 +811,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -836,6 +836,48 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -1680,7 +1722,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2112,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -4034,7 +4076,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6325,7 +6367,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9738,7 +9780,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/optabs.c b/gcc/optabs.c
index a373d7a..f42bd9e 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -770,6 +770,47 @@ expand_vector_broadcast (enum machine_mode vmode, rtx op)
   return ret;
 }
 
+/* Create a new vector value in VMODE with all bytes set to VAL.  The
+   mode of VAL must be QImode or it should be a constant.
+   If VAL is a constant, then the return value will be a constant.  */
+
+extern rtx
+expand_vector_broadcast_of_byte_value (enum machine_mode vmode, rtx val)
+{
+  enum insn_code icode;
+  rtvec vec;
+  rtx ret;
+  int i, n;
+
+  enum machine_mode byte_vmode;
+
+  gcc_checking_assert (VECTOR_MODE_P (vmode));
+  gcc_assert (CONSTANT_P (val) || GET_MODE (val) == QImode);
+  byte_vmode = mode_for_vector (QImode, GET_MODE_SIZE (vmode));
+
+  n = GET_MODE_NUNITS (byte_vmode);
+  vec = rtvec_alloc (n);
+  for (i = 0; i < n; ++i)
+    RTVEC_ELT (vec, i) = val;
+
+  if (CONSTANT_P (val))
+    ret = gen_rtx_CONST_VECTOR (byte_vmode, vec);
+  else
+    {
+      icode = optab_handler (vec_init_optab, byte_vmode);
+      if (icode == CODE_FOR_nothing)
+	return NULL;
+
+      ret = gen_reg_rtx (byte_vmode);
+      emit_insn (GEN_FCN (icode) (ret, gen_rtx_PARALLEL (byte_vmode, vec)));
+    }
+
+  if (vmode != byte_vmode)
+    ret = convert_to_mode (vmode, ret, 1);
+
+  return ret;
+}
+
 /* This subroutine of expand_doubleword_shift handles the cases in which
    the effective shift value is >= BITS_PER_WORD.  The arguments and return
    value are the same as for the parent routine, except that SUPERWORD_OP1
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 926d21f..dca1742 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -1147,4 +1147,5 @@ extern void expand_jump_insn (enum insn_code icode, unsigned int nops,
 extern rtx prepare_operand (enum insn_code, rtx, int, enum machine_mode,
 			    enum machine_mode, int);
 
+extern rtx expand_vector_broadcast_of_byte_value (enum machine_mode, rtx);
 #endif /* GCC_OPTABS_H */
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f13485e..4ec67c7 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2513,6 +2513,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index c3bec0e..76cf291 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,14 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 81fd12f..e70ecba 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1442,4 +1442,15 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f19fb50..ace8686 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -175,3 +175,5 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index 483d117..e3d3b91 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue -mstringop-strategy=rep_byte" } */
 
 #include <string.h>
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-01 17:36                                   ` Michael Zolotukhin
  2011-11-01 17:48                                     ` Michael Zolotukhin
@ 2011-11-02 19:12                                     ` Jan Hubicka
  2011-11-02 19:37                                       ` Michael Zolotukhin
  2011-11-02 19:55                                     ` Jan Hubicka
  2 siblings, 1 reply; 52+ messages in thread
From: Jan Hubicka @ 2011-11-02 19:12 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jan Hubicka, Richard Henderson, Jakub Jelinek, Jack Howarth,
	gcc-patches, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan, vmakarov

Hi,
I am going to benchmark the following hunk separately tonight. It is
independent change.

Rth, Vladimir: there are obviously several options how to make GCC use SSE for
64bit loads/stores in 32bit codegen (and 128bit loads/stores in 128bit
codegen). What do you think is best variant here?

(an alternative would be to make move patterns to preffer SSE variant in this
case or change RA order to iterate through SSE first, but at least with pre-IRA
this used to lead to bad decisions making RA to place value in SSE despite the
fact it is used in arithmetic that can't be done with SSE).

Honza

@@ -15266,6 +15363,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-02 19:12                                     ` Jan Hubicka
@ 2011-11-02 19:37                                       ` Michael Zolotukhin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-11-02 19:37 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Henderson, Jakub Jelinek, Jack Howarth, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan,
	vmakarov

> I am going to benchmark the following hunk separately tonight. It is
> independent change.

You would probably need some changes from sse.md (for gen_sse2_loadq).

Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-01 17:36                                   ` Michael Zolotukhin
  2011-11-01 17:48                                     ` Michael Zolotukhin
  2011-11-02 19:12                                     ` Jan Hubicka
@ 2011-11-02 19:55                                     ` Jan Hubicka
  2011-11-03 12:56                                       ` Michael Zolotukhin
  2 siblings, 1 reply; 52+ messages in thread
From: Jan Hubicka @ 2011-11-02 19:55 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jan Hubicka, Richard Henderson, Jakub Jelinek, Jack Howarth,
	gcc-patches, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..6ce240a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},

I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know)
algorithms added to enable the SSE codegen: there are chips that do support SSE
but internally split the wide moves into word sized chunks and thus benefits
are minimal and setup costs won't pay back.  So we do not want to default to
SSE codegen whenever possible, just when it is supposed to pay back.

I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where
one can not implicitely use SSE).  Probably making sse_unrolled_loop->unrolled_loop
transition is easiest for this quite rare case even if some of other algorithm may
turn out to be superrior.

+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);

No vector move because promoted vector value is not computed, yet?  (it would
make sense to bypass it to keep hot path for small blocks SSE free).
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;

Where did go the logic dropping unroll factor to 2 for 32bit integer loops?  This is
important otherwise se starve the RA.
@@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */

The unrolled epilogue does at most one byte wide move, while the rolled does at most
4*16. Given that the odds are that the blocks are small, are you sure this is not
causing performance problems?

@@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);

Hmm, it would seem more natural to turn this into V8QImode since it is really
vector of 4 duplicated bytes.  This will avoid some TARGET_64BIT tess bellow.
Also AMD chips are very slow on integer->SSE moves. How the final promotion
sequence looks there?
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))

This seems quite dubious.  The instruction pattern representing the move should
refuse the constants via its condition or predicates.

The i386 specific bits seems quite good to me now and ready for mainline with
the above comments addressed.  It may make your life easier if you tried to
make version that does not need any of the middle end changes - I think it is
possible, the middle end changes are mostly about expanding memcpy/memset w/o
loops at all.  This can be handled incrementally after your current patch gets
to mainline.

Honza

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-02 19:55                                     ` Jan Hubicka
@ 2011-11-03 12:56                                       ` Michael Zolotukhin
  2011-11-06 14:28                                         ` Jan Hubicka
  2011-11-07 15:52                                         ` Jan Hubicka
  0 siblings, 2 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-11-03 12:56 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Henderson, Jakub Jelinek, Jack Howarth, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

[-- Attachment #1: Type: text/plain, Size: 14598 bytes --]

Hi,

Please check the updated patch - it's separated from the most of the
middle-end changes. I tried to take into account your remarks in it,
you can find my comments below.

The patch bootstrapped, make-checked, and showed performance gains on
memset/memcpy on sizes ~256-4k (checked on Atom).

Is it ok for trunk?

ChangeLog:
2011-11-03  Zolotukhin Michael  <michael.v.zolotukhin@gmail.com>

        * config/i386/i386.h (processor_costs): Add second dimension to
        stringop_algs array.
        (clear_ratio): Tune value to improve performance.
        * config/i386/i386.c (cost models): Initialize second dimension of
        stringop_algs arrays.  Tune cost model in atom_cost, generic32_cost
        and generic64_cost.
        (promote_duplicated_reg): Add support for vector modes, add
        declaration.
        (promote_duplicated_reg_to_size): Likewise.
        (ix86_expand_move): Add support for vector moves, that use half of
        vector register.
        (expand_set_or_movmem_via_loop_with_iter): New function.
        (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
        different loops, produced by this function.
        (emit_strset): New function.
        (expand_movmem_epilogue): Add epilogue generation for bigger sizes,
        use SSE-moves where possible.
        (expand_setmem_epilogue): Likewise.
        (expand_movmem_prologue): Likewise for prologue.
        (expand_setmem_prologue): Likewise.
        (expand_constant_movmem_prologue): Likewise.
        (expand_constant_setmem_prologue): Likewise.
        (decide_alg): Add new argument align_unknown.  Fix algorithm of
        strategy selection if TARGET_INLINE_ALL_STRINGOPS is set.
        (decide_alignment): Update desired alignment according to chosen move
        mode.
        (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
        (ix86_expand_setmem): Likewise.
        (ix86_slow_unaligned_access): Implementation of new hook
        slow_unaligned_access.
        * config/i386/i386.md (strset): Enable half-SSE moves.
        * config/i386/sse.md (sse2_loadq): Add expand for sse2_loadq.
        (vec_dupv4si): Add expand for vec_dupv4si.
        (vec_dupv2di): Add expand for vec_dupv2di.
        * cse.c (cse_insn): Stop forward propagation of vector constants.
        * fwprop.c (forward_propagate_and_simplify): Likewise.
        * doc/tm.texi (SLOW_UNALIGNED_ACCESS): Remove documentation for deleted
        macro SLOW_UNALIGNED_ACCESS.
        (TARGET_SLOW_UNALIGNED_ACCESS): Add documentation on new hook.
        * doc/tm.texi.in (SLOW_UNALIGNED_ACCESS): Likewise.
        (TARGET_SLOW_UNALIGNED_ACCESS): Likewise.
        * emit-rtl.c (get_mem_align_offset): Add handling of MEM_REFs.
        (adjust_address_1): Improve algorithm for determining alignment of
        address+offset.
        * expr.c (compute_align_by_offset): New function.
        (vector_mode_for_mode): New function.
        (vector_extensions_used_for_mode): New function.
        (alignment_for_piecewise_move): Use hook slow_unaligned_access instead
        of macros SLOW_UNALIGNED_ACCESS.
        (emit_group_load_1): Likewise.
        (emit_group_store): Likewise.
        (emit_push_insn): Likewise.
        (store_field): Likewise.
        (expand_expr_real_1): Likewise.
        * expr.h (compute_align_by_offset): Add declaration.
        * optabs.c (expand_vector_broadcast_of_byte_value): New function.
        * optabs.h (expand_vector_broadcast_of_byte_value): Add declaration.
        * rtl.h (vector_extensions_used_for_mode): Add declaration.
        * target.def (DEFHOOK): Add hook slow_unaligned_access.
        * targhooks.c (default_slow_unaligned_access): Add default hook
        implementation.
        * targhooks.h (default_slow_unaligned_access): Add prototype.
        * testsuite/gcc.target/i386/sw-1.c: Fix compile options to preserve
        correct behaviour of the test.


> I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know)
> algorithms added to enable the SSE codegen: there are chips that do support SSE
> but internally split the wide moves into word sized chunks and thus benefits
> are minimal and setup costs won't pay back.  So we do not want to default to
> SSE codegen whenever possible, just when it is supposed to pay back.
>
> I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where
> one can not implicitely use SSE).  Probably making sse_unrolled_loop->unrolled_loop
> transition is easiest for this quite rare case even if some of other algorithm may
> turn out to be superrior.

There is rolled loop algorithm, that doesn't use SSE-modes - such
architectures could use it instead of unrolled_loop. I think the
performance wouldn't suffer much from that.
For the most of modern processors, SSE-moves are faster than several
word-sized moves, so this change in unrolled_loop implementation seems
reasonable to me, but, of course, if you think introducing
sse_unrolled_move is worth doing, it could be done.


> No vector move because promoted vector value is not computed, yet?  (it would
> make sense to bypass it to keep hot path for small blocks SSE free).

Yes, that's the reason.
Actually, such path does exist - it's used if the block size is so
small, that prologue and epilogue would do all the work without main
loop.


> Where did go the logic dropping unroll factor to 2 for 32bit integer loops?  This is
> important otherwise se starve the RA.

Fixed.


> The unrolled epilogue does at most one byte wide move, while the rolled does at most
> 4*16. Given that the odds are that the blocks are small, are you sure this is not
> causing performance problems?

It didn't hurt performance - quite the contrary, it was done to avoid
performance problems.
The point of this is following. If we generated an unrolled loop with
SSE moves and a prologue with alignment checks, then we wouldn't know,
how much bytes will left after the main loop. So in epilogue we'll
generate a loop with unknown at compile time trip count. In previous
implementation such loop simply used byte-moves, so in worst case we'd
have (UnrollFactor*VectorWidth-1) byte moves. Now before such
byte-loop we generate a rolled loop with SSE-moves. This loop would
iterate at most (UnrollFactor-1) times, but that still would greatly
improve performance.

Finally, we would have something like this:
main_loop:
   sse_move
   sse_move
   sse_move
   sse_move
   iter += 4*vector_size
   if (iter < count ) goto main_loop
epilogue_sse_loop:
   sse_move
   iter += vector_size
   if (iter < count ) goto epilogue_sse_loop
epilogue_byte_loop:
   byte_move
   iter ++
   if (iter < count ) goto epilogue_byte_loop


> +      gcc_assert (TARGET_SSE);
> +      if (TARGET_64BIT)
> +       promoted_val = promote_duplicated_reg (V2DImode, val);
> +      else
> +       promoted_val = promote_duplicated_reg (V4SImode, val);
> Hmm, it would seem more natural to turn this into V8QImode since it is really
> vector of 4 duplicated bytes.  This will avoid some TARGET_64BIT tess bellow.
> Also AMD chips are very slow on integer->SSE moves. How the final promotion
> sequence looks there?

Here we expect that we've already had a value, promoted to GPR. In
64-bit it has DImode, in 32-bit - SImode - that's why we need this
test here. I added some comments and an assert to make it a bit
clearer.
Final promotion sequence looks like this (example is for 32 bit, for
64 bit it looks quite similar):
    SImode_register = ByteValue*0x01010101  // generated somewhere
before promote_duplicated_reg
    V4SImode_register = vec_dupv4si (SImode_register) // generated in
promote_duplicated_reg

When we don't have promoted to GPR value, we use
expand_vector_broadcast_of_byte_value instead of
promote_duplicated_reg, which just generate rtx_PARALLEL. Later it's
transformed to the same dupv4si or similar, but that functionality
isn't in the current patch, it's already implemented.


> +             /* Don't propagate vector-constants, as for now no architecture
> +                supports vector immediates.  */
> +         && !vector_extensions_used_for_mode (mode))
>
> This seems quite dubious.  The instruction pattern representing the move should
> refuse the constants via its condition or predicates.

It does, but it can't generate efficient code if such constants are
propagated. Take a look at the next example.
Suppose we'd generate something like this:
    v4si_reg = (0 | 0 | 0 | 0)
    sse_mov [mem], v4si_reg
    sse_mov [mem+16], v4si_reg

After constant propagation we'd have:
    v4si_reg = (0 | 0 | 0 | 0)
    sse_mov [mem], (0 | 0 | 0 | 0)
    sse_mov [mem+16], (0 | 0 | 0 | 0)

Initially, we had usual stores that could be generated without any
problems, but after constant propagation we need to generate store of
an immediate to memory. As we no longer know that there is a specially
prepared register containing this value, we either need to initialize
a vector-register again, or split the store into several word-size
stores. The first option is too expensive, so the second one takes
place. It's correct, but we don't have SSE-moves, so it's slower than
it could be without constant propagation.


On 2 November 2011 23:45, Jan Hubicka <hubicka@ucw.cz> wrote:
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 2c53423..6ce240a 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
>   COSTS_N_BYTES (2),                   /* cost of FABS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FCHS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FSQRT instruction.  */
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>
> I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know)
> algorithms added to enable the SSE codegen: there are chips that do support SSE
> but internally split the wide moves into word sized chunks and thus benefits
> are minimal and setup costs won't pay back.  So we do not want to default to
> SSE codegen whenever possible, just when it is supposed to pay back.
>
> I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where
> one can not implicitely use SSE).  Probably making sse_unrolled_loop->unrolled_loop
> transition is easiest for this quite rare case even if some of other algorithm may
> turn out to be superrior.
>
> +  if (align <= 8 && desired_alignment > 8)
> +    {
> +      rtx label = ix86_expand_aligntest (destptr, 8, false);
> +      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> +      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> +      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> +      ix86_adjust_counter (count, 8);
> +      emit_label (label);
> +      LABEL_NUSES (label) = 1;
> +    }
> +  gcc_assert (desired_alignment <= 16);
>
> No vector move because promoted vector value is not computed, yet?  (it would
> make sense to bypass it to keep hot path for small blocks SSE free).
>     case unrolled_loop:
>       need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
> +      /* Use SSE instructions, if possible.  */
> +      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
> +      unroll_factor = 4;
>
> Where did go the logic dropping unroll factor to 2 for 32bit integer loops?  This is
> important otherwise se starve the RA.
> @@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
>       LABEL_NUSES (label) = 1;
>     }
>
> +  /* We haven't updated addresses, so we'll do it now.
> +     Also, if the epilogue seems to be big, we'll generate a loop (not
> +     unrolled) in it.  We'll do it only if alignment is unknown, because in
> +     this case in epilogue we have to perform memmove by bytes, which is very
> +     slow.  */
>
> The unrolled epilogue does at most one byte wide move, while the rolled does at most
> 4*16. Given that the odds are that the blocks are small, are you sure this is not
> causing performance problems?
>
> @@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
>  static rtx
>  promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
>  {
> -  rtx promoted_val;
> +  rtx promoted_val = NULL_RTX;
>
> -  if (TARGET_64BIT
> -      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
> -    promoted_val = promote_duplicated_reg (DImode, val);
> +  if (size_needed > 8 || (desired_align > align && desired_align > 8))
> +    {
> +      gcc_assert (TARGET_SSE);
> +      if (TARGET_64BIT)
> +       promoted_val = promote_duplicated_reg (V2DImode, val);
> +      else
> +       promoted_val = promote_duplicated_reg (V4SImode, val);
>
> Hmm, it would seem more natural to turn this into V8QImode since it is really
> vector of 4 duplicated bytes.  This will avoid some TARGET_64BIT tess bellow.
> Also AMD chips are very slow on integer->SSE moves. How the final promotion
> sequence looks there?
> diff --git a/gcc/cse.c b/gcc/cse.c
> index ae67685..3b6471d 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
>                 to fold switch statements when an ADDR_DIFF_VEC is used.  */
>              || (GET_CODE (src_folded) == MINUS
>                  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
> -                 && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
> +                 && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
> +             /* Don't propagate vector-constants, as for now no architecture
> +                supports vector immediates.  */
> +         && !vector_extensions_used_for_mode (mode))
>
> This seems quite dubious.  The instruction pattern representing the move should
> refuse the constants via its condition or predicates.
>
> The i386 specific bits seems quite good to me now and ready for mainline with
> the above comments addressed.  It may make your life easier if you tried to
> make version that does not need any of the middle end changes - I think it is
> possible, the middle end changes are mostly about expanding memcpy/memset w/o
> loops at all.  This can be handled incrementally after your current patch gets
> to mainline.
>
> Honza
>

-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

[-- Attachment #2: memfunc-x86_specific.patch --]
[-- Type: application/octet-stream, Size: 87203 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..8b35a73 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,21 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1867,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1945,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2537,6 +2632,8 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
+static rtx promote_duplicated_reg_to_size (rtx, int, int, int);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15266,6 +15363,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20677,22 +20806,37 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
+/* Helper function for expand_set_or_movmem_via_loop.
+
+   When SRCPTR is non-NULL, output simple loop to move memory
    pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
    overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
    equivalent loop to set memory by VALUE (supposed to be in MODE).
 
    The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.
 
+   If ITER isn't NULL, than it'll be used in the generated loop without
+   initialization (that allows to generate several consequent loops using the
+   same iterator).
+   If CHANGE_PTRS is specified, DESTPTR and SRCPTR would be increased by
+   iterator value at the end of the function (as if they iterate in the loop).
+   Otherwise, their vaules'll stay unchanged.
 
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+   If EXPECTED_SIZE isn't -1, than it's used to compute branch-probabilities on
+   the loop backedge.  When expected size is unknown (it's -1), the probability
+   is set to 80%.
+
+   Return value is rtx of iterator, used in the loop - it could be reused in
+   consequent calls of this function.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20700,10 +20844,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20714,18 +20860,21 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
   tmp = convert_modes (Pmode, iter_mode, iter, true);
   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
-  destmem = change_address (destmem, mode, x_addr);
+  destmem =
+    adjust_automodify_address_nv (copy_rtx (destmem), mode, x_addr, 0);
 
   if (srcmem)
     {
       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
-      srcmem = change_address (srcmem, mode, y_addr);
+      srcmem =
+	adjust_automodify_address_nv (copy_rtx (srcmem), mode, y_addr, 0);
 
       /* When unrolling for chips that reorder memory reads and writes,
 	 we can save registers by using single temporary.
@@ -20797,19 +20946,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -20913,7 +21086,18 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this constant to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (COUNT % MAX_SIZE) bytes from SRCPTR to DESTPTR.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -20924,43 +21108,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -21066,87 +21262,134 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set with VALUE at most (COUNT % MAX_SIZE) bytes starting from
+   DESTPTR.
+   DESTMEM provides MEMrtx to feed proper aliasing info.
+   PROMOTED_TO_GPR_VALUE is rtx representing a GPR containing broadcasted VALUE.
+   PROMOTED_TO_VECTOR_VALUE is rtx representing a vector register containing
+   broadcasted VALUE.
+   PROMOTED_TO_GPR_VALUE and PROMOTED_TO_VECTOR_VALUE could be NULL if the
+   promotion hasn't been generated before.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx promoted_to_gpr_value, rtx value, rtx count,
+			int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!promoted_to_vector_value
+	      || !VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = adjust_automodify_address_nv (destmem, move_mode,
+						    destptr, offset);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      expand_vector_broadcast_of_byte_value (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+
+  if (!promoted_to_gpr_value)
+    promoted_to_gpr_value = promote_duplicated_reg_to_size (value,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode));
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
+	{
+	  destmem = change_address (destmem,
+				    GET_MODE (promoted_to_vector_value),
+				    destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_vector_value));
+	}
+      else if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21154,16 +21397,22 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 8)
     {
       rtx label = ix86_expand_aligntest (count, 8, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem,
+				 gen_lowpart (DImode, promoted_to_vector_value)));
+	}
+      else if (TARGET_64BIT)
+	{
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21171,24 +21420,27 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (SImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (HImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (QImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -21204,8 +21456,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, QImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21214,8 +21466,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      srcmem = change_address (srcmem, HImode, srcptr);
-      destmem = change_address (destmem, HImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, HImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21224,14 +21476,34 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      srcmem = change_address (srcmem, SImode, srcptr);
-      destmem = change_address (destmem, SImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, DImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, DImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -21286,6 +21558,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -21293,7 +21596,9 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
   if (src_align_bytes >= 0)
     {
       unsigned int src_align = 0;
-      if ((src_align_bytes & 7) == (align_bytes & 7))
+      if ((src_align_bytes & 15) == (align_bytes & 15))
+	src_align = 16;
+      else if ((src_align_bytes & 7) == (align_bytes & 7))
 	src_align = 8;
       else if ((src_align_bytes & 3) == (align_bytes & 3))
 	src_align = 4;
@@ -21321,7 +21626,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      destmem = change_address (destmem, QImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21330,7 +21635,7 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      destmem = change_address (destmem, HImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21339,13 +21644,23 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      destmem = change_address (destmem, SImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -21381,6 +21696,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      off = 4;
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 4;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -21392,7 +21720,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -21401,7 +21729,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -21414,7 +21742,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -21423,9 +21751,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21489,29 +21817,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21535,9 +21867,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21625,6 +21959,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21648,9 +21987,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21669,11 +22016,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = TARGET_64BIT ? 4 : 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21785,6 +22137,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
 				  desired_align);
+	  set_mem_align (src, desired_align*BITS_PER_UNIT);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -21842,11 +22196,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
     expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
 			    epilogue_size_needed);
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -21917,7 +22306,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -21983,11 +22402,27 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      /* We want to promote to vector register, so we expect that at least SSE
+	 is available.  */
+      gcc_assert (TARGET_SSE);
+
+      /* In case of promotion to vector register, we expect that val is a
+	 constant or already promoted to GPR value.  */
+      gcc_assert (GET_MODE (val) == Pmode || CONSTANT_P (val));
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -22015,10 +22450,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   int size_needed = 0, epilogue_size_needed;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
-  rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx gpr_promoted_val = NULL;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22038,8 +22477,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -22057,11 +22499,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -22106,8 +22558,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -22116,12 +22570,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -22161,9 +22609,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 2: Alignment prologue.  */
 
   /* Do the expensive promotion once we branched off the small blocks.  */
-  if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+  if (!gpr_promoted_val)
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -22175,17 +22625,20 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	     the pain to maintain it for the first move, so throw away
 	     the info early.  */
 	  dst = change_address (dst, BLKmode, destreg);
-	  expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
+	  expand_setmem_prologue (dst, destreg, gpr_promoted_val, count_exp, align,
 				  desired_align);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
 	  /* If we know how many bytes need to be stored before dst is
 	     sufficiently aligned, maintain aliasing info accurately.  */
-	  dst = expand_constant_setmem_prologue (dst, destreg, promoted_val,
+	  dst = expand_constant_setmem_prologue (dst, destreg, gpr_promoted_val,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -22213,7 +22666,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       emit_label (label);
       LABEL_NUSES (label) = 1;
       label = NULL;
-      promoted_val = val_exp;
+      gpr_promoted_val = val_exp;
       epilogue_size_needed = 1;
     }
   else if (label == NULL_RTX)
@@ -22227,27 +22680,34 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, gpr_promoted_val,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (gpr_promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      gcc_assert (TARGET_64BIT);
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  DImode, val_exp);
       break;
     case rep_prefix_4_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  SImode, val_exp);
       break;
     case rep_prefix_1_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  QImode, val_exp);
       break;
     }
@@ -22280,15 +22740,29 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
+			    val_exp, count_exp, epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -37436,6 +37910,33 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -37743,6 +38244,9 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bd69ec2..550b2ab 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 9c9508d..bd38e48 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -15905,6 +15905,17 @@
 	      (clobber (reg:CC FLAGS_REG))])]
   ""
 {
+  rtx vec_reg;
+  enum machine_mode mode = GET_MODE (operands[2]);
+  if (vector_extensions_used_for_mode (mode)
+      && CONSTANT_P (operands[2]))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, operands[2]);
+      operands[2] = vec_reg;
+    }
   if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
     operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
 
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ff77003..b8ecc59 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7426,6 +7426,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7537,6 +7544,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -7578,6 +7595,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=x,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..844ed17 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6252,23 +6271,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 187122e..c32e745 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5718,6 +5718,25 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6190,23 +6209,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 8465237..ff568b1 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1495,6 +1495,12 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (get_object_alignment_1 (expr, &offset) < align)
+	return -1;
+      offset /= BITS_PER_UNIT;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2058,7 +2064,6 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
   enum machine_mode address_mode;
   int pbits;
   struct mem_attrs attrs, *defattrs;
-  unsigned HOST_WIDE_INT max_align;
 
   attrs = *get_mem_attrs (memref);
 
@@ -2115,8 +2120,12 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      if zero.  */
   if (offset != 0)
     {
-      max_align = (offset & -offset) * BITS_PER_UNIT;
-      attrs.align = MIN (attrs.align, max_align);
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	attrs.align = compute_align_by_offset (old_offset + attrs.offset);
+      else
+	attrs.align = MIN (attrs.align,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
     }
 
   /* We can compute the size in a number of ways.  */
diff --git a/gcc/expr.c b/gcc/expr.c
index b020978..76d131d 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -811,7 +811,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -820,6 +820,16 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset == 0) ?
+	    BIGGEST_ALIGNMENT :
+	    MIN (BIGGEST_ALIGNMENT, (offset & -offset) * BITS_PER_UNIT);
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
@@ -836,6 +846,48 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -1680,7 +1732,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2070,7 +2122,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -4034,7 +4086,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -6325,7 +6377,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9738,7 +9790,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index 1bf1369..6f697d7 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -706,4 +706,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5368d18..cbbb75a 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1273,6 +1273,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/optabs.c b/gcc/optabs.c
index a373d7a..f42bd9e 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -770,6 +770,47 @@ expand_vector_broadcast (enum machine_mode vmode, rtx op)
   return ret;
 }
 
+/* Create a new vector value in VMODE with all bytes set to VAL.  The
+   mode of VAL must be QImode or it should be a constant.
+   If VAL is a constant, then the return value will be a constant.  */
+
+extern rtx
+expand_vector_broadcast_of_byte_value (enum machine_mode vmode, rtx val)
+{
+  enum insn_code icode;
+  rtvec vec;
+  rtx ret;
+  int i, n;
+
+  enum machine_mode byte_vmode;
+
+  gcc_checking_assert (VECTOR_MODE_P (vmode));
+  gcc_assert (CONSTANT_P (val) || GET_MODE (val) == QImode);
+  byte_vmode = mode_for_vector (QImode, GET_MODE_SIZE (vmode));
+
+  n = GET_MODE_NUNITS (byte_vmode);
+  vec = rtvec_alloc (n);
+  for (i = 0; i < n; ++i)
+    RTVEC_ELT (vec, i) = val;
+
+  if (CONSTANT_P (val))
+    ret = gen_rtx_CONST_VECTOR (byte_vmode, vec);
+  else
+    {
+      icode = optab_handler (vec_init_optab, byte_vmode);
+      if (icode == CODE_FOR_nothing)
+	return NULL;
+
+      ret = gen_reg_rtx (byte_vmode);
+      emit_insn (GEN_FCN (icode) (ret, gen_rtx_PARALLEL (byte_vmode, vec)));
+    }
+
+  if (vmode != byte_vmode)
+    ret = convert_to_mode (vmode, ret, 1);
+
+  return ret;
+}
+
 /* This subroutine of expand_doubleword_shift handles the cases in which
    the effective shift value is >= BITS_PER_WORD.  The arguments and return
    value are the same as for the parent routine, except that SUPERWORD_OP1
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 926d21f..dca1742 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -1147,4 +1147,5 @@ extern void expand_jump_insn (enum insn_code icode, unsigned int nops,
 extern rtx prepare_operand (enum insn_code, rtx, int, enum machine_mode,
 			    enum machine_mode, int);
 
+extern rtx expand_vector_broadcast_of_byte_value (enum machine_mode, rtx);
 #endif /* GCC_OPTABS_H */
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f13485e..4ec67c7 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2513,6 +2513,9 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
 extern HOST_WIDE_INT find_args_size_adjust (rtx);
 extern int fixup_args_size_notes (rtx, rtx, int);
 
diff --git a/gcc/target.def b/gcc/target.def
index c3bec0e..76cf291 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1498,6 +1498,14 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 81fd12f..e70ecba 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1442,4 +1442,15 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f19fb50..ace8686 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -175,3 +175,5 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index 483d117..e3d3b91 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue -mstringop-strategy=rep_byte" } */
 
 #include <string.h>
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-03 12:56                                       ` Michael Zolotukhin
@ 2011-11-06 14:28                                         ` Jan Hubicka
  2011-11-07 15:52                                         ` Jan Hubicka
  1 sibling, 0 replies; 52+ messages in thread
From: Jan Hubicka @ 2011-11-06 14:28 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jan Hubicka, Richard Henderson, Jakub Jelinek, Jack Howarth,
	gcc-patches, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

> 
> There is rolled loop algorithm, that doesn't use SSE-modes - such
> architectures could use it instead of unrolled_loop. I think the
> performance wouldn't suffer much from that.
> For the most of modern processors, SSE-moves are faster than several
> word-sized moves, so this change in unrolled_loop implementation seems
> reasonable to me, but, of course, if you think introducing
> sse_unrolled_move is worth doing, it could be done.

This don't seem to be quite true for AMD chips.  With your change I get on
amdfam10 hardware:
memset
                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte profiled dynamic
block size 8192000 0:00.62 0:00.55 0:00.54 0:00.54 0:00.60 0:00.54 0:00.57 0:00.54 0:00.57 0:00.54 0:00.60 0:03.63 0:00.62 0:00.62 best: 0:00.54 loop
block size  819200 0:00.65 0:00.64 0:00.64 0:00.64 0:00.63 0:00.64 0:00.62 0:00.64 0:00.66 0:00.66 0:00.69 0:04.10 0:00.66 0:00.66 best: 0:00.62 rep8noalign
block size   81920 0:00.21 0:00.21 0:00.21 0:00.21 0:00.27 0:00.21 0:00.21 0:00.20 0:00.27 0:00.20 0:00.25 0:04.18 0:00.21 0:00.21 best: 0:00.20 loop
block size   20480 0:00.18 0:00.18 0:00.18 0:00.18 0:00.24 0:00.18 0:00.20 0:00.20 0:00.28 0:00.17 0:00.23 0:04.29 0:00.18 0:00.18 best: 0:00.17 unrl
block size    8192 0:00.15 0:00.15 0:00.15 0:00.15 0:00.21 0:00.15 0:00.17 0:00.25 0:00.28 0:00.14 0:00.20 0:01.26 0:00.14 0:00.15 best: 0:00.14 unrl
block size    4096 0:00.15 0:00.15 0:00.16 0:00.15 0:00.21 0:00.14 0:00.16 0:00.25 0:00.23 0:00.14 0:00.19 0:01.24 0:00.15 0:00.15 best: 0:00.14 rep8
block size    2048 0:00.16 0:00.18 0:00.18 0:00.17 0:00.23 0:00.16 0:00.18 0:00.26 0:00.21 0:00.17 0:00.21 0:01.25 0:00.16 0:00.16 best: 0:00.16 libcall
block size    1024 0:00.19 0:00.24 0:00.24 0:00.20 0:00.25 0:00.17 0:00.19 0:00.28 0:00.23 0:00.21 0:00.26 0:01.26 0:00.17 0:00.16 best: 0:00.17 rep8
block size     512 0:00.23 0:00.34 0:00.33 0:00.23 0:00.26 0:00.20 0:00.22 0:00.31 0:00.27 0:00.27 0:00.29 0:01.29 0:00.20 0:00.19 best: 0:00.20 rep8
block size     256 0:00.29 0:00.51 0:00.51 0:00.28 0:00.30 0:00.25 0:00.26 0:00.38 0:00.35 0:00.39 0:00.38 0:01.33 0:00.24 0:00.25 best: 0:00.25 rep8
block size     128 0:00.39 0:00.76 0:00.76 0:00.40 0:00.42 0:00.38 0:00.40 0:00.52 0:00.51 0:00.55 0:00.52 0:01.41 0:00.37 0:00.38 best: 0:00.38 rep8
block size      64 0:00.72 0:00.95 0:00.95 0:00.70 0:00.73 0:00.65 0:00.72 0:00.75 0:00.75 0:00.74 0:00.76 0:01.48 0:00.64 0:00.65 best: 0:00.65 rep8
block size      48 0:00.89 0:00.98 0:01.12 0:00.86 0:00.72 0:00.83 0:00.88 0:00.94 0:00.92 0:00.92 0:00.91 0:01.71 0:00.93 0:00.67 best: 0:00.72 rep4noalign
block size      32 0:01.18 0:01.30 0:01.30 0:01.11 0:01.13 0:01.11 0:01.13 0:01.20 0:01.19 0:01.13 0:01.19 0:01.79 0:01.15 0:01.11 best: 0:01.11 rep4
block size      24 0:01.57 0:01.71 0:01.71 0:01.52 0:01.51 0:01.52 0:01.52 0:01.57 0:01.56 0:01.49 0:01.52 0:02.30 0:01.46 0:01.53 best: 0:01.49 unrl
block size      16 0:02.53 0:02.61 0:02.61 0:02.63 0:02.52 0:02.64 0:02.61 0:02.56 0:02.52 0:02.25 0:02.50 0:03.08 0:02.26 0:02.63 best: 0:02.25 unrl
block size      14 0:02.73 0:02.77 0:02.77 0:02.62 0:02.58 0:02.64 0:02.59 0:02.60 0:02.61 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:03.58 0:02.48 0:02.67 best: 0:02.58 rep4noalign
block size      12 0:03.29 0:03.09 0:03.08 0:03.02 0:02.98 0:03.06 0:02.96 0:02.89 0:02.96 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:03.89 0:02.70 0:03.05 best: 0:02.89 loop
block size      10 0:03.58 0:03.64 0:03.60 0:03.58 0:03.31 0:03.52 0:03.38 0:03.36 0:03.38 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:04.42 0:03.10 0:03.43 best: 0:03.31 rep4noalign
block size       8 0:04.19 0:03.76 0:03.75 0:03.98 0:03.83 0:03.82 0:03.70 0:03.70 0:03.80 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:04.68 Command terminated by signal 11 0:00.00 0:03.87 best: 0:03.70 loop
block size       6 0:06.20 0:05.66 0:05.67 0:05.69 0:05.60 0:05.73 0:05.53 0:05.56 0:05.46 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:06.23 0:05.60 0:05.65 best: 0:05.46 loopnoalign
block size       4 0:09.58 0:06.93 0:06.94 0:07.13 0:07.30 0:07.05 0:06.94 0:07.05 0:07.28 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:07.37 Command terminated by signal 11 0:00.00 0:07.46 best: 0:06.93 rep1
block size       1 0:38.46 0:17.27 0:17.27 0:15.14 0:15.11 0:16.34 0:15.10 0:16.38 0:15.11 Command terminated by signal 11 0:00.00 0:15.22 0:16.33 0:14.87 0:16.31 best: 0:15.10 rep8noalign

The ICEs for SSE loop < 16 bytes needs to be solved.
memset
                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte profiled dynamic
block size 8192000 0:00.62 0:00.55 0:00.55 0:00.55 0:00.60 0:00.55 0:00.57 0:00.54 0:00.59 0:00.55 0:00.57 0:03.60 0:00.61 0:00.61 best: 0:00.54 loop
block size  819200 0:00.62 0:00.61 0:00.62 0:00.61 0:00.40 0:00.63 0:00.40 0:00.65 0:00.50 0:00.65 0:00.40 0:03.84 0:00.35 0:00.65 best: 0:00.40 rep4noalign
block size   81920 0:00.21 0:00.21 0:00.21 0:00.21 0:00.27 0:00.21 0:00.22 0:00.27 0:00.40 0:00.21 0:00.22 0:04.47 0:00.21 0:00.21 best: 0:00.21 libcall
block size   20480 0:00.18 0:00.18 0:00.18 0:00.18 0:00.24 0:00.18 0:00.20 0:00.28 0:00.30 0:00.17 0:00.20 0:04.07 0:00.18 0:00.18 best: 0:00.17 unrl
block size    8192 0:00.15 0:00.15 0:00.15 0:00.15 0:00.21 0:00.15 0:00.16 0:00.17 0:00.18 0:00.14 0:00.17 0:01.27 0:00.15 0:00.15 best: 0:00.14 unrl
block size    4096 0:00.15 0:00.16 0:00.16 0:00.15 0:00.21 0:00.15 0:00.17 0:00.18 0:00.17 0:00.14 0:00.17 0:01.24 0:00.15 0:00.15 best: 0:00.14 unrl
block size    2048 0:00.16 0:00.18 0:00.19 0:00.17 0:00.23 0:00.16 0:00.18 0:00.19 0:00.19 0:00.16 0:00.19 0:01.25 0:00.16 0:00.16 best: 0:00.16 libcall
block size    1024 0:00.19 0:00.24 0:00.24 0:00.21 0:00.26 0:00.17 0:00.19 0:00.22 0:00.21 0:00.21 0:00.22 0:01.27 0:00.17 0:00.17 best: 0:00.17 rep8
block size     512 0:00.22 0:00.34 0:00.33 0:00.23 0:00.26 0:00.20 0:00.22 0:00.25 0:00.25 0:00.28 0:00.28 0:01.29 0:00.20 0:00.21 best: 0:00.20 rep8
block size     256 0:00.29 0:00.51 0:00.51 0:00.29 0:00.31 0:00.25 0:00.27 0:00.35 0:00.33 0:00.39 0:00.37 0:01.33 0:00.25 0:00.25 best: 0:00.25 rep8
block size     128 0:00.39 0:00.76 0:00.76 0:00.40 0:00.42 0:00.38 0:00.41 0:00.50 0:00.50 0:00.60 0:00.54 0:01.41 0:00.38 0:00.33 best: 0:00.38 rep8
block size      64 0:00.54 0:00.86 0:00.86 0:00.55 0:00.57 0:00.53 0:00.58 0:00.73 0:00.71 0:00.95 0:00.84 0:01.47 0:00.54 0:00.53 best: 0:00.53 rep8
block size      48 0:00.71 0:00.98 0:00.98 0:00.69 0:00.72 0:00.69 0:00.75 0:00.91 0:00.87 0:01.28 0:01.07 0:01.69 0:01.11 0:00.69 best: 0:00.69 rep4
block size      32 0:00.97 0:01.14 0:01.14 0:00.89 0:00.91 0:00.87 0:00.93 0:01.08 0:01.10 0:01.47 0:01.37 0:01.70 0:01.36 0:00.87 best: 0:00.87 rep8
block size      24 0:01.55 0:01.41 0:01.72 0:01.52 0:01.53 0:01.56 0:01.27 0:01.65 0:01.60 0:02.13 0:02.05 0:02.29 0:01.98 0:01.56 best: 0:01.27 rep8noalign
block size      16 0:02.51 0:02.60 0:02.59 0:02.63 0:02.34 0:02.61 0:02.73 0:02.63 0:02.55 0:02.86 0:03.33 0:03.09 0:03.26 0:02.65 best: 0:02.34 rep4noalign
block size      14 0:02.18 0:02.24 0:02.24 0:02.09 0:02.08 0:02.19 0:02.16 0:02.15 0:02.17 0:02.99 0:03.06 0:02.96 0:02.74 0:02.76 best: 0:02.08 rep4noalign
block size      12 0:03.28 0:03.08 0:03.07 0:02.92 0:02.93 0:03.12 0:03.10 0:02.92 0:02.98 0:03.72 0:03.60 0:03.77 0:03.64 0:02.92 best: 0:02.92 loop
block size      10 0:03.49 0:03.56 0:03.56 0:03.60 0:03.39 0:03.80 0:03.73 0:03.51 0:03.65 0:04.59 0:04.50 0:04.53 0:04.55 0:03.72 best: 0:03.39 rep4noalign
block size       8 0:04.31 0:04.26 0:04.26 0:04.31 0:04.46 0:04.15 0:04.14 0:04.23 0:04.13 0:05.00 0:04.59 0:05.07 0:04.99 0:04.18 best: 0:04.13 loopnoalign
block size       6 0:06.69 0:06.06 0:06.05 0:06.26 0:05.83 0:06.09 0:05.93 0:05.91 0:05.75 0:06.42 0:06.27 0:06.39 0:06.75 0:05.94 best: 0:05.75 loopnoalign
block size       4 0:10.18 0:07.51 0:07.51 0:07.69 0:07.51 0:07.39 0:07.41 0:07.35 0:07.41 0:07.38 0:07.39 0:07.39 0:07.34 0:07.37 best: 0:07.35 loop
block size       1 0:37.06 0:15.82 0:15.82 0:14.65 0:11.15 0:14.57 0:14.48 0:14.48 0:14.46 0:14.35 0:14.31 0:15.17 0:14.19 0:14.66 best: 0:11.15 rep4noalign

> It didn't hurt performance - quite the contrary, it was done to avoid
> performance problems.
> The point of this is following. If we generated an unrolled loop with
> SSE moves and a prologue with alignment checks, then we wouldn't know,
> how much bytes will left after the main loop. So in epilogue we'll
> generate a loop with unknown at compile time trip count. In previous
> implementation such loop simply used byte-moves, so in worst case we'd
> have (UnrollFactor*VectorWidth-1) byte moves. Now before such
> byte-loop we generate a rolled loop with SSE-moves. This loop would
> iterate at most (UnrollFactor-1) times, but that still would greatly
> improve performance.
> 
> Finally, we would have something like this:
> main_loop:
>    sse_move
>    sse_move
>    sse_move
>    sse_move
>    iter += 4*vector_size
>    if (iter < count ) goto main_loop
> epilogue_sse_loop:
>    sse_move
>    iter += vector_size
>    if (iter < count ) goto epilogue_sse_loop
> epilogue_byte_loop:
>    byte_move
>    iter ++
>    if (iter < count ) goto epilogue_byte_loop

Hmm, OK.  Imisremembered how we generate the epilogue now - it is quite a while since
I looked into this last time.
> > This seems quite dubious.  The instruction pattern representing the move should
> > refuse the constants via its condition or predicates.
> 
> It does, but it can't generate efficient code if such constants are
> propagated. Take a look at the next example.
> Suppose we'd generate something like this:
>     v4si_reg = (0 | 0 | 0 | 0)
>     sse_mov [mem], v4si_reg
>     sse_mov [mem+16], v4si_reg
> 
> After constant propagation we'd have:
>     v4si_reg = (0 | 0 | 0 | 0)
>     sse_mov [mem], (0 | 0 | 0 | 0)
>     sse_mov [mem+16], (0 | 0 | 0 | 0)

Well, if instruction predicate of sse_mov would reject the immediate operand
for memory destination, the propagation will not happen.

Lets work out what breaks with the small blocks and I will look into the patterns
to prevent th propagation.

Honza

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-10-26 20:36                               ` Michael Zolotukhin
@ 2011-11-07  2:48                                 ` Jan Hubicka
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Hubicka @ 2011-11-07  2:48 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jakub Jelinek, Jack Howarth, gcc-patches, Jan Hubicka,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

Hi,
I looked into the patch more over weekend and I am attaching current version.
The patches disabling CSE and forwprop on constants are apparently papering around
problem that subregs of vector registers used in epilogue make IRA to think
that it can't put value into SSE register (resulting in NO_REGS class) and making
reloat to output load of 0 into the internal loop.

Easy way around is to avoid those subregs even if IRA probably could be improved
here. 

I also plugged some code paths - the pain here is that the stringops have many
variants - different algorithms, different alignmnet, constant/variable
counts. These increase the testing matrix and some of code paths was worng with
the new SSE code.

I also added sse_loop as sse algorithms - I find it confusing as unrolled_loop.
The loop produced has only one store/copy operation so it is not really unrolled
and at older AMD chips we still want to produce unrolled loop w/o SSE. 
Also having sse_loop as separate variant allows to specify both SSE and non-SSE
code gen strategies in the stringop_algs description.

The code now generates faster strinops on core2/buldozer machines I can test, but still
it will need bit of cleanups tomorrow. I think it would be better to use V8QI mode
for the promoted value (since it is what it really is) avoiding the need for changes in
expand_move and the loadq pattern.  We still may want to produce SSE moves for 64bit
operations in 32bit codegen but this is independent problem + the patch as it is seems
to produce slight regression on crafty.

Honza

Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 181033)
+++ config/i386/i386.h	(working copy)
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 181033)
+++ config/i386/i386.md	(working copy)
@@ -15937,6 +15937,17 @@
 	      (clobber (reg:CC FLAGS_REG))])]
   ""
 {
+  rtx vec_reg;
+  enum machine_mode mode = GET_MODE (operands[2]);
+  if (vector_extensions_used_for_mode (mode)
+      && CONSTANT_P (operands[2]))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, operands[2]);
+      operands[2] = vec_reg;
+    }
   if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
     operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
 
Index: config/i386/i386-opts.h
===================================================================
--- config/i386/i386-opts.h	(revision 181033)
+++ config/i386/i386-opts.h	(working copy)
@@ -37,7 +37,8 @@ enum stringop_alg
    rep_prefix_8_byte,
    loop_1_byte,
    loop,
-   unrolled_loop
+   unrolled_loop,
+   sse_loop
 };
 
 /* Available call abi.  */
Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 181033)
+++ config/i386/sse.md	(working copy)
@@ -7398,6 +7398,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -7509,6 +7516,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x,x")
 	(vec_duplicate:V4SI
@@ -7525,6 +7542,16 @@
    (set_attr "prefix" "maybe_vex,vex,orig")
    (set_attr "mode" "TI,V4SF,V4SF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand"     "=x,x,x,x")
 	(vec_duplicate:V2DI
Index: config/i386/i386.opt
===================================================================
--- config/i386/i386.opt	(revision 181033)
+++ config/i386/i386.opt	(working copy)
@@ -324,6 +324,9 @@ Enum(stringop_alg) String(loop) Value(lo
 EnumValue
 Enum(stringop_alg) String(unrolled_loop) Value(unrolled_loop)
 
+EnumValue
+Enum(stringop_alg) String(sse_loop) Value(sse_loop)
+
 mtls-dialect=
 Target RejectNegative Joined Var(ix86_tls_dialect) Enum(tls_dialect) Init(TLS_DIALECT_GNU)
 Use given thread-local storage dialect
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 181033)
+++ config/i386/i386.c	(working copy)
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost =
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/*
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/*
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost =
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,21 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1784,10 +1867,16 @@ struct processor_costs generic64_cost =
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1856,10 +1945,16 @@ struct processor_costs generic32_cost =
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2536,6 +2631,8 @@ static void ix86_set_current_function (t
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
+static rtx promote_duplicated_reg_to_size (rtx, int, int, int);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -15300,6 +15397,38 @@ ix86_expand_move (enum machine_mode mode
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -20800,22 +20929,37 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
+/* Helper function for expand_set_or_movmem_via_loop.
+
+   When SRCPTR is non-NULL, output simple loop to move memory
    pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
    overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
    equivalent loop to set memory by VALUE (supposed to be in MODE).
 
    The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.
 
+   If ITER isn't NULL, than it'll be used in the generated loop without
+   initialization (that allows to generate several consequent loops using the
+   same iterator).
+   If CHANGE_PTRS is specified, DESTPTR and SRCPTR would be increased by
+   iterator value at the end of the function (as if they iterate in the loop).
+   Otherwise, their vaules'll stay unchanged.
+
+   If EXPECTED_SIZE isn't -1, than it's used to compute branch-probabilities on
+   the loop backedge.  When expected size is unknown (it's -1), the probability
+   is set to 80%.
 
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+   Return value is rtx of iterator, used in the loop - it could be reused in
+   consequent calls of this function.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20823,10 +20967,12 @@ expand_set_or_movmem_via_loop (rtx destm
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20837,18 +20983,21 @@ expand_set_or_movmem_via_loop (rtx destm
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
   tmp = convert_modes (Pmode, iter_mode, iter, true);
   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
-  destmem = change_address (destmem, mode, x_addr);
+  destmem =
+    adjust_automodify_address_nv (copy_rtx (destmem), mode, x_addr, 0);
 
   if (srcmem)
     {
       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
-      srcmem = change_address (srcmem, mode, y_addr);
+      srcmem =
+	adjust_automodify_address_nv (copy_rtx (srcmem), mode, y_addr, 0);
 
       /* When unrolling for chips that reorder memory reads and writes,
 	 we can save registers by using single temporary.
@@ -20920,19 +21069,43 @@ expand_set_or_movmem_via_loop (rtx destm
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -21036,7 +21209,18 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this constant to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (COUNT % MAX_SIZE) bytes from SRCPTR to DESTPTR.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -21047,43 +21231,55 @@ expand_movmem_epilogue (rtx destmem, rtx
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (remainder_size >= 4)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -21189,87 +21385,121 @@ expand_setmem_epilogue_via_loop (rtx des
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
-static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+/* Output code to set with VALUE at most (COUNT % MAX_SIZE) bytes starting from
+   DESTPTR.
+   DESTMEM provides MEMrtx to feed proper aliasing info.
+   PROMOTED_TO_GPR_VALUE is rtx representing a GPR containing broadcasted VALUE.
+   PROMOTED_TO_VECTOR_VALUE is rtx representing a vector register containing
+   broadcasted VALUE.
+   PROMOTED_TO_GPR_VALUE and PROMOTED_TO_VECTOR_VALUE could be NULL if the
+   promotion hasn't been generated before.  */
+static void
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx promoted_to_gpr_value, rtx value, rtx count,
+			int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
-	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
-	}
-      if ((countval & 0x08) && max_size > 8)
-	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
-	}
-      if ((countval & 0x04) && max_size > 4)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+
+      if (promoted_to_vector_value)
+	while (remainder_size >= 16)
+	  {
+	    if (GET_MODE (destmem) != move_mode)
+	      destmem = adjust_automodify_address_nv (destmem, move_mode,
+						      destptr, offset);
+	    emit_strset (destmem, promoted_to_vector_value, destptr,
+			 move_mode, offset);
+
+	    offset += 16;
+	    remainder_size -= 16;
+	  }
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      while (remainder_size >= GET_MODE_SIZE (Pmode))
+	{
+	  if (!promoted_to_gpr_value)
+	    promoted_to_gpr_value = promote_duplicated_reg (Pmode, value);
+	  emit_strset (destmem, promoted_to_gpr_value, destptr, Pmode, offset);
+	  offset += GET_MODE_SIZE (Pmode);
+	  remainder_size -= GET_MODE_SIZE (Pmode);
+	}
+
+      if (!promoted_to_gpr_value && remainder_size > 1)
+	promoted_to_gpr_value = promote_duplicated_reg (remainder_size >= 4
+							? SImode : HImode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_to_gpr_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_to_gpr_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem,
+		       promoted_to_gpr_value ? gen_lowpart (QImode, promoted_to_gpr_value) : 1,
+		        destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+
+  if (!promoted_to_gpr_value)
+    promoted_to_gpr_value = promote_duplicated_reg_to_size (value,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode));
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
+	{
+	  destmem = change_address (destmem,
+				    GET_MODE (promoted_to_vector_value),
+				    destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_vector_value));
+	}
+      else if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21279,14 +21509,22 @@ expand_setmem_epilogue (rtx destmem, rtx
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	}
+      /* FIXME: When this hunk it output, IRA classifies promoted_to_vector_value
+         as NO_REGS.  */
+      else if (TARGET_SSE && promoted_to_vector_value && 0)
+	{
+	  destmem = change_address (destmem, V2SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem,
+				 gen_lowpart (V2SImode, promoted_to_vector_value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21294,24 +21532,27 @@ expand_setmem_epilogue (rtx destmem, rtx
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (SImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (HImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (QImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -21327,8 +21568,8 @@ expand_movmem_prologue (rtx destmem, rtx
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, QImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21337,8 +21578,8 @@ expand_movmem_prologue (rtx destmem, rtx
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      srcmem = change_address (srcmem, HImode, srcptr);
-      destmem = change_address (destmem, HImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, HImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21347,14 +21588,34 @@ expand_movmem_prologue (rtx destmem, rtx
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      srcmem = change_address (srcmem, SImode, srcptr);
-      destmem = change_address (destmem, SImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, DImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, DImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -21409,6 +21670,37 @@ expand_constant_movmem_prologue (rtx dst
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -21416,7 +21708,9 @@ expand_constant_movmem_prologue (rtx dst
   if (src_align_bytes >= 0)
     {
       unsigned int src_align = 0;
-      if ((src_align_bytes & 7) == (align_bytes & 7))
+      if ((src_align_bytes & 15) == (align_bytes & 15))
+	src_align = 16;
+      else if ((src_align_bytes & 7) == (align_bytes & 7))
 	src_align = 8;
       else if ((src_align_bytes & 3) == (align_bytes & 3))
 	src_align = 4;
@@ -21444,7 +21738,7 @@ expand_setmem_prologue (rtx destmem, rtx
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      destmem = change_address (destmem, QImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21453,7 +21747,7 @@ expand_setmem_prologue (rtx destmem, rtx
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      destmem = change_address (destmem, HImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21462,13 +21756,23 @@ expand_setmem_prologue (rtx destmem, rtx
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      destmem = change_address (destmem, SImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -21504,6 +21808,19 @@ expand_constant_setmem_prologue (rtx dst
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      off = 4;
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 4;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -21515,7 +21832,7 @@ expand_constant_setmem_prologue (rtx dst
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -21524,7 +21841,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -21537,7 +21854,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -21546,9 +21863,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21612,29 +21929,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21658,9 +21979,14 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
+	desired_align = GET_MODE_SIZE (Pmode);
+	break;
       case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case sse_loop:
+	desired_align = 16;
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21748,6 +22074,11 @@ ix86_expand_movmem (rtx dst, rtx src, rt
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21771,9 +22102,17 @@ ix86_expand_movmem (rtx dst, rtx src, rt
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21792,11 +22131,22 @@ ix86_expand_movmem (rtx dst, rtx src, rt
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      move_mode = Pmode;
+      unroll_factor = TARGET_64BIT ? 4 : 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
+      break;
+    case sse_loop:
+      need_zero_guard = true;
+      /* Use SSE instructions, if possible.  */
+      move_mode = align_unknown ? DImode : V4SImode;
+      unroll_factor = TARGET_64BIT ? 4 : 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21857,6 +22207,12 @@ ix86_expand_movmem (rtx dst, rtx src, rt
 	}
       else
 	{
+	  /* SSE re-use iteration counter in the epilogue.  */
+	  if (alg == sse_loop)
+	    {
+	      loop_iter = gen_reg_rtx (counter_mode (count_exp));
+              emit_move_insn (loop_iter, const0_rtx);
+	    }
 	  label = gen_label_rtx ();
 	  emit_cmp_and_jump_insns (count_exp,
 				   GEN_INT (epilogue_size_needed),
@@ -21908,6 +22264,8 @@ ix86_expand_movmem (rtx dst, rtx src, rt
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
 				  desired_align);
+	  set_mem_align (src, desired_align*BITS_PER_UNIT);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -21964,12 +22322,23 @@ ix86_expand_movmem (rtx dst, rtx src, rt
       expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
 				     count_exp, Pmode, 1, expected_size);
       break;
+    case sse_loop:
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, loop_iter,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
+      break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+					       srcreg, NULL,
+					       count_exp, NULL,
+					       move_mode,
+					       unroll_factor,
+					       expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -22020,9 +22389,41 @@ ix86_expand_movmem (rtx dst, rtx src, rt
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == sse_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
     expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
 			    epilogue_size_needed);
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -22040,7 +22441,37 @@ promote_duplicated_reg (enum machine_mod
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -22106,11 +22537,27 @@ promote_duplicated_reg (enum machine_mod
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      /* We want to promote to vector register, so we expect that at least SSE
+	 is available.  */
+      gcc_assert (TARGET_SSE);
+
+      /* In case of promotion to vector register, we expect that val is a
+	 constant or already promoted to GPR value.  */
+      gcc_assert (GET_MODE (val) == Pmode || CONSTANT_P (val));
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -22138,10 +22585,14 @@ ix86_expand_setmem (rtx dst, rtx count_e
   int size_needed = 0, epilogue_size_needed;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
-  rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx gpr_promoted_val = NULL;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22161,8 +22612,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -22180,11 +22634,28 @@ ix86_expand_setmem (rtx dst, rtx count_e
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      move_mode = Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
+      break;
+    case sse_loop:
+      need_zero_guard = true;
+      move_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -22229,8 +22700,10 @@ ix86_expand_setmem (rtx dst, rtx count_e
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -22239,12 +22712,6 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -22259,6 +22726,12 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	}
       else
 	{
+	  /* SSE re-use iteration counter in the epilogue.  */
+	  if (alg == sse_loop)
+	    {
+	      loop_iter = gen_reg_rtx (counter_mode (count_exp));
+              emit_move_insn (loop_iter, const0_rtx);
+	    }
 	  label = gen_label_rtx ();
 	  emit_cmp_and_jump_insns (count_exp,
 				   GEN_INT (epilogue_size_needed),
@@ -22284,9 +22757,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
   /* Step 2: Alignment prologue.  */
 
   /* Do the expensive promotion once we branched off the small blocks.  */
-  if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+  if (!gpr_promoted_val)
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -22298,17 +22773,20 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	     the pain to maintain it for the first move, so throw away
 	     the info early.  */
 	  dst = change_address (dst, BLKmode, destreg);
-	  expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
+	  expand_setmem_prologue (dst, destreg, gpr_promoted_val, count_exp, align,
 				  desired_align);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
 	  /* If we know how many bytes need to be stored before dst is
 	     sufficiently aligned, maintain aliasing info accurately.  */
-	  dst = expand_constant_setmem_prologue (dst, destreg, promoted_val,
+	  dst = expand_constant_setmem_prologue (dst, destreg, gpr_promoted_val,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -22336,7 +22814,7 @@ ix86_expand_setmem (rtx dst, rtx count_e
       emit_label (label);
       LABEL_NUSES (label) = 1;
       label = NULL;
-      promoted_val = val_exp;
+      gpr_promoted_val = val_exp;
       epilogue_size_needed = 1;
     }
   else if (label == NULL_RTX)
@@ -22350,27 +22828,40 @@ ix86_expand_setmem (rtx dst, rtx count_e
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, gpr_promoted_val,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, gpr_promoted_val, count_exp,
+				     NULL, move_mode, unroll_factor,
+				     expected_size, false);
+      break;
+    case sse_loop:
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (gpr_promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     loop_iter, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      gcc_assert (TARGET_64BIT);
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  DImode, val_exp);
       break;
     case rep_prefix_4_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  SImode, val_exp);
       break;
     case rep_prefix_1_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  QImode, val_exp);
       break;
     }
@@ -22401,17 +22892,33 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
+      /* We can not rely on fact that promoved value is known.  */
+      vec_promoted_val = 0;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == sse_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
+			    val_exp, count_exp, epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -37676,6 +38183,33 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -37977,6 +38511,9 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-03 12:56                                       ` Michael Zolotukhin
  2011-11-06 14:28                                         ` Jan Hubicka
@ 2011-11-07 15:52                                         ` Jan Hubicka
  2011-11-07 16:24                                           ` Michael Zolotukhin
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Hubicka @ 2011-11-07 15:52 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jan Hubicka, Richard Henderson, Jakub Jelinek, Jack Howarth,
	gcc-patches, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan


Hi,
this is variant of patch I hope to commit today after some further testing.
I removed most of the code changing interfaces outside i386 backend and also
reverted changes to generic cost models since those are results of discussion
in between AMD and Intel and any changes here needs to be discussed on both
sides.

I noticed that core model still use generic costs that is quite bogus. It also
seems bogus to have two costs for 32bit and 64bit mode of core - the only
reason why there are 32bit and 64bit generic models is because 32bit generic
still take into acount 32bit chips (centrino and Athlon).  I think we may drop
those and remove optimizations targetted to help those chips.

Bootstrapped/regtested x86_64-linux, intend to commit it today after some
further testing.

Honza

2011-11-03  Zolotukhin Michael  <michael.v.zolotukhin@gmail.com>
	    Jan Hubicka  <jh@suse.cz>

        * config/i386/i386.h (processor_costs): Add second dimension to
        stringop_algs array.
        * config/i386/i386.c (cost models): Initialize second dimension of
        stringop_algs arrays.
	(core_cost): New costs based on generic64 costs with updated stringop
	values.
        (promote_duplicated_reg): Add support for vector modes, add
        declaration.
        (promote_duplicated_reg_to_size): Likewise.
	(processor_target): Set core costs for core variants.
        (expand_set_or_movmem_via_loop_with_iter): New function.
        (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
        different loops, produced by this function.
        (emit_strset): New function.
        (expand_movmem_epilogue): Add epilogue generation for bigger sizes,
        use SSE-moves where possible.
        (expand_setmem_epilogue): Likewise.
        (expand_movmem_prologue): Likewise for prologue.
        (expand_setmem_prologue): Likewise.
        (expand_constant_movmem_prologue): Likewise.
        (expand_constant_setmem_prologue): Likewise.
        (decide_alg): Add new argument align_unknown.  Fix algorithm of
        strategy selection if TARGET_INLINE_ALL_STRINGOPS is set; Skip sse_loop
        (decide_alignment): Update desired alignment according to chosen move
        mode.
        (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
        (ix86_expand_setmem): Likewise.
        (ix86_slow_unaligned_access): Implementation of new hook
        slow_unaligned_access.
        * config/i386/i386.md (strset): Enable half-SSE moves.
        * config/i386/sse.md (vec_dupv4si): Add expand for vec_dupv4si.
        (vec_dupv2di): Add expand for vec_dupv2di.

Index: i386.h
===================================================================
--- i386.h	(revision 181033)
+++ i386.h	(working copy)
@@ -159,8 +159,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
Index: i386.md
===================================================================
--- i386.md	(revision 181033)
+++ i386.md	(working copy)
@@ -15937,6 +15937,17 @@
 	      (clobber (reg:CC FLAGS_REG))])]
   ""
 {
+  rtx vec_reg;
+  enum machine_mode mode = GET_MODE (operands[2]);
+  if (vector_extensions_used_for_mode (mode)
+      && CONSTANT_P (operands[2]))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, operands[2]);
+      operands[2] = vec_reg;
+    }
   if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
     operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
 
Index: i386-opts.h
===================================================================
--- i386-opts.h	(revision 181033)
+++ i386-opts.h	(working copy)
@@ -37,7 +37,8 @@ enum stringop_alg
    rep_prefix_8_byte,
    loop_1_byte,
    loop,
-   unrolled_loop
+   unrolled_loop,
+   sse_loop
 };
 
 /* Available call abi.  */
Index: sse.md
===================================================================
--- sse.md	(revision 181033)
+++ sse.md	(working copy)
@@ -7509,6 +7509,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x,x")
 	(vec_duplicate:V4SI
@@ -7525,6 +7535,16 @@
    (set_attr "prefix" "maybe_vex,vex,orig")
    (set_attr "mode" "TI,V4SF,V4SF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand"     "=x,x,x,x")
 	(vec_duplicate:V2DI
Index: i386.opt
===================================================================
--- i386.opt	(revision 181033)
+++ i386.opt	(working copy)
@@ -324,6 +324,9 @@ Enum(stringop_alg) String(loop) Value(lo
 EnumValue
 Enum(stringop_alg) String(unrolled_loop) Value(unrolled_loop)
 
+EnumValue
+Enum(stringop_alg) String(sse_loop) Value(sse_loop)
+
 mtls-dialect=
 Target RejectNegative Joined Var(ix86_tls_dialect) Enum(tls_dialect) Init(TLS_DIALECT_GNU)
 Use given thread-local storage dialect
Index: i386.c
===================================================================
--- i386.c	(revision 181033)
+++ i386.c	(working copy)
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost =
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/*
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/*
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost =
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {512, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {512, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1704,13 +1779,108 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  
+     SSE loops works best on Atom, but fall back into non-SSE unrolled loop variant
+     if that fails.  */
+  {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
+	       {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, sse_loop}, {1024, unrolled_loop},	 /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
+	       {-1, libcall}}}}},
+  1,					/* scalar_stmt_cost.  */
+  1,					/* scalar load_cost.  */
+  1,					/* scalar_store_cost.  */
+  1,					/* vec_stmt_cost.  */
+  1,					/* vec_to_scalar_cost.  */
+  1,					/* scalar_to_vec_cost.  */
+  1,					/* vec_align_load_cost.  */
+  2,					/* vec_unalign_load_cost.  */
+  1,					/* vec_store_cost.  */
+  3,					/* cond_taken_branch_cost.  */
+  1,					/* cond_not_taken_branch_cost.  */
+};
+
+/* Core should produce code tuned for core variants.  */
+static const
+struct processor_costs core_cost = {
+  COSTS_N_INSNS (1),			/* cost of an add instruction */
+  /* On all chips taken into consideration lea is 2 cycles and more.  With
+     this cost however our current implementation of synth_mult results in
+     use of unnecessary temporary registers causing regression on several
+     SPECfp benchmarks.  */
+  COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
+  COSTS_N_INSNS (1),			/* variable shift costs */
+  COSTS_N_INSNS (1),			/* constant shift costs */
+  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
+   COSTS_N_INSNS (4),			/*				 HI */
+   COSTS_N_INSNS (3),			/*				 SI */
+   COSTS_N_INSNS (4),			/*				 DI */
+   COSTS_N_INSNS (2)},			/*			      other */
+  0,					/* cost of multiply per each bit set */
+  {COSTS_N_INSNS (18),			/* cost of a divide/mod for QI */
+   COSTS_N_INSNS (26),			/*			    HI */
+   COSTS_N_INSNS (42),			/*			    SI */
+   COSTS_N_INSNS (74),			/*			    DI */
+   COSTS_N_INSNS (74)},			/*			    other */
+  COSTS_N_INSNS (1),			/* cost of movsx */
+  COSTS_N_INSNS (1),			/* cost of movzx */
+  8,					/* "large" insn */
+  17,					/* MOVE_RATIO */
+  4,				     /* cost for loading QImode using movzbl */
+  {4, 4, 4},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {4, 4, 4},				/* cost of storing integer registers */
+  4,					/* cost of reg,reg fld/fst */
+  {12, 12, 12},				/* cost of loading fp registers
+					   in SFmode, DFmode and XFmode */
+  {6, 6, 8},				/* cost of storing fp registers
+					   in SFmode, DFmode and XFmode */
+  2,					/* cost of moving MMX register */
+  {8, 8},				/* cost of loading MMX registers
+					   in SImode and DImode */
+  {8, 8},				/* cost of storing MMX registers
+					   in SImode and DImode */
+  2,					/* cost of moving SSE register */
+  {8, 8, 8},				/* cost of loading SSE registers
+					   in SImode, DImode and TImode */
+  {8, 8, 8},				/* cost of storing SSE registers
+					   in SImode, DImode and TImode */
+  5,					/* MMX or SSE register to integer */
+  32,					/* size of l1 cache.  */
+  512,					/* size of l2 cache.  */
+  64,					/* size of prefetch block */
+  6,					/* number of parallel prefetches */
+  /* Benchmarks shows large regressions on K8 sixtrack benchmark when this
+     value is increased to perhaps more appropriate value of 5.  */
+  3,					/* Branch cost */
+  COSTS_N_INSNS (8),			/* cost of FADD and FSUB insns.  */
+  COSTS_N_INSNS (8),			/* cost of FMUL instruction.  */
+  COSTS_N_INSNS (20),			/* cost of FDIV instruction.  */
+  COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
+  COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
+  COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_4_byte}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_4_byte}, {-1, libcall}}}, /* Unknown alignment.  */
+    {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment.  */
+    {libcall, {{256, rep_prefix_8_byte}}}},
+   {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment.  */
+    {libcall, {{256, rep_prefix_8_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1724,7 +1894,7 @@ struct processor_costs atom_cost = {
   1,					/* cond_not_taken_branch_cost.  */
 };
 
-/* Generic64 should produce code tuned for Nocona and K8.  */
+/* Generic64 should produce code tuned for Nocona, Core,  K8, Amdfam10 and buldozer.  */
 static const
 struct processor_costs generic64_cost = {
   COSTS_N_INSNS (1),			/* cost of an add instruction */
@@ -1784,10 +1954,16 @@ struct processor_costs generic64_cost =
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1801,8 +1977,8 @@ struct processor_costs generic64_cost =
   1,					/* cond_not_taken_branch_cost.  */
 };
 
-/* Generic32 should produce code tuned for PPro, Pentium4, Nocona,
-   Athlon and K8.  */
+/* Generic32 should produce code tuned for PPro, Pentium4, Nocona, Core
+   Athlon, K8, amdfam10, buldozer.  */
 static const
 struct processor_costs generic32_cost = {
   COSTS_N_INSNS (1),			/* cost of an add instruction */
@@ -1856,10 +2032,16 @@ struct processor_costs generic32_cost =
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2536,6 +2718,8 @@ static void ix86_set_current_function (t
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
+static rtx promote_duplicated_reg_to_size (rtx, int, int, int);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -2582,13 +2766,13 @@ static const struct ptt processor_target
   {&k8_cost, 16, 7, 16, 7, 16},
   {&nocona_cost, 0, 0, 0, 0, 0},
   /* Core 2 32-bit.  */
-  {&generic32_cost, 16, 10, 16, 10, 16},
+  {&core_cost, 16, 10, 16, 10, 16},
   /* Core 2 64-bit.  */
-  {&generic64_cost, 16, 10, 16, 10, 16},
+  {&core_cost, 16, 10, 16, 10, 16},
   /* Core i7 32-bit.  */
-  {&generic32_cost, 16, 10, 16, 10, 16},
+  {&core_cost, 16, 10, 16, 10, 16},
   /* Core i7 64-bit.  */
-  {&generic64_cost, 16, 10, 16, 10, 16},
+  {&core_cost, 16, 10, 16, 10, 16},
   {&generic32_cost, 16, 7, 16, 7, 16},
   {&generic64_cost, 16, 10, 16, 10, 16},
   {&amdfam10_cost, 32, 24, 32, 7, 32},
@@ -20800,22 +20984,37 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
+/* Helper function for expand_set_or_movmem_via_loop.
+
+   When SRCPTR is non-NULL, output simple loop to move memory
    pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
    overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
    equivalent loop to set memory by VALUE (supposed to be in MODE).
 
    The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.
 
+   If ITER isn't NULL, than it'll be used in the generated loop without
+   initialization (that allows to generate several consequent loops using the
+   same iterator).
+   If CHANGE_PTRS is specified, DESTPTR and SRCPTR would be increased by
+   iterator value at the end of the function (as if they iterate in the loop).
+   Otherwise, their vaules'll stay unchanged.
+
+   If EXPECTED_SIZE isn't -1, than it's used to compute branch-probabilities on
+   the loop backedge.  When expected size is unknown (it's -1), the probability
+   is set to 80%.
 
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+   Return value is rtx of iterator, used in the loop - it could be reused in
+   consequent calls of this function.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -20823,10 +21022,12 @@ expand_set_or_movmem_via_loop (rtx destm
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -20837,18 +21038,21 @@ expand_set_or_movmem_via_loop (rtx destm
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
   tmp = convert_modes (Pmode, iter_mode, iter, true);
   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
-  destmem = change_address (destmem, mode, x_addr);
+  destmem =
+    adjust_automodify_address_nv (copy_rtx (destmem), mode, x_addr, 0);
 
   if (srcmem)
     {
       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
-      srcmem = change_address (srcmem, mode, y_addr);
+      srcmem =
+	adjust_automodify_address_nv (copy_rtx (srcmem), mode, y_addr, 0);
 
       /* When unrolling for chips that reorder memory reads and writes,
 	 we can save registers by using single temporary.
@@ -20920,19 +21124,43 @@ expand_set_or_movmem_via_loop (rtx destm
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -21036,7 +21264,18 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this constant to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (COUNT % MAX_SIZE) bytes from SRCPTR to DESTPTR.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -21047,43 +21286,55 @@ expand_movmem_epilogue (rtx destmem, rtx
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      if (remainder_size >= 4)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -21189,87 +21440,121 @@ expand_setmem_epilogue_via_loop (rtx des
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set with VALUE at most (COUNT % MAX_SIZE) bytes starting from
+   DESTPTR.
+   DESTMEM provides MEMrtx to feed proper aliasing info.
+   PROMOTED_TO_GPR_VALUE is rtx representing a GPR containing broadcasted VALUE.
+   PROMOTED_TO_VECTOR_VALUE is rtx representing a vector register containing
+   broadcasted VALUE.
+   PROMOTED_TO_GPR_VALUE and PROMOTED_TO_VECTOR_VALUE could be NULL if the
+   promotion hasn't been generated before.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx promoted_to_gpr_value, rtx value, rtx count,
+			int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
-	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
-	}
-      if ((countval & 0x08) && max_size > 8)
-	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
-	}
-      if ((countval & 0x04) && max_size > 4)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+
+      if (promoted_to_vector_value)
+	while (remainder_size >= 16)
+	  {
+	    if (GET_MODE (destmem) != move_mode)
+	      destmem = adjust_automodify_address_nv (destmem, move_mode,
+						      destptr, offset);
+	    emit_strset (destmem, promoted_to_vector_value, destptr,
+			 move_mode, offset);
+
+	    offset += 16;
+	    remainder_size -= 16;
+	  }
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      while (remainder_size >= GET_MODE_SIZE (Pmode))
+	{
+	  if (!promoted_to_gpr_value)
+	    promoted_to_gpr_value = promote_duplicated_reg (Pmode, value);
+	  emit_strset (destmem, promoted_to_gpr_value, destptr, Pmode, offset);
+	  offset += GET_MODE_SIZE (Pmode);
+	  remainder_size -= GET_MODE_SIZE (Pmode);
+	}
+
+      if (!promoted_to_gpr_value && remainder_size > 1)
+	promoted_to_gpr_value = promote_duplicated_reg (remainder_size >= 4
+							? SImode : HImode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_to_gpr_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_to_gpr_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem,
+		       promoted_to_gpr_value ? gen_lowpart (QImode, promoted_to_gpr_value) : value,
+		        destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+
+  if (!promoted_to_gpr_value)
+    promoted_to_gpr_value = promote_duplicated_reg_to_size (value,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode));
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
-      if (TARGET_64BIT)
+      if (TARGET_SSE && promoted_to_vector_value)
+	{
+	  destmem = change_address (destmem,
+				    GET_MODE (promoted_to_vector_value),
+				    destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_vector_value));
+	}
+      else if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21279,14 +21564,22 @@ expand_setmem_epilogue (rtx destmem, rtx
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	}
+      /* FIXME: When this hunk it output, IRA classifies promoted_to_vector_value
+         as NO_REGS.  */
+      else if (TARGET_SSE && promoted_to_vector_value && 0)
+	{
+	  destmem = change_address (destmem, V2SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem,
+				 gen_lowpart (V2SImode, promoted_to_vector_value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
+	  emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -21294,24 +21587,27 @@ expand_setmem_epilogue (rtx destmem, rtx
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (SImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (HImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem,
+			     gen_lowpart (QImode, promoted_to_gpr_value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -21327,8 +21623,8 @@ expand_movmem_prologue (rtx destmem, rtx
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, QImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21337,8 +21633,8 @@ expand_movmem_prologue (rtx destmem, rtx
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      srcmem = change_address (srcmem, HImode, srcptr);
-      destmem = change_address (destmem, HImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, HImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21347,14 +21643,34 @@ expand_movmem_prologue (rtx destmem, rtx
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      srcmem = change_address (srcmem, SImode, srcptr);
-      destmem = change_address (destmem, SImode, destptr);
+      srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, DImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, DImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
+	  destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -21409,6 +21725,37 @@ expand_constant_movmem_prologue (rtx dst
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -21416,7 +21763,9 @@ expand_constant_movmem_prologue (rtx dst
   if (src_align_bytes >= 0)
     {
       unsigned int src_align = 0;
-      if ((src_align_bytes & 7) == (align_bytes & 7))
+      if ((src_align_bytes & 15) == (align_bytes & 15))
+	src_align = 16;
+      else if ((src_align_bytes & 7) == (align_bytes & 7))
 	src_align = 8;
       else if ((src_align_bytes & 3) == (align_bytes & 3))
 	src_align = 4;
@@ -21444,7 +21793,7 @@ expand_setmem_prologue (rtx destmem, rtx
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      destmem = change_address (destmem, QImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       ix86_adjust_counter (count, 1);
       emit_label (label);
@@ -21453,7 +21802,7 @@ expand_setmem_prologue (rtx destmem, rtx
   if (align <= 2 && desired_alignment > 2)
     {
       rtx label = ix86_expand_aligntest (destptr, 2, false);
-      destmem = change_address (destmem, HImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       ix86_adjust_counter (count, 2);
       emit_label (label);
@@ -21462,13 +21811,23 @@ expand_setmem_prologue (rtx destmem, rtx
   if (align <= 4 && desired_alignment > 4)
     {
       rtx label = ix86_expand_aligntest (destptr, 4, false);
-      destmem = change_address (destmem, SImode, destptr);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       ix86_adjust_counter (count, 4);
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -21504,6 +21863,19 @@ expand_constant_setmem_prologue (rtx dst
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      off = 4;
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 4;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -21515,7 +21887,7 @@ expand_constant_setmem_prologue (rtx dst
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -21524,7 +21896,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -21537,7 +21909,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -21546,9 +21918,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -21612,29 +21984,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -21658,9 +22034,14 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
+	desired_align = GET_MODE_SIZE (Pmode);
+	break;
       case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case sse_loop:
+	desired_align = 16;
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -21748,6 +22129,11 @@ ix86_expand_movmem (rtx dst, rtx src, rt
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -21771,9 +22157,17 @@ ix86_expand_movmem (rtx dst, rtx src, rt
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -21792,11 +22186,22 @@ ix86_expand_movmem (rtx dst, rtx src, rt
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      move_mode = Pmode;
+      unroll_factor = TARGET_64BIT ? 4 : 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
+      break;
+    case sse_loop:
+      need_zero_guard = true;
+      /* Use SSE instructions, if possible.  */
+      move_mode = align_unknown ? DImode : V4SImode;
+      unroll_factor = TARGET_64BIT ? 4 : 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -21857,6 +22262,12 @@ ix86_expand_movmem (rtx dst, rtx src, rt
 	}
       else
 	{
+	  /* SSE and unrolled algs re-use iteration counter in the epilogue.  */
+	  if (alg == sse_loop || alg == unrolled_loop)
+	    {
+	      loop_iter = gen_reg_rtx (counter_mode (count_exp));
+              emit_move_insn (loop_iter, const0_rtx);
+	    }
 	  label = gen_label_rtx ();
 	  emit_cmp_and_jump_insns (count_exp,
 				   GEN_INT (epilogue_size_needed),
@@ -21908,6 +22319,8 @@ ix86_expand_movmem (rtx dst, rtx src, rt
 	  dst = change_address (dst, BLKmode, destreg);
 	  expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
 				  desired_align);
+	  set_mem_align (src, desired_align*BITS_PER_UNIT);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
@@ -21964,12 +22377,16 @@ ix86_expand_movmem (rtx dst, rtx src, rt
       expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
 				     count_exp, Pmode, 1, expected_size);
       break;
+    case sse_loop:
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, loop_iter,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -22020,9 +22437,41 @@ ix86_expand_movmem (rtx dst, rtx src, rt
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == sse_loop || alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
     expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
 			    epilogue_size_needed);
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -22040,7 +22489,37 @@ promote_duplicated_reg (enum machine_mod
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -22106,11 +22585,27 @@ promote_duplicated_reg (enum machine_mod
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      /* We want to promote to vector register, so we expect that at least SSE
+	 is available.  */
+      gcc_assert (TARGET_SSE);
+
+      /* In case of promotion to vector register, we expect that val is a
+	 constant or already promoted to GPR value.  */
+      gcc_assert (GET_MODE (val) == Pmode || CONSTANT_P (val));
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -22138,10 +22633,14 @@ ix86_expand_setmem (rtx dst, rtx count_e
   int size_needed = 0, epilogue_size_needed;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
-  rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx gpr_promoted_val = NULL;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22161,8 +22660,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -22180,11 +22682,28 @@ ix86_expand_setmem (rtx dst, rtx count_e
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      move_mode = Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
+      break;
+    case sse_loop:
+      need_zero_guard = true;
+      move_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -22229,8 +22748,10 @@ ix86_expand_setmem (rtx dst, rtx count_e
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -22239,12 +22760,6 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -22259,6 +22774,12 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	}
       else
 	{
+	  /* SSE and unrolled_lopo algs re-use iteration counter in the epilogue.  */
+	  if (alg == sse_loop || alg == unrolled_loop)
+	    {
+	      loop_iter = gen_reg_rtx (counter_mode (count_exp));
+              emit_move_insn (loop_iter, const0_rtx);
+	    }
 	  label = gen_label_rtx ();
 	  emit_cmp_and_jump_insns (count_exp,
 				   GEN_INT (epilogue_size_needed),
@@ -22284,9 +22805,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
   /* Step 2: Alignment prologue.  */
 
   /* Do the expensive promotion once we branched off the small blocks.  */
-  if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+  if (!gpr_promoted_val)
+    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   GET_MODE_SIZE (Pmode),
+						   GET_MODE_SIZE (Pmode),
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -22298,17 +22821,20 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	     the pain to maintain it for the first move, so throw away
 	     the info early.  */
 	  dst = change_address (dst, BLKmode, destreg);
-	  expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
+	  expand_setmem_prologue (dst, destreg, gpr_promoted_val, count_exp, align,
 				  desired_align);
+	  set_mem_align (dst, desired_align*BITS_PER_UNIT);
 	}
       else
 	{
 	  /* If we know how many bytes need to be stored before dst is
 	     sufficiently aligned, maintain aliasing info accurately.  */
-	  dst = expand_constant_setmem_prologue (dst, destreg, promoted_val,
+	  dst = expand_constant_setmem_prologue (dst, destreg, gpr_promoted_val,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -22336,7 +22862,7 @@ ix86_expand_setmem (rtx dst, rtx count_e
       emit_label (label);
       LABEL_NUSES (label) = 1;
       label = NULL;
-      promoted_val = val_exp;
+      gpr_promoted_val = val_exp;
       epilogue_size_needed = 1;
     }
   else if (label == NULL_RTX)
@@ -22350,27 +22876,40 @@ ix86_expand_setmem (rtx dst, rtx count_e
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, gpr_promoted_val,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, gpr_promoted_val, count_exp,
+				     loop_iter, move_mode, unroll_factor,
+				     expected_size, false);
+      break;
+    case sse_loop:
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (gpr_promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     loop_iter, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      gcc_assert (TARGET_64BIT);
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  DImode, val_exp);
       break;
     case rep_prefix_4_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  SImode, val_exp);
       break;
     case rep_prefix_1_byte:
-      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
+      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
 				  QImode, val_exp);
       break;
     }
@@ -22401,17 +22940,33 @@ ix86_expand_setmem (rtx dst, rtx count_e
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
+      /* We can not rely on fact that promoved value is known.  */
+      vec_promoted_val = 0;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == sse_loop || alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
+			    val_exp, count_exp, epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-07 15:52                                         ` Jan Hubicka
@ 2011-11-07 16:24                                           ` Michael Zolotukhin
  2011-11-07 16:59                                             ` Jan Hubicka
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Zolotukhin @ 2011-11-07 16:24 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Henderson, Jakub Jelinek, Jack Howarth, gcc-patches,
	Richard Guenther, H.J. Lu, izamyatin, areg.melikadamyan

Hi, Jan!
I was just preparing my version of the patch, but it seems a bit late
now. Please see my comments to this and your previous letter below.

By the way, would it be possible to commit other part of the patch
(middle-end part) - probably also by small parts -  and some other
tuning after stage 1 closes?


> The patches disabling CSE and forwprop on constants are apparently papering around
> problem that subregs of vector registers used in epilogue make IRA to think
> that it can't put value into SSE register (resulting in NO_REGS class) and making
> reloat to output load of 0 into the internal loop.

The problem here isn't about subregs - there is just no way to emit
store when destination is 128-bit memory operand and source is 128-bit
immediate. We should somehow find that previously we initialized a
vector register for use in this particular store - either IRA should
traverse through the code trying to find such initialization (i.e. IRA
should revert the FWP's work), or we just shouldn't let such
situations happen by disabling FWP for 128-bit immediates. I think the
second option is much easier both in implementation and for
understanding.


> I also plugged some code paths - the pain here is that the stringops have many
> variants - different algorithms, different alignmnet, constant/variable
> counts. These increase the testing matrix and some of code paths was worng with
> the new SSE code.

Yep, I also saw such fails, thanks for the fixes. Though, I see
another problem here: the main reason of these fails is that when size
is small we could skip main loop and thus reach epilogue with
uninitialized loop-iterator and/or promoted value. To make the
algorithm absolutely correct, we should either perform needed
initializations in the very beginning (before zero-testing) or use
byte-loop in the epilogue. The second way could greatly hurt
performance, so I think we should just initialize everything before
the main loop in assumption that size is big enough and it'll be used
in the main loop.
Moreover, this algorithm wasn't initially intended for small sizes -
memcpy/memset for small sizes should be expanded earlier, in
move_by_pieces or set_by_pieces (it was in middle-end part of the
patch). So the assumption about the size should be correct.


> We still may want to produce SSE moves for 64bit
> operations in 32bit codegen but this is independent problem + the patch as it is seems
> to produce slight regression on crafty.

Actually, such 8byte moves aren't critical for these part of the patch
- here such moves only could be used in prologues/epilogues and
doesn't affect performance much (assuming that size isn't very small
and small performance loss in prologue/epilogue doesn't affect overall
performance).
But for memcpy/memset for small sizes, which are expanded in
middle-end part, that could be quite crucial. For example, for copying
of 24 bytes with unknown alignment on Atom three 8-byte SSE moves
could be much faster than six 4-byte moves via GPR. So, it's
definitely good to have an opportunity to generate such moves.


> I think it would be better to use V8QI mode
> for the promoted value (since it is what it really is) avoiding the need for changes in
> expand_move and the loadq pattern.

Actually, we rarely operate in byte mode - usually we move/store at
least with Pmode (when we use GPR). So V4SI or V2DI also looks
reasonable to me here. Also, when promoting value from GPR to
SSE-register, we surely need value in DI/SI-mode value, not in QImode.
We could make everything in QI/V16QI/V8QI-modes but that could lead to
generation of converts in many places (like in promotion to vector).

> I noticed that core model still use generic costs that is quite bogus.
Yes, I agree. It's better to have separate cost-models for them.


> I also reverted changes to generic cost models since those are results of discussion
> in between AMD and Intel and any changes here needs to be discussed on both
> sides.

Sure, I totally agree.


On 7 November 2011 19:41, Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> this is variant of patch I hope to commit today after some further testing.
> I removed most of the code changing interfaces outside i386 backend and also
> reverted changes to generic cost models since those are results of discussion
> in between AMD and Intel and any changes here needs to be discussed on both
> sides.
>
> I noticed that core model still use generic costs that is quite bogus. It also
> seems bogus to have two costs for 32bit and 64bit mode of core - the only
> reason why there are 32bit and 64bit generic models is because 32bit generic
> still take into acount 32bit chips (centrino and Athlon).  I think we may drop
> those and remove optimizations targetted to help those chips.
>
> Bootstrapped/regtested x86_64-linux, intend to commit it today after some
> further testing.
>
> Honza
>
> 2011-11-03  Zolotukhin Michael  <michael.v.zolotukhin@gmail.com>
>            Jan Hubicka  <jh@suse.cz>
>
>        * config/i386/i386.h (processor_costs): Add second dimension to
>        stringop_algs array.
>        * config/i386/i386.c (cost models): Initialize second dimension of
>        stringop_algs arrays.
>        (core_cost): New costs based on generic64 costs with updated stringop
>        values.
>        (promote_duplicated_reg): Add support for vector modes, add
>        declaration.
>        (promote_duplicated_reg_to_size): Likewise.
>        (processor_target): Set core costs for core variants.
>        (expand_set_or_movmem_via_loop_with_iter): New function.
>        (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
>        different loops, produced by this function.
>        (emit_strset): New function.
>        (expand_movmem_epilogue): Add epilogue generation for bigger sizes,
>        use SSE-moves where possible.
>        (expand_setmem_epilogue): Likewise.
>        (expand_movmem_prologue): Likewise for prologue.
>        (expand_setmem_prologue): Likewise.
>        (expand_constant_movmem_prologue): Likewise.
>        (expand_constant_setmem_prologue): Likewise.
>        (decide_alg): Add new argument align_unknown.  Fix algorithm of
>        strategy selection if TARGET_INLINE_ALL_STRINGOPS is set; Skip sse_loop
>        (decide_alignment): Update desired alignment according to chosen move
>        mode.
>        (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
>        (ix86_expand_setmem): Likewise.
>        (ix86_slow_unaligned_access): Implementation of new hook
>        slow_unaligned_access.
>        * config/i386/i386.md (strset): Enable half-SSE moves.
>        * config/i386/sse.md (vec_dupv4si): Add expand for vec_dupv4si.
>        (vec_dupv2di): Add expand for vec_dupv2di.
>
> Index: i386.h
> ===================================================================
> --- i386.h      (revision 181033)
> +++ i386.h      (working copy)
> @@ -159,8 +159,12 @@ struct processor_costs {
>   const int fchs;              /* cost of FCHS instruction.  */
>   const int fsqrt;             /* cost of FSQRT instruction.  */
>                                /* Specify what algorithm
> -                                  to use for stringops on unknown size.  */
> -  struct stringop_algs memcpy[2], memset[2];
> +                                  to use for stringops on unknown size.
> +                                  First index is used to specify whether
> +                                  alignment is known or not.
> +                                  Second - to specify whether 32 or 64 bits
> +                                  are used.  */
> +  struct stringop_algs memcpy[2][2], memset[2][2];
>   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
>                                   load and store.  */
>   const int scalar_load_cost;   /* Cost of scalar load.  */
> Index: i386.md
> ===================================================================
> --- i386.md     (revision 181033)
> +++ i386.md     (working copy)
> @@ -15937,6 +15937,17 @@
>              (clobber (reg:CC FLAGS_REG))])]
>   ""
>  {
> +  rtx vec_reg;
> +  enum machine_mode mode = GET_MODE (operands[2]);
> +  if (vector_extensions_used_for_mode (mode)
> +      && CONSTANT_P (operands[2]))
> +    {
> +      if (mode == DImode)
> +       mode = TARGET_64BIT ? V2DImode : V4SImode;
> +      vec_reg = gen_reg_rtx (mode);
> +      emit_move_insn (vec_reg, operands[2]);
> +      operands[2] = vec_reg;
> +    }
>   if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
>     operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
>
> Index: i386-opts.h
> ===================================================================
> --- i386-opts.h (revision 181033)
> +++ i386-opts.h (working copy)
> @@ -37,7 +37,8 @@ enum stringop_alg
>    rep_prefix_8_byte,
>    loop_1_byte,
>    loop,
> -   unrolled_loop
> +   unrolled_loop,
> +   sse_loop
>  };
>
>  /* Available call abi.  */
> Index: sse.md
> ===================================================================
> --- sse.md      (revision 181033)
> +++ sse.md      (working copy)
> @@ -7509,6 +7509,16 @@
>    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
>    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
>
> +(define_expand "vec_dupv4si"
> +  [(set (match_operand:V4SI 0 "register_operand" "")
> +       (vec_duplicate:V4SI
> +         (match_operand:SI 1 "nonimmediate_operand" "")))]
> +  "TARGET_SSE"
> +{
> +  if (!TARGET_AVX)
> +    operands[1] = force_reg (V4SImode, operands[1]);
> +})
> +
>  (define_insn "*vec_dupv4si"
>   [(set (match_operand:V4SI 0 "register_operand"     "=x,x,x")
>        (vec_duplicate:V4SI
> @@ -7525,6 +7535,16 @@
>    (set_attr "prefix" "maybe_vex,vex,orig")
>    (set_attr "mode" "TI,V4SF,V4SF")])
>
> +(define_expand "vec_dupv2di"
> +  [(set (match_operand:V2DI 0 "register_operand" "")
> +       (vec_duplicate:V2DI
> +         (match_operand:DI 1 "nonimmediate_operand" "")))]
> +  "TARGET_SSE"
> +{
> +  if (!TARGET_AVX)
> +    operands[1] = force_reg (V2DImode, operands[1]);
> +})
> +
>  (define_insn "*vec_dupv2di"
>   [(set (match_operand:V2DI 0 "register_operand"     "=x,x,x,x")
>        (vec_duplicate:V2DI
> Index: i386.opt
> ===================================================================
> --- i386.opt    (revision 181033)
> +++ i386.opt    (working copy)
> @@ -324,6 +324,9 @@ Enum(stringop_alg) String(loop) Value(lo
>  EnumValue
>  Enum(stringop_alg) String(unrolled_loop) Value(unrolled_loop)
>
> +EnumValue
> +Enum(stringop_alg) String(sse_loop) Value(sse_loop)
> +
>  mtls-dialect=
>  Target RejectNegative Joined Var(ix86_tls_dialect) Enum(tls_dialect) Init(TLS_DIALECT_GNU)
>  Use given thread-local storage dialect
> Index: i386.c
> ===================================================================
> --- i386.c      (revision 181033)
> +++ i386.c      (working copy)
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost =
>   COSTS_N_BYTES (2),                   /* cost of FABS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FCHS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FSQRT instruction.  */
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -632,10 +636,14 @@ struct processor_costs i386_cost = {      /*
>   COSTS_N_INSNS (22),                  /* cost of FABS instruction.  */
>   COSTS_N_INSNS (24),                  /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (122),                 /* cost of FSQRT instruction.  */
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -704,10 +712,14 @@ struct processor_costs i486_cost = {      /*
>   COSTS_N_INSNS (3),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (3),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (83),                  /* cost of FSQRT instruction.  */
> -  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> +  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> +   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
>   COSTS_N_INSNS (1),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (1),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (70),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{libcall, {{-1, rep_prefix_4_byte}}},
> +   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  {{{libcall, {{-1, rep_prefix_4_byte}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{libcall, {{-1, rep_prefix_4_byte}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost =
>      noticeable win, for bigger blocks either rep movsl or rep movsb is
>      way to go.  Rep movsb has apparently more expensive startup time in CPU,
>      but after 4K the difference is down in the noise.  */
> -  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
> +  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
>                        {8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{rep_prefix_4_byte, {{1024, unrolled_loop},
> -                       {8192, rep_prefix_4_byte}, {-1, libcall}}},
> +   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
> +                       {8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
> +                       {8192, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{rep_prefix_4_byte, {{1024, unrolled_loop},
> +                       {8192, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
>   COSTS_N_INSNS (1),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (1),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (54),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
>   COSTS_N_INSNS (2),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (2),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (56),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
>   /* For some reason, Athlon deals better with REP prefix (relative to loops)
>      compared to K8. Alignment becomes important after 8 bytes for memcpy and
>      128 bytes for memset.  */
> -  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> +  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
>   /* K8 has optimized REP instruction for medium sized blocks, but for very
>      small blocks it is better to use loop. For large blocks, libcall can
>      do nontemporary accesses and beat inline considerably.  */
> -  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
>    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> -  {{libcall, {{8, loop}, {24, unrolled_loop},
> +   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> +  {{{libcall, {{8, loop}, {24, unrolled_loop},
>              {2048, rep_prefix_4_byte}, {-1, libcall}}},
>    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {{libcall, {{8, loop}, {24, unrolled_loop},
> +             {2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
>   4,                                   /* scalar_stmt_cost.  */
>   2,                                   /* scalar load_cost.  */
>   2,                                   /* scalar_store_cost.  */
> @@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
>   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
>      very small blocks it is better to use loop. For large blocks, libcall can
>      do nontemporary accesses and beat inline considerably.  */
> -  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> -   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> -  {{libcall, {{8, loop}, {24, unrolled_loop},
> +  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{16, loop}, {512, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{16, loop}, {512, rep_prefix_8_byte}, {-1, libcall}}}}},
> +  {{{libcall, {{8, loop}, {24, unrolled_loop},
>              {2048, rep_prefix_4_byte}, {-1, libcall}}},
>    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {{libcall, {{8, loop}, {24, unrolled_loop},
> +             {2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
>   4,                                   /* scalar_stmt_cost.  */
>   2,                                   /* scalar load_cost.  */
>   2,                                   /* scalar_store_cost.  */
> @@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
>   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
>       very small blocks it is better to use loop. For large blocks, libcall
>       can do nontemporary accesses and beat inline considerably.  */
> -  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
>    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> -  {{libcall, {{8, loop}, {24, unrolled_loop},
> +   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> +  {{{libcall, {{8, loop}, {24, unrolled_loop},
>              {2048, rep_prefix_4_byte}, {-1, libcall}}},
>    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {{libcall, {{8, loop}, {24, unrolled_loop},
> +             {2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
>   6,                                   /* scalar_stmt_cost.  */
>   4,                                   /* scalar load_cost.  */
>   4,                                   /* scalar_store_cost.  */
> @@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
>   /*  BDVER2 has optimized REP instruction for medium sized blocks, but for
>       very small blocks it is better to use loop. For large blocks, libcall
>       can do nontemporary accesses and beat inline considerably.  */
> -  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
>    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> -  {{libcall, {{8, loop}, {24, unrolled_loop},
> +  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> +  {{{libcall, {{8, loop}, {24, unrolled_loop},
>              {2048, rep_prefix_4_byte}, {-1, libcall}}},
>    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +  {{libcall, {{8, loop}, {24, unrolled_loop},
> +             {2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
>   6,                                   /* scalar_stmt_cost.  */
>   4,                                   /* scalar load_cost.  */
>   4,                                   /* scalar_store_cost.  */
> @@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
>   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
>      very small blocks it is better to use loop. For large blocks, libcall can
>      do nontemporary accesses and beat inline considerably.  */
> -  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
>    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> -  {{libcall, {{8, loop}, {24, unrolled_loop},
> +   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> +  {{{libcall, {{8, loop}, {24, unrolled_loop},
>              {2048, rep_prefix_4_byte}, {-1, libcall}}},
>    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {{libcall, {{8, loop}, {24, unrolled_loop},
> +             {2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
>   4,                                   /* scalar_stmt_cost.  */
>   2,                                   /* scalar load_cost.  */
>   2,                                   /* scalar_store_cost.  */
> @@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
>   COSTS_N_INSNS (2),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (2),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (43),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> +
> +  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> +   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> +   DUMMY_STRINGOP_ALGS}},
> +
> +  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
>    {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> +   {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
>   COSTS_N_INSNS (3),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (3),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (44),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> +
> +  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
>    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
>              {100000, unrolled_loop}, {-1, libcall}}}},
> -  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> +   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
> +             {100000, unrolled_loop}, {-1, libcall}}}}},
> +
> +  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
>    {-1, libcall}}},
>    {libcall, {{24, loop}, {64, unrolled_loop},
>              {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> +   {-1, libcall}}},
> +   {libcall, {{24, loop}, {64, unrolled_loop},
> +             {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -1704,13 +1779,108 @@ struct processor_costs atom_cost = {
>   COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
> -   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
> -         {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> -  {{libcall, {{8, loop}, {15, unrolled_loop},
> -         {2048, rep_prefix_4_byte}, {-1, libcall}}},
> -   {libcall, {{24, loop}, {32, unrolled_loop},
> -         {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +
> +  /* stringop_algs for memcpy.
> +     SSE loops works best on Atom, but fall back into non-SSE unrolled loop variant
> +     if that fails.  */
> +  {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
> +    {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> +   {{libcall, {{-1, libcall}}},                               /* Unknown alignment.  */
> +    {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
> +              {-1, libcall}}}}},
> +
> +  /* stringop_algs for memset.  */
> +  {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
> +    {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> +   {{libcall, {{1024, sse_loop}, {1024, unrolled_loop},         /* Unknown alignment.  */
> +              {-1, libcall}}},
> +    {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
> +              {-1, libcall}}}}},
> +  1,                                   /* scalar_stmt_cost.  */
> +  1,                                   /* scalar load_cost.  */
> +  1,                                   /* scalar_store_cost.  */
> +  1,                                   /* vec_stmt_cost.  */
> +  1,                                   /* vec_to_scalar_cost.  */
> +  1,                                   /* scalar_to_vec_cost.  */
> +  1,                                   /* vec_align_load_cost.  */
> +  2,                                   /* vec_unalign_load_cost.  */
> +  1,                                   /* vec_store_cost.  */
> +  3,                                   /* cond_taken_branch_cost.  */
> +  1,                                   /* cond_not_taken_branch_cost.  */
> +};
> +
> +/* Core should produce code tuned for core variants.  */
> +static const
> +struct processor_costs core_cost = {
> +  COSTS_N_INSNS (1),                   /* cost of an add instruction */
> +  /* On all chips taken into consideration lea is 2 cycles and more.  With
> +     this cost however our current implementation of synth_mult results in
> +     use of unnecessary temporary registers causing regression on several
> +     SPECfp benchmarks.  */
> +  COSTS_N_INSNS (1) + 1,               /* cost of a lea instruction */
> +  COSTS_N_INSNS (1),                   /* variable shift costs */
> +  COSTS_N_INSNS (1),                   /* constant shift costs */
> +  {COSTS_N_INSNS (3),                  /* cost of starting multiply for QI */
> +   COSTS_N_INSNS (4),                  /*                               HI */
> +   COSTS_N_INSNS (3),                  /*                               SI */
> +   COSTS_N_INSNS (4),                  /*                               DI */
> +   COSTS_N_INSNS (2)},                 /*                            other */
> +  0,                                   /* cost of multiply per each bit set */
> +  {COSTS_N_INSNS (18),                 /* cost of a divide/mod for QI */
> +   COSTS_N_INSNS (26),                 /*                          HI */
> +   COSTS_N_INSNS (42),                 /*                          SI */
> +   COSTS_N_INSNS (74),                 /*                          DI */
> +   COSTS_N_INSNS (74)},                        /*                          other */
> +  COSTS_N_INSNS (1),                   /* cost of movsx */
> +  COSTS_N_INSNS (1),                   /* cost of movzx */
> +  8,                                   /* "large" insn */
> +  17,                                  /* MOVE_RATIO */
> +  4,                                /* cost for loading QImode using movzbl */
> +  {4, 4, 4},                           /* cost of loading integer registers
> +                                          in QImode, HImode and SImode.
> +                                          Relative to reg-reg move (2).  */
> +  {4, 4, 4},                           /* cost of storing integer registers */
> +  4,                                   /* cost of reg,reg fld/fst */
> +  {12, 12, 12},                                /* cost of loading fp registers
> +                                          in SFmode, DFmode and XFmode */
> +  {6, 6, 8},                           /* cost of storing fp registers
> +                                          in SFmode, DFmode and XFmode */
> +  2,                                   /* cost of moving MMX register */
> +  {8, 8},                              /* cost of loading MMX registers
> +                                          in SImode and DImode */
> +  {8, 8},                              /* cost of storing MMX registers
> +                                          in SImode and DImode */
> +  2,                                   /* cost of moving SSE register */
> +  {8, 8, 8},                           /* cost of loading SSE registers
> +                                          in SImode, DImode and TImode */
> +  {8, 8, 8},                           /* cost of storing SSE registers
> +                                          in SImode, DImode and TImode */
> +  5,                                   /* MMX or SSE register to integer */
> +  32,                                  /* size of l1 cache.  */
> +  512,                                 /* size of l2 cache.  */
> +  64,                                  /* size of prefetch block */
> +  6,                                   /* number of parallel prefetches */
> +  /* Benchmarks shows large regressions on K8 sixtrack benchmark when this
> +     value is increased to perhaps more appropriate value of 5.  */
> +  3,                                   /* Branch cost */
> +  COSTS_N_INSNS (8),                   /* cost of FADD and FSUB insns.  */
> +  COSTS_N_INSNS (8),                   /* cost of FMUL instruction.  */
> +  COSTS_N_INSNS (20),                  /* cost of FDIV instruction.  */
> +  COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
> +  COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
> +  COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> +
> +  /* stringop_algs for memcpy.  */
> +  {{{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_4_byte}, {-1, libcall}}}, /* Known alignment.  */
> +    {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_4_byte}, {-1, libcall}}}, /* Unknown alignment.  */
> +    {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}}}}},
> +
> +  /* stringop_algs for memset.  */
> +  {{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment.  */
> +    {libcall, {{256, rep_prefix_8_byte}}}},
> +   {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment.  */
> +    {libcall, {{256, rep_prefix_8_byte}}}}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -1724,7 +1894,7 @@ struct processor_costs atom_cost = {
>   1,                                   /* cond_not_taken_branch_cost.  */
>  };
>
> -/* Generic64 should produce code tuned for Nocona and K8.  */
> +/* Generic64 should produce code tuned for Nocona, Core,  K8, Amdfam10 and buldozer.  */
>  static const
>  struct processor_costs generic64_cost = {
>   COSTS_N_INSNS (1),                   /* cost of an add instruction */
> @@ -1784,10 +1954,16 @@ struct processor_costs generic64_cost =
>   COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> -  {DUMMY_STRINGOP_ALGS,
> -   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> -  {DUMMY_STRINGOP_ALGS,
> -   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +
> +  {{DUMMY_STRINGOP_ALGS,
> +    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {DUMMY_STRINGOP_ALGS,
> +    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> +
> +  {{DUMMY_STRINGOP_ALGS,
> +    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +   {DUMMY_STRINGOP_ALGS,
> +    {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -1801,8 +1977,8 @@ struct processor_costs generic64_cost =
>   1,                                   /* cond_not_taken_branch_cost.  */
>  };
>
> -/* Generic32 should produce code tuned for PPro, Pentium4, Nocona,
> -   Athlon and K8.  */
> +/* Generic32 should produce code tuned for PPro, Pentium4, Nocona, Core
> +   Athlon, K8, amdfam10, buldozer.  */
>  static const
>  struct processor_costs generic32_cost = {
>   COSTS_N_INSNS (1),                   /* cost of an add instruction */
> @@ -1856,10 +2032,16 @@ struct processor_costs generic32_cost =
>   COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>   COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>   COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> +  /* stringop_algs for memcpy.  */
> +  {{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> -  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> +   {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
> +  /* stringop_algs for memset.  */
> +  {{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
>    DUMMY_STRINGOP_ALGS},
> +   {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> +   DUMMY_STRINGOP_ALGS}},
>   1,                                   /* scalar_stmt_cost.  */
>   1,                                   /* scalar load_cost.  */
>   1,                                   /* scalar_store_cost.  */
> @@ -2536,6 +2718,8 @@ static void ix86_set_current_function (t
>  static unsigned int ix86_minimum_incoming_stack_boundary (bool);
>
>  static enum calling_abi ix86_function_abi (const_tree);
> +static rtx promote_duplicated_reg (enum machine_mode, rtx);
> +static rtx promote_duplicated_reg_to_size (rtx, int, int, int);
>
>
>  #ifndef SUBTARGET32_DEFAULT_CPU
> @@ -2582,13 +2766,13 @@ static const struct ptt processor_target
>   {&k8_cost, 16, 7, 16, 7, 16},
>   {&nocona_cost, 0, 0, 0, 0, 0},
>   /* Core 2 32-bit.  */
> -  {&generic32_cost, 16, 10, 16, 10, 16},
> +  {&core_cost, 16, 10, 16, 10, 16},
>   /* Core 2 64-bit.  */
> -  {&generic64_cost, 16, 10, 16, 10, 16},
> +  {&core_cost, 16, 10, 16, 10, 16},
>   /* Core i7 32-bit.  */
> -  {&generic32_cost, 16, 10, 16, 10, 16},
> +  {&core_cost, 16, 10, 16, 10, 16},
>   /* Core i7 64-bit.  */
> -  {&generic64_cost, 16, 10, 16, 10, 16},
> +  {&core_cost, 16, 10, 16, 10, 16},
>   {&generic32_cost, 16, 7, 16, 7, 16},
>   {&generic64_cost, 16, 10, 16, 10, 16},
>   {&amdfam10_cost, 32, 24, 32, 7, 32},
> @@ -20800,22 +20984,37 @@ counter_mode (rtx count_exp)
>   return SImode;
>  }
>
> -/* When SRCPTR is non-NULL, output simple loop to move memory
> +/* Helper function for expand_set_or_movmem_via_loop.
> +
> +   When SRCPTR is non-NULL, output simple loop to move memory
>    pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
>    overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
>    equivalent loop to set memory by VALUE (supposed to be in MODE).
>
>    The size is rounded down to whole number of chunk size moved at once.
> -   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
> +   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.
>
> +   If ITER isn't NULL, than it'll be used in the generated loop without
> +   initialization (that allows to generate several consequent loops using the
> +   same iterator).
> +   If CHANGE_PTRS is specified, DESTPTR and SRCPTR would be increased by
> +   iterator value at the end of the function (as if they iterate in the loop).
> +   Otherwise, their vaules'll stay unchanged.
> +
> +   If EXPECTED_SIZE isn't -1, than it's used to compute branch-probabilities on
> +   the loop backedge.  When expected size is unknown (it's -1), the probability
> +   is set to 80%.
>
> -static void
> -expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
> -                              rtx destptr, rtx srcptr, rtx value,
> -                              rtx count, enum machine_mode mode, int unroll,
> -                              int expected_size)
> +   Return value is rtx of iterator, used in the loop - it could be reused in
> +   consequent calls of this function.  */
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> +                                        rtx destptr, rtx srcptr, rtx value,
> +                                        rtx count, rtx iter,
> +                                        enum machine_mode mode, int unroll,
> +                                        int expected_size, bool change_ptrs)
>  {
> -  rtx out_label, top_label, iter, tmp;
> +  rtx out_label, top_label, tmp;
>   enum machine_mode iter_mode = counter_mode (count);
>   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
>   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
> @@ -20823,10 +21022,12 @@ expand_set_or_movmem_via_loop (rtx destm
>   rtx x_addr;
>   rtx y_addr;
>   int i;
> +  bool reuse_iter = (iter != NULL_RTX);
>
>   top_label = gen_label_rtx ();
>   out_label = gen_label_rtx ();
> -  iter = gen_reg_rtx (iter_mode);
> +  if (!reuse_iter)
> +    iter = gen_reg_rtx (iter_mode);
>
>   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
>                              NULL, 1, OPTAB_DIRECT);
> @@ -20837,18 +21038,21 @@ expand_set_or_movmem_via_loop (rtx destm
>                               true, out_label);
>       predict_jump (REG_BR_PROB_BASE * 10 / 100);
>     }
> -  emit_move_insn (iter, const0_rtx);
> +  if (!reuse_iter)
> +    emit_move_insn (iter, const0_rtx);
>
>   emit_label (top_label);
>
>   tmp = convert_modes (Pmode, iter_mode, iter, true);
>   x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
> -  destmem = change_address (destmem, mode, x_addr);
> +  destmem =
> +    adjust_automodify_address_nv (copy_rtx (destmem), mode, x_addr, 0);
>
>   if (srcmem)
>     {
>       y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
> -      srcmem = change_address (srcmem, mode, y_addr);
> +      srcmem =
> +       adjust_automodify_address_nv (copy_rtx (srcmem), mode, y_addr, 0);
>
>       /* When unrolling for chips that reorder memory reads and writes,
>         we can save registers by using single temporary.
> @@ -20920,19 +21124,43 @@ expand_set_or_movmem_via_loop (rtx destm
>     }
>   else
>     predict_jump (REG_BR_PROB_BASE * 80 / 100);
> -  iter = ix86_zero_extend_to_Pmode (iter);
> -  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
> -                            true, OPTAB_LIB_WIDEN);
> -  if (tmp != destptr)
> -    emit_move_insn (destptr, tmp);
> -  if (srcptr)
> +  if (change_ptrs)
>     {
> -      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
> +      iter = ix86_zero_extend_to_Pmode (iter);
> +      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
>                                 true, OPTAB_LIB_WIDEN);
> -      if (tmp != srcptr)
> -       emit_move_insn (srcptr, tmp);
> +      if (tmp != destptr)
> +       emit_move_insn (destptr, tmp);
> +      if (srcptr)
> +       {
> +         tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
> +                                    true, OPTAB_LIB_WIDEN);
> +         if (tmp != srcptr)
> +           emit_move_insn (srcptr, tmp);
> +       }
>     }
>   emit_label (out_label);
> +  return iter;
> +}
> +
> +/* When SRCPTR is non-NULL, output simple loop to move memory
> +   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
> +   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
> +   equivalent loop to set memory by VALUE (supposed to be in MODE).
> +
> +   The size is rounded down to whole number of chunk size moved at once.
> +   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
> +
> +static void
> +expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
> +                              rtx destptr, rtx srcptr, rtx value,
> +                              rtx count, enum machine_mode mode, int unroll,
> +                              int expected_size)
> +{
> +  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
> +                                destptr, srcptr, value,
> +                                count, NULL_RTX, mode, unroll,
> +                                expected_size, true);
>  }
>
>  /* Output "rep; mov" instruction.
> @@ -21036,7 +21264,18 @@ emit_strmov (rtx destmem, rtx srcmem,
>   emit_insn (gen_strmov (destptr, dest, srcptr, src));
>  }
>
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this constant to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
> +{
> +  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
> +  emit_insn (gen_strset (destptr, dest, value));
> +}
> +
> +/* Output code to copy (COUNT % MAX_SIZE) bytes from SRCPTR to DESTPTR.
> +   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
>  static void
>  expand_movmem_epilogue (rtx destmem, rtx srcmem,
>                        rtx destptr, rtx srcptr, rtx count, int max_size)
> @@ -21047,43 +21286,55 @@ expand_movmem_epilogue (rtx destmem, rtx
>       HOST_WIDE_INT countval = INTVAL (count);
>       int offset = 0;
>
> -      if ((countval & 0x10) && max_size > 16)
> +      int remainder_size = countval % max_size;
> +      enum machine_mode move_mode = Pmode;
> +
> +      /* Firstly, try to move data with the widest possible mode.
> +        Remaining part we'll move using Pmode and narrower modes.  */
> +      if (TARGET_SSE)
>        {
> -         if (TARGET_64BIT)
> -           {
> -             emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
> -             emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
> -           }
> -         else
> -           gcc_unreachable ();
> -         offset += 16;
> +         if (max_size >= GET_MODE_SIZE (V4SImode))
> +           move_mode = V4SImode;
> +         else if (max_size >= GET_MODE_SIZE (DImode))
> +           move_mode = DImode;
>        }
> -      if ((countval & 0x08) && max_size > 8)
> +
> +      while (remainder_size >= GET_MODE_SIZE (move_mode))
>        {
> -         if (TARGET_64BIT)
> -           emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
> -         else
> -           {
> -             emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
> -             emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
> -           }
> -         offset += 8;
> +         emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
> +         offset += GET_MODE_SIZE (move_mode);
> +         remainder_size -= GET_MODE_SIZE (move_mode);
> +       }
> +
> +      /* Move the remaining part of epilogue - its size might be
> +        a size of the widest mode.  */
> +      move_mode = Pmode;
> +      while (remainder_size >= GET_MODE_SIZE (move_mode))
> +       {
> +         emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
> +         offset += GET_MODE_SIZE (move_mode);
> +         remainder_size -= GET_MODE_SIZE (move_mode);
>        }
> -      if ((countval & 0x04) && max_size > 4)
> +
> +      if (remainder_size >= 4)
>        {
> -          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
> +         emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
>          offset += 4;
> +         remainder_size -= 4;
>        }
> -      if ((countval & 0x02) && max_size > 2)
> +      if (remainder_size >= 2)
>        {
> -          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
> +         emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
>          offset += 2;
> +         remainder_size -= 2;
>        }
> -      if ((countval & 0x01) && max_size > 1)
> +      if (remainder_size >= 1)
>        {
> -          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
> +         emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
>          offset += 1;
> +         remainder_size -= 1;
>        }
> +      gcc_assert (remainder_size == 0);
>       return;
>     }
>   if (max_size > 8)
> @@ -21189,87 +21440,121 @@ expand_setmem_epilogue_via_loop (rtx des
>                                 1, max_size / 2);
>  }
>
> -/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
> +/* Output code to set with VALUE at most (COUNT % MAX_SIZE) bytes starting from
> +   DESTPTR.
> +   DESTMEM provides MEMrtx to feed proper aliasing info.
> +   PROMOTED_TO_GPR_VALUE is rtx representing a GPR containing broadcasted VALUE.
> +   PROMOTED_TO_VECTOR_VALUE is rtx representing a vector register containing
> +   broadcasted VALUE.
> +   PROMOTED_TO_GPR_VALUE and PROMOTED_TO_VECTOR_VALUE could be NULL if the
> +   promotion hasn't been generated before.  */
>  static void
> -expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
> +expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
> +                       rtx promoted_to_gpr_value, rtx value, rtx count,
> +                       int max_size)
>  {
> -  rtx dest;
> -
>   if (CONST_INT_P (count))
>     {
>       HOST_WIDE_INT countval = INTVAL (count);
>       int offset = 0;
>
> -      if ((countval & 0x10) && max_size > 16)
> -       {
> -         if (TARGET_64BIT)
> -           {
> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         else
> -           gcc_unreachable ();
> -         offset += 16;
> -       }
> -      if ((countval & 0x08) && max_size > 8)
> -       {
> -         if (TARGET_64BIT)
> -           {
> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         else
> -           {
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         offset += 8;
> -       }
> -      if ((countval & 0x04) && max_size > 4)
> +      int remainder_size = countval % max_size;
> +      enum machine_mode move_mode = Pmode;
> +
> +      /* Firstly, try to move data with the widest possible mode.
> +        Remaining part we'll move using Pmode and narrower modes.  */
> +
> +      if (promoted_to_vector_value)
> +       while (remainder_size >= 16)
> +         {
> +           if (GET_MODE (destmem) != move_mode)
> +             destmem = adjust_automodify_address_nv (destmem, move_mode,
> +                                                     destptr, offset);
> +           emit_strset (destmem, promoted_to_vector_value, destptr,
> +                        move_mode, offset);
> +
> +           offset += 16;
> +           remainder_size -= 16;
> +         }
> +
> +      /* Move the remaining part of epilogue - its size might be
> +        a size of the widest mode.  */
> +      while (remainder_size >= GET_MODE_SIZE (Pmode))
> +       {
> +         if (!promoted_to_gpr_value)
> +           promoted_to_gpr_value = promote_duplicated_reg (Pmode, value);
> +         emit_strset (destmem, promoted_to_gpr_value, destptr, Pmode, offset);
> +         offset += GET_MODE_SIZE (Pmode);
> +         remainder_size -= GET_MODE_SIZE (Pmode);
> +       }
> +
> +      if (!promoted_to_gpr_value && remainder_size > 1)
> +       promoted_to_gpr_value = promote_duplicated_reg (remainder_size >= 4
> +                                                       ? SImode : HImode, value);
> +      if (remainder_size >= 4)
>        {
> -         dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> -         emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
> +         emit_strset (destmem, gen_lowpart (SImode, promoted_to_gpr_value), destptr,
> +                      SImode, offset);
>          offset += 4;
> +         remainder_size -= 4;
>        }
> -      if ((countval & 0x02) && max_size > 2)
> +      if (remainder_size >= 2)
>        {
> -         dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
> -         emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
> -         offset += 2;
> +         emit_strset (destmem, gen_lowpart (HImode, promoted_to_gpr_value), destptr,
> +                      HImode, offset);
> +         offset +=2;
> +         remainder_size -= 2;
>        }
> -      if ((countval & 0x01) && max_size > 1)
> +      if (remainder_size >= 1)
>        {
> -         dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
> -         emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
> +         emit_strset (destmem,
> +                      promoted_to_gpr_value ? gen_lowpart (QImode, promoted_to_gpr_value) : value,
> +                       destptr,
> +                      QImode, offset);
>          offset += 1;
> +         remainder_size -= 1;
>        }
> +      gcc_assert (remainder_size == 0);
>       return;
>     }
> +
> +  /* count isn't const.  */
>   if (max_size > 32)
>     {
> -      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
> +      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
> +                                      max_size);
>       return;
>     }
> +
> +  if (!promoted_to_gpr_value)
> +    promoted_to_gpr_value = promote_duplicated_reg_to_size (value,
> +                                                  GET_MODE_SIZE (Pmode),
> +                                                  GET_MODE_SIZE (Pmode),
> +                                                  GET_MODE_SIZE (Pmode));
> +
>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
> -      if (TARGET_64BIT)
> +      if (TARGET_SSE && promoted_to_vector_value)
> +       {
> +         destmem = change_address (destmem,
> +                                   GET_MODE (promoted_to_vector_value),
> +                                   destptr);
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_vector_value));
> +       }
> +      else if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
>        }
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
> @@ -21279,14 +21564,22 @@ expand_setmem_epilogue (rtx destmem, rtx
>       rtx label = ix86_expand_aligntest (count, 8, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> +       }
> +      /* FIXME: When this hunk it output, IRA classifies promoted_to_vector_value
> +         as NO_REGS.  */
> +      else if (TARGET_SSE && promoted_to_vector_value && 0)
> +       {
> +         destmem = change_address (destmem, V2SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem,
> +                                gen_lowpart (V2SImode, promoted_to_vector_value)));
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> +         emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
>        }
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
> @@ -21294,24 +21587,27 @@ expand_setmem_epilogue (rtx destmem, rtx
>   if (max_size > 4)
>     {
>       rtx label = ix86_expand_aligntest (count, 4, true);
> -      dest = change_address (destmem, SImode, destptr);
> -      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
> +      destmem = change_address (destmem, SImode, destptr);
> +      emit_insn (gen_strset (destptr, destmem,
> +                            gen_lowpart (SImode, promoted_to_gpr_value)));
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
>     }
>   if (max_size > 2)
>     {
>       rtx label = ix86_expand_aligntest (count, 2, true);
> -      dest = change_address (destmem, HImode, destptr);
> -      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
> +      destmem = change_address (destmem, HImode, destptr);
> +      emit_insn (gen_strset (destptr, destmem,
> +                            gen_lowpart (HImode, promoted_to_gpr_value)));
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
>     }
>   if (max_size > 1)
>     {
>       rtx label = ix86_expand_aligntest (count, 1, true);
> -      dest = change_address (destmem, QImode, destptr);
> -      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
> +      destmem = change_address (destmem, QImode, destptr);
> +      emit_insn (gen_strset (destptr, destmem,
> +                            gen_lowpart (QImode, promoted_to_gpr_value)));
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
>     }
> @@ -21327,8 +21623,8 @@ expand_movmem_prologue (rtx destmem, rtx
>   if (align <= 1 && desired_alignment > 1)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 1, false);
> -      srcmem = change_address (srcmem, QImode, srcptr);
> -      destmem = change_address (destmem, QImode, destptr);
> +      srcmem = adjust_automodify_address_nv (srcmem, QImode, srcptr, 0);
> +      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
>       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
>       ix86_adjust_counter (count, 1);
>       emit_label (label);
> @@ -21337,8 +21633,8 @@ expand_movmem_prologue (rtx destmem, rtx
>   if (align <= 2 && desired_alignment > 2)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 2, false);
> -      srcmem = change_address (srcmem, HImode, srcptr);
> -      destmem = change_address (destmem, HImode, destptr);
> +      srcmem = adjust_automodify_address_nv (srcmem, HImode, srcptr, 0);
> +      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
>       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
>       ix86_adjust_counter (count, 2);
>       emit_label (label);
> @@ -21347,14 +21643,34 @@ expand_movmem_prologue (rtx destmem, rtx
>   if (align <= 4 && desired_alignment > 4)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 4, false);
> -      srcmem = change_address (srcmem, SImode, srcptr);
> -      destmem = change_address (destmem, SImode, destptr);
> +      srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
> +      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
>       emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
>       ix86_adjust_counter (count, 4);
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
>     }
> -  gcc_assert (desired_alignment <= 8);
> +  if (align <= 8 && desired_alignment > 8)
> +    {
> +      rtx label = ix86_expand_aligntest (destptr, 8, false);
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         srcmem = adjust_automodify_address_nv (srcmem, DImode, srcptr, 0);
> +         destmem = adjust_automodify_address_nv (destmem, DImode, destptr, 0);
> +         emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> +       }
> +      else
> +       {
> +         srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
> +         destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> +         emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> +         emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> +       }
> +      ix86_adjust_counter (count, 8);
> +      emit_label (label);
> +      LABEL_NUSES (label) = 1;
> +    }
> +  gcc_assert (desired_alignment <= 16);
>  }
>
>  /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
> @@ -21409,6 +21725,37 @@ expand_constant_movmem_prologue (rtx dst
>       off = 4;
>       emit_insn (gen_strmov (destreg, dst, srcreg, src));
>     }
> +  if (align_bytes & 8)
> +    {
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      else
> +       {
> +         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
> +       set_mem_align (dst, 8 * BITS_PER_UNIT);
> +      if (src_align_bytes >= 0)
> +       {
> +         unsigned int src_align = 0;
> +         if ((src_align_bytes & 7) == (align_bytes & 7))
> +           src_align = 8;
> +         else if ((src_align_bytes & 3) == (align_bytes & 3))
> +           src_align = 4;
> +         else if ((src_align_bytes & 1) == (align_bytes & 1))
> +           src_align = 2;
> +         if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
> +           set_mem_align (src, src_align * BITS_PER_UNIT);
> +       }
> +      off = 8;
> +    }
>   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
>   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
>   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
> @@ -21416,7 +21763,9 @@ expand_constant_movmem_prologue (rtx dst
>   if (src_align_bytes >= 0)
>     {
>       unsigned int src_align = 0;
> -      if ((src_align_bytes & 7) == (align_bytes & 7))
> +      if ((src_align_bytes & 15) == (align_bytes & 15))
> +       src_align = 16;
> +      else if ((src_align_bytes & 7) == (align_bytes & 7))
>        src_align = 8;
>       else if ((src_align_bytes & 3) == (align_bytes & 3))
>        src_align = 4;
> @@ -21444,7 +21793,7 @@ expand_setmem_prologue (rtx destmem, rtx
>   if (align <= 1 && desired_alignment > 1)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 1, false);
> -      destmem = change_address (destmem, QImode, destptr);
> +      destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
>       emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
>       ix86_adjust_counter (count, 1);
>       emit_label (label);
> @@ -21453,7 +21802,7 @@ expand_setmem_prologue (rtx destmem, rtx
>   if (align <= 2 && desired_alignment > 2)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 2, false);
> -      destmem = change_address (destmem, HImode, destptr);
> +      destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
>       emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
>       ix86_adjust_counter (count, 2);
>       emit_label (label);
> @@ -21462,13 +21811,23 @@ expand_setmem_prologue (rtx destmem, rtx
>   if (align <= 4 && desired_alignment > 4)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 4, false);
> -      destmem = change_address (destmem, SImode, destptr);
> +      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
>       emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
>       ix86_adjust_counter (count, 4);
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
>     }
> -  gcc_assert (desired_alignment <= 8);
> +  if (align <= 8 && desired_alignment > 8)
> +    {
> +      rtx label = ix86_expand_aligntest (destptr, 8, false);
> +      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> +      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> +      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> +      ix86_adjust_counter (count, 8);
> +      emit_label (label);
> +      LABEL_NUSES (label) = 1;
> +    }
> +  gcc_assert (desired_alignment <= 16);
>  }
>
>  /* Set enough from DST to align DST known to by aligned by ALIGN to
> @@ -21504,6 +21863,19 @@ expand_constant_setmem_prologue (rtx dst
>       emit_insn (gen_strset (destreg, dst,
>                             gen_lowpart (SImode, value)));
>     }
> +  if (align_bytes & 8)
> +    {
> +      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +      emit_insn (gen_strset (destreg, dst,
> +           gen_lowpart (SImode, value)));
> +      off = 4;
> +      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +      emit_insn (gen_strset (destreg, dst,
> +           gen_lowpart (SImode, value)));
> +      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
> +       set_mem_align (dst, 8 * BITS_PER_UNIT);
> +      off = 4;
> +    }
>   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
>   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
>     set_mem_align (dst, desired_align * BITS_PER_UNIT);
> @@ -21515,7 +21887,7 @@ expand_constant_setmem_prologue (rtx dst
>  /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
>  static enum stringop_alg
>  decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
> -           int *dynamic_check)
> +           int *dynamic_check, bool align_unknown)
>  {
>   const struct stringop_algs * algs;
>   bool optimize_for_speed;
> @@ -21524,7 +21896,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
>      consider such algorithms if the user has appropriated those
>      registers for their own purposes. */
>   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
> -                             || (memset
> +                            || (memset
>                                 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
>
>  #define ALG_USABLE_P(alg) (rep_prefix_usable                   \
> @@ -21537,7 +21909,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
>      of time processing large blocks.  */
>   if (optimize_function_for_size_p (cfun)
>       || (optimize_insn_for_size_p ()
> -          && expected_size != -1 && expected_size < 256))
> +         && expected_size != -1 && expected_size < 256))
>     optimize_for_speed = false;
>   else
>     optimize_for_speed = true;
> @@ -21546,9 +21918,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
>
>   *dynamic_check = -1;
>   if (memset)
> -    algs = &cost->memset[TARGET_64BIT != 0];
> +    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
>   else
> -    algs = &cost->memcpy[TARGET_64BIT != 0];
> +    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
>   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
>     return ix86_stringop_alg;
>   /* rep; movq or rep; movl is the smallest variant.  */
> @@ -21612,29 +21984,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
>       enum stringop_alg alg;
>       int i;
>       bool any_alg_usable_p = true;
> +      bool only_libcall_fits = true;
>
>       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
> -        {
> -          enum stringop_alg candidate = algs->size[i].alg;
> -          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
> +       {
> +         enum stringop_alg candidate = algs->size[i].alg;
> +         any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
>
> -          if (candidate != libcall && candidate
> -              && ALG_USABLE_P (candidate))
> -              max = algs->size[i].max;
> -        }
> +         if (candidate != libcall && candidate
> +             && ALG_USABLE_P (candidate))
> +           {
> +             max = algs->size[i].max;
> +             only_libcall_fits = false;
> +           }
> +       }
>       /* If there aren't any usable algorithms, then recursing on
> -         smaller sizes isn't going to find anything.  Just return the
> -         simple byte-at-a-time copy loop.  */
> -      if (!any_alg_usable_p)
> -        {
> -          /* Pick something reasonable.  */
> -          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> -            *dynamic_check = 128;
> -          return loop_1_byte;
> -        }
> +        smaller sizes isn't going to find anything.  Just return the
> +        simple byte-at-a-time copy loop.  */
> +      if (!any_alg_usable_p || only_libcall_fits)
> +       {
> +         /* Pick something reasonable.  */
> +         if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> +           *dynamic_check = 128;
> +         return loop_1_byte;
> +       }
>       if (max == -1)
>        max = 4096;
> -      alg = decide_alg (count, max / 2, memset, dynamic_check);
> +      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
>       gcc_assert (*dynamic_check == -1);
>       gcc_assert (alg != libcall);
>       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> @@ -21658,9 +22034,14 @@ decide_alignment (int align,
>       case no_stringop:
>        gcc_unreachable ();
>       case loop:
> +       desired_align = GET_MODE_SIZE (Pmode);
> +       break;
>       case unrolled_loop:
>        desired_align = GET_MODE_SIZE (Pmode);
>        break;
> +      case sse_loop:
> +       desired_align = 16;
> +       break;
>       case rep_prefix_8_byte:
>        desired_align = 8;
>        break;
> @@ -21748,6 +22129,11 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>   enum stringop_alg alg;
>   int dynamic_check;
>   bool need_zero_guard = false;
> +  bool align_unknown;
> +  int unroll_factor;
> +  enum machine_mode move_mode;
> +  rtx loop_iter = NULL_RTX;
> +  int dst_offset, src_offset;
>
>   if (CONST_INT_P (align_exp))
>     align = INTVAL (align_exp);
> @@ -21771,9 +22157,17 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>
>   /* Step 0: Decide on preferred algorithm, desired alignment and
>      size of chunks to be copied by main loop.  */
> -
> -  alg = decide_alg (count, expected_size, false, &dynamic_check);
> +  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
> +  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
> +  align_unknown = (dst_offset < 0
> +                  || src_offset < 0
> +                  || src_offset != dst_offset);
> +  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
>   desired_align = decide_alignment (align, alg, expected_size);
> +  if (align_unknown)
> +    desired_align = align;
> +  unroll_factor = 1;
> +  move_mode = Pmode;
>
>   if (!TARGET_ALIGN_STRINGOPS)
>     align = desired_align;
> @@ -21792,11 +22186,22 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>       gcc_unreachable ();
>     case loop:
>       need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (Pmode);
> +      move_mode = Pmode;
> +      unroll_factor = 1;
> +      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>       break;
>     case unrolled_loop:
>       need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
> +      move_mode = Pmode;
> +      unroll_factor = TARGET_64BIT ? 4 : 2;
> +      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> +      break;
> +    case sse_loop:
> +      need_zero_guard = true;
> +      /* Use SSE instructions, if possible.  */
> +      move_mode = align_unknown ? DImode : V4SImode;
> +      unroll_factor = TARGET_64BIT ? 4 : 2;
> +      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>       break;
>     case rep_prefix_8_byte:
>       size_needed = 8;
> @@ -21857,6 +22262,12 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>        }
>       else
>        {
> +         /* SSE and unrolled algs re-use iteration counter in the epilogue.  */
> +         if (alg == sse_loop || alg == unrolled_loop)
> +           {
> +             loop_iter = gen_reg_rtx (counter_mode (count_exp));
> +              emit_move_insn (loop_iter, const0_rtx);
> +           }
>          label = gen_label_rtx ();
>          emit_cmp_and_jump_insns (count_exp,
>                                   GEN_INT (epilogue_size_needed),
> @@ -21908,6 +22319,8 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>          dst = change_address (dst, BLKmode, destreg);
>          expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
>                                  desired_align);
> +         set_mem_align (src, desired_align*BITS_PER_UNIT);
> +         set_mem_align (dst, desired_align*BITS_PER_UNIT);
>        }
>       else
>        {
> @@ -21964,12 +22377,16 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>       expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
>                                     count_exp, Pmode, 1, expected_size);
>       break;
> +    case sse_loop:
>     case unrolled_loop:
> -      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
> -        registers for 4 temporaries anyway.  */
> -      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
> -                                    count_exp, Pmode, TARGET_64BIT ? 4 : 2,
> -                                    expected_size);
> +      /* In some cases we want to use the same iterator in several adjacent
> +        loops, so here we save loop iterator rtx and don't update addresses.  */
> +      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
> +                                                          srcreg, NULL,
> +                                                          count_exp, loop_iter,
> +                                                          move_mode,
> +                                                          unroll_factor,
> +                                                          expected_size, false);
>       break;
>     case rep_prefix_8_byte:
>       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
> @@ -22020,9 +22437,41 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>       LABEL_NUSES (label) = 1;
>     }
>
> +  /* We haven't updated addresses, so we'll do it now.
> +     Also, if the epilogue seems to be big, we'll generate a loop (not
> +     unrolled) in it.  We'll do it only if alignment is unknown, because in
> +     this case in epilogue we have to perform memmove by bytes, which is very
> +     slow.  */
> +  if (alg == sse_loop || alg == unrolled_loop)
> +    {
> +      rtx tmp;
> +      if (align_unknown && unroll_factor > 1)
> +       {
> +         /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
> +            do this, we can have very big epilogue - when alignment is statically
> +            unknown we'll have the epilogue byte by byte which may be very slow.  */
> +         loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
> +             srcreg, NULL, count_exp,
> +             loop_iter, move_mode, 1,
> +             expected_size, false);
> +         src = change_address (src, BLKmode, srcreg);
> +         dst = change_address (dst, BLKmode, destreg);
> +         epilogue_size_needed = GET_MODE_SIZE (move_mode);
> +       }
> +      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
> +                              true, OPTAB_LIB_WIDEN);
> +      if (tmp != destreg)
> +       emit_move_insn (destreg, tmp);
> +
> +      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
> +                              true, OPTAB_LIB_WIDEN);
> +      if (tmp != srcreg)
> +       emit_move_insn (srcreg, tmp);
> +    }
>   if (count_exp != const0_rtx && epilogue_size_needed > 1)
>     expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
>                            epilogue_size_needed);
> +
>   if (jump_around_label)
>     emit_label (jump_around_label);
>   return true;
> @@ -22040,7 +22489,37 @@ promote_duplicated_reg (enum machine_mod
>   rtx tmp;
>   int nops = mode == DImode ? 3 : 2;
>
> +  if (VECTOR_MODE_P (mode))
> +    {
> +      enum machine_mode inner = GET_MODE_INNER (mode);
> +      rtx promoted_val, vec_reg;
> +      if (CONST_INT_P (val))
> +       return ix86_build_const_vector (mode, true, val);
> +
> +      promoted_val = promote_duplicated_reg (inner, val);
> +      vec_reg = gen_reg_rtx (mode);
> +      switch (mode)
> +       {
> +       case V2DImode:
> +         emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
> +         break;
> +       case V4SImode:
> +         emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
> +         break;
> +       default:
> +         gcc_unreachable ();
> +         break;
> +       }
> +
> +      return vec_reg;
> +    }
>   gcc_assert (mode == SImode || mode == DImode);
> +  if (mode == DImode && !TARGET_64BIT)
> +    {
> +      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
> +      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
> +      return vec_reg;
> +    }
>   if (val == const0_rtx)
>     return copy_to_mode_reg (mode, const0_rtx);
>   if (CONST_INT_P (val))
> @@ -22106,11 +22585,27 @@ promote_duplicated_reg (enum machine_mod
>  static rtx
>  promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
>  {
> -  rtx promoted_val;
> +  rtx promoted_val = NULL_RTX;
>
> -  if (TARGET_64BIT
> -      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
> -    promoted_val = promote_duplicated_reg (DImode, val);
> +  if (size_needed > 8 || (desired_align > align && desired_align > 8))
> +    {
> +      /* We want to promote to vector register, so we expect that at least SSE
> +        is available.  */
> +      gcc_assert (TARGET_SSE);
> +
> +      /* In case of promotion to vector register, we expect that val is a
> +        constant or already promoted to GPR value.  */
> +      gcc_assert (GET_MODE (val) == Pmode || CONSTANT_P (val));
> +      if (TARGET_64BIT)
> +       promoted_val = promote_duplicated_reg (V2DImode, val);
> +      else
> +       promoted_val = promote_duplicated_reg (V4SImode, val);
> +    }
> +  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
> +    {
> +      gcc_assert (TARGET_64BIT);
> +      promoted_val = promote_duplicated_reg (DImode, val);
> +    }
>   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
>     promoted_val = promote_duplicated_reg (SImode, val);
>   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
> @@ -22138,10 +22633,14 @@ ix86_expand_setmem (rtx dst, rtx count_e
>   int size_needed = 0, epilogue_size_needed;
>   int desired_align = 0, align_bytes = 0;
>   enum stringop_alg alg;
> -  rtx promoted_val = NULL;
> -  bool force_loopy_epilogue = false;
> +  rtx gpr_promoted_val = NULL;
> +  rtx vec_promoted_val = NULL;
>   int dynamic_check;
>   bool need_zero_guard = false;
> +  bool align_unknown;
> +  unsigned int unroll_factor;
> +  enum machine_mode move_mode;
> +  rtx loop_iter = NULL_RTX;
>
>   if (CONST_INT_P (align_exp))
>     align = INTVAL (align_exp);
> @@ -22161,8 +22660,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
>   /* Step 0: Decide on preferred algorithm, desired alignment and
>      size of chunks to be copied by main loop.  */
>
> -  alg = decide_alg (count, expected_size, true, &dynamic_check);
> +  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
> +  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
>   desired_align = decide_alignment (align, alg, expected_size);
> +  unroll_factor = 1;
> +  move_mode = Pmode;
>
>   if (!TARGET_ALIGN_STRINGOPS)
>     align = desired_align;
> @@ -22180,11 +22682,28 @@ ix86_expand_setmem (rtx dst, rtx count_e
>       gcc_unreachable ();
>     case loop:
>       need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (Pmode);
> +      move_mode = Pmode;
> +      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>       break;
>     case unrolled_loop:
>       need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (Pmode) * 4;
> +      move_mode = Pmode;
> +      unroll_factor = 1;
> +      /* Select maximal available 1,2 or 4 unroll factor.  */
> +      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> +            && unroll_factor < 4)
> +       unroll_factor *= 2;
> +      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> +      break;
> +    case sse_loop:
> +      need_zero_guard = true;
> +      move_mode = TARGET_64BIT ? V2DImode : V4SImode;
> +      unroll_factor = 1;
> +      /* Select maximal available 1,2 or 4 unroll factor.  */
> +      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> +            && unroll_factor < 4)
> +       unroll_factor *= 2;
> +      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>       break;
>     case rep_prefix_8_byte:
>       size_needed = 8;
> @@ -22229,8 +22748,10 @@ ix86_expand_setmem (rtx dst, rtx count_e
>      main loop and epilogue (ie one load of the big constant in the
>      front of all code.  */
>   if (CONST_INT_P (val_exp))
> -    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
> -                                                  desired_align, align);
> +    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
> +                                                  GET_MODE_SIZE (Pmode),
> +                                                  GET_MODE_SIZE (Pmode),
> +                                                  align);
>   /* Ensure that alignment prologue won't copy past end of block.  */
>   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
>     {
> @@ -22239,12 +22760,6 @@ ix86_expand_setmem (rtx dst, rtx count_e
>         Make sure it is power of 2.  */
>       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
>
> -      /* To improve performance of small blocks, we jump around the VAL
> -        promoting mode.  This mean that if the promoted VAL is not constant,
> -        we might not use it in the epilogue and have to use byte
> -        loop variant.  */
> -      if (epilogue_size_needed > 2 && !promoted_val)
> -        force_loopy_epilogue = true;
>       if (count)
>        {
>          if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
> @@ -22259,6 +22774,12 @@ ix86_expand_setmem (rtx dst, rtx count_e
>        }
>       else
>        {
> +         /* SSE and unrolled_lopo algs re-use iteration counter in the epilogue.  */
> +         if (alg == sse_loop || alg == unrolled_loop)
> +           {
> +             loop_iter = gen_reg_rtx (counter_mode (count_exp));
> +              emit_move_insn (loop_iter, const0_rtx);
> +           }
>          label = gen_label_rtx ();
>          emit_cmp_and_jump_insns (count_exp,
>                                   GEN_INT (epilogue_size_needed),
> @@ -22284,9 +22805,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
>   /* Step 2: Alignment prologue.  */
>
>   /* Do the expensive promotion once we branched off the small blocks.  */
> -  if (!promoted_val)
> -    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
> -                                                  desired_align, align);
> +  if (!gpr_promoted_val)
> +    gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
> +                                                  GET_MODE_SIZE (Pmode),
> +                                                  GET_MODE_SIZE (Pmode),
> +                                                  align);
>   gcc_assert (desired_align >= 1 && align >= 1);
>
>   if (desired_align > align)
> @@ -22298,17 +22821,20 @@ ix86_expand_setmem (rtx dst, rtx count_e
>             the pain to maintain it for the first move, so throw away
>             the info early.  */
>          dst = change_address (dst, BLKmode, destreg);
> -         expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
> +         expand_setmem_prologue (dst, destreg, gpr_promoted_val, count_exp, align,
>                                  desired_align);
> +         set_mem_align (dst, desired_align*BITS_PER_UNIT);
>        }
>       else
>        {
>          /* If we know how many bytes need to be stored before dst is
>             sufficiently aligned, maintain aliasing info accurately.  */
> -         dst = expand_constant_setmem_prologue (dst, destreg, promoted_val,
> +         dst = expand_constant_setmem_prologue (dst, destreg, gpr_promoted_val,
>                                                 desired_align, align_bytes);
>          count_exp = plus_constant (count_exp, -align_bytes);
>          count -= align_bytes;
> +         if (count < (unsigned HOST_WIDE_INT) size_needed)
> +           goto epilogue;
>        }
>       if (need_zero_guard
>          && (count < (unsigned HOST_WIDE_INT) size_needed
> @@ -22336,7 +22862,7 @@ ix86_expand_setmem (rtx dst, rtx count_e
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
>       label = NULL;
> -      promoted_val = val_exp;
> +      gpr_promoted_val = val_exp;
>       epilogue_size_needed = 1;
>     }
>   else if (label == NULL_RTX)
> @@ -22350,27 +22876,40 @@ ix86_expand_setmem (rtx dst, rtx count_e
>     case no_stringop:
>       gcc_unreachable ();
>     case loop_1_byte:
> -      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
> +      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
>                                     count_exp, QImode, 1, expected_size);
>       break;
>     case loop:
> -      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
> +      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, gpr_promoted_val,
>                                     count_exp, Pmode, 1, expected_size);
>       break;
>     case unrolled_loop:
> -      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
> -                                    count_exp, Pmode, 4, expected_size);
> +      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> +                                    NULL, gpr_promoted_val, count_exp,
> +                                    loop_iter, move_mode, unroll_factor,
> +                                    expected_size, false);
> +      break;
> +    case sse_loop:
> +      vec_promoted_val =
> +       promote_duplicated_reg_to_size (gpr_promoted_val,
> +                                       GET_MODE_SIZE (move_mode),
> +                                       desired_align, align);
> +      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> +                                    NULL, vec_promoted_val, count_exp,
> +                                    loop_iter, move_mode, unroll_factor,
> +                                    expected_size, false);
>       break;
>     case rep_prefix_8_byte:
> -      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
> +      gcc_assert (TARGET_64BIT);
> +      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
>                                  DImode, val_exp);
>       break;
>     case rep_prefix_4_byte:
> -      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
> +      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
>                                  SImode, val_exp);
>       break;
>     case rep_prefix_1_byte:
> -      expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
> +      expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
>                                  QImode, val_exp);
>       break;
>     }
> @@ -22401,17 +22940,33 @@ ix86_expand_setmem (rtx dst, rtx count_e
>        }
>       emit_label (label);
>       LABEL_NUSES (label) = 1;
> +      /* We can not rely on fact that promoved value is known.  */
> +      vec_promoted_val = 0;
>     }
>  epilogue:
> -  if (count_exp != const0_rtx && epilogue_size_needed > 1)
> +  if (alg == sse_loop || alg == unrolled_loop)
>     {
> -      if (force_loopy_epilogue)
> -       expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
> -                                        epilogue_size_needed);
> -      else
> -       expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
> -                               epilogue_size_needed);
> +      rtx tmp;
> +      if (align_unknown && unroll_factor > 1)
> +       {
> +         /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
> +            do this, we can have very big epilogue - when alignment is statically
> +            unknown we'll have the epilogue byte by byte which may be very slow.  */
> +         loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> +             NULL, vec_promoted_val, count_exp,
> +             loop_iter, move_mode, 1,
> +             expected_size, false);
> +         dst = change_address (dst, BLKmode, destreg);
> +         epilogue_size_needed = GET_MODE_SIZE (move_mode);
> +       }
> +      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
> +                              true, OPTAB_LIB_WIDEN);
> +      if (tmp != destreg)
> +       emit_move_insn (destreg, tmp);
>     }
> +  if (count_exp != const0_rtx && epilogue_size_needed > 1)
> +    expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
> +                           val_exp, count_exp, epilogue_size_needed);
>   if (jump_around_label)
>     emit_label (jump_around_label);
>   return true;
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-11-07 16:24                                           ` Michael Zolotukhin
@ 2011-11-07 16:59                                             ` Jan Hubicka
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Hubicka @ 2011-11-07 16:59 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jan Hubicka, Richard Henderson, Jakub Jelinek, Jack Howarth,
	gcc-patches, Richard Guenther, H.J. Lu, izamyatin,
	areg.melikadamyan

> Hi, Jan!
> I was just preparing my version of the patch, but it seems a bit late
> now. Please see my comments to this and your previous letter below.
> 
> By the way, would it be possible to commit other part of the patch
> (middle-end part) - probably also by small parts -  and some other
> tuning after stage 1 closes?

I can't review the middle-end parts, so you need to ping some of the middle-end
maintainers. Usually patches submitted before stage1 end tends to be acceptable.
> 
> 
> > The patches disabling CSE and forwprop on constants are apparently papering around
> > problem that subregs of vector registers used in epilogue make IRA to think
> > that it can't put value into SSE register (resulting in NO_REGS class) and making
> > reloat to output load of 0 into the internal loop.
> 
> The problem here isn't about subregs - there is just no way to emit
> store when destination is 128-bit memory operand and source is 128-bit
> immediate. We should somehow find that previously we initialized a

Yes, I know.  But that is what instruction predicates are for.
When instruction pattern refuse the immediate operand, forwrprop/cse will
do nothing.
> 
> > I also plugged some code paths - the pain here is that the stringops have many
> > variants - different algorithms, different alignmnet, constant/variable
> > counts. These increase the testing matrix and some of code paths was worng with
> > the new SSE code.
> 
> Yep, I also saw such fails, thanks for the fixes. Though, I see
> another problem here: the main reason of these fails is that when size
> is small we could skip main loop and thus reach epilogue with
> uninitialized loop-iterator and/or promoted value. To make the
> algorithm absolutely correct, we should either perform needed
> initializations in the very beginning (before zero-testing) or use
> byte-loop in the epilogue. The second way could greatly hurt
> performance, so I think we should just initialize everything before
> the main loop in assumption that size is big enough and it'll be used
> in the main loop.
> Moreover, this algorithm wasn't initially intended for small sizes -
> memcpy/memset for small sizes should be expanded earlier, in
> move_by_pieces or set_by_pieces (it was in middle-end part of the
> patch). So the assumption about the size should be correct.

Problem here are memset/memcpy of small but variable size. This is very common
and it is critical to handle it right. I.e. even GCC have this in its ggc allocation
code for example.

Yes, there is still bug with the promotion.  The purpose of force_loopy_pologue
logic you removed was precisely to get this right.  I plugged the bug with
vector mode moves but not with integer modes.  I will look into it tonight.
Perhaps in first cut we could revert to the previous code for memset with
variable byte & keep your faster epilogues for memcpy & memsets of constants
where broadcasting is cheap.  I will have to think more.
> 
> 
> > We still may want to produce SSE moves for 64bit
> > operations in 32bit codegen but this is independent problem + the patch as it is seems
> > to produce slight regression on crafty.
> 
> Actually, such 8byte moves aren't critical for these part of the patch
> - here such moves only could be used in prologues/epilogues and
> doesn't affect performance much (assuming that size isn't very small
> and small performance loss in prologue/epilogue doesn't affect overall
> performance).
> But for memcpy/memset for small sizes, which are expanded in
> middle-end part, that could be quite crucial. For example, for copying
> of 24 bytes with unknown alignment on Atom three 8-byte SSE moves
> could be much faster than six 4-byte moves via GPR. So, it's
> definitely good to have an opportunity to generate such moves.

Yep, but we need to handle move_by_pieces/clear_by_pieces independently since it is middle
end change and it is not really related to the actual memset/memcpy expansion in backend.
So please prepare separate patch for that.
> > I think it would be better to use V8QI mode
> > for the promoted value (since it is what it really is) avoiding the need for changes in
> > expand_move and the loadq pattern.
> 
> Actually, we rarely operate in byte mode - usually we move/store at
> least with Pmode (when we use GPR). So V4SI or V2DI also looks
> reasonable to me here. Also, when promoting value from GPR to
> SSE-register, we surely need value in DI/SI-mode value, not in QImode.
> We could make everything in QI/V16QI/V8QI-modes but that could lead to
> generation of converts in many places (like in promotion to vector).

I meant V16QI, but anyway, it is not a byte, but a vector of bytes and that is what
we really do.  The code produced for V2DI or V16QI is actualy the same, but V16QI better
describe the fact that the value is a byte broadcast to 16 copies. Also that will unify
the 32bit/64bit codegen that currently use V4SI or V2DI for no obvious reason.

Honza

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
  2011-07-13  8:53 Uros Bizjak
@ 2011-07-13 12:16 ` Michael Zolotukhin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Zolotukhin @ 2011-07-13 12:16 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, H.J. Lu, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 659 bytes --]

Thanks for the remarks! Corrected patch is attached.

On 13 July 2011 12:48, Uros Bizjak <ubizjak@gmail.com> wrote:
> Hello!
>
>> Please don't use -m32/-m64 in testcases directly.
>> You should use
>>
>> /* { dg-do compile { target { ! ia32 } } } */
>>
>> for 32bit insns and
>>
>> /* { dg-do compile { target { ia32 } } } */
>>
>> for 64bit insns.
>
> Also, there is no need to add -mtune if -march is already specified.
> -mtune will follow -march.
> To scan for the %xmm register, you don't have to add -dp to compile
> flags. -dp will also dump pattern name to file, so unless you are
> looking for specific pattern name, you should omit -dp.
>
> Uros.
>

[-- Attachment #2: memfunc.patch --]
[-- Type: application/octet-stream, Size: 154869 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 1ee8cf8..40d6baa 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -3564,7 +3564,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				  builtin_memset_read_str, &c, dest_align,
 				  true))
 	store_by_pieces (dest_mem, tree_low_cst (len, 1),
-			 builtin_memset_read_str, &c, dest_align, true, 0);
+			 builtin_memset_read_str, gen_int_mode (c, val_mode),
+			 dest_align, true, 0);
       else if (!set_storage_via_setmem (dest_mem, len_rtx,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a46101b..a4043f5 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -632,10 +636,14 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (22),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (24),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (122),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -704,10 +712,14 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (83),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (70),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{-1, rep_prefix_4_byte}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost = {
      noticeable win, for bigger blocks either rep movsl or rep movsb is
      way to go.  Rep movsb has apparently more expensive startup time in CPU,
      but after 4K the difference is down in the noise.  */
-  {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+  {{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
 			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{rep_prefix_4_byte, {{1024, unrolled_loop},
-  			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{rep_prefix_4_byte, {{1024, unrolled_loop},
+			{8192, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (54),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (56),			/* cost of FSQRT instruction.  */
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
   /* For some reason, Athlon deals better with REP prefix (relative to loops)
      compared to K8. Alignment becomes important after 8 bytes for memcpy and
      128 bytes for memset.  */
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  {{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
   /* K8 has optimized REP instruction for medium sized blocks, but for very
      small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
   /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
   /*  BDVER1 has optimized REP instruction for medium sized blocks, but for
       very small blocks it is better to use loop. For large blocks, libcall
       can do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   6,					/* scalar_stmt_cost.  */
   4,					/* scalar load_cost.  */
   4,					/* scalar_store_cost.  */
@@ -1402,11 +1451,16 @@ struct processor_costs btver1_cost = {
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
-  {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+  {{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
    {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {24, unrolled_loop},
+   {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
+  {{{libcall, {{8, loop}, {24, unrolled_loop},
 	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
    {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{8, loop}, {24, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   4,					/* scalar_stmt_cost.  */
   2,					/* scalar load_cost.  */
   2,					/* scalar_store_cost.  */
@@ -1473,11 +1527,18 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (2),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (2),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (43),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   DUMMY_STRINGOP_ALGS}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1544,13 +1605,22 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (3),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (3),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (44),			/* cost of FSQRT instruction.  */
-  {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+
+  {{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
    {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
 	      {100000, unrolled_loop}, {-1, libcall}}}},
-  {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
+	      {100000, unrolled_loop}, {-1, libcall}}}}},
+
+  {{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
    {-1, libcall}}},
    {libcall, {{24, loop}, {64, unrolled_loop},
 	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+   {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
+   {-1, libcall}}},
+   {libcall, {{24, loop}, {64, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1617,13 +1687,20 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
-   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {{libcall, {{8, loop}, {15, unrolled_loop},
-	  {2048, rep_prefix_4_byte}, {-1, libcall}}},
-   {libcall, {{24, loop}, {32, unrolled_loop},
-	  {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+    {libcall, {{-1, libcall}}}}},
+
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
+    {libcall, {{4096, unrolled_loop}, {-1, libcall}}}},
+   {{libcall, {{1024, unrolled_loop},		       /* Unknown alignment.  */
+	       {-1, libcall}}},
+    {libcall, {{2048, unrolled_loop},
+	       {-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1697,10 +1774,16 @@ struct processor_costs generic64_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
-  {DUMMY_STRINGOP_ALGS,
-   {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
+
+  {{DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}},
+   {DUMMY_STRINGOP_ALGS,
+   {libcall, {{-1, libcall}}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -1769,10 +1852,16 @@ struct processor_costs generic32_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+  /* stringop_algs for memcpy.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
-  {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
+  /* stringop_algs for memset.  */
+  {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}},
    DUMMY_STRINGOP_ALGS},
+   {{libcall, {{-1, libcall}}},
+   DUMMY_STRINGOP_ALGS}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
@@ -2451,6 +2540,7 @@ static void ix86_set_current_function (tree);
 static unsigned int ix86_minimum_incoming_stack_boundary (bool);
 
 static enum calling_abi ix86_function_abi (const_tree);
+static rtx promote_duplicated_reg (enum machine_mode, rtx);
 
 \f
 #ifndef SUBTARGET32_DEFAULT_CPU
@@ -14952,6 +15042,28 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
@@ -19470,22 +19582,17 @@ counter_mode (rtx count_exp)
   return SImode;
 }
 
-/* When SRCPTR is non-NULL, output simple loop to move memory
-   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
-   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
-   equivalent loop to set memory by VALUE (supposed to be in MODE).
-
-   The size is rounded down to whole number of chunk size moved at once.
-   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
-
-
-static void
-expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
-			       rtx destptr, rtx srcptr, rtx value,
-			       rtx count, enum machine_mode mode, int unroll,
-			       int expected_size)
+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)
 {
-  rtx out_label, top_label, iter, tmp;
+  rtx out_label, top_label, tmp;
   enum machine_mode iter_mode = counter_mode (count);
   rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
   rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
@@ -19493,10 +19600,12 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
   rtx x_addr;
   rtx y_addr;
   int i;
+  bool reuse_iter = (iter != NULL_RTX);
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  iter = gen_reg_rtx (iter_mode);
+  if (!reuse_iter)
+    iter = gen_reg_rtx (iter_mode);
 
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
 			      NULL, 1, OPTAB_DIRECT);
@@ -19507,7 +19616,8 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
 			       true, out_label);
       predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  emit_move_insn (iter, const0_rtx);
+  if (!reuse_iter)
+    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
@@ -19590,19 +19700,43 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
     }
   else
     predict_jump (REG_BR_PROB_BASE * 80 / 100);
-  iter = ix86_zero_extend_to_Pmode (iter);
-  tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
-			     true, OPTAB_LIB_WIDEN);
-  if (tmp != destptr)
-    emit_move_insn (destptr, tmp);
-  if (srcptr)
+  if (change_ptrs)
     {
-      tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+      iter = ix86_zero_extend_to_Pmode (iter);
+      tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
 				 true, OPTAB_LIB_WIDEN);
-      if (tmp != srcptr)
-	emit_move_insn (srcptr, tmp);
+      if (tmp != destptr)
+	emit_move_insn (destptr, tmp);
+      if (srcptr)
+	{
+	  tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
+				     true, OPTAB_LIB_WIDEN);
+	  if (tmp != srcptr)
+	    emit_move_insn (srcptr, tmp);
+	}
     }
   emit_label (out_label);
+  return iter;
+}
+
+/* When SRCPTR is non-NULL, output simple loop to move memory
+   pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
+   overall size is COUNT specified in bytes.  When SRCPTR is NULL, output the
+   equivalent loop to set memory by VALUE (supposed to be in MODE).
+
+   The size is rounded down to whole number of chunk size moved at once.
+   SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.  */
+
+static void
+expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
+			       rtx destptr, rtx srcptr, rtx value,
+			       rtx count, enum machine_mode mode, int unroll,
+			       int expected_size)
+{
+  expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
+				 destptr, srcptr, value,
+				 count, NULL_RTX, mode, unroll,
+				 expected_size, true);
 }
 
 /* Output "rep; mov" instruction.
@@ -19704,7 +19838,27 @@ emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)
+{
+  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
+  rtx vec_reg;
+  if (vector_extensions_used_for_mode (mode) && CONSTANT_P (value))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, value);
+      emit_insn (gen_strset (destptr, dest, vec_reg));
+    }
+  else
+    emit_insn (gen_strset (destptr, dest, value));
+}
+
+/* Output code to copy (count % max_size) bytes from SRC to DEST.  */
 static void
 expand_movmem_epilogue (rtx destmem, rtx srcmem,
 			rtx destptr, rtx srcptr, rtx count, int max_size)
@@ -19715,43 +19869,55 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (V4SImode))
+	    move_mode = V4SImode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
-	  else
-	    {
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
-	      emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
-	    }
-	  offset += 8;
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (remainder_size >= 4)
+	{
+	  emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
 	  offset += 2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
+	  emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
   if (max_size > 8)
@@ -19857,87 +20023,122 @@ expand_setmem_epilogue_via_loop (rtx destmem, rtx destptr, rtx value,
 				 1, max_size / 2);
 }
 
-/* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
+/* Output code to set at most count & (max_size - 1) bytes starting by
+   DESTMEM.  */
 static void
-expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
+expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
+			rtx value, rtx count, int max_size)
 {
-  rtx dest;
-
   if (CONST_INT_P (count))
     {
       HOST_WIDE_INT countval = INTVAL (count);
       int offset = 0;
 
-      if ((countval & 0x10) && max_size > 16)
+      int remainder_size = countval % max_size;
+      enum machine_mode move_mode = Pmode;
+      enum machine_mode sse_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      rtx promoted_value = NULL_RTX;
+
+      /* Firstly, try to move data with the widest possible mode.
+	 Remaining part we'll move using Pmode and narrower modes.  */
+      if (TARGET_SSE)
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    gcc_unreachable ();
-	  offset += 16;
+	  if (max_size >= GET_MODE_SIZE (sse_mode))
+	    move_mode = sse_mode;
+	  else if (max_size >= GET_MODE_SIZE (DImode))
+	    move_mode = DImode;
+	  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+	    promoted_to_vector_value = NULL_RTX;
 	}
-      if ((countval & 0x08) && max_size > 8)
+
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
+	  if (!promoted_to_vector_value)
+	    promoted_to_vector_value =
+	      targetm.promote_rtx_for_memset (move_mode, value);
+	  emit_strset (destmem, promoted_to_vector_value, destptr,
+		       move_mode, offset);
+
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
 	}
-      if ((countval & 0x04) && max_size > 4)
+
+      /* Move the remaining part of epilogue - its size might be
+	 a size of the widest mode.  */
+      move_mode = Pmode;
+      promoted_value = NULL_RTX;
+      while (remainder_size >= GET_MODE_SIZE (move_mode))
+	{
+	  if (!promoted_value)
+	    promoted_value = promote_duplicated_reg (move_mode, value);
+	  emit_strset (destmem, promoted_value, destptr, move_mode, offset);
+	  offset += GET_MODE_SIZE (move_mode);
+	  remainder_size -= GET_MODE_SIZE (move_mode);
+	}
+
+      if (!promoted_value)
+	promoted_value = promote_duplicated_reg (move_mode, value);
+      if (remainder_size >= 4)
 	{
-	  dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+	  emit_strset (destmem, gen_lowpart (SImode, promoted_value), destptr,
+		       SImode, offset);
 	  offset += 4;
+	  remainder_size -= 4;
 	}
-      if ((countval & 0x02) && max_size > 2)
+      if (remainder_size >= 2)
 	{
-	  dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
-	  offset += 2;
+	  emit_strset (destmem, gen_lowpart (HImode, promoted_value), destptr,
+		       HImode, offset);
+	  offset +=2;
+	  remainder_size -= 2;
 	}
-      if ((countval & 0x01) && max_size > 1)
+      if (remainder_size >= 1)
 	{
-	  dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
-	  emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+	  emit_strset (destmem, gen_lowpart (QImode, promoted_value), destptr,
+		       QImode, offset);
 	  offset += 1;
+	  remainder_size -= 1;
 	}
+      gcc_assert (remainder_size == 0);
       return;
     }
+
+  /* count isn't const.  */
   if (max_size > 32)
     {
-      expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
+      expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
+				       max_size);
       return;
     }
+  /* If it turned out, that we promoted value to non-vector register, we can
+     reuse it.  */
+  if (!VECTOR_MODE_P (GET_MODE (promoted_to_vector_value)))
+    value = promoted_to_vector_value;
+
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -19947,14 +20148,17 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
       rtx label = ix86_expand_aligntest (count, 8, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
 	}
       emit_label (label);
       LABEL_NUSES (label) = 1;
@@ -19962,24 +20166,24 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
   if (max_size > 4)
     {
       rtx label = ix86_expand_aligntest (count, 4, true);
-      dest = change_address (destmem, SImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 2)
     {
       rtx label = ix86_expand_aligntest (count, 2, true);
-      dest = change_address (destmem, HImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
+      destmem = change_address (destmem, HImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
   if (max_size > 1)
     {
       rtx label = ix86_expand_aligntest (count, 1, true);
-      dest = change_address (destmem, QImode, destptr);
-      emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
+      destmem = change_address (destmem, QImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
@@ -20022,7 +20226,27 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  srcmem = change_address (srcmem, DImode, srcptr);
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      else
+	{
+	  srcmem = change_address (srcmem, SImode, srcptr);
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	  emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
+	}
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
@@ -20078,6 +20302,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      if (src_align_bytes >= 0)
+	{
+	  unsigned int src_align = 0;
+	  if ((src_align_bytes & 7) == (align_bytes & 7))
+	    src_align = 8;
+	  else if ((src_align_bytes & 3) == (align_bytes & 3))
+	    src_align = 4;
+	  else if ((src_align_bytes & 1) == (align_bytes & 1))
+	    src_align = 2;
+	  if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
+	    set_mem_align (src, src_align * BITS_PER_UNIT);
+	}
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
@@ -20137,7 +20392,17 @@ expand_setmem_prologue (rtx destmem, rtx destptr, rtx value, rtx count,
       emit_label (label);
       LABEL_NUSES (label) = 1;
     }
-  gcc_assert (desired_alignment <= 8);
+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = change_address (destmem, SImode, destptr);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);
 }
 
 /* Set enough from DST to align DST known to by aligned by ALIGN to
@@ -20173,6 +20438,19 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
       emit_insn (gen_strset (destreg, dst,
 			     gen_lowpart (SImode, value)));
     }
+  if (align_bytes & 8)
+    {
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+      emit_insn (gen_strset (destreg, dst,
+	    gen_lowpart (SImode, value)));
+
+      if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
+	set_mem_align (dst, 8 * BITS_PER_UNIT);
+      off = 8;
+    }
   dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
   if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
     set_mem_align (dst, desired_align * BITS_PER_UNIT);
@@ -20184,7 +20462,7 @@ expand_constant_setmem_prologue (rtx dst, rtx destreg, rtx value,
 /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
 static enum stringop_alg
 decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
-	    int *dynamic_check)
+	    int *dynamic_check, bool align_unknown)
 {
   const struct stringop_algs * algs;
   bool optimize_for_speed;
@@ -20193,7 +20471,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      consider such algorithms if the user has appropriated those
      registers for their own purposes.	*/
   bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
-                             || (memset
+			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
 #define ALG_USABLE_P(alg) (rep_prefix_usable			\
@@ -20206,7 +20484,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
      of time processing large blocks.  */
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
-          && expected_size != -1 && expected_size < 256))
+	  && expected_size != -1 && expected_size < 256))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -20215,9 +20493,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
 
   *dynamic_check = -1;
   if (memset)
-    algs = &cost->memset[TARGET_64BIT != 0];
+    algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
   else
-    algs = &cost->memcpy[TARGET_64BIT != 0];
+    algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
   if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
     return ix86_stringop_alg;
   /* rep; movq or rep; movl is the smallest variant.  */
@@ -20281,29 +20559,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
       enum stringop_alg alg;
       int i;
       bool any_alg_usable_p = true;
+      bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
-        {
-          enum stringop_alg candidate = algs->size[i].alg;
-          any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
+	{
+	  enum stringop_alg candidate = algs->size[i].alg;
+	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
-          if (candidate != libcall && candidate
-              && ALG_USABLE_P (candidate))
-              max = algs->size[i].max;
-        }
+	  if (candidate != libcall && candidate
+	      && ALG_USABLE_P (candidate))
+	    {
+	      max = algs->size[i].max;
+	      only_libcall_fits = false;
+	    }
+	}
       /* If there aren't any usable algorithms, then recursing on
-         smaller sizes isn't going to find anything.  Just return the
-         simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p)
-        {
-          /* Pick something reasonable.  */
-          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
-            *dynamic_check = 128;
-          return loop_1_byte;
-        }
+	 smaller sizes isn't going to find anything.  Just return the
+	 simple byte-at-a-time copy loop.  */
+      if (!any_alg_usable_p || only_libcall_fits)
+	{
+	  /* Pick something reasonable.  */
+	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
+	    *dynamic_check = 128;
+	  return loop_1_byte;
+	}
       if (max == -1)
 	max = 4096;
-      alg = decide_alg (count, max / 2, memset, dynamic_check);
+      alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
       gcc_assert (*dynamic_check == -1);
       gcc_assert (alg != libcall);
       if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -20327,9 +20609,11 @@ decide_alignment (int align,
       case no_stringop:
 	gcc_unreachable ();
       case loop:
-      case unrolled_loop:
 	desired_align = GET_MODE_SIZE (Pmode);
 	break;
+      case unrolled_loop:
+	desired_align = GET_MODE_SIZE (TARGET_SSE ? V4SImode : Pmode);
+	break;
       case rep_prefix_8_byte:
 	desired_align = 8;
 	break;
@@ -20417,6 +20701,11 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
   enum stringop_alg alg;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
+  int dst_offset, src_offset;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -20440,9 +20729,17 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
-
-  alg = decide_alg (count, expected_size, false, &dynamic_check);
+  dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
+  src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
+  align_unknown = (dst_offset < 0
+		   || src_offset < 0
+		   || src_offset != dst_offset);
+  alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  if (align_unknown)
+    desired_align = align;
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -20461,11 +20758,16 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      unroll_factor = 1;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -20634,11 +20936,14 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
-	 registers for 4 temporaries anyway.  */
-      expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
-				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
-				     expected_size);
+      /* In some cases we want to use the same iterator in several adjacent
+	 loops, so here we save loop iterator rtx and don't update addresses.  */
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+							   srcreg, NULL,
+							   count_exp, NULL_RTX,
+							   move_mode,
+							   unroll_factor,
+							   expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
@@ -20689,9 +20994,50 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */
+  if (alg == unrolled_loop)
+    {
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
+	      srcreg, NULL, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  src = change_address (src, BLKmode, srcreg);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
+
+      tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != srcreg)
+	emit_move_insn (srcreg, tmp);
+    }
   if (count_exp != const0_rtx && epilogue_size_needed > 1)
-    expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
-			    epilogue_size_needed);
+    {
+      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
+			      epilogue_size_needed);
+    }
+
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -20709,7 +21055,37 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
   rtx tmp;
   int nops = mode == DImode ? 3 : 2;
 
+  if (VECTOR_MODE_P (mode))
+    {
+      enum machine_mode inner = GET_MODE_INNER (mode);
+      rtx promoted_val, vec_reg;
+      if (CONST_INT_P (val))
+	return ix86_build_const_vector (mode, true, val);
+
+      promoted_val = promote_duplicated_reg (inner, val);
+      vec_reg = gen_reg_rtx (mode);
+      switch (mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+
+      return vec_reg;
+    }
   gcc_assert (mode == SImode || mode == DImode);
+  if (mode == DImode && !TARGET_64BIT)
+    {
+      rtx vec_reg = promote_duplicated_reg (V4SImode, val);
+      vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
+      return vec_reg;
+    }
   if (val == const0_rtx)
     return copy_to_mode_reg (mode, const0_rtx);
   if (CONST_INT_P (val))
@@ -20775,11 +21151,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+        promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+        promoted_val = promote_duplicated_reg (V4SImode, val);
+    }
+  else if (size_needed > 4 || (desired_align > align && desired_align > 4))
+    {
+      gcc_assert (TARGET_64BIT || TARGET_SSE);
+      promoted_val = promote_duplicated_reg (DImode, val);
+    }
   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
     promoted_val = promote_duplicated_reg (SImode, val);
   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
@@ -20805,12 +21191,17 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   unsigned HOST_WIDE_INT count = 0;
   HOST_WIDE_INT expected_size = -1;
   int size_needed = 0, epilogue_size_needed;
+  int promote_size_needed = 0;
   int desired_align = 0, align_bytes = 0;
   enum stringop_alg alg;
   rtx promoted_val = NULL;
-  bool force_loopy_epilogue = false;
+  rtx vec_promoted_val = NULL;
   int dynamic_check;
   bool need_zero_guard = false;
+  bool align_unknown;
+  unsigned int unroll_factor;
+  enum machine_mode move_mode;
+  rtx loop_iter = NULL_RTX;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -20830,8 +21221,11 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
   /* Step 0: Decide on preferred algorithm, desired alignment and
      size of chunks to be copied by main loop.  */
 
-  alg = decide_alg (count, expected_size, true, &dynamic_check);
+  align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
+  alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
   desired_align = decide_alignment (align, alg, expected_size);
+  unroll_factor = 1;
+  move_mode = Pmode;
 
   if (!TARGET_ALIGN_STRINGOPS)
     align = desired_align;
@@ -20849,11 +21243,21 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       gcc_unreachable ();
     case loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode);
+      move_mode = Pmode;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * 4;
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE
+		  ? (TARGET_64BIT ? V2DImode : V4SImode)
+		  : Pmode;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	     && unroll_factor < 4)
+	unroll_factor *= 2;
+      size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
       size_needed = 8;
@@ -20870,6 +21274,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       break;
     }
   epilogue_size_needed = size_needed;
+  promote_size_needed = GET_MODE_SIZE (Pmode);
 
   /* Step 1: Prologue guard.  */
 
@@ -20898,8 +21303,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
      main loop and epilogue (ie one load of the big constant in the
      front of all code.  */
   if (CONST_INT_P (val_exp))
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   /* Ensure that alignment prologue won't copy past end of block.  */
   if (size_needed > 1 || (desired_align > 1 && desired_align > align))
     {
@@ -20908,12 +21315,6 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 	 Make sure it is power of 2.  */
       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
 
-      /* To improve performance of small blocks, we jump around the VAL
-	 promoting mode.  This mean that if the promoted VAL is not constant,
-	 we might not use it in the epilogue and have to use byte
-	 loop variant.  */
-      if (epilogue_size_needed > 2 && !promoted_val)
-        force_loopy_epilogue = true;
       if (count)
 	{
 	  if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
@@ -20954,8 +21355,10 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 
   /* Do the expensive promotion once we branched off the small blocks.  */
   if (!promoted_val)
-    promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
-						   desired_align, align);
+    promoted_val = promote_duplicated_reg_to_size (val_exp,
+						   promote_size_needed,
+						   promote_size_needed,
+						   align);
   gcc_assert (desired_align >= 1 && align >= 1);
 
   if (desired_align > align)
@@ -20978,6 +21381,8 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 						 desired_align, align_bytes);
 	  count_exp = plus_constant (count_exp, -align_bytes);
 	  count -= align_bytes;
+	  if (count < (unsigned HOST_WIDE_INT) size_needed)
+	    goto epilogue;
 	}
       if (need_zero_guard
 	  && (count < (unsigned HOST_WIDE_INT) size_needed
@@ -21019,7 +21424,7 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
     case no_stringop:
       gcc_unreachable ();
     case loop_1_byte:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
+      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
 				     count_exp, QImode, 1, expected_size);
       break;
     case loop:
@@ -21027,8 +21432,14 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
 				     count_exp, Pmode, 1, expected_size);
       break;
     case unrolled_loop:
-      expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
-				     count_exp, Pmode, 4, expected_size);
+      vec_promoted_val =
+	promote_duplicated_reg_to_size (promoted_val,
+					GET_MODE_SIZE (move_mode),
+					desired_align, align);
+      loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+				     NULL, vec_promoted_val, count_exp,
+				     NULL_RTX, move_mode, unroll_factor,
+				     expected_size, false);
       break;
     case rep_prefix_8_byte:
       expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
@@ -21072,15 +21483,36 @@ ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
  epilogue:
-  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+  if (alg == unrolled_loop)
     {
-      if (force_loopy_epilogue)
-	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
-					 epilogue_size_needed);
-      else
-	expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
-				epilogue_size_needed);
+      rtx tmp;
+      if (align_unknown && unroll_factor > 1)
+	{
+	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
+	     do this, we can have very big epilogue - when alignment is statically
+	     unknown we'll have the epilogue byte by byte which may be very slow.  */
+	  rtx epilogue_loop_jump_around = gen_label_rtx ();
+	  rtx tmp = plus_constant (loop_iter, GET_MODE_SIZE (move_mode));
+	  emit_cmp_and_jump_insns (count_exp, tmp, LT, NULL_RTX,
+				   counter_mode (count_exp), true,
+				   epilogue_loop_jump_around);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
+	      NULL, vec_promoted_val, count_exp,
+	      loop_iter, move_mode, 1,
+	      expected_size, false);
+	  emit_label (epilogue_loop_jump_around);
+	  dst = change_address (dst, BLKmode, destreg);
+	  epilogue_size_needed = GET_MODE_SIZE (move_mode);
+	}
+      tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
+			       true, OPTAB_LIB_WIDEN);
+      if (tmp != destreg)
+	emit_move_insn (destreg, tmp);
     }
+  if (count_exp != const0_rtx && epilogue_size_needed > 1)
+    expand_setmem_epilogue (dst, destreg, promoted_val, val_exp, count_exp,
+			    epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;
@@ -34598,6 +35030,87 @@ ix86_autovectorize_vector_sizes (void)
   return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
+/* Target hook.  Prevent unaligned access to data in vector modes.  */
+
+static bool
+ix86_slow_unaligned_access (enum machine_mode mode,
+			    unsigned int align)
+{
+  if (TARGET_AVX)
+    {
+      if (GET_MODE_SIZE (mode) == 32)
+	{
+	  if (align <= 16)
+	    return (TARGET_AVX256_SPLIT_UNALIGNED_LOAD ||
+		    TARGET_AVX256_SPLIT_UNALIGNED_STORE);
+	  else
+	    return false;
+	}
+    }
+
+  if (GET_MODE_SIZE (mode) > 8)
+    {
+      return (! TARGET_SSE_UNALIGNED_LOAD_OPTIMAL &&
+	      ! TARGET_SSE_UNALIGNED_STORE_OPTIMAL);
+    }
+
+  return false;
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).
+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */
+
+static rtx
+ix86_promote_rtx_for_memset (enum machine_mode mode  ATTRIBUTE_UNUSED,
+			      rtx val)
+{
+  enum machine_mode val_mode = GET_MODE (val);
+  gcc_assert (VALID_INT_MODE_P (val_mode) || val_mode == VOIDmode);
+
+  if (vector_extensions_used_for_mode (mode) && TARGET_SSE)
+    {
+      rtx promoted_val, vec_reg;
+      enum machine_mode vec_mode = TARGET_64BIT ? V2DImode : V4SImode;
+      if (CONST_INT_P (val))
+	{
+	  rtx const_vec;
+	  HOST_WIDE_INT int_val = (UINTVAL (val) & 0xFF)
+				   * (TARGET_64BIT
+				      ? 0x0101010101010101
+				      : 0x01010101);
+	  val = gen_int_mode (int_val, Pmode);
+	  vec_reg = gen_reg_rtx (vec_mode);
+	  const_vec = ix86_build_const_vector (vec_mode, true, val);
+	  if (mode != vec_mode)
+	    const_vec = convert_to_mode (vec_mode, const_vec, 1);
+	  emit_move_insn (vec_reg, const_vec);
+	  return vec_reg;
+	}
+      /* Else: val isn't const.  */
+      promoted_val = promote_duplicated_reg (Pmode, val);
+      vec_reg = gen_reg_rtx (vec_mode);
+      switch (vec_mode)
+	{
+	case V2DImode:
+	  emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
+	  break;
+	case V4SImode:
+	  emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
+	  break;
+	default:
+	  gcc_unreachable ();
+	  break;
+	}
+      return vec_reg;
+    }
+  return NULL_RTX;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -34899,6 +35412,12 @@ ix86_autovectorize_vector_sizes (void)
 #undef TARGET_CONDITIONAL_REGISTER_USAGE
 #define TARGET_CONDITIONAL_REGISTER_USAGE ix86_conditional_register_usage
 
+#undef TARGET_SLOW_UNALIGNED_ACCESS
+#define TARGET_SLOW_UNALIGNED_ACCESS ix86_slow_unaligned_access
+
+#undef TARGET_PROMOTE_RTX_FOR_MEMSET
+#define TARGET_PROMOTE_RTX_FOR_MEMSET ix86_promote_rtx_for_memset
+
 #if TARGET_MACHO
 #undef TARGET_INIT_LIBFUNCS
 #define TARGET_INIT_LIBFUNCS darwin_rename_builtins
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8cef4e7..cf6d092 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -156,8 +156,12 @@ struct processor_costs {
   const int fchs;		/* cost of FCHS instruction.  */
   const int fsqrt;		/* cost of FSQRT instruction.  */
 				/* Specify what algorithm
-				   to use for stringops on unknown size.  */
-  struct stringop_algs memcpy[2], memset[2];
+				   to use for stringops on unknown size.
+				   First index is used to specify whether
+				   alignment is known or not.
+				   Second - to specify whether 32 or 64 bits
+				   are used.  */
+  struct stringop_algs memcpy[2][2], memset[2][2];
   const int scalar_stmt_cost;   /* Cost of any scalar operation, excluding
 				   load and store.  */
   const int scalar_load_cost;   /* Cost of scalar load.  */
@@ -1712,7 +1716,7 @@ typedef struct ix86_args {
 /* If a clear memory operation would take CLEAR_RATIO or more simple
    move-instruction sequences, we will do a clrmem or libcall instead.  */
 
-#define CLEAR_RATIO(speed) ((speed) ? MIN (6, ix86_cost->move_ratio) : 2)
+#define CLEAR_RATIO(speed) ((speed) ? ix86_cost->move_ratio : 2)
 
 /* Define if shifts truncate the shift count which implies one can
    omit a sign-extension or zero-extension of a shift count.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 7abee33..c2c8ef6 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6345,6 +6345,13 @@
    (set_attr "prefix" "maybe_vex,maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI,TI,V4SF,SF,SF")])
 
+(define_expand "sse2_loadq"
+ [(set (match_operand:V2DI 0 "register_operand")
+       (vec_concat:V2DI
+	 (match_operand:DI 1 "memory_operand")
+	 (const_int 0)))]
+  "!TARGET_64BIT && TARGET_SSE2")
+
 (define_insn_and_split "sse2_stored"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=xm,r")
 	(vec_select:SI
@@ -6456,6 +6463,16 @@
    (set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
    (set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
 
+(define_expand "vec_dupv4si"
+  [(set (match_operand:V4SI 0 "register_operand" "")
+	(vec_duplicate:V4SI
+	  (match_operand:SI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V4SImode, operands[1]);
+})
+
 (define_insn "*vec_dupv4si_avx"
   [(set (match_operand:V4SI 0 "register_operand"     "=x,x")
 	(vec_duplicate:V4SI
@@ -6496,6 +6513,16 @@
    (set_attr "prefix" "orig,vex,maybe_vex")
    (set_attr "mode" "TI,TI,DF")])
 
+(define_expand "vec_dupv2di"
+  [(set (match_operand:V2DI 0 "register_operand" "")
+	(vec_duplicate:V2DI
+	  (match_operand:DI 1 "nonimmediate_operand" "")))]
+  "TARGET_SSE"
+{
+  if (!TARGET_AVX)
+    operands[1] = force_reg (V2DImode, operands[1]);
+})
+
 (define_insn "*vec_dupv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=Y2,x")
 	(vec_duplicate:V2DI
diff --git a/gcc/cse.c b/gcc/cse.c
index a078329..9cf70ce 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4614,7 +4614,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))
 	src_const = src_folded, src_const_elt = elt;
       else if (src_const == 0 && src_eqv_here && CONSTANT_P (src_eqv_here))
 	src_const = src_eqv_here, src_const_elt = src_eqv_elt;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c0648a5..44e9947 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5747,6 +5747,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@deftypefn {Target Hook} rtx TARGET_PROMOTE_RTX_FOR_MEMSET (enum machine_mode @var{mode}, rtx @var{val})
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6219,23 +6245,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3660d36..0e41fb4 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -5690,6 +5690,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_SLOW_UNALIGNED_ACCESS
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.
+
+When this hook returns true, the compiler will act as if
+@code{STRICT_ALIGNMENT} were nonzero when generating code for block
+moves.  This can cause significantly more instructions to be produced.
+Therefore, the hook sould not return true if unaligned accesses only add a
+cycle or two to the time for a memory access.
+
+If current compilation options require building faster code, the hook can
+be used to prevent access to unaligned data in some set of modes even if
+processor can do the access without trap.
+
+By default the hook returns value of define @code{SLOW_UNALIGNED_ACCESS} if
+it is defined and @code{STRICT_ALIGNMENT} otherwise.
+@end deftypefn
+
+@hook TARGET_PROMOTE_RTX_FOR_MEMSET
+This hook returns rtx of mode MODE with promoted value VAL or NULL.
+The hook generates instruction, that are needed for performing promotion of
+@var{val} to mode @var{mode}.
+If generation of instructions for promotion failed, the hook returns NULL.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
@@ -6162,23 +6188,6 @@ may eliminate subsequent memory access if subsequent accesses occur to
 other fields in the same word of the structure, but to different bytes.
 @end defmac
 
-@defmac SLOW_UNALIGNED_ACCESS (@var{mode}, @var{alignment})
-Define this macro to be the value 1 if memory accesses described by the
-@var{mode} and @var{alignment} parameters have a cost many times greater
-than aligned accesses, for example if they are emulated in a trap
-handler.
-
-When this macro is nonzero, the compiler will act as if
-@code{STRICT_ALIGNMENT} were nonzero when generating code for block
-moves.  This can cause significantly more instructions to be produced.
-Therefore, do not set this macro nonzero if unaligned accesses only add a
-cycle or two to the time for a memory access.
-
-If the value of this macro is always zero, it need not be defined.  If
-this macro is defined, it should produce a nonzero value when
-@code{STRICT_ALIGNMENT} is nonzero.
-@end defmac
-
 @defmac MOVE_RATIO (@var{speed})
 The threshold of number of scalar memory-to-memory move insns, @emph{below}
 which a sequence of insns should be generated instead of a
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index c641b7e..18e1a8c 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1504,6 +1504,11 @@ get_mem_align_offset (rtx mem, unsigned int align)
       if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
 	return -1;
     }
+  else if (TREE_CODE (expr) == MEM_REF)
+    {
+      if (MEM_ALIGN (mem) < (unsigned int) align)
+	return -1;
+    }
   else if (TREE_CODE (expr) == COMPONENT_REF)
     {
       while (1)
@@ -2059,9 +2064,14 @@ adjust_address_1 (rtx memref, enum machine_mode mode, HOST_WIDE_INT offset,
      lowest-order set bit in OFFSET, but don't change the alignment if OFFSET
      if zero.  */
   if (offset != 0)
-    memalign
-      = MIN (memalign,
-	     (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
+    {
+      int old_offset = get_mem_align_offset (memref, MOVE_MAX*BITS_PER_UNIT);
+      if (old_offset >= 0)
+	memalign = compute_align_by_offset (old_offset + offset);
+      else
+	memalign = MIN (memalign,
+	      (unsigned HOST_WIDE_INT) (offset & -offset) * BITS_PER_UNIT);
+    }
 
   /* We can compute the size in a number of ways.  */
   if (GET_MODE (new_rtx) != BLKmode)
diff --git a/gcc/expr.c b/gcc/expr.c
index fb4379f..410779a 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -125,15 +125,18 @@ struct store_by_pieces_d
 static unsigned HOST_WIDE_INT move_by_pieces_ninsns (unsigned HOST_WIDE_INT,
 						     unsigned int,
 						     unsigned int);
-static void move_by_pieces_1 (rtx (*) (rtx, ...), enum machine_mode,
-			      struct move_by_pieces_d *);
+static void move_by_pieces_insn (rtx (*) (rtx, ...), enum machine_mode,
+		  struct move_by_pieces_d *);
 static bool block_move_libcall_safe_for_call_parm (void);
 static bool emit_block_move_via_movmem (rtx, rtx, rtx, unsigned, unsigned, HOST_WIDE_INT);
 static tree emit_block_move_libcall_fn (int);
 static void emit_block_move_via_loop (rtx, rtx, rtx, unsigned);
 static rtx clear_by_pieces_1 (void *, HOST_WIDE_INT, enum machine_mode);
 static void clear_by_pieces (rtx, unsigned HOST_WIDE_INT, unsigned int);
+static void set_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
 static void store_by_pieces_1 (struct store_by_pieces_d *, unsigned int);
+static void set_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
+			       struct store_by_pieces_d *, rtx);
 static void store_by_pieces_2 (rtx (*) (rtx, ...), enum machine_mode,
 			       struct store_by_pieces_d *);
 static tree clear_storage_libcall_fn (int);
@@ -160,6 +163,12 @@ static void do_tablejump (rtx, enum machine_mode, rtx, rtx, rtx);
 static rtx const_vector_from_tree (tree);
 static void write_complex_part (rtx, rtx, bool);
 
+static enum machine_mode widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT);
+static enum machine_mode widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT,
+						      unsigned int);
+static enum machine_mode generate_move_with_mode (struct store_by_pieces_d *,
+					   enum machine_mode, rtx *, rtx *);
+
 /* This macro is used to determine whether move_by_pieces should be called
    to perform a structure copy.  */
 #ifndef MOVE_BY_PIECES_P
@@ -808,7 +817,7 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
 	   tmode != VOIDmode;
 	   xmode = tmode, tmode = GET_MODE_WIDER_MODE (tmode))
 	if (GET_MODE_SIZE (tmode) > max_pieces
-	    || SLOW_UNALIGNED_ACCESS (tmode, align))
+	    || targetm.slow_unaligned_access (tmode, align))
 	  break;
 
       align = MAX (align, GET_MODE_ALIGNMENT (xmode));
@@ -817,11 +826,66 @@ alignment_for_piecewise_move (unsigned int max_pieces, unsigned int align)
   return align;
 }
 
+/* Given an offset from align border,
+   compute the maximal alignment of offsetted data.  */
+unsigned int
+compute_align_by_offset (int offset)
+{
+    return (offset==0) ?
+	    MOVE_MAX * BITS_PER_UNIT :
+	    MIN (MOVE_MAX, (offset & -offset)) * BITS_PER_UNIT;
+}
+
+/* Estimate cost of move for given size and offset.  Offset is used for
+   determining max alignment.  */
+static int
+compute_aligned_cost (unsigned HOST_WIDE_INT size, int offset)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  int cur_off = offset;
+
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_aligned_mov (size,
+	  compute_align_by_offset (cur_off));
+      int cur_mode_cost;
+      enum vect_cost_for_stmt type_of_cost = vector_load;
+      if (GET_MODE_SIZE (mode) <= UNITS_PER_WORD
+	  && (SCALAR_INT_MODE_P (mode) || SCALAR_FLOAT_MODE_P (mode)))
+	type_of_cost = scalar_load;
+      cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (type_of_cost, NULL, 0);
+      size -= GET_MODE_SIZE (mode);
+      cur_off += GET_MODE_SIZE (mode);
+      cost += cur_mode_cost;
+    }
+  return cost;
+}
+
+/* Estimate cost of move for given size.  It's assumed, that
+   alignment is unknown, so we need to use unaligned movs.  */
+static int
+compute_unaligned_cost (unsigned HOST_WIDE_INT size)
+{
+  unsigned HOST_WIDE_INT cost = 0;
+  while (size > 0)
+    {
+      enum machine_mode mode = widest_mode_for_unaligned_mov (size);
+      unsigned HOST_WIDE_INT n_insns = size/GET_MODE_SIZE (mode);
+      int cur_mode_cost =
+	targetm.vectorize.builtin_vectorization_cost (unaligned_load, NULL, 0);
+
+      cost += n_insns*cur_mode_cost;
+      size %= GET_MODE_SIZE (mode);
+    }
+  return cost;
+}
+
 /* Return the widest integer mode no wider than SIZE.  If no such mode
    can be found, return VOIDmode.  */
 
 static enum machine_mode
-widest_int_mode_for_size (unsigned int size)
+widest_int_mode_for_size (unsigned HOST_WIDE_INT size)
 {
   enum machine_mode tmode, mode = VOIDmode;
 
@@ -833,6 +897,170 @@ widest_int_mode_for_size (unsigned int size)
   return mode;
 }
 
+/* If mode is a scalar mode, find corresponding preferred vector mode.
+   If such mode can't be found, return vector mode, corresponding to Pmode
+   (a kind of default vector mode).
+   For vector modes return the mode itself.  */
+
+static enum machine_mode
+vector_mode_for_mode (enum machine_mode mode)
+{
+  enum machine_mode xmode;
+  if (VECTOR_MODE_P (mode))
+    return mode;
+  xmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (VECTOR_MODE_P (xmode))
+    return xmode;
+
+  return targetm.vectorize.preferred_simd_mode (Pmode);
+}
+
+/* The routine checks if vector instructions are required for operating
+   with mode specified.
+   For vector modes it checks, if the corresponding vector extension is
+   supported.
+   Operations with scalar mode will use vector extensions if this scalar
+   mode is wider than default scalar mode (Pmode) and vector extension
+   for parent vector mode is available.  */
+
+bool vector_extensions_used_for_mode (enum machine_mode mode)
+{
+  enum machine_mode vector_mode = vector_mode_for_mode (mode);
+
+  if (VECTOR_MODE_P (mode))
+    return targetm.vector_mode_supported_p (mode);
+
+  /* mode is a scalar mode.  */
+  if (VECTOR_MODE_P (vector_mode)
+     && targetm.vector_mode_supported_p (vector_mode)
+     && (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode)))
+    return true;
+
+  return false;
+}
+
+/* Find the widest move mode for the given size if alignment is unknown.  */
+static enum machine_mode
+widest_mode_for_unaligned_mov (unsigned HOST_WIDE_INT size)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  Here we can find modes wider than Pmode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+       tmode != VOIDmode;
+       tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size
+	  || targetm.slow_unaligned_access (tmode, BITS_PER_UNIT))
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD
+	  && optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
+/* Find the widest move mode for the given size and alignment.  */
+static enum machine_mode
+widest_mode_for_aligned_mov (unsigned HOST_WIDE_INT size, unsigned int align)
+{
+  enum machine_mode mode;
+  enum machine_mode tmode, xmode;
+  enum machine_mode best_simd_mode = targetm.vectorize.preferred_simd_mode (
+      mode_for_size (UNITS_PER_WORD*BITS_PER_UNIT, MODE_INT, 0));
+
+  /* Find the widest integer mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (optab_handler (mov_optab, tmode) != CODE_FOR_nothing
+	  && targetm.scalar_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+  mode = xmode;
+
+  /* Find the widest vector mode.  */
+  for (tmode = GET_CLASS_NARROWEST_MODE (MODE_VECTOR_INT), xmode = VOIDmode;
+      tmode != VOIDmode;
+      tmode = GET_MODE_WIDER_MODE (tmode))
+    {
+      if (GET_MODE_SIZE (tmode) > size || GET_MODE_ALIGNMENT (tmode) > align)
+	break;
+      if (GET_MODE_SIZE (GET_MODE_INNER (tmode)) == UNITS_PER_WORD &&
+	  optab_handler (mov_optab, tmode) != CODE_FOR_nothing     &&
+	  targetm.vector_mode_supported_p (tmode))
+	xmode = tmode;
+    }
+
+  /* Choose between integer and vector modes.  */
+  if (xmode != VOIDmode && GET_MODE_SIZE (xmode) > GET_MODE_SIZE (mode))
+    mode = xmode;
+
+  /* If found vector and scalar modes have the same sizes, and vector mode is
+     best_simd_mode, then prefer vector mode to scalar mode.  */
+  if (xmode != VOIDmode
+      && GET_MODE_SIZE (xmode) == GET_MODE_SIZE (mode)
+      && xmode == best_simd_mode)
+    mode = xmode;
+
+  /* If we failed to find a mode that might use vector extensions, try to
+     find widest ordinary integer mode.  */
+  if (mode == VOIDmode)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  /* If found mode won't use vector extensions, then there is no need to use
+     mode wider then Pmode.  */
+  if (!vector_extensions_used_for_mode (mode)
+      && GET_MODE_SIZE (mode) > MOVE_MAX_PIECES)
+    mode = widest_int_mode_for_size (MIN (MOVE_MAX_PIECES, size) + 1);
+
+  return mode;
+}
+
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
    store efficiently.  Due to internal GCC limitations, this is
    MOVE_MAX_PIECES limited by the number of bytes GCC can represent
@@ -873,6 +1101,7 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
   rtx to_addr, from_addr = XEXP (from, 0);
   unsigned int max_size = MOVE_MAX_PIECES + 1;
   enum insn_code icode;
+  int dst_offset, src_offset;
 
   align = MIN (to ? MEM_ALIGN (to) : align, MEM_ALIGN (from));
 
@@ -957,23 +1186,37 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 	data.to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
     }
 
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  /* First move what we can in the largest integer mode, then go to
-     successively smaller modes.  */
-
-  while (max_size > 1)
+  src_offset = get_mem_align_offset (from, MOVE_MAX*BITS_PER_UNIT);
+  dst_offset = get_mem_align_offset (to, MOVE_MAX*BITS_PER_UNIT);
+  if (src_offset < 0
+      || dst_offset < 0
+      || src_offset != dst_offset
+      || compute_aligned_cost (data.len, src_offset) >=
+	 compute_unaligned_cost (data.len))
     {
-      enum machine_mode mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      while (data.len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data.len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	move_by_pieces_1 (GEN_FCN (icode), mode, &data);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing);
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	}
+    }
+  else
+    {
+      while (data.len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data.len,
+	      compute_align_by_offset (src_offset));
 
-      max_size = GET_MODE_SIZE (mode);
+	  icode = optab_handler (mov_optab, mode);
+	  gcc_assert (icode != CODE_FOR_nothing &&
+	      compute_align_by_offset (src_offset) >= GET_MODE_ALIGNMENT (mode));
+	  move_by_pieces_insn (GEN_FCN (icode), mode, &data);
+	  src_offset += GET_MODE_SIZE (mode);
+	}
     }
 
   /* The code above should have handled everything.  */
@@ -1011,35 +1254,47 @@ move_by_pieces (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 }
 
 /* Return number of insns required to move L bytes by pieces.
-   ALIGN (in bits) is maximum alignment we can assume.  */
+   ALIGN (in bits) is maximum alignment we can assume.
+   This is just an estimation, so the actual number of instructions might
+   differ from it (there are several options of expanding memmove).  */
 
 static unsigned HOST_WIDE_INT
 move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
-		       unsigned int max_size)
+		       unsigned int max_size ATTRIBUTE_UNUSED)
 {
   unsigned HOST_WIDE_INT n_insns = 0;
-
-  align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
-
-  while (max_size > 1)
+  unsigned HOST_WIDE_INT n_insns_u = 0;
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT len = l;
+  while (len > 0)
     {
-      enum machine_mode mode;
-      enum insn_code icode;
-
-      mode = widest_int_mode_for_size (max_size);
-
-      if (mode == VOIDmode)
-	break;
+      mode = widest_mode_for_aligned_mov (len, align);
+      if (GET_MODE_SIZE (mode) < MOVE_MAX)
+	{
+	  align += GET_MODE_ALIGNMENT (mode);
+	  len -= GET_MODE_SIZE (mode);
+	  n_insns ++;
+	}
+      else
+	{
+	  /* We are using the widest mode.  */
+	  n_insns += len/GET_MODE_SIZE (mode);
+	  len = len%GET_MODE_SIZE (mode);
+	}
+    }
+  gcc_assert (!len);
 
-      icode = optab_handler (mov_optab, mode);
-      if (icode != CODE_FOR_nothing && align >= GET_MODE_ALIGNMENT (mode))
-	n_insns += l / GET_MODE_SIZE (mode), l %= GET_MODE_SIZE (mode);
+  len = l;
+  while (len > 0)
+    {
+      mode = widest_mode_for_unaligned_mov (len);
+      n_insns_u += len/GET_MODE_SIZE (mode);
+      len = len%GET_MODE_SIZE (mode);
 
-      max_size = GET_MODE_SIZE (mode);
     }
 
-  gcc_assert (!l);
-  return n_insns;
+  gcc_assert (!len);
+  return MIN (n_insns, n_insns_u);
 }
 
 /* Subroutine of move_by_pieces.  Move as many bytes as appropriate
@@ -1047,60 +1302,57 @@ move_by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align,
    to make a move insn for that mode.  DATA has all the other info.  */
 
 static void
-move_by_pieces_1 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+move_by_pieces_insn (rtx (*genfun) (rtx, ...), enum machine_mode mode,
 		  struct move_by_pieces_d *data)
 {
   unsigned int size = GET_MODE_SIZE (mode);
   rtx to1 = NULL_RTX, from1;
 
-  while (data->len >= size)
-    {
-      if (data->reverse)
-	data->offset -= size;
-
-      if (data->to)
-	{
-	  if (data->autinc_to)
-	    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
-					     data->offset);
-	  else
-	    to1 = adjust_address (data->to, mode, data->offset);
-	}
+  if (data->reverse)
+    data->offset -= size;
 
-      if (data->autinc_from)
-	from1 = adjust_automodify_address (data->from, mode, data->from_addr,
-					   data->offset);
+  if (data->to)
+    {
+      if (data->autinc_to)
+	to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+					 data->offset);
       else
-	from1 = adjust_address (data->from, mode, data->offset);
+	to1 = adjust_address (data->to, mode, data->offset);
+    }
 
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
-	emit_insn (gen_add2_insn (data->to_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
-      if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
-	emit_insn (gen_add2_insn (data->from_addr,
-				  GEN_INT (-(HOST_WIDE_INT)size)));
+  if (data->autinc_from)
+    from1 = adjust_automodify_address (data->from, mode, data->from_addr,
+				       data->offset);
+  else
+    from1 = adjust_address (data->from, mode, data->offset);
 
-      if (data->to)
-	emit_insn ((*genfun) (to1, from1));
-      else
-	{
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_from < 0)
+    emit_insn (gen_add2_insn (data->from_addr,
+			      GEN_INT (-(HOST_WIDE_INT)size)));
+
+  if (data->to)
+    emit_insn ((*genfun) (to1, from1));
+  else
+    {
 #ifdef PUSH_ROUNDING
-	  emit_single_push_insn (mode, from1, NULL);
+      emit_single_push_insn (mode, from1, NULL);
 #else
-	  gcc_unreachable ();
+      gcc_unreachable ();
 #endif
-	}
+    }
 
-      if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
-	emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
-      if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
-	emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+  if (HAVE_POST_INCREMENT && data->explicit_inc_from > 0)
+    emit_insn (gen_add2_insn (data->from_addr, GEN_INT (size)));
 
-      if (! data->reverse)
-	data->offset += size;
+  if (! data->reverse)
+    data->offset += size;
 
-      data->len -= size;
-    }
+  data->len -= size;
 }
 \f
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -1677,7 +1929,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (src)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (src))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (src))
 	      || MEM_ALIGN (src) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2067,7 +2319,7 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, int ssize)
 
       /* Optimize the access just a bit.  */
       if (MEM_P (dest)
-	  && (! SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (dest))
+	  && (! targetm.slow_unaligned_access (mode, MEM_ALIGN (dest))
 	      || MEM_ALIGN (dest) >= GET_MODE_ALIGNMENT (mode))
 	  && bytepos * BITS_PER_UNIT % GET_MODE_ALIGNMENT (mode) == 0
 	  && bytelen == GET_MODE_SIZE (mode))
@@ -2356,7 +2608,10 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
   data.constfundata = constfundata;
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  if (memsetp)
+    set_by_pieces_1 (&data, align);
+  else
+    store_by_pieces_1 (&data, align);
   if (endp)
     {
       rtx to1;
@@ -2400,10 +2655,10 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
     return;
 
   data.constfun = clear_by_pieces_1;
-  data.constfundata = NULL;
+  data.constfundata = CONST0_RTX (QImode);
   data.len = len;
   data.to = to;
-  store_by_pieces_1 (&data, align);
+  set_by_pieces_1 (&data, align);
 }
 
 /* Callback routine for clear_by_pieces.
@@ -2417,13 +2672,121 @@ clear_by_pieces_1 (void *data ATTRIBUTE_UNUSED,
   return const0_rtx;
 }
 
-/* Subroutine of clear_by_pieces and store_by_pieces.
+/* Helper function for set by pieces - generates move with the given mode.
+   Returns a mode used for in generated move (it could differ from requested,
+   if the requested mode isn't supported.  */
+static enum machine_mode generate_move_with_mode (
+			      struct store_by_pieces_d *data,
+			      enum machine_mode mode,
+			      rtx *promoted_to_vector_value_ptr,
+			      rtx *promoted_value_ptr)
+{
+  enum insn_code icode;
+  rtx rhs = NULL_RTX;
+
+  gcc_assert (promoted_to_vector_value_ptr && promoted_value_ptr);
+
+  if (vector_extensions_used_for_mode (mode))
+    {
+      enum machine_mode vec_mode = vector_mode_for_mode (mode);
+      if (!(*promoted_to_vector_value_ptr))
+	*promoted_to_vector_value_ptr
+	  = targetm.promote_rtx_for_memset (vec_mode, (rtx)data->constfundata);
+
+      rhs = convert_to_mode (vec_mode, *promoted_to_vector_value_ptr, 1);
+    }
+  else
+    {
+      if (CONST_INT_P ((rtx)data->constfundata))
+	{
+	  /* We don't need to load the constant to a register, if it could be
+	     encoded as an immediate operand.  */
+	  rtx imm_const;
+	  switch (mode)
+	    {
+	    case DImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x0101010101010101, DImode);
+	      break;
+	    case SImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x01010101, SImode);
+	      break;
+	    case HImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000101, HImode);
+	      break;
+	    case QImode:
+	      imm_const
+		= gen_int_mode ((UINTVAL ((rtx)data->constfundata) & 0xFF)
+				* 0x00000001, QImode);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	      break;
+	    }
+	  rhs = imm_const;
+	}
+      else /* data->constfundata isn't const.  */
+	{
+	  if (!(*promoted_value_ptr))
+	    {
+	      rtx coeff;
+	      enum machine_mode promoted_value_mode;
+	      /* Choose mode for promoted value.  It shouldn't be narrower, than
+		 Pmode.  */
+	      if (GET_MODE_SIZE (mode) > GET_MODE_SIZE (Pmode))
+		promoted_value_mode = mode;
+	      else
+		promoted_value_mode = Pmode;
+
+	      switch (promoted_value_mode)
+		{
+		case DImode:
+		  coeff = gen_int_mode (0x0101010101010101, DImode);
+		  break;
+		case SImode:
+		  coeff = gen_int_mode (0x01010101, SImode);
+		  break;
+		default:
+		  gcc_unreachable ();
+		  break;
+		}
+	      *promoted_value_ptr = convert_to_mode (promoted_value_mode,
+						     (rtx)data->constfundata,
+						     1);
+	      *promoted_value_ptr = expand_mult (promoted_value_mode,
+						 *promoted_value_ptr, coeff,
+						 NULL_RTX, 1);
+	    }
+	  rhs = convert_to_mode (mode, *promoted_value_ptr, 1);
+	}
+    }
+  /* If RHS is null, then the requested mode isn't supported and can't be used.
+     Use Pmode instead.  */
+  if (!rhs)
+    {
+      generate_move_with_mode (data, Pmode, promoted_to_vector_value_ptr,
+			       promoted_value_ptr);
+      return Pmode;
+    }
+
+  gcc_assert (rhs);
+  icode = optab_handler (mov_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  set_by_pieces_2 (GEN_FCN (icode), mode, data, rhs);
+  return mode;
+}
+
+/* Subroutine of store_by_pieces.
    Generate several move instructions to store LEN bytes of block TO.  (A MEM
    rtx with BLKmode).  ALIGN is maximum alignment we can assume.  */
 
 static void
-store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
-		   unsigned int align ATTRIBUTE_UNUSED)
+store_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
 {
   enum machine_mode to_addr_mode
     = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
@@ -2498,6 +2861,134 @@ store_by_pieces_1 (struct store_by_pieces_d *data ATTRIBUTE_UNUSED,
   gcc_assert (!data->len);
 }
 
+/* Subroutine of clear_by_pieces and store_by_pieces.
+   Generate several move instructions to store LEN bytes of block TO.  (A MEM
+   rtx with BLKmode).  ALIGN is maximum alignment we can assume.
+   As opposed to store_by_pieces_1, this routine always generates code for
+   memset.  (store_by_pieces_1 is sometimes used to generate code for memcpy
+   rather than for memset).  */
+
+static void
+set_by_pieces_1 (struct store_by_pieces_d *data, unsigned int align)
+{
+  enum machine_mode to_addr_mode
+    = targetm.addr_space.address_mode (MEM_ADDR_SPACE (data->to));
+  rtx to_addr = XEXP (data->to, 0);
+  unsigned int max_size = STORE_MAX_PIECES + 1;
+  int dst_offset;
+  rtx promoted_to_vector_value = NULL_RTX;
+  rtx promoted_value = NULL_RTX;
+
+  data->offset = 0;
+  data->to_addr = to_addr;
+  data->autinc_to
+    = (GET_CODE (to_addr) == PRE_INC || GET_CODE (to_addr) == PRE_DEC
+       || GET_CODE (to_addr) == POST_INC || GET_CODE (to_addr) == POST_DEC);
+
+  data->explicit_inc_to = 0;
+  data->reverse
+    = (GET_CODE (to_addr) == PRE_DEC || GET_CODE (to_addr) == POST_DEC);
+  if (data->reverse)
+    data->offset = data->len;
+
+  /* If storing requires more than two move insns,
+     copy addresses to registers (to make displacements shorter)
+     and use post-increment if available.  */
+  if (!data->autinc_to
+      && move_by_pieces_ninsns (data->len, align, max_size) > 2)
+    {
+      /* Determine the main mode we'll be using.
+	 MODE might not be used depending on the definitions of the
+	 USE_* macros below.  */
+      enum machine_mode mode ATTRIBUTE_UNUSED
+	= widest_int_mode_for_size (max_size);
+
+      if (USE_STORE_PRE_DECREMENT (mode) && data->reverse && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode,
+					    plus_constant (to_addr, data->len));
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = -1;
+	}
+
+      if (USE_STORE_POST_INCREMENT (mode) && ! data->reverse
+	  && ! data->autinc_to)
+	{
+	  data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+	  data->autinc_to = 1;
+	  data->explicit_inc_to = 1;
+	}
+
+      if ( !data->autinc_to && CONSTANT_P (to_addr))
+	data->to_addr = copy_to_mode_reg (to_addr_mode, to_addr);
+    }
+
+  dst_offset = get_mem_align_offset (data->to, MOVE_MAX*BITS_PER_UNIT);
+  if (dst_offset < 0
+      || compute_aligned_cost (data->len, dst_offset) >=
+	 compute_unaligned_cost (data->len))
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode = widest_mode_for_unaligned_mov (data->len);
+	  generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	}
+    }
+  else
+    {
+      while (data->len > 0)
+	{
+	  enum machine_mode mode;
+	  mode = widest_mode_for_aligned_mov (data->len,
+	      compute_align_by_offset (dst_offset));
+	  mode = generate_move_with_mode (data, mode, &promoted_to_vector_value,
+				   &promoted_value);
+	  dst_offset += GET_MODE_SIZE (mode);
+	}
+    }
+
+  /* The code above should have handled everything.  */
+  gcc_assert (!data->len);
+}
+
+/* Subroutine of set_by_pieces_1.  Emit move instruction with mode MODE.
+   DATA has info about destination, RHS is source, GENFUN is the gen_...
+   function to make a move insn for that mode.  */
+
+static void
+set_by_pieces_2 (rtx (*genfun) (rtx, ...), enum machine_mode mode,
+		   struct store_by_pieces_d *data, rtx rhs)
+{
+  unsigned int size = GET_MODE_SIZE (mode);
+  rtx to1;
+
+  if (data->reverse)
+    data->offset -= size;
+
+  if (data->autinc_to)
+    to1 = adjust_automodify_address (data->to, mode, data->to_addr,
+	data->offset);
+  else
+    to1 = adjust_address (data->to, mode, data->offset);
+
+  if (HAVE_PRE_DECREMENT && data->explicit_inc_to < 0)
+    emit_insn (gen_add2_insn (data->to_addr,
+	  GEN_INT (-(HOST_WIDE_INT) size)));
+
+  gcc_assert (rhs);
+
+  emit_insn ((*genfun) (to1, rhs));
+
+  if (HAVE_POST_INCREMENT && data->explicit_inc_to > 0)
+    emit_insn (gen_add2_insn (data->to_addr, GEN_INT (size)));
+
+  if (! data->reverse)
+    data->offset += size;
+
+  data->len -= size;
+}
+
 /* Subroutine of store_by_pieces_1.  Store as many bytes as appropriate
    with move instructions for mode MODE.  GENFUN is the gen_... function
    to make a move insn for that mode.  DATA has all the other info.  */
@@ -3714,7 +4205,7 @@ emit_push_insn (rtx x, enum machine_mode mode, tree type, rtx size,
 	  /* Here we avoid the case of a structure whose weak alignment
 	     forces many pushes of a small amount of data,
 	     and such small pushes do rounding that causes trouble.  */
-	  && ((! SLOW_UNALIGNED_ACCESS (word_mode, align))
+	  && ((! targetm.slow_unaligned_access (word_mode, align))
 	      || align >= BIGGEST_ALIGNMENT
 	      || (PUSH_ROUNDING (align / BITS_PER_UNIT)
 		  == (align / BITS_PER_UNIT)))
@@ -5839,7 +6330,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
       || (mode != BLKmode
 	  && ((((MEM_ALIGN (target) < GET_MODE_ALIGNMENT (mode))
 		|| bitpos % GET_MODE_ALIGNMENT (mode))
-	       && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (target)))
+	       && targetm.slow_unaligned_access (mode, MEM_ALIGN (target)))
 	      || (bitpos % BITS_PER_UNIT != 0)))
       /* If the RHS and field are a constant size and the size of the
 	 RHS isn't the same size as the bitfield, we must use bitfield
@@ -9195,7 +9686,7 @@ expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 		     && ((modifier == EXPAND_CONST_ADDRESS
 			  || modifier == EXPAND_INITIALIZER)
 			 ? STRICT_ALIGNMENT
-			 : SLOW_UNALIGNED_ACCESS (mode1, MEM_ALIGN (op0))))
+			 : targetm.slow_unaligned_access (mode1, MEM_ALIGN (op0))))
 		    || (bitpos % BITS_PER_UNIT != 0)))
 	    /* If the type and the field are a constant size and the
 	       size of the type isn't the same size as the bitfield,
diff --git a/gcc/expr.h b/gcc/expr.h
index cb4050d..b9ec9c2 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -693,4 +693,8 @@ extern tree build_libfunc_function (const char *);
 /* Get the personality libfunc for a function decl.  */
 rtx get_personality_function (tree);
 
+/* Given offset from maximum alignment boundary, compute maximum alignment,
+   that can be assumed.  */
+unsigned int compute_align_by_offset (int);
+
 #endif /* GCC_EXPR_H */
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 5db9ed8..f15f957 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1270,6 +1270,10 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
       return false;
     }
 
+  /* Don't propagate vector-constants.  */
+  if (vector_extensions_used_for_mode (GET_MODE (reg)) && CONSTANT_P (src))
+      return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
diff --git a/gcc/rtl.h b/gcc/rtl.h
index e3ceecd..4584206 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2428,6 +2428,10 @@ extern void emit_jump (rtx);
 /* In expr.c */
 extern rtx move_by_pieces (rtx, rtx, unsigned HOST_WIDE_INT,
 			   unsigned int, int);
+/* Check if vector instructions are required for operating with mode
+   specified.  */
+bool vector_extensions_used_for_mode (enum machine_mode);
+
 
 /* In cfgrtl.c */
 extern void print_rtl_with_bb (FILE *, const_rtx);
diff --git a/gcc/target.def b/gcc/target.def
index c67f0ba..378e6e5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1479,6 +1479,22 @@ DEFHOOK
  bool, (struct ao_ref_s *ref),
  default_ref_may_alias_errno)
 
+/* True if access to unaligned data in given mode is too slow or
+   prohibited.  */
+DEFHOOK
+(slow_unaligned_access,
+ "",
+ bool, (enum machine_mode mode, unsigned int align),
+ default_slow_unaligned_access)
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+DEFHOOK
+(promote_rtx_for_memset,
+ "",
+ rtx, (enum machine_mode mode, rtx val),
+ default_promote_rtx_for_memset)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index bcb8a12..4c4e4bd 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1441,4 +1441,24 @@ default_pch_valid_p (const void *data_p, size_t len)
   return NULL;
 }
 
+bool
+default_slow_unaligned_access (enum machine_mode mode ATTRIBUTE_UNUSED,
+			       unsigned int align ATTRIBUTE_UNUSED)
+{
+#ifdef SLOW_UNALIGNED_ACCESS
+  return SLOW_UNALIGNED_ACCESS (mode, align);
+#else
+  return STRICT_ALIGNMENT;
+#endif
+}
+
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL or NULL.
+   VAL is supposed to represent one byte.  */
+rtx
+default_promote_rtx_for_memset (enum machine_mode mode ATTRIBUTE_UNUSED,
+				 rtx val ATTRIBUTE_UNUSED)
+{
+  return NULL_RTX;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index ce89d32..27f2f4d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -174,3 +174,6 @@ extern enum machine_mode default_get_reg_raw_mode(int);
 
 extern void *default_get_pch_validity (size_t *);
 extern const char *default_pch_valid_p (const void *, size_t);
+extern bool default_slow_unaligned_access (enum machine_mode mode,
+					   unsigned int align);
+extern rtx default_promote_rtx_for_memset (enum machine_mode mode, rtx val);
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
new file mode 100644
index 0000000..39c8ef0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
new file mode 100644
index 0000000..439694b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
new file mode 100644
index 0000000..51f4c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
new file mode 100644
index 0000000..bca8680
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
new file mode 100644
index 0000000..5bc8e74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
new file mode 100644
index 0000000..b7dff27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
new file mode 100644
index 0000000..bee85fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
new file mode 100644
index 0000000..1160beb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
new file mode 100644
index 0000000..b1c78ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
new file mode 100644
index 0000000..a15a0f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
new file mode 100644
index 0000000..2789660
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
new file mode 100644
index 0000000..17e0342
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
new file mode 100644
index 0000000..e437378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
new file mode 100644
index 0000000..ba716df
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
new file mode 100644
index 0000000..1845e95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
new file mode 100644
index 0000000..2b23751
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
new file mode 100644
index 0000000..e751192
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemcpy" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
new file mode 100644
index 0000000..7defe7e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
new file mode 100644
index 0000000..ea27378
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
new file mode 100644
index 0000000..de2a557
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
new file mode 100644
index 0000000..1f82258
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
new file mode 100644
index 0000000..7f60806
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
new file mode 100644
index 0000000..94f0864
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
new file mode 100644
index 0000000..20545c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
new file mode 100644
index 0000000..52dab8e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
new file mode 100644
index 0000000..c662480
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
new file mode 100644
index 0000000..9e8e152
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {}, src[SIZE + OFFSET] = {};
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
new file mode 100644
index 0000000..662fc20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
new file mode 100644
index 0000000..c90e852
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
new file mode 100644
index 0000000..5a41f82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
new file mode 100644
index 0000000..ec2dfff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memcpy where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memcpy (void *, void *, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst, *src;
+
+void
+do_copy ()
+{
+  memcpy (dst+OFFSET, src+OFFSET, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
new file mode 100644
index 0000000..d6b2cd5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
new file mode 100644
index 0000000..9cd89e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
new file mode 100644
index 0000000..ddf25fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
new file mode 100644
index 0000000..fde4f5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s16-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	16
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
new file mode 100644
index 0000000..4fe2d36
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
new file mode 100644
index 0000000..2209563
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
new file mode 100644
index 0000000..8d99dde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
new file mode 100644
index 0000000..e0ad04a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s3072-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	3072
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "\tmemset" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
new file mode 100644
index 0000000..404d04e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
new file mode 100644
index 0000000..1df9db0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
new file mode 100644
index 0000000..beb005c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
new file mode 100644
index 0000000..29f5ea3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
new file mode 100644
index 0000000..2504333
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
new file mode 100644
index 0000000..b0aaada
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
new file mode 100644
index 0000000..3e250d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
new file mode 100644
index 0000000..c13edd7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
new file mode 100644
index 0000000..17d9525
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
new file mode 100644
index 0000000..8125e9d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
new file mode 100644
index 0000000..ff74811
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
new file mode 100644
index 0000000..d7e0c3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s512-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	512
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
new file mode 100644
index 0000000..ea7b439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
new file mode 100644
index 0000000..5ef250d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-10.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
new file mode 100644
index 0000000..846a807
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-11.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
new file mode 100644
index 0000000..a8f7c3b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-12.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
new file mode 100644
index 0000000..ae05e93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
new file mode 100644
index 0000000..96462bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
new file mode 100644
index 0000000..6aee01e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
new file mode 100644
index 0000000..bbad9b9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
new file mode 100644
index 0000000..8e90d72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
new file mode 100644
index 0000000..26d0b42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
new file mode 100644
index 0000000..84ec749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
new file mode 100644
index 0000000..ef15265
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a0-9.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
new file mode 100644
index 0000000..444a8de
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
new file mode 100644
index 0000000..9154fb9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
new file mode 100644
index 0000000..9b7dac1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
new file mode 100644
index 0000000..713c8a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-a1-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
new file mode 100644
index 0000000..8c700c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
new file mode 100644
index 0000000..c344fd0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "movq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
new file mode 100644
index 0000000..125de2f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
new file mode 100644
index 0000000..b50de1b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s64-au-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	64
+#define OFFSET	1
+char *dst;
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 0, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
new file mode 100644
index 0000000..c6fd271
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-1.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
new file mode 100644
index 0000000..32972e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-2.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
new file mode 100644
index 0000000..ac615e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-3.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
new file mode 100644
index 0000000..8458cfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-4.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=atom" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
new file mode 100644
index 0000000..210946d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-5.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
new file mode 100644
index 0000000..e63feae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-6.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
new file mode 100644
index 0000000..72b2ba0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-7.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set ()
+{
+  memset (dst+OFFSET, 5, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
new file mode 100644
index 0000000..cb5dc85
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-s768-a0-8.c
@@ -0,0 +1,15 @@
+/* Ensure that we use SSE-moves for memset where it's needed.  */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=corei7 -mstringop-strategy=unrolled_loop" } */
+extern void *memset (void *, int, __SIZE_TYPE__);
+#define SIZE	768
+#define OFFSET	0
+char dst[SIZE + OFFSET] = {};
+
+void
+do_set (char c)
+{
+  memset (dst+OFFSET, c, sizeof (dst[0]) * SIZE);
+}
+
+/* { dg-final { scan-assembler "%xmm" } } */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Use of vector instructions in memmov/memset expanding
@ 2011-07-13  8:53 Uros Bizjak
  2011-07-13 12:16 ` Michael Zolotukhin
  0 siblings, 1 reply; 52+ messages in thread
From: Uros Bizjak @ 2011-07-13  8:53 UTC (permalink / raw)
  To: gcc-patches; +Cc: Michael Zolotukhin, H.J. Lu, Richard Guenther

Hello!

> Please don't use -m32/-m64 in testcases directly.
> You should use
>
> /* { dg-do compile { target { ! ia32 } } } */
>
> for 32bit insns and
>
> /* { dg-do compile { target { ia32 } } } */
>
> for 64bit insns.

Also, there is no need to add -mtune if -march is already specified.
-mtune will follow -march.
To scan for the %xmm register, you don't have to add -dp to compile
flags. -dp will also dump pattern name to file, so unless you are
looking for specific pattern name, you should omit -dp.

Uros.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-11-07 16:56 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CANtU07-DAOMe9Nk4oYj3FJnkZqgkHvSnobsugeSfcRUzDChrrg@mail.gmail.com>
2011-07-11 21:03 ` Use of vector instructions in memmov/memset expanding Michael Zolotukhin
2011-07-11 21:09 ` Michael Zolotukhin
2011-07-12  5:11   ` H.J. Lu
2011-07-12 20:35     ` Michael Zolotukhin
2011-07-16  2:51   ` Jan Hubicka
2011-07-18 11:25     ` Michael Zolotukhin
2011-07-26 15:46       ` Michael Zolotukhin
2011-08-22  9:52       ` Michael Zolotukhin
     [not found]       ` <CANtU07-eCpAZ=VgvkdBCORq8bR0UZCgryofBXU_4FcRDJ7hWoQ@mail.gmail.com>
2011-09-28 12:29         ` Michael Zolotukhin
2011-09-28 12:36           ` Michael Zolotukhin
2011-09-28 12:38             ` Michael Zolotukhin
2011-09-28 12:49             ` Andi Kleen
2011-09-28 12:49               ` Jakub Jelinek
2011-09-28 12:51                 ` Jan Hubicka
2011-09-28 13:31                   ` Michael Zolotukhin
2011-09-28 13:33                     ` Jan Hubicka
2011-09-28 13:51                       ` Michael Zolotukhin
2011-09-28 16:21                       ` Andi Kleen
2011-09-28 16:33                         ` Michael Zolotukhin
2011-09-28 18:29                           ` Andi Kleen
2011-09-28 18:36                             ` Andi Kleen
2011-09-29  8:25                               ` Michael Zolotukhin
2011-09-28 12:54                 ` Michael Zolotukhin
2011-09-28 14:15           ` Jack Howarth
2011-09-28 14:28             ` Michael Zolotukhin
2011-09-28 22:52               ` Jack Howarth
2011-09-29  8:13                 ` Michael Zolotukhin
2011-09-29 12:09                   ` Michael Zolotukhin
2011-09-29 12:12                     ` Michael Zolotukhin
2011-09-29 12:23                       ` Michael Zolotukhin
2011-09-29 13:02                     ` Jakub Jelinek
2011-10-20  8:39                       ` Michael Zolotukhin
2011-10-20  8:46                         ` Michael Zolotukhin
2011-10-20  8:46                           ` Michael Zolotukhin
2011-10-20  8:51                             ` Michael Zolotukhin
2011-10-26 20:36                               ` Michael Zolotukhin
2011-11-07  2:48                                 ` Jan Hubicka
2011-10-27 16:10                             ` Jan Hubicka
2011-10-28 13:14                               ` Michael Zolotukhin
2011-10-28 16:38                                 ` Richard Henderson
2011-11-01 17:36                                   ` Michael Zolotukhin
2011-11-01 17:48                                     ` Michael Zolotukhin
2011-11-02 19:12                                     ` Jan Hubicka
2011-11-02 19:37                                       ` Michael Zolotukhin
2011-11-02 19:55                                     ` Jan Hubicka
2011-11-03 12:56                                       ` Michael Zolotukhin
2011-11-06 14:28                                         ` Jan Hubicka
2011-11-07 15:52                                         ` Jan Hubicka
2011-11-07 16:24                                           ` Michael Zolotukhin
2011-11-07 16:59                                             ` Jan Hubicka
2011-07-13  8:53 Uros Bizjak
2011-07-13 12:16 ` Michael Zolotukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).