public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
@ 2011-06-20 17:01 harsha.jagasia
  2011-06-20 17:03 ` H.J. Lu
  0 siblings, 1 reply; 8+ messages in thread
From: harsha.jagasia @ 2011-06-20 17:01 UTC (permalink / raw)
  To: gcc-patches, hubicka, ubizjak, hongjiu.lu; +Cc: harsha.jagasia

Is it ok to backport patches, with Changelogs below, already in trunk to gcc
4.6? These patches are for AVX-256bit load store splitting. These patches
make significant performance difference >=3% to several CPU2006 and
Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will post
backported patches for commit approval.

AMD plans to submit additional patches on AVX-256 load/store splitting to
trunk. We will send additional backport requests for those later once they
are accepted/comitted to trunk.

[PATCH] Split 32-byte AVX unaligned load/store.
gcc/
2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>
	* config/i386/i386.c (flag_opts): Add -mavx256-split-unaligned-load
	and -mavx256-split-unaligned-store.
	(ix86_option_override_internal): Split 32-byte AVX unaligned
	load/store by default.
	(ix86_avx256_split_vector_move_misalign): New.
	(ix86_expand_vector_move_misalign): Use it.

	* config/i386/i386.opt: Add -mavx256-split-unaligned-load and
	-mavx256-split-unaligned-store.

	* config/i386/sse.md (*avx_mov<mode>_internal): Verify unaligned
	256bit load/store.  Generate unaligned store on misaligned memory
	operand.
	(*avx_movu<ssemodesuffix><avxmodesuffix>): Verify unaligned
	256bit load/store.
	(*avx_movdqu<avxmodesuffix>): Likewise.

	* doc/invoke.texi: Document -mavx256-split-unaligned-load and
	-mavx256-split-unaligned-store.
gcc/testsuite/
2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>
	* gcc.target/i386/avx256-unaligned-load-1.c: New.
	* gcc.target/i386/avx256-unaligned-load-2.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-3.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-4.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-5.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-6.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-7.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-1.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-2.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-3.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-4.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-5.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-6.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-7.c: Likewise.

[PATCH] Don't assert unaligned 256bit load/store.
2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>
	* config/i386/sse.md (*avx_mov<mode>_internal): Don't assert
	unaligned 256bit load/store.
	(*avx_movu<ssemodesuffix><avxmodesuffix>): Likewise.
	(*avx_movdqu<avxmodesuffix>): Likewise.

[PATCH] Fix a typo in -mavx256-split-unaligned-store.
2011-03-28  H.J. Lu  <hongjiu.lu@intel.com>
	* config/i386/i386.c (flag_opts): Fix a typo in
	-mavx256-split-unaligned-store.

[PATCH] * config/i386/i386.c (ix86_reorg): Run move_or_delete_vzeroupper first.
2011-05-04  Uros Bizjak  <ubizjak@gmail.com>
	* config/i386/i386.c (ix86_reorg): Run move_or_delete_vzeroupper first.

[PATCH] Save the initial options after checking vzeroupper.
gcc/
2011-05-23  H.J. Lu  <hongjiu.lu@intel.com>
	PR target/47315
	* config/i386/i386.c (ix86_option_override_internal): Save the
	initial options after checking vzeroupper.
gcc/testsuite/
2011-05-23  H.J. Lu  <hongjiu.lu@intel.com>
	PR target/47315
	* gcc.target/i386/pr47315.c: New test.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
  2011-06-20 17:01 Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware harsha.jagasia
@ 2011-06-20 17:03 ` H.J. Lu
  2011-06-20 17:17   ` Jagasia, Harsha
  0 siblings, 1 reply; 8+ messages in thread
From: H.J. Lu @ 2011-06-20 17:03 UTC (permalink / raw)
  To: harsha.jagasia; +Cc: gcc-patches, hubicka, ubizjak, hongjiu.lu

On Mon, Jun 20, 2011 at 9:58 AM,  <harsha.jagasia@amd.com> wrote:
> Is it ok to backport patches, with Changelogs below, already in trunk to gcc
> 4.6? These patches are for AVX-256bit load store splitting. These patches
> make significant performance difference >=3% to several CPU2006 and
> Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will post
> backported patches for commit approval.
>
> AMD plans to submit additional patches on AVX-256 load/store splitting to
> trunk. We will send additional backport requests for those later once they
> are accepted/comitted to trunk.
>

Since we will make some changes on trunk, I would prefer to to do
the backport after trunk change is finished.

Thanks.


-- 
H.J.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
  2011-06-20 17:03 ` H.J. Lu
@ 2011-06-20 17:17   ` Jagasia, Harsha
  2011-06-20 22:17     ` Fang, Changpeng
  2011-06-27 23:10     ` Fang, Changpeng
  0 siblings, 2 replies; 8+ messages in thread
From: Jagasia, Harsha @ 2011-06-20 17:17 UTC (permalink / raw)
  To: 'H.J. Lu'
  Cc: 'gcc-patches@gcc.gnu.org', 'hubicka@ucw.cz',
	'ubizjak@gmail.com', 'hongjiu.lu@intel.com',
	Fang, Changpeng

> On Mon, Jun 20, 2011 at 9:58 AM,  <harsha.jagasia@amd.com> wrote:
> > Is it ok to backport patches, with Changelogs below, already in trunk
> to gcc
> > 4.6? These patches are for AVX-256bit load store splitting. These
> patches
> > make significant performance difference >=3% to several CPU2006 and
> > Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will
> post
> > backported patches for commit approval.
> >
> > AMD plans to submit additional patches on AVX-256 load/store
> splitting to
> > trunk. We will send additional backport requests for those later once
> they
> > are accepted/comitted to trunk.
> >
> 
> Since we will make some changes on trunk, I would prefer to to do
> the backport after trunk change is finished.

Ok, thanks. Adding Changpeng who is working on the trunk changes.

Harsha


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
  2011-06-20 17:17   ` Jagasia, Harsha
@ 2011-06-20 22:17     ` Fang, Changpeng
  2011-06-20 23:38       ` Lu, Hongjiu
  2011-06-27 23:10     ` Fang, Changpeng
  1 sibling, 1 reply; 8+ messages in thread
From: Fang, Changpeng @ 2011-06-20 22:17 UTC (permalink / raw)
  To: Jagasia, Harsha, 'H.J. Lu'
  Cc: 'gcc-patches@gcc.gnu.org', 'hubicka@ucw.cz',
	'ubizjak@gmail.com', 'hongjiu.lu@intel.com'

The patch that disables default setting of unaligned load splitting for bdver1 has been committed
to trunk as revision 175230.

Here is the patch: http://gcc.gnu.org/ml/gcc-patches/2011-06/msg01518.html.

H. J., is there anything else that is pending to fix at this moment regarding avx256 load/store splitting?

If no, can we backport the set of patches to 4.6 branch now?

Thanks,

Changpeng 





________________________________________
From: Jagasia, Harsha
Sent: Monday, June 20, 2011 12:03 PM
To: 'H.J. Lu'
Cc: 'gcc-patches@gcc.gnu.org'; 'hubicka@ucw.cz'; 'ubizjak@gmail.com'; 'hongjiu.lu@intel.com'; Fang, Changpeng
Subject: RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.

> On Mon, Jun 20, 2011 at 9:58 AM,  <harsha.jagasia@amd.com> wrote:
> > Is it ok to backport patches, with Changelogs below, already in trunk
> to gcc
> > 4.6? These patches are for AVX-256bit load store splitting. These
> patches
> > make significant performance difference >=3% to several CPU2006 and
> > Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will
> post
> > backported patches for commit approval.
> >
> > AMD plans to submit additional patches on AVX-256 load/store
> splitting to
> > trunk. We will send additional backport requests for those later once
> they
> > are accepted/comitted to trunk.
> >
>
> Since we will make some changes on trunk, I would prefer to to do
> the backport after trunk change is finished.

Ok, thanks. Adding Changpeng who is working on the trunk changes.

Harsha


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
  2011-06-20 22:17     ` Fang, Changpeng
@ 2011-06-20 23:38       ` Lu, Hongjiu
  2011-06-21  8:46         ` Richard Guenther
  0 siblings, 1 reply; 8+ messages in thread
From: Lu, Hongjiu @ 2011-06-20 23:38 UTC (permalink / raw)
  To: Fang, Changpeng, Jagasia, Harsha, 'H.J. Lu'
  Cc: 'gcc-patches@gcc.gnu.org', 'hubicka@ucw.cz',
	'ubizjak@gmail.com'

> 
> The patch that disables default setting of unaligned load splitting
> for bdver1 has been committed
> to trunk as revision 175230.
> 
> Here is the patch: http://gcc.gnu.org/ml/gcc-patches/2011-
> 06/msg01518.html.
> 
> H. J., is there anything else that is pending to fix at this moment
> regarding avx256 load/store splitting?
> 
> If no, can we backport the set of patches to 4.6 branch now?
> 

I have no problems with backporting now.

Thanks.

H.J.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
  2011-06-20 23:38       ` Lu, Hongjiu
@ 2011-06-21  8:46         ` Richard Guenther
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Guenther @ 2011-06-21  8:46 UTC (permalink / raw)
  To: Lu, Hongjiu
  Cc: Fang, Changpeng, Jagasia, Harsha, H.J. Lu, gcc-patches, hubicka, ubizjak

On Tue, Jun 21, 2011 at 1:02 AM, Lu, Hongjiu <hongjiu.lu@intel.com> wrote:
>>
>> The patch that disables default setting of unaligned load splitting
>> for bdver1 has been committed
>> to trunk as revision 175230.
>>
>> Here is the patch: http://gcc.gnu.org/ml/gcc-patches/2011-
>> 06/msg01518.html.
>>
>> H. J., is there anything else that is pending to fix at this moment
>> regarding avx256 load/store splitting?
>>
>> If no, can we backport the set of patches to 4.6 branch now?
>>
>
> I have no problems with backporting now.

The 4.6 branch is frozen at the moment, please refrain from checking
in anything.

Thanks,
Richard.

> Thanks.
>
> H.J.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
  2011-06-20 17:17   ` Jagasia, Harsha
  2011-06-20 22:17     ` Fang, Changpeng
@ 2011-06-27 23:10     ` Fang, Changpeng
  2011-06-28  9:44       ` Richard Guenther
  1 sibling, 1 reply; 8+ messages in thread
From: Fang, Changpeng @ 2011-06-27 23:10 UTC (permalink / raw)
  To: Jagasia, Harsha, 'H.J. Lu', gcc-patches
  Cc: 'hubicka@ucw.cz', 'ubizjak@gmail.com',
	'hongjiu.lu@intel.com'

[-- Attachment #1: Type: text/plain, Size: 2031 bytes --]

Hi,

Attached are the patches we propose to backport to gcc 4.6 branch which are related to avx256 unaligned load/store splitting.
As we mentioned before,  The combined effect of these patches are positive on both AMD and Intel CPUs on cpu2006 and
polyhedron 2005.

0001-Split-32-byte-AVX-unaligned-load-store.patch
Initial patch that implements unaligned load/store splitting

0001-Don-t-assert-unaligned-256bit-load-store.patch
Remove the assert.

0001-Fix-a-typo-in-mavx256-split-unaligned-store.patch
Fix a typo.

0002-pr49089-enable-avx256-splitting-unaligned-load-store.patch
Disable unaligned load splitting for bdver1.

All these patches are in 4.7 trunk.

Bootstrap and tests are on-going in gcc 4.6 branch.

Is It OK to commit to 4.6 branch as long as the tests pass?

Thanks,

Changpeng 



________________________________________
From: Jagasia, Harsha
Sent: Monday, June 20, 2011 12:03 PM
To: 'H.J. Lu'
Cc: 'gcc-patches@gcc.gnu.org'; 'hubicka@ucw.cz'; 'ubizjak@gmail.com'; 'hongjiu.lu@intel.com'; Fang, Changpeng
Subject: RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.

> On Mon, Jun 20, 2011 at 9:58 AM,  <harsha.jagasia@amd.com> wrote:
> > Is it ok to backport patches, with Changelogs below, already in trunk
> to gcc
> > 4.6? These patches are for AVX-256bit load store splitting. These
> patches
> > make significant performance difference >=3% to several CPU2006 and
> > Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will
> post
> > backported patches for commit approval.
> >
> > AMD plans to submit additional patches on AVX-256 load/store
> splitting to
> > trunk. We will send additional backport requests for those later once
> they
> > are accepted/comitted to trunk.
> >
>
> Since we will make some changes on trunk, I would prefer to to do
> the backport after trunk change is finished.

Ok, thanks. Adding Changpeng who is working on the trunk changes.

Harsha


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Split-32-byte-AVX-unaligned-load-store.patch --]
[-- Type: text/x-patch; name="0001-Split-32-byte-AVX-unaligned-load-store.patch", Size: 27084 bytes --]

From b8cb8d5224d650672add0fb6a74d759ef12e428f Mon Sep 17 00:00:00 2001
From: hjl <hjl@138bc75d-0d04-0410-961f-82ee72b054a4>
Date: Sun, 27 Mar 2011 18:56:00 +0000
Subject: [PATCH] Split 32-byte AVX unaligned load/store.

gcc/

2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>

	* config/i386/i386.c (flag_opts): Add -mavx256-split-unaligned-load
	and -mavx256-split-unaligned-store.
	(ix86_option_override_internal): Split 32-byte AVX unaligned
	load/store by default.
	(ix86_avx256_split_vector_move_misalign): New.
	(ix86_expand_vector_move_misalign): Use it.

	* config/i386/i386.opt: Add -mavx256-split-unaligned-load and
	-mavx256-split-unaligned-store.

	* config/i386/sse.md (*avx_mov<mode>_internal): Verify unaligned
	256bit load/store.  Generate unaligned store on misaligned memory
	operand.
	(*avx_movu<ssemodesuffix><avxmodesuffix>): Verify unaligned
	256bit load/store.
	(*avx_movdqu<avxmodesuffix>): Likewise.

	* doc/invoke.texi: Document -mavx256-split-unaligned-load and
	-mavx256-split-unaligned-store.

gcc/testsuite/

2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>

	* gcc.target/i386/avx256-unaligned-load-1.c: New.
	* gcc.target/i386/avx256-unaligned-load-2.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-3.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-4.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-5.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-6.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-7.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-1.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-2.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-3.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-4.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-5.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-6.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-7.c: Likewise.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@171578 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog                                      |   22 ++++++
 gcc/config/i386/i386.c                             |   76 +++++++++++++++++--
 gcc/config/i386/i386.opt                           |    8 ++
 gcc/config/i386/sse.md                             |   42 ++++++++++--
 gcc/doc/invoke.texi                                |    9 ++-
 gcc/testsuite/ChangeLog                            |   17 +++++
 .../gcc.target/i386/avx256-unaligned-load-1.c      |   19 +++++
 .../gcc.target/i386/avx256-unaligned-load-2.c      |   29 ++++++++
 .../gcc.target/i386/avx256-unaligned-load-3.c      |   19 +++++
 .../gcc.target/i386/avx256-unaligned-load-4.c      |   19 +++++
 .../gcc.target/i386/avx256-unaligned-load-5.c      |   43 +++++++++++
 .../gcc.target/i386/avx256-unaligned-load-6.c      |   42 +++++++++++
 .../gcc.target/i386/avx256-unaligned-load-7.c      |   60 +++++++++++++++
 .../gcc.target/i386/avx256-unaligned-store-1.c     |   22 ++++++
 .../gcc.target/i386/avx256-unaligned-store-2.c     |   29 ++++++++
 .../gcc.target/i386/avx256-unaligned-store-3.c     |   22 ++++++
 .../gcc.target/i386/avx256-unaligned-store-4.c     |   20 +++++
 .../gcc.target/i386/avx256-unaligned-store-5.c     |   42 +++++++++++
 .../gcc.target/i386/avx256-unaligned-store-6.c     |   42 +++++++++++
 .../gcc.target/i386/avx256-unaligned-store-7.c     |   45 ++++++++++++
 20 files changed, 613 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-load-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-load-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-load-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-load-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-load-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-load-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-load-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-store-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-store-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-store-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-store-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256-unaligned-store-7.c

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 41c0ef2..ca0e3d6 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,25 @@
+2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* config/i386/i386.c (flag_opts): Add -mavx256-split-unaligned-load
+	and -mavx256-split-unaligned-store.
+	(ix86_option_override_internal): Split 32-byte AVX unaligned
+	load/store by default.
+	(ix86_avx256_split_vector_move_misalign): New.
+	(ix86_expand_vector_move_misalign): Use it.
+
+	* config/i386/i386.opt: Add -mavx256-split-unaligned-load and
+	-mavx256-split-unaligned-store.
+
+	* config/i386/sse.md (*avx_mov<mode>_internal): Verify unaligned
+	256bit load/store.  Generate unaligned store on misaligned memory
+	operand.
+	(*avx_movu<ssemodesuffix><avxmodesuffix>): Verify unaligned
+	256bit load/store.
+	(*avx_movdqu<avxmodesuffix>): Likewise.
+
+	* doc/invoke.texi: Document -mavx256-split-unaligned-load and
+	-mavx256-split-unaligned-store.
+
 2011-03-27  Richard Sandiford  <rdsandiford@googlemail.com>
 
 	PR target/38598
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 4e8ca69..a4ca762 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -3130,6 +3130,8 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune,
     { "-mvect8-ret-in-mem",		MASK_VECT8_RETURNS },
     { "-m8bit-idiv",			MASK_USE_8BIT_IDIV },
     { "-mvzeroupper",			MASK_VZEROUPPER },
+    { "-mavx256-split-unaligned-load",	MASK_AVX256_SPLIT_UNALIGNED_LOAD},
+    { "-mavx256-split-unaligned-stroe",	MASK_AVX256_SPLIT_UNALIGNED_STORE},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -4274,11 +4276,18 @@ ix86_option_override_internal (bool main_args_p)
   if (TARGET_AVX)
     {
       /* When not optimize for size, enable vzeroupper optimization for
-	 TARGET_AVX with -fexpensive-optimizations.  */
-      if (!optimize_size
-	  && flag_expensive_optimizations
-	  && !(target_flags_explicit & MASK_VZEROUPPER))
-	target_flags |= MASK_VZEROUPPER;
+	 TARGET_AVX with -fexpensive-optimizations and split 32-byte
+	 AVX unaligned load/store.  */
+      if (!optimize_size)
+	{
+	  if (flag_expensive_optimizations
+	      && !(target_flags_explicit & MASK_VZEROUPPER))
+	    target_flags |= MASK_VZEROUPPER;
+	  if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
+	    target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
+	  if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
+	    target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
+	}
     }
   else 
     {
@@ -15588,6 +15597,57 @@ ix86_expand_vector_move (enum machine_mode mode, rtx operands[])
   emit_insn (gen_rtx_SET (VOIDmode, op0, op1));
 }
 
+/* Split 32-byte AVX unaligned load and store if needed.  */
+
+static void
+ix86_avx256_split_vector_move_misalign (rtx op0, rtx op1)
+{
+  rtx m;
+  rtx (*extract) (rtx, rtx, rtx);
+  rtx (*move_unaligned) (rtx, rtx);
+  enum machine_mode mode;
+
+  switch (GET_MODE (op0))
+    {
+    default:
+      gcc_unreachable ();
+    case V32QImode:
+      extract = gen_avx_vextractf128v32qi;
+      move_unaligned = gen_avx_movdqu256;
+      mode = V16QImode;
+      break;
+    case V8SFmode:
+      extract = gen_avx_vextractf128v8sf;
+      move_unaligned = gen_avx_movups256;
+      mode = V4SFmode;
+      break;
+    case V4DFmode:
+      extract = gen_avx_vextractf128v4df;
+      move_unaligned = gen_avx_movupd256;
+      mode = V2DFmode;
+      break;
+    }
+
+  if (MEM_P (op1) && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
+    {
+      rtx r = gen_reg_rtx (mode);
+      m = adjust_address (op1, mode, 0);
+      emit_move_insn (r, m);
+      m = adjust_address (op1, mode, 16);
+      r = gen_rtx_VEC_CONCAT (GET_MODE (op0), r, m);
+      emit_move_insn (op0, r);
+    }
+  else if (MEM_P (op0) && TARGET_AVX256_SPLIT_UNALIGNED_STORE)
+    {
+      m = adjust_address (op0, mode, 0);
+      emit_insn (extract (m, op1, const0_rtx));
+      m = adjust_address (op0, mode, 16);
+      emit_insn (extract (m, op1, const1_rtx));
+    }
+  else
+    emit_insn (move_unaligned (op0, op1));
+}
+
 /* Implement the movmisalign patterns for SSE.  Non-SSE modes go
    straight to ix86_expand_vector_move.  */
 /* Code generation for scalar reg-reg moves of single and double precision data:
@@ -15672,7 +15732,7 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 	    case 32:
 	      op0 = gen_lowpart (V32QImode, op0);
 	      op1 = gen_lowpart (V32QImode, op1);
-	      emit_insn (gen_avx_movdqu256 (op0, op1));
+	      ix86_avx256_split_vector_move_misalign (op0, op1);
 	      break;
 	    default:
 	      gcc_unreachable ();
@@ -15688,7 +15748,7 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 	      emit_insn (gen_avx_movups (op0, op1));
 	      break;
 	    case V8SFmode:
-	      emit_insn (gen_avx_movups256 (op0, op1));
+	      ix86_avx256_split_vector_move_misalign (op0, op1);
 	      break;
 	    case V2DFmode:
 	      if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
@@ -15701,7 +15761,7 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 	      emit_insn (gen_avx_movupd (op0, op1));
 	      break;
 	    case V4DFmode:
-	      emit_insn (gen_avx_movupd256 (op0, op1));
+	      ix86_avx256_split_vector_move_misalign (op0, op1);
 	      break;
 	    default:
 	      gcc_unreachable ();
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index e02d098..f63a406 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -420,3 +420,11 @@ Emit profiling counter call at function entry before prologue.
 m8bit-idiv
 Target Report Mask(USE_8BIT_IDIV) Save
 Expand 32bit/64bit integer divide into 8bit unsigned integer divide with run-time check
+
+mavx256-split-unaligned-load
+Target Report Mask(AVX256_SPLIT_UNALIGNED_LOAD) Save
+Split 32-byte AVX unaligned load
+
+mavx256-split-unaligned-store
+Target Report Mask(AVX256_SPLIT_UNALIGNED_STORE) Save
+Split 32-byte AVX unaligned store
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 70a0b34..de11f73 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -203,19 +203,35 @@
       return standard_sse_constant_opcode (insn, operands[1]);
     case 1:
     case 2:
+      if (GET_MODE_ALIGNMENT (<MODE>mode) == 256
+	  && ((TARGET_AVX256_SPLIT_UNALIGNED_STORE
+	       && misaligned_operand (operands[0], <MODE>mode))
+	      || (TARGET_AVX256_SPLIT_UNALIGNED_LOAD
+		  && misaligned_operand (operands[1], <MODE>mode))))
+	gcc_unreachable ();
       switch (get_attr_mode (insn))
         {
 	case MODE_V8SF:
 	case MODE_V4SF:
-	  return "vmovaps\t{%1, %0|%0, %1}";
+	  if (misaligned_operand (operands[0], <MODE>mode)
+	      || misaligned_operand (operands[1], <MODE>mode))
+	    return "vmovups\t{%1, %0|%0, %1}";
+	  else
+	    return "vmovaps\t{%1, %0|%0, %1}";
 	case MODE_V4DF:
 	case MODE_V2DF:
-	  if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
+	  if (misaligned_operand (operands[0], <MODE>mode)
+	      || misaligned_operand (operands[1], <MODE>mode))
+	    return "vmovupd\t{%1, %0|%0, %1}";
+	  else if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
 	    return "vmovaps\t{%1, %0|%0, %1}";
 	  else
 	    return "vmovapd\t{%1, %0|%0, %1}";
 	default:
-	  if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
+	  if (misaligned_operand (operands[0], <MODE>mode)
+	      || misaligned_operand (operands[1], <MODE>mode))
+	    return "vmovdqu\t{%1, %0|%0, %1}";
+	  else if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
 	    return "vmovaps\t{%1, %0|%0, %1}";
 	  else
 	    return "vmovdqa\t{%1, %0|%0, %1}";
@@ -400,7 +416,15 @@
 	  UNSPEC_MOVU))]
   "AVX_VEC_FLOAT_MODE_P (<MODE>mode)
    && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
-  "vmovu<ssemodesuffix>\t{%1, %0|%0, %1}"
+{
+  if (GET_MODE_ALIGNMENT (<MODE>mode) == 256
+      && ((TARGET_AVX256_SPLIT_UNALIGNED_STORE
+	   && misaligned_operand (operands[0], <MODE>mode))
+	  || (TARGET_AVX256_SPLIT_UNALIGNED_LOAD
+	      && misaligned_operand (operands[1], <MODE>mode))))
+    gcc_unreachable ();
+  return "vmovu<ssemodesuffix>\t{%1, %0|%0, %1}";
+}
   [(set_attr "type" "ssemov")
    (set_attr "movu" "1")
    (set_attr "prefix" "vex")
@@ -459,7 +483,15 @@
 	  [(match_operand:AVXMODEQI 1 "nonimmediate_operand" "xm,x")]
 	  UNSPEC_MOVU))]
   "TARGET_AVX && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
-  "vmovdqu\t{%1, %0|%0, %1}"
+{
+  if (GET_MODE_ALIGNMENT (<MODE>mode) == 256
+      && ((TARGET_AVX256_SPLIT_UNALIGNED_STORE
+	   && misaligned_operand (operands[0], <MODE>mode))
+	  || (TARGET_AVX256_SPLIT_UNALIGNED_LOAD
+	      && misaligned_operand (operands[1], <MODE>mode))))
+    gcc_unreachable ();
+  return "vmovdqu\t{%1, %0|%0, %1}";
+}
   [(set_attr "type" "ssemov")
    (set_attr "movu" "1")
    (set_attr "prefix" "vex")
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 925455d..85bf2b4 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -602,7 +602,8 @@ Objective-C and Objective-C++ Dialects}.
 -momit-leaf-frame-pointer  -mno-red-zone -mno-tls-direct-seg-refs @gol
 -mcmodel=@var{code-model} -mabi=@var{name} @gol
 -m32  -m64 -mlarge-data-threshold=@var{num} @gol
--msse2avx -mfentry -m8bit-idiv}
+-msse2avx -mfentry -m8bit-idiv @gol
+-mavx256-split-unaligned-load -mavx256-split-unaligned-store}
 
 @emph{i386 and x86-64 Windows Options}
 @gccoptlist{-mconsole -mcygwin -mno-cygwin -mdll @gol
@@ -12669,6 +12670,12 @@ runt-time check.  If both dividend and divisor are within range of 0
 to 255, 8bit unsigned integer divide will be used instead of
 32bit/64bit integer divide.
 
+@item -mavx256-split-unaligned-load
+@item -mavx256-split-unaligned-store
+@opindex avx256-split-unaligned-load
+@opindex avx256-split-unaligned-store
+Split 32-byte AVX unaligned load and store.
+
 @end table
 
 These @samp{-m} switches are supported in addition to the above
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 3cc61b0..fdcc95f 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,20 @@
+2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx256-unaligned-load-1.c: New.
+	* gcc.target/i386/avx256-unaligned-load-2.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-load-3.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-load-4.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-load-5.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-load-6.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-load-7.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-store-1.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-store-2.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-store-3.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-store-4.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-store-5.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-store-6.c: Likewise.
+	* gcc.target/i386/avx256-unaligned-store-7.c: Likewise.
+
 2011-03-27  Thomas Koenig  <tkoenig@gcc.gnu.org>
 
 	PR fortran/47065
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-1.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-1.c
new file mode 100644
index 0000000..023e859
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-load" } */
+
+#define N 1024
+
+float a[N], b[N+3], c[N];
+
+void
+avx_test (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    c[i] = a[i] * b[i+3];
+}
+
+/* { dg-final { scan-assembler-not "\\*avx_movups256/1" } } */
+/* { dg-final { scan-assembler "\\*avx_movups/1" } } */
+/* { dg-final { scan-assembler "vinsertf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-2.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-2.c
new file mode 100644
index 0000000..8394e27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-2.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-load" } */
+
+#define N 1024
+
+char **ep;
+char **fp;
+
+void
+avx_test (void)
+{
+  int i;
+  char **ap;
+  char **bp;
+  char **cp;
+
+  ap = ep;
+  bp = fp;
+  for (i = 128; i >= 0; i--)
+    {
+      *ap++ = *cp++;
+      *bp++ = 0;
+    }
+}
+
+/* { dg-final { scan-assembler-not "\\*avx_movdqu256/1" } } */
+/* { dg-final { scan-assembler "\\*avx_movdqu/1" } } */
+/* { dg-final { scan-assembler "vinsertf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-3.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-3.c
new file mode 100644
index 0000000..ec7d59d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-3.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-load" } */
+
+#define N 1024
+
+double a[N], b[N+3], c[N];
+
+void
+avx_test (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    c[i] = a[i] * b[i+3];
+}
+
+/* { dg-final { scan-assembler-not "\\*avx_movupd256/1" } } */
+/* { dg-final { scan-assembler "\\*avx_movupd/1" } } */
+/* { dg-final { scan-assembler "vinsertf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-4.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-4.c
new file mode 100644
index 0000000..0d3ef33
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-4.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -dp -mavx -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store" } */
+
+#define N 1024
+
+float a[N], b[N+3];
+
+void
+avx_test (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i] = a[i+3] * 2;
+}
+
+/* { dg-final { scan-assembler "\\*avx_movups256/1" } } */
+/* { dg-final { scan-assembler-not "\\*avx_movups/1" } } */
+/* { dg-final { scan-assembler-not "vinsertf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-5.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-5.c
new file mode 100644
index 0000000..153b66f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-5.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-load" } */
+
+#include "avx-check.h"
+
+#define N 8
+
+float a[N+3] = { -1, -1, -1, 24.43, 68.346, 43.35,
+		 546.46, 46.79, 82.78, 82.7, 9.4 };
+float b[N];
+float c[N];
+
+void
+foo (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i] = a[i+3] * 2;
+}
+
+__attribute__ ((noinline))
+float
+bar (float x)
+{
+  return x * 2;
+}
+
+void
+avx_test (void)
+{
+  int i;
+
+  foo ();
+
+  for (i = 0; i < N; i++)
+    c[i] = bar (a[i+3]);
+
+  for (i = 0; i < N; i++)
+    if (b[i] != c[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-6.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-6.c
new file mode 100644
index 0000000..2fa984c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-6.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-load" } */
+
+#include "avx-check.h"
+
+#define N 4
+
+double a[N+3] = { -1, -1, -1, 24.43, 68.346, 43.35, 546.46 };
+double b[N];
+double c[N];
+
+void
+foo (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i] = a[i+3] * 2;
+}
+
+__attribute__ ((noinline))
+double
+bar (double x)
+{
+  return x * 2;
+}
+
+void
+avx_test (void)
+{
+  int i;
+
+  foo ();
+
+  for (i = 0; i < N; i++)
+    c[i] = bar (a[i+3]);
+
+  for (i = 0; i < N; i++)
+    if (b[i] != c[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-7.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-7.c
new file mode 100644
index 0000000..ad16a53
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-7.c
@@ -0,0 +1,60 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-load" } */
+
+#include "avx-check.h"
+
+#define N 128
+
+char **ep;
+char **fp;
+char **mp;
+char **lp;
+
+__attribute__ ((noinline))
+void
+foo (void)
+{
+  mp = (char **) malloc (N);
+  lp = (char **) malloc (N);
+  ep = (char **) malloc (N);
+  fp = (char **) malloc (N);
+}
+
+void
+avx_test (void)
+{
+  int i;
+  char **ap, **bp, **cp, **dp;
+  char *str = "STR";
+
+  foo ();
+
+  cp = mp;
+  dp = lp;
+
+  for (i = N; i >= 0; i--)
+    {
+      *cp++ = str;
+      *dp++ = str;
+    }
+
+  ap = ep;
+  bp = fp;
+  cp = mp;
+  dp = lp;
+
+  for (i = N; i >= 0; i--)
+    {
+      *ap++ = *cp++;
+      *bp++ = *dp++;
+    }
+
+  for (i = N; i >= 0; i--)
+    {
+      if (strcmp (*--ap, "STR") != 0)
+	abort ();
+      if (strcmp (*--bp, "STR") != 0)
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-1.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-1.c
new file mode 100644
index 0000000..99db55c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store" } */
+
+#define N 1024
+
+float a[N], b[N+3], c[N], d[N];
+
+void
+avx_test (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i+3] = a[i] * 10.0;
+
+  for (i = 0; i < N; i++)
+    d[i] = c[i] * 20.0;
+}
+
+/* { dg-final { scan-assembler-not "\\*avx_movups256/2" } } */
+/* { dg-final { scan-assembler "movups.*\\*avx_movv4sf_internal/3" } } */
+/* { dg-final { scan-assembler "vextractf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c
new file mode 100644
index 0000000..38ee9e2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store" } */
+
+#define N 1024
+
+char **ep;
+char **fp;
+
+void
+avx_test (void)
+{
+  int i;
+  char **ap;
+  char **bp;
+  char **cp;
+
+  ap = ep;
+  bp = fp;
+  for (i = 128; i >= 0; i--)
+    {
+      *ap++ = *cp++;
+      *bp++ = 0;
+    }
+}
+
+/* { dg-final { scan-assembler-not "\\*avx_movdqu256/2" } } */
+/* { dg-final { scan-assembler "movdqu.*\\*avx_movv16qi_internal/3" } } */
+/* { dg-final { scan-assembler "vextractf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c
new file mode 100644
index 0000000..eaab6fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store" } */
+
+#define N 1024
+
+double a[N], b[N+3], c[N], d[N];
+
+void
+avx_test (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i+3] = a[i] * 10.0;
+
+  for (i = 0; i < N; i++)
+    d[i] = c[i] * 20.0;
+}
+
+/* { dg-final { scan-assembler-not "\\*avx_movupd256/2" } } */
+/* { dg-final { scan-assembler "movupd.*\\*avx_movv2df_internal/3" } } */
+/* { dg-final { scan-assembler "vextractf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-4.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-4.c
new file mode 100644
index 0000000..96cca66
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-4.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -dp -mavx -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store" } */
+
+#define N 1024
+
+float a[N], b[N+3], c[N];
+
+void
+avx_test (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i+3] = a[i] * c[i];
+}
+
+/* { dg-final { scan-assembler "\\*avx_movups256/2" } } */
+/* { dg-final { scan-assembler-not "\\*avx_movups/2" } } */
+/* { dg-final { scan-assembler-not "\\*avx_movv4sf_internal/3" } } */
+/* { dg-final { scan-assembler-not "vextractf128" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-5.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-5.c
new file mode 100644
index 0000000..642da3c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-5.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store" } */
+
+#include "avx-check.h"
+
+#define N 8
+
+float a[N] = { 24.43, 68.346, 43.35, 546.46, 46.79, 82.78, 82.7, 9.4 };
+float b[N+3];
+float c[N+3];
+
+void
+foo (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i+3] = a[i] * 2;
+}
+
+__attribute__ ((noinline))
+float
+bar (float x)
+{
+  return x * 2;
+}
+
+void
+avx_test (void)
+{
+  int i;
+
+  foo ();
+
+  for (i = 0; i < N; i++)
+    c[i+3] = bar (a[i]);
+
+  for (i = 0; i < N; i++)
+    if (b[i+3] != c[i+3])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-6.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-6.c
new file mode 100644
index 0000000..a0de7a5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-6.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store" } */
+
+#include "avx-check.h"
+
+#define N 4
+
+double a[N] = { 24.43, 68.346, 43.35, 546.46 };
+double b[N+3];
+double c[N+3];
+
+void
+foo (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    b[i+3] = a[i] * 2;
+}
+
+__attribute__ ((noinline))
+double
+bar (double x)
+{
+  return x * 2;
+}
+
+void
+avx_test (void)
+{
+  int i;
+
+  foo ();
+
+  for (i = 0; i < N; i++)
+    c[i+3] = bar (a[i]);
+
+  for (i = 0; i < N; i++)
+    if (b[i+3] != c[i+3])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-7.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-7.c
new file mode 100644
index 0000000..4272dc3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-7.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store" } */
+
+#include "avx-check.h"
+
+#define N 128
+
+char **ep;
+char **fp;
+
+__attribute__ ((noinline))
+void
+foo (void)
+{
+  ep = (char **) malloc (N);
+  fp = (char **) malloc (N);
+}
+
+void
+avx_test (void)
+{
+  int i;
+  char **ap, **bp;
+  char *str = "STR";
+
+  foo ();
+
+  ap = ep;
+  bp = fp;
+
+  for (i = N; i >= 0; i--)
+    {
+      *ap++ = str;
+      *bp++ = str;
+    }
+
+  for (i = N; i >= 0; i--)
+    {
+      if (strcmp (*--ap, "STR") != 0)
+	abort ();
+      if (strcmp (*--bp, "STR") != 0)
+	abort ();
+    }
+}
-- 
1.6.0.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0001-Don-t-assert-unaligned-256bit-load-store.patch --]
[-- Type: text/x-patch; name="0001-Don-t-assert-unaligned-256bit-load-store.patch", Size: 2963 bytes --]

From 30d07ab33d8126c2ff061bbb7e4b221672721a62 Mon Sep 17 00:00:00 2001
From: hjl <hjl@138bc75d-0d04-0410-961f-82ee72b054a4>
Date: Mon, 28 Mar 2011 02:49:34 +0000
Subject: [PATCH] Don't assert unaligned 256bit load/store.

2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>

	* config/i386/sse.md (*avx_mov<mode>_internal): Don't assert
	unaligned 256bit load/store.
	(*avx_movu<ssemodesuffix><avxmodesuffix>): Likewise.
	(*avx_movdqu<avxmodesuffix>): Likewise.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@171590 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog          |    7 +++++++
 gcc/config/i386/sse.md |   26 ++------------------------
 2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index d8bf12e..e9166cf 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2011-03-27  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* config/i386/sse.md (*avx_mov<mode>_internal): Don't assert
+	unaligned 256bit load/store.
+	(*avx_movu<ssemodesuffix><avxmodesuffix>): Likewise.
+	(*avx_movdqu<avxmodesuffix>): Likewise.
+
 2011-03-27  Vladimir Makarov  <vmakarov@redhat.com>
 
 	PR bootstrap/48307
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index de11f73..4c22bc5 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -203,12 +203,6 @@
       return standard_sse_constant_opcode (insn, operands[1]);
     case 1:
     case 2:
-      if (GET_MODE_ALIGNMENT (<MODE>mode) == 256
-	  && ((TARGET_AVX256_SPLIT_UNALIGNED_STORE
-	       && misaligned_operand (operands[0], <MODE>mode))
-	      || (TARGET_AVX256_SPLIT_UNALIGNED_LOAD
-		  && misaligned_operand (operands[1], <MODE>mode))))
-	gcc_unreachable ();
       switch (get_attr_mode (insn))
         {
 	case MODE_V8SF:
@@ -416,15 +410,7 @@
 	  UNSPEC_MOVU))]
   "AVX_VEC_FLOAT_MODE_P (<MODE>mode)
    && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
-{
-  if (GET_MODE_ALIGNMENT (<MODE>mode) == 256
-      && ((TARGET_AVX256_SPLIT_UNALIGNED_STORE
-	   && misaligned_operand (operands[0], <MODE>mode))
-	  || (TARGET_AVX256_SPLIT_UNALIGNED_LOAD
-	      && misaligned_operand (operands[1], <MODE>mode))))
-    gcc_unreachable ();
-  return "vmovu<ssemodesuffix>\t{%1, %0|%0, %1}";
-}
+  "vmovu<ssemodesuffix>\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssemov")
    (set_attr "movu" "1")
    (set_attr "prefix" "vex")
@@ -483,15 +469,7 @@
 	  [(match_operand:AVXMODEQI 1 "nonimmediate_operand" "xm,x")]
 	  UNSPEC_MOVU))]
   "TARGET_AVX && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
-{
-  if (GET_MODE_ALIGNMENT (<MODE>mode) == 256
-      && ((TARGET_AVX256_SPLIT_UNALIGNED_STORE
-	   && misaligned_operand (operands[0], <MODE>mode))
-	  || (TARGET_AVX256_SPLIT_UNALIGNED_LOAD
-	      && misaligned_operand (operands[1], <MODE>mode))))
-    gcc_unreachable ();
-  return "vmovdqu\t{%1, %0|%0, %1}";
-}
+  "vmovdqu\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssemov")
    (set_attr "movu" "1")
    (set_attr "prefix" "vex")
-- 
1.6.0.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: 0001-Fix-a-typo-in-mavx256-split-unaligned-store.patch --]
[-- Type: text/x-patch; name="0001-Fix-a-typo-in-mavx256-split-unaligned-store.patch", Size: 1612 bytes --]

From 8e4ee659317b006dce70bc231f2dacf182244be0 Mon Sep 17 00:00:00 2001
From: hjl <hjl@138bc75d-0d04-0410-961f-82ee72b054a4>
Date: Mon, 28 Mar 2011 20:40:41 +0000
Subject: [PATCH] Fix a typo in -mavx256-split-unaligned-store.

2011-03-28  H.J. Lu  <hongjiu.lu@intel.com>

	* config/i386/i386.c (flag_opts): Fix a typo in
	-mavx256-split-unaligned-store.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@171626 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog          |    5 +++++
 gcc/config/i386/i386.c |    2 +-
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 1de788d..1469c80 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2011-03-28  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* config/i386/i386.c (flag_opts): Fix a typo in
+	-mavx256-split-unaligned-store.
+
 2011-03-28  Anatoly Sokolov  <aesok@post.ru>
 
 	* config/h8300/h8300.h (FUNCTION_VALUE_REGNO_P, FUNCTION_VALUE,
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a4ca762..8542238 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -3131,7 +3131,7 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune,
     { "-m8bit-idiv",			MASK_USE_8BIT_IDIV },
     { "-mvzeroupper",			MASK_VZEROUPPER },
     { "-mavx256-split-unaligned-load",	MASK_AVX256_SPLIT_UNALIGNED_LOAD},
-    { "-mavx256-split-unaligned-stroe",	MASK_AVX256_SPLIT_UNALIGNED_STORE},
+    { "-mavx256-split-unaligned-store",	MASK_AVX256_SPLIT_UNALIGNED_STORE},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
-- 
1.6.0.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #5: 0002-pr49089-enable-avx256-splitting-unaligned-load-store.patch --]
[-- Type: text/x-patch; name="0002-pr49089-enable-avx256-splitting-unaligned-load-store.patch", Size: 2027 bytes --]

From 50310fc367348b406fc88d54c3ab54d1a304ad52 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@huainan.(none)>
Date: Mon, 13 Jun 2011 13:13:32 -0700
Subject: [PATCH 2/2] pr49089: enable avx256 splitting unaligned load/store only when beneficial

	* config/i386/i386.c (avx256_split_unaligned_load): New definition.
	  (avx256_split_unaligned_store): New definition.
	  (ix86_option_override_internal): Enable avx256 unaligned load(store)
	  splitting only when avx256_split_unaligned_load(store) is set.
---
 gcc/config/i386/i386.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7b266b9..3bc0b53 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2121,6 +2121,12 @@ static const unsigned int x86_arch_always_fancy_math_387
   = m_PENT | m_ATOM | m_PPRO | m_AMD_MULTIPLE | m_PENT4
     | m_NOCONA | m_CORE2I7 | m_GENERIC;
 
+static const unsigned int x86_avx256_split_unaligned_load
+  = m_COREI7 | m_GENERIC;
+
+static const unsigned int x86_avx256_split_unaligned_store
+  = m_COREI7 | m_BDVER1 | m_GENERIC;
+
 /* In case the average insn count for single function invocation is
    lower than this constant, emit fast (but longer) prologue and
    epilogue code.  */
@@ -4194,9 +4200,11 @@ ix86_option_override_internal (bool main_args_p)
 	  if (flag_expensive_optimizations
 	      && !(target_flags_explicit & MASK_VZEROUPPER))
 	    target_flags |= MASK_VZEROUPPER;
-	  if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
+	  if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
+	      && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
 	    target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
-	  if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
+	  if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
+	      && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	    target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
 	}
     }
-- 
1.7.0.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
  2011-06-27 23:10     ` Fang, Changpeng
@ 2011-06-28  9:44       ` Richard Guenther
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Guenther @ 2011-06-28  9:44 UTC (permalink / raw)
  To: Fang, Changpeng
  Cc: Jagasia, Harsha, H.J. Lu, gcc-patches, hubicka, ubizjak, hongjiu.lu

On Tue, Jun 28, 2011 at 12:33 AM, Fang, Changpeng
<Changpeng.Fang@amd.com> wrote:
> Hi,
>
> Attached are the patches we propose to backport to gcc 4.6 branch which are related to avx256 unaligned load/store splitting.
> As we mentioned before,  The combined effect of these patches are positive on both AMD and Intel CPUs on cpu2006 and
> polyhedron 2005.
>
> 0001-Split-32-byte-AVX-unaligned-load-store.patch
> Initial patch that implements unaligned load/store splitting
>
> 0001-Don-t-assert-unaligned-256bit-load-store.patch
> Remove the assert.
>
> 0001-Fix-a-typo-in-mavx256-split-unaligned-store.patch
> Fix a typo.
>
> 0002-pr49089-enable-avx256-splitting-unaligned-load-store.patch
> Disable unaligned load splitting for bdver1.
>
> All these patches are in 4.7 trunk.
>
> Bootstrap and tests are on-going in gcc 4.6 branch.
>
> Is It OK to commit to 4.6 branch as long as the tests pass?

Yes, if they have been approved and checked in for trunk.

Thanks,
Richard.

> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Jagasia, Harsha
> Sent: Monday, June 20, 2011 12:03 PM
> To: 'H.J. Lu'
> Cc: 'gcc-patches@gcc.gnu.org'; 'hubicka@ucw.cz'; 'ubizjak@gmail.com'; 'hongjiu.lu@intel.com'; Fang, Changpeng
> Subject: RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
>
>> On Mon, Jun 20, 2011 at 9:58 AM,  <harsha.jagasia@amd.com> wrote:
>> > Is it ok to backport patches, with Changelogs below, already in trunk
>> to gcc
>> > 4.6? These patches are for AVX-256bit load store splitting. These
>> patches
>> > make significant performance difference >=3% to several CPU2006 and
>> > Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will
>> post
>> > backported patches for commit approval.
>> >
>> > AMD plans to submit additional patches on AVX-256 load/store
>> splitting to
>> > trunk. We will send additional backport requests for those later once
>> they
>> > are accepted/comitted to trunk.
>> >
>>
>> Since we will make some changes on trunk, I would prefer to to do
>> the backport after trunk change is finished.
>
> Ok, thanks. Adding Changpeng who is working on the trunk changes.
>
> Harsha
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-06-28  8:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-20 17:01 Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware harsha.jagasia
2011-06-20 17:03 ` H.J. Lu
2011-06-20 17:17   ` Jagasia, Harsha
2011-06-20 22:17     ` Fang, Changpeng
2011-06-20 23:38       ` Lu, Hongjiu
2011-06-21  8:46         ` Richard Guenther
2011-06-27 23:10     ` Fang, Changpeng
2011-06-28  9:44       ` Richard Guenther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).