AVX generic mode tuning discussion.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* AVX generic mode tuning discussion.
@ 2011-07-12 22:26 harsha.jagasia
  2011-07-12 22:29 ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: harsha.jagasia @ 2011-07-12 22:26 UTC (permalink / raw)
  To: gcc-patches, hubicka, ubizjak, hjl.tools, Changpeng.Fang, rth
  Cc: harsha.jagasia

We would like to propose changing AVX generic mode tuning to generate 128-bit
AVX instead of 256-bit AVX. As per H.J's suggestion, we have reviewed the
various tuning choices made for generic mode with respect to AMD's upcoming
Bulldozer processor. At this moment, this is the most significant change we
have to propose. While we are willing to re-engineer generic mode, this
feature needs immediate discussion since the performance impact on Bulldozer
is significant.

Here is the relative CPU2006 performance data we have gathered using gcc on AMD
Bulldozer (BD) and Intel Sandybridge (SB) machines with "-Ofast -mtune=generic
-mavx".

		%gain/loss avx256 vs avx128
		(negative % indicates loss
		positive % indicates gain)

		AMD BD	Intel SB
410.bwaves	-2.34	-1.52   	   
416.gamess	-1.11	-0.30
433.milc	0.47	-1.75
434.zeusmp	-3.61	0.68
435.gromacs	-0.54	-0.38
436.cactusADM	-23.56	21.49
437.leslie3d	-0.44	1.56
444.namd	0.00	0.00
447.dealII	-0.36	-0.23
450.soplex	-0.43	-0.29
453.povray	0.50	3.63
454.calculix	-8.29	1.38
459.GemsFDTD	2.37	-1.54
465.tonto	0.00	0.00
470.lbm		0.00	0.21
481.wrf		-4.80	0.00
482.sphinx3	-10.20	-3.65
SpecINT		-3.29	1.01

400.perlbench	0.93	1.47
401.bzip2	0.60	0.00
403.gcc		0.00	0.00
429.mcf		0.00	-0.36
445.gobmk	-1.03	0.37
456.hmmer	-0.64	0.38
458.sjeng	1.74	0.00
462.libquantum	0.31	0.00
464.h264ref	0.00	0.00
471.omnetpp	-1.27	0.00
473.astar	0.00	0.46
483.xalancbmk	0.51	0.00
SpecFP	      	0.09	0.19

As per the data, the 1% performance gain for Intel Sandybridge on SpecFP is
eclipsed by a 3% degradation for AMD Bulldozer.

For the data above, generic mode splits both 256-bit misaligned loads and
stores, as is currently the case in trunk. 

Even if we disable 256-bit misaliged load splitting, AVX 256-bit performance
improves only by ~1.4% on SpecFP for AMD Bulldozer. On the other hand, AVX
256-bit performance drops by 0.12% on Intel Sandybridge. In this case with
AVX 256 load splitting disabled, a cumulative 0.9% performance gain for Intel
Sandybridge is reflected versus a 1.9% loss for AMD Bulldozer comparing AVX 256
to AVX 128 and hence AVX 256 is still not a fair choice for generic mode.

Please provide thoughts. It would be great if HJ can verify Intel Sandybridge
data.

Thanks,
Harsha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AVX generic mode tuning discussion.
  2011-07-12 22:26 AVX generic mode tuning discussion harsha.jagasia
@ 2011-07-12 22:29 ` Richard Henderson
  2011-07-13  8:49   ` Richard Guenther
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Richard Henderson @ 2011-07-12 22:29 UTC (permalink / raw)
  To: harsha.jagasia; +Cc: gcc-patches, hubicka, ubizjak, hjl.tools, Changpeng.Fang

On 07/12/2011 02:22 PM, harsha.jagasia@amd.com wrote:
> We would like to propose changing AVX generic mode tuning to generate 128-bit
> AVX instead of 256-bit AVX.

You indicate a 3% reduction on bulldozer with avx256.
How does avx128 compare to -mno-avx -msse4.2?
Will the next AMD generation have a useable avx256?

I'm not keen on the idea of generic mode being tune
for a single processor revision that maybe shouldn't
actually be using avx at all.

r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AVX generic mode tuning discussion.
  2011-07-12 22:29 ` Richard Henderson
@ 2011-07-13  8:49   ` Richard Guenther
  2011-07-13  9:07     ` Jakub Jelinek
                       ` (3 more replies)
  2011-07-21 21:18   ` Jagasia, Harsha
       [not found]   ` <63EE40A00BA43F49B85FACBB03F078B60821086630@sausexmbp02.amd.com>
  2 siblings, 4 replies; 12+ messages in thread
From: Richard Guenther @ 2011-07-13  8:49 UTC (permalink / raw)
  To: Richard Henderson
  Cc: harsha.jagasia, gcc-patches, hubicka, ubizjak, hjl.tools, Changpeng.Fang

On Tue, Jul 12, 2011 at 11:56 PM, Richard Henderson <rth@redhat.com> wrote:
> On 07/12/2011 02:22 PM, harsha.jagasia@amd.com wrote:
>> We would like to propose changing AVX generic mode tuning to generate 128-bit
>> AVX instead of 256-bit AVX.
>
> You indicate a 3% reduction on bulldozer with avx256.
> How does avx128 compare to -mno-avx -msse4.2?
> Will the next AMD generation have a useable avx256?
>
> I'm not keen on the idea of generic mode being tune
> for a single processor revision that maybe shouldn't
> actually be using avx at all.

Btw, it looks like the data is massively skewed by
436.cactusADM.  What are the overall numbers if you
disregard cactus?  It's also for sure the case that the vectorizer
cost model has not been touched for avx256 vs. avx128 vs. sse,
so a more sensible approach would be to look at differentiating
things there to improve the cactus numbers.  Harsha, did you
investigate why avx256 is such a loss for cactus or why it is
so much of a win for SB?

I suppose generic tuning is of less importance for AVX as
people need to enable that manually anyway (and will possibly
do so only via means of -march=native).

Thanks,
Richard.

>
> r~
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AVX generic mode tuning discussion.
  2011-07-13  8:49   ` Richard Guenther
@ 2011-07-13  9:07     ` Jakub Jelinek
  2011-07-21 21:37     ` Jagasia, Harsha
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Jakub Jelinek @ 2011-07-13  9:07 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Richard Henderson, harsha.jagasia, gcc-patches, hubicka, ubizjak,
	hjl.tools, Changpeng.Fang

On Wed, Jul 13, 2011 at 10:42:41AM +0200, Richard Guenther wrote:
> I suppose generic tuning is of less importance for AVX as
> people need to enable that manually anyway (and will possibly
> do so only via means of -march=native).

Yeah, but if somebody does compile with -mavx -mtune=generic,
I'd expect the intent is that he wants fastest code not just on current
generation of CPUs, but on the next few following ones, and I'd say that
being able to use twice as big vectorization factor ought to be a win in
most cases if the cost model gets it right.  If not for the vectorization
factor doubling, what would be reasons why somebody would compile
code with -mavx -mtune=generic and rule out support for many recent chips?
Yeah, there are the > 2 operand forms and such code can avoid penalty when
mixed with AVX256 code, but would that be strong reason enough to lose the
support of most of the recent CPUs?  When targeting just a particular CPU
and using -march= with CPU which already includes AVX, -mtune=generic probably
doesn't make much sense, you probably want -march=native and you are
optimizing for the CPU you have.

	Jakub

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: AVX generic mode tuning discussion.
  2011-07-12 22:29 ` Richard Henderson
  2011-07-13  8:49   ` Richard Guenther
@ 2011-07-21 21:18   ` Jagasia, Harsha
       [not found]   ` <63EE40A00BA43F49B85FACBB03F078B60821086630@sausexmbp02.amd.com>
  2 siblings, 0 replies; 12+ messages in thread
From: Jagasia, Harsha @ 2011-07-21 21:18 UTC (permalink / raw)
  To: 'Richard Henderson'
  Cc: 'gcc-patches@gcc.gnu.org', 'hubicka@ucw.cz',
	'ubizjak@gmail.com', 'hjl.tools@gmail.com',
	Fang, Changpeng

> On 07/12/2011 02:22 PM, harsha.jagasia@amd.com wrote:
> > We would like to propose changing AVX generic mode tuning to generate
> 128-bit
> > AVX instead of 256-bit AVX.
> 
> You indicate a 3% reduction on bulldozer with avx256.
> How does avx128 compare to -mno-avx -msse4.2?

We see these % differences going from SSE42 to AVX128 to AVX256 on Bulldozer with "-mtune=generic -Ofast".
(Positive is improvement, negative is degradation)

Bulldozer:		
			AVX128/SSE42	AVX256/AVX-128
410.bwaves		-1.4%			-1.4%
416.gamess		-1.1%			0.0%
433.milc		0.5%			-2.4%
434.zeusmp		9.7%			-2.1%
435.gromacs		5.1%			0.5%
436.cactusADM	8.2%			-23.8%
437.leslie3d	8.1%			0.4%
444.namd		3.6%			0.0%
447.dealII		-1.4%			-0.4%
450.soplex		-0.4%			-0.4%
453.povray		0.0%			-1.5%
454.calculix	15.7%			-8.3%
459.GemsFDTD	4.9%			1.4%
465.tonto		1.3%			-0.6%
470.lbm		0.9%			0.3%
481.wrf		7.3%			-3.6%
482.sphinx3		5.0%			-9.8%
SPECFP		3.8%			-3.2%

> Will the next AMD generation have a useable avx256?
> I'm not keen on the idea of generic mode being tune
> for a single processor revision that maybe shouldn't
> actually be using avx at all.

We see a substantial gain in several SPECFP benchmarks going from SSE42 to AVX128 on Bulldozer.
IMHO, accomplishing even a 5% gain in an individual benchmark takes a hardware company several man months.
The loss with AVX256 for Bulldozer is much more significant than the gain for SandyBridge.
While the general trend in the industry is a move toward AVX256, for now we would be disadvantaging Bulldozer with this choice.

We have several customers who use -mtune=generic and it is default, unless a user explicitly overrides it with -mtune=native. They are the ones who want to experiment with latest ISA using gcc, but want to keep their ISA selection and tuning agnostic on x86/64. IMHO, it is with these customers in mind that generic was introduced in the first place.

Thanks,
Harsha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: AVX generic mode tuning discussion.
  2011-07-13  8:49   ` Richard Guenther
  2011-07-13  9:07     ` Jakub Jelinek
@ 2011-07-21 21:37     ` Jagasia, Harsha
  2013-01-07 18:24     ` FW: " Jagasia, Harsha
       [not found]     ` <873A3B0C5474B84F92B91855BCB4FE1625438297@sausexdag01.amd.com>
  3 siblings, 0 replies; 12+ messages in thread
From: Jagasia, Harsha @ 2011-07-21 21:37 UTC (permalink / raw)
  To: 'Richard Guenther', 'Richard Henderson'
  Cc: 'gcc-patches@gcc.gnu.org', 'hubicka@ucw.cz',
	'ubizjak@gmail.com', 'hjl.tools@gmail.com',
	Fang, Changpeng

> >> We would like to propose changing AVX generic mode tuning to
> generate 128-bit
> >> AVX instead of 256-bit AVX.
> >
> > You indicate a 3% reduction on bulldozer with avx256.
> > How does avx128 compare to -mno-avx -msse4.2?
> > Will the next AMD generation have a useable avx256?
> >
> > I'm not keen on the idea of generic mode being tune
> > for a single processor revision that maybe shouldn't
> > actually be using avx at all.
> 
> Btw, it looks like the data is massively skewed by
> 436.cactusADM.  What are the overall numbers if you
> disregard cactus?  

Disregarding cactus, these are the cumulative SpecFP scores we see.

On Bulldozer:
			AVX256/AVX128
SPECFP		-1.8%

On SandyBridge:
			AVX256/AVX128
SPECFP		-0.15%

> It's also for sure the case that the vectorizer
> cost model has not been touched for avx256 vs. avx128 vs. sse,
> so a more sensible approach would be to look at differentiating
> things there to improve the cactus numbers.  

I am not sure how much the vectorizer cost model can help here.
The cost model can decide whether to vectorize and/or what vectorization factor to use.
But in generic mode, that decision has to be processor family neutral anyway.

> Harsha, did you
> investigate why avx256 is such a loss for cactus or why it is
> so much of a win for SB?

We are planning to investigate cactus and other cases to understand the reasons behind these observations better on Bulldozer, but disregarding cactus, there appear to be no significant gains on Sandybridge with AVX256 over AVX128 as well.

Thanks,
Harsha


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: AVX generic mode tuning discussion.
       [not found]   ` <63EE40A00BA43F49B85FACBB03F078B60821086630@sausexmbp02.amd.com>
@ 2011-10-31 21:21     ` Jagasia, Harsha
  2011-11-01  9:47       ` Richard Guenther
  0 siblings, 1 reply; 12+ messages in thread
From: Jagasia, Harsha @ 2011-10-31 21:21 UTC (permalink / raw)
  To: 'Richard Henderson'
  Cc: 'gcc-patches@gcc.gnu.org', 'hubicka@ucw.cz',
	'ubizjak@gmail.com', 'hjl.tools@gmail.com'

> > > We would like to propose changing AVX generic mode tuning to
> generate
> > 128-bit
> > > AVX instead of 256-bit AVX.
> >
> > You indicate a 3% reduction on bulldozer with avx256.
> > How does avx128 compare to -mno-avx -msse4.2?
> 
> We see these % differences going from SSE42 to AVX128 to AVX256 on
> Bulldozer with "-mtune=generic -Ofast".
> (Positive is improvement, negative is degradation)
> 
> Bulldozer:
> 			AVX128/SSE42	AVX256/AVX-128
> 410.bwaves		-1.4%			-1.4%
> 416.gamess		-1.1%			0.0%
> 433.milc		0.5%			-2.4%
> 434.zeusmp		9.7%			-2.1%
> 435.gromacs		5.1%			0.5%
> 436.cactusADM	8.2%			-23.8%
> 437.leslie3d	8.1%			0.4%
> 444.namd		3.6%			0.0%
> 447.dealII		-1.4%			-0.4%
> 450.soplex		-0.4%			-0.4%
> 453.povray		0.0%			-1.5%
> 454.calculix	15.7%			-8.3%
> 459.GemsFDTD	4.9%			1.4%
> 465.tonto		1.3%			-0.6%
> 470.lbm		0.9%			0.3%
> 481.wrf		7.3%			-3.6%
> 482.sphinx3		5.0%			-9.8%
> SPECFP		3.8%			-3.2%
> 
> > Will the next AMD generation have a useable avx256?
> > I'm not keen on the idea of generic mode being tune
> > for a single processor revision that maybe shouldn't
> > actually be using avx at all.
> 
> We see a substantial gain in several SPECFP benchmarks going from SSE42
> to AVX128 on Bulldozer.
> IMHO, accomplishing even a 5% gain in an individual benchmark takes a
> hardware company several man months.
> The loss with AVX256 for Bulldozer is much more significant than the
> gain for SandyBridge.
> While the general trend in the industry is a move toward AVX256, for
> now we would be disadvantaging Bulldozer with this choice.
> 
> We have several customers who use -mtune=generic and it is default,
> unless a user explicitly overrides it with -mtune=native. They are the
> ones who want to experiment with latest ISA using gcc, but want to keep
> their ISA selection and tuning agnostic on x86/64. IMHO, it is with
> these customers in mind that generic was introduced in the first place.

Since stage 1 closure is around the corner, just wanted to ping to see if the maintainers have made up their mind on this one.
AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out pretty much all of that gain in generic mode.
Until there is a convergence on AVX-256 for x86/64, we would like to propose having generic generate avx-128 by default and have a user override to avx-256 manually when known to benefit performance.

Thanks,
Harsha


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AVX generic mode tuning discussion.
  2011-10-31 21:21     ` Jagasia, Harsha
@ 2011-11-01  9:47       ` Richard Guenther
  2011-11-02 17:17         ` Jagasia, Harsha
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Guenther @ 2011-11-01  9:47 UTC (permalink / raw)
  To: Jagasia, Harsha
  Cc: Richard Henderson, gcc-patches, hubicka, ubizjak, hjl.tools

On Mon, Oct 31, 2011 at 9:36 PM, Jagasia, Harsha <harsha.jagasia@amd.com> wrote:
>> > > We would like to propose changing AVX generic mode tuning to
>> generate
>> > 128-bit
>> > > AVX instead of 256-bit AVX.
>> >
>> > You indicate a 3% reduction on bulldozer with avx256.
>> > How does avx128 compare to -mno-avx -msse4.2?
>>
>> We see these % differences going from SSE42 to AVX128 to AVX256 on
>> Bulldozer with "-mtune=generic -Ofast".
>> (Positive is improvement, negative is degradation)
>>
>> Bulldozer:
>>                       AVX128/SSE42    AVX256/AVX-128
>> 410.bwaves            -1.4%                   -1.4%
>> 416.gamess            -1.1%                   0.0%
>> 433.milc              0.5%                    -2.4%
>> 434.zeusmp            9.7%                    -2.1%
>> 435.gromacs           5.1%                    0.5%
>> 436.cactusADM 8.2%                    -23.8%
>> 437.leslie3d  8.1%                    0.4%
>> 444.namd              3.6%                    0.0%
>> 447.dealII            -1.4%                   -0.4%
>> 450.soplex            -0.4%                   -0.4%
>> 453.povray            0.0%                    -1.5%
>> 454.calculix  15.7%                   -8.3%
>> 459.GemsFDTD  4.9%                    1.4%
>> 465.tonto             1.3%                    -0.6%
>> 470.lbm               0.9%                    0.3%
>> 481.wrf               7.3%                    -3.6%
>> 482.sphinx3           5.0%                    -9.8%
>> SPECFP                3.8%                    -3.2%
>>
>> > Will the next AMD generation have a useable avx256?
>> > I'm not keen on the idea of generic mode being tune
>> > for a single processor revision that maybe shouldn't
>> > actually be using avx at all.
>>
>> We see a substantial gain in several SPECFP benchmarks going from SSE42
>> to AVX128 on Bulldozer.
>> IMHO, accomplishing even a 5% gain in an individual benchmark takes a
>> hardware company several man months.
>> The loss with AVX256 for Bulldozer is much more significant than the
>> gain for SandyBridge.
>> While the general trend in the industry is a move toward AVX256, for
>> now we would be disadvantaging Bulldozer with this choice.
>>
>> We have several customers who use -mtune=generic and it is default,
>> unless a user explicitly overrides it with -mtune=native. They are the
>> ones who want to experiment with latest ISA using gcc, but want to keep
>> their ISA selection and tuning agnostic on x86/64. IMHO, it is with
>> these customers in mind that generic was introduced in the first place.
>
> Since stage 1 closure is around the corner, just wanted to ping to see if the maintainers have made up their mind on this one.
> AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out pretty much all of that gain in generic mode.
> Until there is a convergence on AVX-256 for x86/64, we would like to propose having generic generate avx-128 by default and have a user override to avx-256 manually when known to benefit performance.

Did somebody spend the time analyzing why CactusADM shows so much of a
difference?  With the recent improvements in vectorizing for AVX, did
you
re-do the measurements with a recent trunk?

I don't think disabling avx-256 by default is a good idea until we
understand why these numbers happen and are convinced we cannot fix
this by proper
cost modeling.

Richard.

> Thanks,
> Harsha
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: AVX generic mode tuning discussion.
  2011-11-01  9:47       ` Richard Guenther
@ 2011-11-02 17:17         ` Jagasia, Harsha
  2011-11-02 20:50           ` Richard Guenther
  0 siblings, 1 reply; 12+ messages in thread
From: Jagasia, Harsha @ 2011-11-02 17:17 UTC (permalink / raw)
  To: 'Richard Guenther'
  Cc: 'Richard Henderson', 'gcc-patches@gcc.gnu.org',
	'hubicka@ucw.cz', 'ubizjak@gmail.com',
	'hjl.tools@gmail.com'

> >> > > We would like to propose changing AVX generic mode tuning to
> >> generate
> >> > 128-bit
> >> > > AVX instead of 256-bit AVX.
> >> >
> >> > You indicate a 3% reduction on bulldozer with avx256.
> >> > How does avx128 compare to -mno-avx -msse4.2?
> >>
> >> We see these % differences going from SSE42 to AVX128 to AVX256 on
> >> Bulldozer with "-mtune=generic -Ofast".
> >> (Positive is improvement, negative is degradation)
> >>
> >> Bulldozer:
> >>                       AVX128/SSE42    AVX256/AVX-128
> >> 410.bwaves            -1.4%                   -1.4%
> >> 416.gamess            -1.1%                   0.0%
> >> 433.milc              0.5%                    -2.4%
> >> 434.zeusmp            9.7%                    -2.1%
> >> 435.gromacs           5.1%                    0.5%
> >> 436.cactusADM         8.2%                    -23.8%
> >> 437.leslie3d          8.1%                    0.4%
> >> 444.namd              3.6%                    0.0%
> >> 447.dealII            -1.4%                   -0.4%
> >> 450.soplex            -0.4%                   -0.4%
> >> 453.povray            0.0%                    -1.5%
> >> 454.calculix          15.7%                   -8.3%
> >> 459.GemsFDTD          4.9%                    1.4%
> >> 465.tonto             1.3%                    -0.6%
> >> 470.lbm               0.9%                    0.3%
> >> 481.wrf               7.3%                    -3.6%
> >> 482.sphinx3           5.0%                    -9.8%
> >> SPECFP                3.8%                    -3.2%
> >>
> >> > Will the next AMD generation have a useable avx256?
> >> > I'm not keen on the idea of generic mode being tune
> >> > for a single processor revision that maybe shouldn't
> >> > actually be using avx at all.
> >>
> >> We see a substantial gain in several SPECFP benchmarks going from
> SSE42
> >> to AVX128 on Bulldozer.
> >> IMHO, accomplishing even a 5% gain in an individual benchmark takes
> a
> >> hardware company several man months.
> >> The loss with AVX256 for Bulldozer is much more significant than the
> >> gain for SandyBridge.
> >> While the general trend in the industry is a move toward AVX256, for
> >> now we would be disadvantaging Bulldozer with this choice.
> >>
> >> We have several customers who use -mtune=generic and it is default,
> >> unless a user explicitly overrides it with -mtune=native. They are
> the
> >> ones who want to experiment with latest ISA using gcc, but want to
> keep
> >> their ISA selection and tuning agnostic on x86/64. IMHO, it is with
> >> these customers in mind that generic was introduced in the first
> place.
> >
> > Since stage 1 closure is around the corner, just wanted to ping to
> see if the maintainers have made up their mind on this one.
> > AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes
> out pretty much all of that gain in generic mode.
> > Until there is a convergence on AVX-256 for x86/64, we would like to
> propose having generic generate avx-128 by default and have a user
> override to avx-256 manually when known to benefit performance.
> 
> Did somebody spend the time analyzing why CactusADM shows so much of a
> difference?  
> With the recent improvements in vectorizing for AVX, did
> you
> re-do the measurements with a recent trunk?
> 
> I don't think disabling avx-256 by default is a good idea until we
> understand why these numbers happen and are convinced we cannot fix
> this by proper
> cost modeling.

We have observed cases where AVX-256 bit code is slower than AVX-128 bit code on Bulldozer. This is because internally the front end, data paths etc for Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 256-bit code at the pipeline can congest the front end causing stalls and hence slowdowns. We expect the behavior of cactus, calculix and sphinx, which are the 3 benchmarks with the biggest avx-256 gaps, to be in the same vein. In general, the hardware design engineers recommend running AVX 128-bit code on Bulldozer. Given the underlying hardware design, software tuning can't really change the results here. Any further analysis of cactus would be a cycle sink at our end and we may not even be able to discuss the details on a public mailing list. x86/64 has not yet converged on avx-256 and generic mode should reflect that.

Posting the re-measurements on trunk for cactus, calculix and sphinx on Bulldozer:
                AVX128/SSE42    AVX256/AVX-128
436.cactusADM   10%			-30%
454.calculix    14.7%			-6%
482.sphinx3	    7%			-9%

All positive % above are improvements, all negative % are degradations.

I will post re-measurements for all of Spec with latest trunk as soon as I have them.

Thoughts?

Thanks,
Harsha


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AVX generic mode tuning discussion.
  2011-11-02 17:17         ` Jagasia, Harsha
@ 2011-11-02 20:50           ` Richard Guenther
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Guenther @ 2011-11-02 20:50 UTC (permalink / raw)
  To: Jagasia, Harsha
  Cc: Richard Henderson, gcc-patches, hubicka, ubizjak, hjl.tools

On Wed, Nov 2, 2011 at 5:57 PM, Jagasia, Harsha <harsha.jagasia@amd.com> wrote:
>> >> > > We would like to propose changing AVX generic mode tuning to
>> >> generate
>> >> > 128-bit
>> >> > > AVX instead of 256-bit AVX.
>> >> >
>> >> > You indicate a 3% reduction on bulldozer with avx256.
>> >> > How does avx128 compare to -mno-avx -msse4.2?
>> >>
>> >> We see these % differences going from SSE42 to AVX128 to AVX256 on
>> >> Bulldozer with "-mtune=generic -Ofast".
>> >> (Positive is improvement, negative is degradation)
>> >>
>> >> Bulldozer:
>> >>                       AVX128/SSE42    AVX256/AVX-128
>> >> 410.bwaves            -1.4%                   -1.4%
>> >> 416.gamess            -1.1%                   0.0%
>> >> 433.milc              0.5%                    -2.4%
>> >> 434.zeusmp            9.7%                    -2.1%
>> >> 435.gromacs           5.1%                    0.5%
>> >> 436.cactusADM         8.2%                    -23.8%
>> >> 437.leslie3d          8.1%                    0.4%
>> >> 444.namd              3.6%                    0.0%
>> >> 447.dealII            -1.4%                   -0.4%
>> >> 450.soplex            -0.4%                   -0.4%
>> >> 453.povray            0.0%                    -1.5%
>> >> 454.calculix          15.7%                   -8.3%
>> >> 459.GemsFDTD          4.9%                    1.4%
>> >> 465.tonto             1.3%                    -0.6%
>> >> 470.lbm               0.9%                    0.3%
>> >> 481.wrf               7.3%                    -3.6%
>> >> 482.sphinx3           5.0%                    -9.8%
>> >> SPECFP                3.8%                    -3.2%
>> >>
>> >> > Will the next AMD generation have a useable avx256?
>> >> > I'm not keen on the idea of generic mode being tune
>> >> > for a single processor revision that maybe shouldn't
>> >> > actually be using avx at all.
>> >>
>> >> We see a substantial gain in several SPECFP benchmarks going from
>> SSE42
>> >> to AVX128 on Bulldozer.
>> >> IMHO, accomplishing even a 5% gain in an individual benchmark takes
>> a
>> >> hardware company several man months.
>> >> The loss with AVX256 for Bulldozer is much more significant than the
>> >> gain for SandyBridge.
>> >> While the general trend in the industry is a move toward AVX256, for
>> >> now we would be disadvantaging Bulldozer with this choice.
>> >>
>> >> We have several customers who use -mtune=generic and it is default,
>> >> unless a user explicitly overrides it with -mtune=native. They are
>> the
>> >> ones who want to experiment with latest ISA using gcc, but want to
>> keep
>> >> their ISA selection and tuning agnostic on x86/64. IMHO, it is with
>> >> these customers in mind that generic was introduced in the first
>> place.
>> >
>> > Since stage 1 closure is around the corner, just wanted to ping to
>> see if the maintainers have made up their mind on this one.
>> > AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes
>> out pretty much all of that gain in generic mode.
>> > Until there is a convergence on AVX-256 for x86/64, we would like to
>> propose having generic generate avx-128 by default and have a user
>> override to avx-256 manually when known to benefit performance.
>>
>> Did somebody spend the time analyzing why CactusADM shows so much of a
>> difference?
>> With the recent improvements in vectorizing for AVX, did
>> you
>> re-do the measurements with a recent trunk?
>>
>> I don't think disabling avx-256 by default is a good idea until we
>> understand why these numbers happen and are convinced we cannot fix
>> this by proper
>> cost modeling.
>
> We have observed cases where AVX-256 bit code is slower than AVX-128 bit code on Bulldozer. This is because internally the front end, data paths etc for Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 256-bit code at the pipeline can congest the front end causing stalls and hence slowdowns. We expect the behavior of cactus, calculix and sphinx, which are the 3 benchmarks with the biggest avx-256 gaps, to be in the same vein. In general, the hardware design engineers recommend running AVX 128-bit code on Bulldozer. Given the underlying hardware design, software tuning can't really change the results here. Any further analysis of cactus would be a cycle sink at our end and we may not even be able to discuss the details on a public mailing list. x86/64 has not yet converged on avx-256 and generic mode should reflect that.

Well, generic hasn't converged on AVX at all.  Cost modeling can deal
with code density just fine - are there any differences between code
density issues of
say, loads vs. stores vs. arithmetic?  I specifically ask about
analysis because AVX-256 has instruction set issues for certain
patterns the vectorizer generates
and the cost model currently does not reflect these at all.

Richard.

> Posting the re-measurements on trunk for cactus, calculix and sphinx on Bulldozer:
>                AVX128/SSE42    AVX256/AVX-128
> 436.cactusADM   10%                     -30%
> 454.calculix    14.7%                   -6%
> 482.sphinx3         7%                  -9%
>
> All positive % above are improvements, all negative % are degradations.
>
> I will post re-measurements for all of Spec with latest trunk as soon as I have them.
>
> Thoughts?
>
> Thanks,
> Harsha
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* FW: AVX generic mode tuning discussion.
  2011-07-13  8:49   ` Richard Guenther
  2011-07-13  9:07     ` Jakub Jelinek
  2011-07-21 21:37     ` Jagasia, Harsha
@ 2013-01-07 18:24     ` Jagasia, Harsha
       [not found]     ` <873A3B0C5474B84F92B91855BCB4FE1625438297@sausexdag01.amd.com>
  3 siblings, 0 replies; 12+ messages in thread
From: Jagasia, Harsha @ 2013-01-07 18:24 UTC (permalink / raw)
  To: Richard Guenther, Richard Henderson
  Cc: gcc-patches, hubicka, ubizjak, hjl.tools

>>> We would like to propose changing AVX generic mode tuning to 
>>> generate 128-bit AVX instead of 256-bit AVX.
>>
>> You indicate a 3% reduction on bulldozer with avx256.
>> How does avx128 compare to -mno-avx -msse4.2?
> Will the next AMD generation have a useable avx256?
>>
>> I'm not keen on the idea of generic mode being tune for a single 
>> processor revision that maybe shouldn't actually be using avx at all.

>Btw, it looks like the data is massively skewed by 436.cactusADM.  What are the overall numbers if you disregard cactus?  It's also for sure the case that the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, so a more sensible >approach would be to look at differentiating things there to improve the cactus numbers. 

>Harsha, did you investigate why avx256 is such a loss for cactus or why it is so much of a win for SB?

I know this thread did not get closed from our end for a while now, but we (AMD) would really like to re-open this discussion. So here goes.

We did investigate why cactus is slower in avx-256 mode than avx-128 mode on AMD processors.

Using "-Ofast" flag (with appropriate flags to generate avx-128 code or avx-256 code) and running with the reference data set, we observe the following runtimes on Bulldozer. 
                                                                Runtime	%Diff AVX-256 versus AVX-128
AVX128                                                616s		38%
AVX256 with store splitting          853s

Scheduling and predictive commoning are turned off in the compiler for both cases, so that the code generated by the compiler for the avx-128 and avx-256 cases are mostly equivalent i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side.

Looking at the cactus source and oprofile reports, the hottest loop nest is a triple nested loop. The innermost loop of this nest has ~400 lines of Fortran code and takes up 99% of the run time of the benchmark. 

Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In order to vectorize the innermost loop, gcc generates a SIMD scalar prologue loop to align the relevant vectors, followed by a SIMD packed avx loop, followed by a SIMD scalar epilogue loop to handle what's left after a whole multiple of vector factor is taken care of. 

Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the innermost Fortran loop's 3 components. 
Oprofile Samples
                                                                AVX 128                               AVX-256-ss         Gap in samples                 Gap as % of total runtime
Total                                                      153408                                  214448                  61040                                    38%
SIMD Vector loop                            135653                                  183074                  47421                                    30%
SIMD Scalar Prolog loop                3817                                       10434                    6617                                       4%
SIMD Scalar Epilog loop                 3471                                       10072                    6601                                       4%

The avx-256 code is spending 30% more time in the SIMD vector loop than the avx-128 code. The code gen appears to be equivalent for this vector loop in the 128b and 256b cases- i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side. The instruction mix and scheduling are same, except for the spilling and loading of one variable.

We know this gap is because there are fewer physical registers available for renaming to the avx-256 code, since our processor loses the upper halves of the FP registers for renaming.
Our entire SIMD pipeline in the processor  is 128-bit and we don't have native true 256-bit, even for foreseeable future generations, unlike Sandybridge/Ivybridge.

The avx-256 code is spending 8% more time in the SIMD scalar prologue and epilogue than the avx-128 code. The code gen is exactly the same for these scalar loops in the 128b and 256b case - i.e exact same instruction mix and scheduling. The reason for the gap is actually the number of iterations that gcc executes in these loops for the 2 cases.  

This is because gcc is following Sandy bridge's recommendation and aligning avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on Bulldozer. 
The Sandybridge Software Optimization Guide mentions that the optimal memory alignment of an AVX 256-bit vector, stored in memory, is 32 bytes. 
The Bulldozer Software Optimization Guide says "Align all packed floating-point data on 16-byte boundaries".

In case of cactus, the relevant double vector has 118 elements that are stepped through in unit stride and the first element handled in the Fortran loop is aligned at a boundary akin to 0x8. 
In avx-128 mode, gcc generates a scalar prologue loop that processes one element at location 0x8, then a vector loop that processes the next 116 elements starting at location 0x10 i.e a 16 byte aligned location, then a scalar epilogue loop that processes the one last element left.
In avx-256 mode, gcc generates a scalar prologue loop that processes the first 3 elements at location 0x8, 0x10 and 0x16, then a vector loop that processes the next 112 elements  starting at location 0x20 i.e a 32 byte aligned location, then a scalar epilogue loop that processes the last three elements left.

Since this Fortran loop is nested inside another double nested loop, the overall impact from doing more work in the scalar loops and less in the vector loop is a reduction in overall vectorization on Bulldozer.

Enabling avx-256 and choosing alignments and vector factors that are optimal for Intel Sandybridge/Ivybridge and sub-optimal for AMD processors as default is truly against the spirit of generic mode here.

PS: I am not actively working on gcc right now, but other AMD gcc team members will pitch in if more is needed.

Thanks
Harsha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AVX generic mode tuning discussion.
       [not found]     ` <873A3B0C5474B84F92B91855BCB4FE1625438297@sausexdag01.amd.com>
@ 2013-01-08 11:22       ` Richard Biener
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Biener @ 2013-01-08 11:22 UTC (permalink / raw)
  To: Jagasia, Harsha
  Cc: Richard Henderson, gcc-patches, hubicka, ubizjak, hjl.tools,
	Vishwanath, Prasad, Harle, Christophe, Eggone, NagaJyothi,
	Ekanathan, Saravanan, Gopalasubramanian, Ganesh, Kumar,
	Venkataramanan, Vermeulen, Michiel

On Mon, Jan 7, 2013 at 7:21 PM, Jagasia, Harsha <harsha.jagasia@amd.com> wrote:
>>>> We would like to propose changing AVX generic mode tuning to generate
>>>> 128-bit AVX instead of 256-bit AVX.
>>>
>>> You indicate a 3% reduction on bulldozer with avx256.
>>> How does avx128 compare to -mno-avx -msse4.2?
>> Will the next AMD generation have a useable avx256?
>>>
>>> I'm not keen on the idea of generic mode being tune for a single
>>> processor revision that maybe shouldn't actually be using avx at all.
>
>>Btw, it looks like the data is massively skewed by 436.cactusADM.  What are the overall numbers if you disregard cactus?  It's also for sure the case that the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, so a more sensible >approach would be to look at differentiating things there to improve the cactus numbers.
>
>>Harsha, did you investigate why avx256 is such a loss for cactus or why it is so much of a win for SB?
>
> I know this thread did not get closed from our end for a while now, but we (AMD) would really like to re-open this discussion. So here goes.
>
> We did investigate why cactus is slower in avx-256 mode than avx-128 mode on AMD processors.
>
> Using "-Ofast" flag (with appropriate flags to generate avx-128 code or avx-256 code) and running with the reference data set, we observe the following runtimes on Bulldozer.
>                                                                 Runtime %Diff AVX-256 versus AVX-128
> AVX128                                                616s              38%
> AVX256 with store splitting          853s
>
> Scheduling and predictive commoning are turned off in the compiler for both cases, so that the code generated by the compiler for the avx-128 and avx-256 cases are mostly equivalent i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side.
>
> Looking at the cactus source and oprofile reports, the hottest loop nest is a triple nested loop. The innermost loop of this nest has ~400 lines of Fortran code and takes up 99% of the run time of the benchmark.
>
> Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In order to vectorize the innermost loop, gcc generates a SIMD scalar prologue loop to align the relevant vectors, followed by a SIMD packed avx loop, followed by a SIMD scalar epilogue loop to handle what's left after a whole multiple of vector factor is taken care of.
>
> Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the innermost Fortran loop's 3 components.
> Oprofile Samples
>                                                                 AVX 128                               AVX-256-ss         Gap in samples                 Gap as % of total runtime
> Total                                                      153408                                  214448                  61040                                    38%
> SIMD Vector loop                            135653                                  183074                  47421                                    30%
> SIMD Scalar Prolog loop                3817                                       10434                    6617                                       4%
> SIMD Scalar Epilog loop                 3471                                       10072                    6601                                       4%
>
> The avx-256 code is spending 30% more time in the SIMD vector loop than the avx-128 code. The code gen appears to be equivalent for this vector loop in the 128b and 256b cases- i.e only avx-128 instructions on one side are being replaced by avx-256 instructions on the other side. The instruction mix and scheduling are same, except for the spilling and loading of one variable.
>
> We know this gap is because there are fewer physical registers available for renaming to the avx-256 code, since our processor loses the upper halves of the FP registers for renaming.
> Our entire SIMD pipeline in the processor  is 128-bit and we don't have native true 256-bit, even for foreseeable future generations, unlike Sandybridge/Ivybridge.
>
> The avx-256 code is spending 8% more time in the SIMD scalar prologue and epilogue than the avx-128 code. The code gen is exactly the same for these scalar loops in the 128b and 256b case - i.e exact same instruction mix and scheduling. The reason for the gap is actually the number of iterations that gcc executes in these loops for the 2 cases.
>
> This is because gcc is following Sandy bridge's recommendation and aligning avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on Bulldozer.
> The Sandybridge Software Optimization Guide mentions that the optimal memory alignment of an AVX 256-bit vector, stored in memory, is 32 bytes.
> The Bulldozer Software Optimization Guide says "Align all packed floating-point data on 16-byte boundaries".
>
> In case of cactus, the relevant double vector has 118 elements that are stepped through in unit stride and the first element handled in the Fortran loop is aligned at a boundary akin to 0x8.
> In avx-128 mode, gcc generates a scalar prologue loop that processes one element at location 0x8, then a vector loop that processes the next 116 elements starting at location 0x10 i.e a 16 byte aligned location, then a scalar epilogue loop that processes the one last element left.
> In avx-256 mode, gcc generates a scalar prologue loop that processes the first 3 elements at location 0x8, 0x10 and 0x16, then a vector loop that processes the next 112 elements  starting at location 0x20 i.e a 32 byte aligned location, then a scalar epilogue loop that processes the last three elements left.
>
> Since this Fortran loop is nested inside another double nested loop, the overall impact from doing more work in the scalar loops and less in the vector loop is a reduction in overall vectorization on Bulldozer.
>
> Enabling avx-256 and choosing alignments and vector factors that are optimal for Intel Sandybridge/Ivybridge and sub-optimal for AMD processors as default is truly against the spirit of generic mode here.
>
> PS: I am not actively working on gcc right now, but other AMD gcc team members will pitch in if more is needed.

As of the alignment issue we probably want to have a target hook the vectorizer
can use to query the desired runtime alignment for a given vector type.  At the
moment things are hard-wired to vector-size / -alignment and it's probably more
envolved to disentangle things here to support 128-bit dynamic alignment for
256-bit vectors.

With masked load/store support we can also implement the whole prologue
loop using a single vector iteration .... (if there is no reduction involved).

The register pressure issue can be resolved using the new vectorizer cost
model infrastructure where the target has the chance to look at the whole
vectorized set of instructions.

As for generally disabling 256bit vector support in the vectorizer for
explicit -mavx and generic tuning you know my opinion (even -mprefer-avx128
is a kludge).  Instead of

/* If AVX is enabled then try vectorizing with both 256bit and 128bit
   vectors.  */

static unsigned int
ix86_autovectorize_vector_sizes (void)
{
  return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
}

it would be better to do what the option suggests - only change the
prefered (first tried) vector size to 128bits but do not disallow 256bit
vectorization if 128bit vectorization is not possible / profitable
(ix86_preferred_simd_mode seems already to be wired that way).
Disabling 256bit vectorization should be done with a different option
(-mavx128, to only enable the 128bit subset?).

Richard.

> Thanks
> Harsha
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-01-08 11:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-12 22:26 AVX generic mode tuning discussion harsha.jagasia
2011-07-12 22:29 ` Richard Henderson
2011-07-13  8:49   ` Richard Guenther
2011-07-13  9:07     ` Jakub Jelinek
2011-07-21 21:37     ` Jagasia, Harsha
2013-01-07 18:24     ` FW: " Jagasia, Harsha
     [not found]     ` <873A3B0C5474B84F92B91855BCB4FE1625438297@sausexdag01.amd.com>
2013-01-08 11:22       ` Richard Biener
2011-07-21 21:18   ` Jagasia, Harsha
     [not found]   ` <63EE40A00BA43F49B85FACBB03F078B60821086630@sausexmbp02.amd.com>
2011-10-31 21:21     ` Jagasia, Harsha
2011-11-01  9:47       ` Richard Guenther
2011-11-02 17:17         ` Jagasia, Harsha
2011-11-02 20:50           ` Richard Guenther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).