-O3 and -ftree-vectorize

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* -O3 and -ftree-vectorize
@ 2014-02-06 18:51 Uros Bizjak
  2014-02-06 19:01 ` Xinliang David Li
  2014-02-06 22:21 ` Tim Prince
  0 siblings, 2 replies; 9+ messages in thread
From: Uros Bizjak @ 2014-02-06 18:51 UTC (permalink / raw)
  To: GCC Development; +Cc: Jakub Jelinek, Richard Biener

Hello!

4.9 does not enable -ftree-vectorize for -O3 (and Ofast) anymore. Is
this intentional?

$/ssd/uros/gcc-build/gcc/xgcc -B /ssd/uros/gcc-build/gcc -O3 -Q
--help=optimizers

...
-ftree-vectorize                      [disabled]
...

This is the reason for low scores on SciMark v2.0 as conducted by
Phoronix. Please see comment #18 in [1].

[1] http://phoronix.com/forums/showthread.php?95147-Clang-s-Competition-For-GCC-On-Intel-Haswell/page2

Uros.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-06 18:51 -O3 and -ftree-vectorize Uros Bizjak
@ 2014-02-06 19:01 ` Xinliang David Li
  2014-02-06 19:20   ` Uros Bizjak
  2014-02-06 22:21 ` Tim Prince
  1 sibling, 1 reply; 9+ messages in thread
From: Xinliang David Li @ 2014-02-06 19:01 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: GCC Development, Jakub Jelinek, Richard Biener

It is a bug in the help message dump. You can see that
ftree-loop-vectorize ftree-slp-vectorize are indeed enabled.

David

On Thu, Feb 6, 2014 at 10:51 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> Hello!
>
> 4.9 does not enable -ftree-vectorize for -O3 (and Ofast) anymore. Is
> this intentional?
>
> $/ssd/uros/gcc-build/gcc/xgcc -B /ssd/uros/gcc-build/gcc -O3 -Q
> --help=optimizers
>
> ...
> -ftree-vectorize                      [disabled]
> ...
>
> This is the reason for low scores on SciMark v2.0 as conducted by
> Phoronix. Please see comment #18 in [1].
>
> [1] http://phoronix.com/forums/showthread.php?95147-Clang-s-Competition-For-GCC-On-Intel-Haswell/page2
>
> Uros.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-06 19:01 ` Xinliang David Li
@ 2014-02-06 19:20   ` Uros Bizjak
  2014-02-06 19:23     ` Xinliang David Li
  0 siblings, 1 reply; 9+ messages in thread
From: Uros Bizjak @ 2014-02-06 19:20 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: GCC Development, Jakub Jelinek, Richard Biener

On Thu, Feb 6, 2014 at 8:01 PM, Xinliang David Li <davidxl@google.com> wrote:
> It is a bug in the help message dump. You can see that
> ftree-loop-vectorize ftree-slp-vectorize are indeed enabled.

I see. It would be also nice to fix -O3 entry in doc/invoke.texi that
currently reads:

--cut here--
@item -O3
@opindex O3
Optimize yet more.  @option{-O3} turns on all optimizations specified
by @option{-O2} and also turns on the @option{-finline-functions},
@option{-funswitch-loops}, @option{-fpredictive-commoning},
@option{-fgcse-after-reload}, @option{-ftree-loop-vectorize},
@option{-ftree-slp-vectorize}, @option{-fvect-cost-model},
@option{-ftree-partial-pre} and @option{-fipa-cp-clone} options.
--cut here--

Uros.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-06 19:20   ` Uros Bizjak
@ 2014-02-06 19:23     ` Xinliang David Li
  0 siblings, 0 replies; 9+ messages in thread
From: Xinliang David Li @ 2014-02-06 19:23 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: GCC Development, Jakub Jelinek, Richard Biener

The invoke.texi is correct. -ftree-vectorize is now simply a
short-hand to enable both loop and slp vectorizer.  I will fix the the
help message appropriately.

David

On Thu, Feb 6, 2014 at 11:20 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Feb 6, 2014 at 8:01 PM, Xinliang David Li <davidxl@google.com> wrote:
>> It is a bug in the help message dump. You can see that
>> ftree-loop-vectorize ftree-slp-vectorize are indeed enabled.
>
> I see. It would be also nice to fix -O3 entry in doc/invoke.texi that
> currently reads:
>
> --cut here--
> @item -O3
> @opindex O3
> Optimize yet more.  @option{-O3} turns on all optimizations specified
> by @option{-O2} and also turns on the @option{-finline-functions},
> @option{-funswitch-loops}, @option{-fpredictive-commoning},
> @option{-fgcse-after-reload}, @option{-ftree-loop-vectorize},
> @option{-ftree-slp-vectorize}, @option{-fvect-cost-model},
> @option{-ftree-partial-pre} and @option{-fipa-cp-clone} options.
> --cut here--
>
> Uros.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-06 18:51 -O3 and -ftree-vectorize Uros Bizjak
  2014-02-06 19:01 ` Xinliang David Li
@ 2014-02-06 22:21 ` Tim Prince
  2014-02-06 22:29   ` Xinliang David Li
  2014-02-07 15:22   ` Jakub Jelinek
  1 sibling, 2 replies; 9+ messages in thread
From: Tim Prince @ 2014-02-06 22:21 UTC (permalink / raw)
  To: gcc


On 2/6/2014 1:51 PM, Uros Bizjak wrote:
> Hello!
>
> 4.9 does not enable -ftree-vectorize for -O3 (and Ofast) anymore. Is
> this intentional?
>
> $/ssd/uros/gcc-build/gcc/xgcc -B /ssd/uros/gcc-build/gcc -O3 -Q
> --help=optimizers
>
> ...
> -ftree-vectorize                      [disabled]
> ...
>
>
I'm seeing vectorization  but no output from -ftree-vectorizer-verbose, 
and no dot product vectorization inside omp parallel regions, with gcc 
g++ or gfortran 4.9.  Primary targets are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it works 
with gfortran and g++, so use gcc -O2 -ftree-vectorize together with 
additional optimization flags which don't break.
I've made source code changes to take advantage of the new vectorization 
with merge() and ? operators; while it's useful for -march=core-avx2, 
it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably 
effective in my tests only on 12 or more cores.
#pragma omp simd reduction(max: ) is giving correct results but poor 
performance in my tests.

You've probably seen my gcc testresults posts.  The one major recent 
improvement is the ability to skip cilkplus tests on targets where it's 
totally unsupported.  Without cilk_for et al. even on "supported" 
targets cilkplus seems useless.
There are still lots of failing stabs tests on targets where those 
apparently aren't supported.

So there are some mysteries about what the developers intend.  I suppose 
this was posted on gcc list on account of such questions being ignored 
on gcc-help.

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-06 22:21 ` Tim Prince
@ 2014-02-06 22:29   ` Xinliang David Li
  2014-02-07 15:22   ` Jakub Jelinek
  1 sibling, 0 replies; 9+ messages in thread
From: Xinliang David Li @ 2014-02-06 22:29 UTC (permalink / raw)
  To: tprince; +Cc: gcc

On Thu, Feb 6, 2014 at 2:21 PM, Tim Prince <n8tm@aol.com> wrote:
>
> On 2/6/2014 1:51 PM, Uros Bizjak wrote:
>>
>> Hello!
>>
>> 4.9 does not enable -ftree-vectorize for -O3 (and Ofast) anymore. Is
>> this intentional?
>>
>> $/ssd/uros/gcc-build/gcc/xgcc -B /ssd/uros/gcc-build/gcc -O3 -Q
>> --help=optimizers
>>
>> ...
>> -ftree-vectorize                      [disabled]
>> ...
>>
>>
> I'm seeing vectorization  but no output from -ftree-vectorizer-verbose,

use -fopt-info.

The old option is deprecated.

>and
> no dot product vectorization inside omp parallel regions, with gcc g++ or
> gfortran 4.9.  Primary targets are cygwin64 and linux x86_64.
> I've been unable to use -O3 vectorization with gcc, although it works with
> gfortran and g++, so use gcc -O2 -ftree-vectorize together with additional
> optimization flags which don't break.

Reproducible case?


David

> I've made source code changes to take advantage of the new vectorization
> with merge() and ? operators; while it's useful for -march=core-avx2, it's
> sometimes a loss for -msse4.1.
> gcc vectorization with #pragma omp parallel for simd is reasonably effective
> in my tests only on 12 or more cores.
> #pragma omp simd reduction(max: ) is giving correct results but poor
> performance in my tests.
>
> You've probably seen my gcc testresults posts.  The one major recent
> improvement is the ability to skip cilkplus tests on targets where it's
> totally unsupported.  Without cilk_for et al. even on "supported" targets
> cilkplus seems useless.
> There are still lots of failing stabs tests on targets where those
> apparently aren't supported.
>
> So there are some mysteries about what the developers intend.  I suppose
> this was posted on gcc list on account of such questions being ignored on
> gcc-help.
>
> --
> Tim Prince
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-06 22:21 ` Tim Prince
  2014-02-06 22:29   ` Xinliang David Li
@ 2014-02-07 15:22   ` Jakub Jelinek
  2014-02-07 16:09     ` Tim Prince
  1 sibling, 1 reply; 9+ messages in thread
From: Jakub Jelinek @ 2014-02-07 15:22 UTC (permalink / raw)
  To: tprince; +Cc: gcc

On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:
> I'm seeing vectorization  but no output from
> -ftree-vectorizer-verbose, and no dot product vectorization inside
> omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
> are cygwin64 and linux x86_64.
> I've been unable to use -O3 vectorization with gcc, although it
> works with gfortran and g++, so use gcc -O2 -ftree-vectorize
> together with additional optimization flags which don't break.

Can you file a GCC bugzilla PR with minimal testcases for this (or point us
at already filed bugreports)?

> I've made source code changes to take advantage of the new
> vectorization with merge() and ? operators; while it's useful for
> -march=core-avx2, it's sometimes a loss for -msse4.1.
> gcc vectorization with #pragma omp parallel for simd is reasonably
> effective in my tests only on 12 or more cores.

Likewise.

> #pragma omp simd reduction(max: ) is giving correct results but poor
> performance in my tests.

Likewise.

	Jakub

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-07 15:22   ` Jakub Jelinek
@ 2014-02-07 16:09     ` Tim Prince
  2014-02-08 14:45       ` Tim Prince
  0 siblings, 1 reply; 9+ messages in thread
From: Tim Prince @ 2014-02-07 16:09 UTC (permalink / raw)
  To: Jakub Jelinek, tprince; +Cc: gcc


On 02/07/2014 10:22 AM, Jakub Jelinek wrote:
> On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:
>> I'm seeing vectorization  but no output from
>> -ftree-vectorizer-verbose, and no dot product vectorization inside
>> omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
>> are cygwin64 and linux x86_64.
>> I've been unable to use -O3 vectorization with gcc, although it
>> works with gfortran and g++, so use gcc -O2 -ftree-vectorize
>> together with additional optimization flags which don't break.
> Can you file a GCC bugzilla PR with minimal testcases for this (or point us
> at already filed bugreports)?
The question of problems with gcc -O3 (called from gfortran) have eluded 
me as to finding a minimal test case.  When I run under debug, it 
appears that somewhere prior to the crash some gfortran code is 
over-written with data by the gcc code, overwhelming my debugging 
skill.  I can get full performance with -O2 plus a bunch of intermediate 
flags.
As to non-vectorization of dot product in omp parallel region, 
-fopt-info (which I didn't know about) is reporting vectorization, but 
there are no parallel simd instructions in the generated code for the 
omp_fn.  I'll file a PR on that if it's still reproduced in a minimal case.

>
>> I've made source code changes to take advantage of the new
>> vectorization with merge() and ? operators; while it's useful for
>> -march=core-avx2, it's sometimes a loss for -msse4.1.
>> gcc vectorization with #pragma omp parallel for simd is reasonably
>> effective in my tests only on 12 or more cores.
> Likewise.
Those are cases of 2 levels of loops from netlib "vector" benchmark 
where only one level is vectorizable and parallelizable. By putting the 
vectorizable loop on the outside the parallelization scales to a large 
number of cores.  I don't expect it to out-perform single thread 
optimized avx vectorization until 8 or more cores are in use, but it 
needs more than expected number of threads even relative to SSE 
vectorization.
>
>> #pragma omp simd reduction(max: ) is giving correct results but poor
>> performance in my tests.
> Likewise.
I'll file a PR on this, didn't know if there might be interest.  I have 
an Intel compiler issue "closed, will not be fixed" so the simd 
reduction(max: ) isn't viable for icc in the near term.
Thanks,

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: -O3 and -ftree-vectorize
  2014-02-07 16:09     ` Tim Prince
@ 2014-02-08 14:45       ` Tim Prince
  0 siblings, 0 replies; 9+ messages in thread
From: Tim Prince @ 2014-02-08 14:45 UTC (permalink / raw)
  To: Jakub Jelinek, tprince; +Cc: gcc


On 2/7/2014 11:09 AM, Tim Prince wrote:
>
> On 02/07/2014 10:22 AM, Jakub Jelinek wrote:
>> On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:
>>> I'm seeing vectorization  but no output from
>>> -ftree-vectorizer-verbose, and no dot product vectorization inside
>>> omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
>>> are cygwin64 and linux x86_64.
>>> I've been unable to use -O3 vectorization with gcc, although it
>>> works with gfortran and g++, so use gcc -O2 -ftree-vectorize
>>> together with additional optimization flags which don't break.
>> Can you file a GCC bugzilla PR with minimal testcases for this (or 
>> point us
>> at already filed bugreports)?
> The question of problems with gcc -O3 (called from gfortran) have 
> eluded me as to finding a minimal test case.  When I run under debug, 
> it appears that somewhere prior to the crash some gfortran code is 
> over-written with data by the gcc code, overwhelming my debugging 
> skill.  I can get full performance with -O2 plus a bunch of 
> intermediate flags.
> As to non-vectorization of dot product in omp parallel region, 
> -fopt-info (which I didn't know about) is reporting vectorization, but 
> there are no parallel simd instructions in the generated code for the 
> omp_fn.  I'll file a PR on that if it's still reproduced in a minimal 
> case.
>
>>
>>> I've made source code changes to take advantage of the new
>>> vectorization with merge() and ? operators; while it's useful for
>>> -march=core-avx2, it's sometimes a loss for -msse4.1.
>>> gcc vectorization with #pragma omp parallel for simd is reasonably
>>> effective in my tests only on 12 or more cores.
>> Likewise.
> Those are cases of 2 levels of loops from netlib "vector" benchmark 
> where only one level is vectorizable and parallelizable. By putting 
> the vectorizable loop on the outside the parallelization scales to a 
> large number of cores.  I don't expect it to out-perform single thread 
> optimized avx vectorization until 8 or more cores are in use, but it 
> needs more than expected number of threads even relative to SSE 
> vectorization.
>>
>>> #pragma omp simd reduction(max: ) is giving correct results but poor
>>> performance in my tests.
>> Likewise.
> I'll file a PR on this, didn't know if there might be interest.  I 
> have an Intel compiler issue "closed, will not be fixed" so the simd 
> reduction(max: ) isn't viable for icc in the near term.
> Thanks,
>
With further investigation, my case with reverse_copy outside and 
inner_product inside an omp parallel region is working very well with 
-O3 -ffast-math for double data type.  There seems a possible 
performance problem with reverse_copy for float data type, so much so 
that gfortran does better with the loop reversal pushed down into the 
parallel dot_products.  I have seen at least 2 cases where the new gcc 
vectorization of stride -1 with vpermd is superior to other compilers, 
even for float data type.
For the cases where omp parallel for simd is set in expectation of 
gaining outer loop parallel simd, gcc is ignoring the simd clause. So it 
is understandable that a large number of cores is needed to overcome the 
lack of parallel simd (other than by simd intrinsics coding).
I'll choose an example of omp simd reduction(max: ) for a PR.
Thanks.

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-02-08 14:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-06 18:51 -O3 and -ftree-vectorize Uros Bizjak
2014-02-06 19:01 ` Xinliang David Li
2014-02-06 19:20   ` Uros Bizjak
2014-02-06 19:23     ` Xinliang David Li
2014-02-06 22:21 ` Tim Prince
2014-02-06 22:29   ` Xinliang David Li
2014-02-07 15:22   ` Jakub Jelinek
2014-02-07 16:09     ` Tim Prince
2014-02-08 14:45       ` Tim Prince

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).