Why vectorization didn't turn on by -O2

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Why vectorization didn't turn on by -O2
@ 2021-05-08 12:07 172060045
  2021-05-08 16:54 ` Xi Ruoyao
  0 siblings, 1 reply; 26+ messages in thread
From: 172060045 @ 2021-05-08 12:07 UTC (permalink / raw)
  To: gcc-help; +Cc: bin.cheng, ook, rguenther

Hi all,

Recently I noticed that gcc -O2 didn't turn on vectorization optimization， 
which it turns on in clang -O2.

Does GCC think it involves the trade-off of space speed, or other considerations?

Thanks in advance.

Yancheng

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-08 12:07 Why vectorization didn't turn on by -O2 172060045
@ 2021-05-08 16:54 ` Xi Ruoyao
  2021-05-09 13:44   ` Segher Boessenkool
  0 siblings, 1 reply; 26+ messages in thread
From: Xi Ruoyao @ 2021-05-08 16:54 UTC (permalink / raw)
  To: 172060045, gcc-help; +Cc: rguenther, bin.cheng, ook, Jan Hubicka

On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> 
> Hi all,
> 
> Recently I noticed that gcc -O2 didn't turn on vectorization
> optimization， 
> which it turns on in clang -O2.
> 
> Does GCC think it involves the trade-off of space speed, or other
> considerations?
> 
> Thanks in advance.
> 
> Yancheng

It's just a decision I think.  The "original" reason may be that
vectorization can make code *slower* on some cases.

There was some discussion about enabling -ftree-loop-vectorization at -
O2 for x86, but that was too late (for GCC 9):

https://gcc.gnu.org/pipermail/gcc/2019-January/228102.html

CC Jan: would this be reconsidered for GCC 12?
-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-08 16:54 ` Xi Ruoyao
@ 2021-05-09 13:44   ` Segher Boessenkool
  2021-05-10  8:21     ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-05-09 13:44 UTC (permalink / raw)
  To: Xi Ruoyao; +Cc: 172060045, gcc-help, bin.cheng, rguenther, ook, Jan Hubicka

On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > Recently I noticed that gcc -O2 didn't turn on vectorization
> > optimization， 
> > which it turns on in clang -O2.
> > 
> > Does GCC think it involves the trade-off of space speed, or other
> > considerations?

-O2 is for optimisations that (almost) never degrade code quality.  -O3
is for those that are only beneficial "on average".

> It's just a decision I think.  The "original" reason may be that
> vectorization can make code *slower* on some cases.

Yup.  Vectorisation always causes hugely different code.

> There was some discussion about enabling -ftree-loop-vectorization at -
> O2 for x86, but that was too late (for GCC 9):

AFAIK the current plan is to enable vectorisation at -O2 with a more
conservative cost model.  This will be a generic change, for all
architectures, and hopefully will arrive in GCC 12.


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-09 13:44   ` Segher Boessenkool
@ 2021-05-10  8:21     ` Richard Biener
  2021-05-10  9:11       ` Jan Hubicka
  2021-05-10  9:24       ` Segher Boessenkool
  0 siblings, 2 replies; 26+ messages in thread
From: Richard Biener @ 2021-05-10  8:21 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Xi Ruoyao, 172060045, gcc-help, bin.cheng, Jan Hubicka

On Sun, 9 May 2021, Segher Boessenkool wrote:

> On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > optimization， 
> > > which it turns on in clang -O2.
> > > 
> > > Does GCC think it involves the trade-off of space speed, or other
> > > considerations?
> 
> -O2 is for optimisations that (almost) never degrade code quality.  -O3
> is for those that are only beneficial "on average".

In particular -O2 is a balance of compile-time, generated code size
and resulting performance.  Vectorization with the -O2 default
cost model of "cheap" tends to mostly increase the first and the second
whilst only moderately affecting the last.

It's all of course hugely dependent on the source base you are
working with.

Richard.

> > It's just a decision I think.  The "original" reason may be that
> > vectorization can make code *slower* on some cases.
> 
> Yup.  Vectorisation always causes hugely different code.
> 
> > There was some discussion about enabling -ftree-loop-vectorization at -
> > O2 for x86, but that was too late (for GCC 9):
> 
> AFAIK the current plan is to enable vectorisation at -O2 with a more
> conservative cost model.  This will be a generic change, for all
> architectures, and hopefully will arrive in GCC 12.
> 
> 
> Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-10  8:21     ` Richard Biener
@ 2021-05-10  9:11       ` Jan Hubicka
  2021-05-10  9:27         ` Segher Boessenkool
  2021-05-10  9:24       ` Segher Boessenkool
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-05-10  9:11 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Xi Ruoyao, 172060045, gcc-help, bin.cheng

> On Sun, 9 May 2021, Segher Boessenkool wrote:
> 
> > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > optimization， 
> > > > which it turns on in clang -O2.
> > > > 
> > > > Does GCC think it involves the trade-off of space speed, or other
> > > > considerations?
> > 
> > -O2 is for optimisations that (almost) never degrade code quality.  -O3
> > is for those that are only beneficial "on average".
> 
> In particular -O2 is a balance of compile-time, generated code size
> and resulting performance.  Vectorization with the -O2 default
> cost model of "cheap" tends to mostly increase the first and the second
> whilst only moderately affecting the last.

Last time I ran -O2 with cheap model enabled, the code size was actually
decreasing at average for SPEC which was a bit of a surprise.

I plan to re-do the benchmarks soon now gcc11 branched.

Honza
> 
> It's all of course hugely dependent on the source base you are
> working with.
> 
> Richard.
> 
> > > It's just a decision I think.  The "original" reason may be that
> > > vectorization can make code *slower* on some cases.
> > 
> > Yup.  Vectorisation always causes hugely different code.
> > 
> > > There was some discussion about enabling -ftree-loop-vectorization at -
> > > O2 for x86, but that was too late (for GCC 9):
> > 
> > AFAIK the current plan is to enable vectorisation at -O2 with a more
> > conservative cost model.  This will be a generic change, for all
> > architectures, and hopefully will arrive in GCC 12.
> > 
> > 
> > Segher


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-10  8:21     ` Richard Biener
  2021-05-10  9:11       ` Jan Hubicka
@ 2021-05-10  9:24       ` Segher Boessenkool
  2021-05-17 16:03         ` Jan Hubicka
  1 sibling, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-05-10  9:24 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-help, Jan Hubicka, bin.cheng, 172060045

On Mon, May 10, 2021 at 10:21:04AM +0200, Richard Biener wrote:
> On Sun, 9 May 2021, Segher Boessenkool wrote:
> > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > optimization， 
> > > > which it turns on in clang -O2.
> > > > 
> > > > Does GCC think it involves the trade-off of space speed, or other
> > > > considerations?
> > 
> > -O2 is for optimisations that (almost) never degrade code quality.  -O3
> > is for those that are only beneficial "on average".
> 
> In particular -O2 is a balance of compile-time, generated code size
> and resulting performance.

Yeah, sorry, I glossed over those details, but they do matter.

My point is that -O2 opts "almost never" should degrade quality (code
speed, code size, compilation speed, and tradeoffs in that all --
nothing is a hard and fast rule here).

> Vectorization with the -O2 default
> cost model of "cheap" tends to mostly increase the first and the second
> whilst only moderately affecting the last.

One promising plan is to use very-cheap instead.  It should increase code
size even less (but see what Honza said).

> It's all of course hugely dependent on the source base you are
> working with.

Yeah.


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-10  9:11       ` Jan Hubicka
@ 2021-05-10  9:27         ` Segher Boessenkool
  2021-05-10 12:10           ` Jan Hubicka
  0 siblings, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-05-10  9:27 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Biener, gcc-help, bin.cheng, 172060045

On Mon, May 10, 2021 at 11:11:36AM +0200, Jan Hubicka wrote:
> > On Sun, 9 May 2021, Segher Boessenkool wrote:
> > > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > > optimization， 
> > > > > which it turns on in clang -O2.
> > > > > 
> > > > > Does GCC think it involves the trade-off of space speed, or other
> > > > > considerations?
> > > 
> > > -O2 is for optimisations that (almost) never degrade code quality.  -O3
> > > is for those that are only beneficial "on average".
> > 
> > In particular -O2 is a balance of compile-time, generated code size
> > and resulting performance.  Vectorization with the -O2 default
> > cost model of "cheap" tends to mostly increase the first and the second
> > whilst only moderately affecting the last.
> 
> Last time I ran -O2 with cheap model enabled, the code size was actually
> decreasing at average for SPEC which was a bit of a surprise.

Huh!  Will you investigate how/why this happens?  Also, it probably is
quite different on different archs.  Interesting times ahead.

> I plan to re-do the benchmarks soon now gcc11 branched.

Great, thanks in advance, gl;hf :-)


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-10  9:27         ` Segher Boessenkool
@ 2021-05-10 12:10           ` Jan Hubicka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Hubicka @ 2021-05-10 12:10 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Richard Biener, gcc-help, bin.cheng, 172060045

> On Mon, May 10, 2021 at 11:11:36AM +0200, Jan Hubicka wrote:
> > > On Sun, 9 May 2021, Segher Boessenkool wrote:
> > > > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > > > optimization， 
> > > > > > which it turns on in clang -O2.
> > > > > > 
> > > > > > Does GCC think it involves the trade-off of space speed, or other
> > > > > > considerations?
> > > > 
> > > > -O2 is for optimisations that (almost) never degrade code quality.  -O3
> > > > is for those that are only beneficial "on average".
> > > 
> > > In particular -O2 is a balance of compile-time, generated code size
> > > and resulting performance.  Vectorization with the -O2 default
> > > cost model of "cheap" tends to mostly increase the first and the second
> > > whilst only moderately affecting the last.
> > 
> > Last time I ran -O2 with cheap model enabled, the code size was actually
> > decreasing at average for SPEC which was a bit of a surprise.
> 
> Huh!  Will you investigate how/why this happens?  Also, it probably is
> quite different on different archs.  Interesting times ahead.
> 
> > I plan to re-do the benchmarks soon now gcc11 branched.
> 
> Great, thanks in advance, gl;hf :-)

I was referring to the very cheap model here.  I think cheap model may
need some love since it seems to pick quite some code overhead and still
produce noticeably fewer speedups than dynamic.  This however may be
very SPEC benchmarks centric since we tune vectorizer for years on
those.

I am re-running benchmarks now and lets see.

Honza
> 
> 
> Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-10  9:24       ` Segher Boessenkool
@ 2021-05-17 16:03         ` Jan Hubicka
  2021-05-17 18:56           ` Richard Sandiford
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-05-17 16:03 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Richard Biener, gcc-help, bin.cheng, 172060045

Hi,
here are updated scores.  
https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
compares
  base:  mainline
  1st column: mainline with very cheap vectorization at -O2 and -O3
  2nd column: mainline with cheap vectorization at -O2 and -O3.

The short story is:

1) -O2 generic performance
    kabylake (Intel):
    				very    cheap
        SPEC/SPEC2006/FP/total 	~ 	8.32% 	
	SPEC/SPEC2006/total 	-0.38% 	4.74% 	
	SPEC/SPEC2006/INT/total	-0.91% 	-0.14% 	

   	SPEC/SPEC2017/INT/total	4.71% 	7.11% 	
	SPEC/SPEC2017/total 	2.22% 	6.52% 	
	SPEC/SPEC2017/FP/total 	0.34% 	6.06% 	
    zen
        SPEC/SPEC2006/FP/total 	0.61% 	10.23% 	
	SPEC/SPEC2006/total 	0.26% 	6.27% 	
	SPEC/SPEC2006/INT/total	34.006 	-0.24% 	0.90% 	

        SPEC/SPEC2017/INT/total	3.937 	5.34% 	7.80% 	
	SPEC/SPEC2017/total 	3.02% 	6.55% 	
	SPEC/SPEC2017/FP/total 	1.26% 	5.60% 	

 2) -O2 size:
     -0.78% (very cheap) 6.51% (cheap) for spec2k2006 
     -0.32% (very cheap) 6.75% (cheap) for spec2k2017 
 3) build times:
     0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
     0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
    here I simply copied data from different configuratoins

So for SPEC i would say that most of compile time costs are derrived
from code size growth which is a problem with cheap model but not with
very cheap.  Very cheap indeed results in code size improvements and
compile time impact is probably somewhere around 0.5%

So from these scores alone this would seem that vectorization makes
sense at -O2 with very cheap model to me (I am sure we have other
optimizations with worse benefits to compile time tradeoffs).

However there are usual arguments against:

  1) Vectorizer being tuned for SPEC.  I think the only way to overcome
     that argument is to enable it by default :)
  2) Workloads improved are more of -Ofast type workloads

Here are non-spec benchmarks we track:
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on

I also tried to run Firefox some time ago. Results are not surprising -
vectorizaiton helps rendering benchmarks which are those compiler with
aggressive flags anyway.

Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-17 16:03         ` Jan Hubicka
@ 2021-05-17 18:56           ` Richard Sandiford
  2021-08-04  8:21             ` Hongtao Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Sandiford @ 2021-05-17 18:56 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Segher Boessenkool, gcc-help, bin.cheng, Richard Biener, 172060045

Jan Hubicka <hubicka@ucw.cz> writes:
> Hi,
> here are updated scores.  
> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> compares
>   base:  mainline
>   1st column: mainline with very cheap vectorization at -O2 and -O3
>   2nd column: mainline with cheap vectorization at -O2 and -O3.
>
> The short story is:
>
> 1) -O2 generic performance
>     kabylake (Intel):
>     				very    cheap
>         SPEC/SPEC2006/FP/total 	~ 	8.32% 	
> 	SPEC/SPEC2006/total 	-0.38% 	4.74% 	
> 	SPEC/SPEC2006/INT/total	-0.91% 	-0.14% 	
>
>    	SPEC/SPEC2017/INT/total	4.71% 	7.11% 	
> 	SPEC/SPEC2017/total 	2.22% 	6.52% 	
> 	SPEC/SPEC2017/FP/total 	0.34% 	6.06% 	
>     zen
>         SPEC/SPEC2006/FP/total 	0.61% 	10.23% 	
> 	SPEC/SPEC2006/total 	0.26% 	6.27% 	
> 	SPEC/SPEC2006/INT/total	34.006 	-0.24% 	0.90% 	
>
>         SPEC/SPEC2017/INT/total	3.937 	5.34% 	7.80% 	
> 	SPEC/SPEC2017/total 	3.02% 	6.55% 	
> 	SPEC/SPEC2017/FP/total 	1.26% 	5.60% 	
>
>  2) -O2 size:
>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006 
>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017 
>  3) build times:
>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
>     here I simply copied data from different configuratoins
>
> So for SPEC i would say that most of compile time costs are derrived
> from code size growth which is a problem with cheap model but not with
> very cheap.  Very cheap indeed results in code size improvements and
> compile time impact is probably somewhere around 0.5%
>
> So from these scores alone this would seem that vectorization makes
> sense at -O2 with very cheap model to me (I am sure we have other
> optimizations with worse benefits to compile time tradeoffs).

Thanks for running these.

The biggest issue I know of for enabling very-cheap at -O2 is:

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Perhaps we could get around that by (hopefully temporarily) disabling
BB SLP within loop vectorisation for the very-cheap model.  This would
purely be a workaround and we should remove it once the PR is fixed.
(It would even be a compile-time win in the meantime :-))

Thanks,
Richard

> However there are usual arguments against:
>
>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
>      that argument is to enable it by default :)
>   2) Workloads improved are more of -Ofast type workloads
>
> Here are non-spec benchmarks we track:
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>
> I also tried to run Firefox some time ago. Results are not surprising -
> vectorizaiton helps rendering benchmarks which are those compiler with
> aggressive flags anyway.
>
> Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-05-17 18:56           ` Richard Sandiford
@ 2021-08-04  8:21             ` Hongtao Liu
  2021-08-04  8:22               ` Richard Sandiford
  0 siblings, 1 reply; 26+ messages in thread
From: Hongtao Liu @ 2021-08-04  8:21 UTC (permalink / raw)
  To: Richard Sandiford, Jan Hubicka, Segher Boessenkool, gcc-help,
	bin.cheng, Richard Biener, 172060045

On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
<gcc-help@gcc.gnu.org> wrote:
>
> Jan Hubicka <hubicka@ucw.cz> writes:
> > Hi,
> > here are updated scores.
> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > compares
> >   base:  mainline
> >   1st column: mainline with very cheap vectorization at -O2 and -O3
> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
> >
> > The short story is:
> >
> > 1) -O2 generic performance
> >     kabylake (Intel):
> >                               very    cheap
> >         SPEC/SPEC2006/FP/total        ~       8.32%
> >       SPEC/SPEC2006/total     -0.38%  4.74%
> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> >
> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
> >       SPEC/SPEC2017/total     2.22%   6.52%
> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
> >     zen
> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
> >       SPEC/SPEC2006/total     0.26%   6.27%
> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> >
> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> >       SPEC/SPEC2017/total     3.02%   6.55%
> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
> >
> >  2) -O2 size:
> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >  3) build times:
> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> >     here I simply copied data from different configuratoins
> >
> > So for SPEC i would say that most of compile time costs are derrived
> > from code size growth which is a problem with cheap model but not with
> > very cheap.  Very cheap indeed results in code size improvements and
> > compile time impact is probably somewhere around 0.5%
> >
> > So from these scores alone this would seem that vectorization makes
> > sense at -O2 with very cheap model to me (I am sure we have other
> > optimizations with worse benefits to compile time tradeoffs).
>
> Thanks for running these.
>
> The biggest issue I know of for enabling very-cheap at -O2 is:
>
>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>
> Perhaps we could get around that by (hopefully temporarily) disabling
> BB SLP within loop vectorisation for the very-cheap model.  This would
> purely be a workaround and we should remove it once the PR is fixed.
> (It would even be a compile-time win in the meantime :-))
>
> Thanks,
> Richard
>
> > However there are usual arguments against:
> >
> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> >      that argument is to enable it by default :)
> >   2) Workloads improved are more of -Ofast type workloads
> >
> > Here are non-spec benchmarks we track:
> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >
> > I also tried to run Firefox some time ago. Results are not surprising -
> > vectorizaiton helps rendering benchmarks which are those compiler with
> > aggressive flags anyway.
> >
> > Honza

Hi:
  I would like to ask if we can turn on O2 vectorization now?


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  8:21             ` Hongtao Liu
@ 2021-08-04  8:22               ` Richard Sandiford
  2021-08-04  8:31                 ` Richard Biener
  2021-08-04  8:36                 ` Hongtao Liu
  0 siblings, 2 replies; 26+ messages in thread
From: Richard Sandiford @ 2021-08-04  8:22 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Jan Hubicka, Segher Boessenkool, gcc-help, bin.cheng,
	Richard Biener, 172060045

Hongtao Liu <crazylht@gmail.com> writes:
> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
>>
>> Jan Hubicka <hubicka@ucw.cz> writes:
>> > Hi,
>> > here are updated scores.
>> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> > compares
>> >   base:  mainline
>> >   1st column: mainline with very cheap vectorization at -O2 and -O3
>> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
>> >
>> > The short story is:
>> >
>> > 1) -O2 generic performance
>> >     kabylake (Intel):
>> >                               very    cheap
>> >         SPEC/SPEC2006/FP/total        ~       8.32%
>> >       SPEC/SPEC2006/total     -0.38%  4.74%
>> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
>> >
>> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
>> >       SPEC/SPEC2017/total     2.22%   6.52%
>> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
>> >     zen
>> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
>> >       SPEC/SPEC2006/total     0.26%   6.27%
>> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
>> >
>> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
>> >       SPEC/SPEC2017/total     3.02%   6.55%
>> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
>> >
>> >  2) -O2 size:
>> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>> >  3) build times:
>> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
>> >     here I simply copied data from different configuratoins
>> >
>> > So for SPEC i would say that most of compile time costs are derrived
>> > from code size growth which is a problem with cheap model but not with
>> > very cheap.  Very cheap indeed results in code size improvements and
>> > compile time impact is probably somewhere around 0.5%
>> >
>> > So from these scores alone this would seem that vectorization makes
>> > sense at -O2 with very cheap model to me (I am sure we have other
>> > optimizations with worse benefits to compile time tradeoffs).
>>
>> Thanks for running these.
>>
>> The biggest issue I know of for enabling very-cheap at -O2 is:
>>
>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>>
>> Perhaps we could get around that by (hopefully temporarily) disabling
>> BB SLP within loop vectorisation for the very-cheap model.  This would
>> purely be a workaround and we should remove it once the PR is fixed.
>> (It would even be a compile-time win in the meantime :-))
>>
>> Thanks,
>> Richard
>>
>> > However there are usual arguments against:
>> >
>> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
>> >      that argument is to enable it by default :)
>> >   2) Workloads improved are more of -Ofast type workloads
>> >
>> > Here are non-spec benchmarks we track:
>> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >
>> > I also tried to run Firefox some time ago. Results are not surprising -
>> > vectorizaiton helps rendering benchmarks which are those compiler with
>> > aggressive flags anyway.
>> >
>> > Honza
>
> Hi:
>   I would like to ask if we can turn on O2 vectorization now?

I think we still need to deal with the PR100089 issue that I mentioned above.
Like I say, “dealing with” it could be as simple as disabling:

      /* If we applied if-conversion then try to vectorize the
	 BB of innermost loops.
	 ???  Ideally BB vectorization would learn to vectorize
	 control flow by applying if-conversion on-the-fly, the
	 following retains the if-converted loop body even when
	 only non-if-converted parts took part in BB vectorization.  */
      if (flag_tree_slp_vectorize != 0
	  && loop_vectorized_call
	  && ! loop->inner)

for the very-cheap vector cost model until the PR is fixed properly.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  8:22               ` Richard Sandiford
@ 2021-08-04  8:31                 ` Richard Biener
  2021-08-04  9:10                   ` Richard Sandiford
                                     ` (3 more replies)
  2021-08-04  8:36                 ` Hongtao Liu
  1 sibling, 4 replies; 26+ messages in thread
From: Richard Biener @ 2021-08-04  8:31 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Hongtao Liu, Jan Hubicka, Segher Boessenkool, gcc-help,
	bin.cheng, 172060045

On Wed, 4 Aug 2021, Richard Sandiford wrote:

> Hongtao Liu <crazylht@gmail.com> writes:
> > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > <gcc-help@gcc.gnu.org> wrote:
> >>
> >> Jan Hubicka <hubicka@ucw.cz> writes:
> >> > Hi,
> >> > here are updated scores.
> >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> > compares
> >> >   base:  mainline
> >> >   1st column: mainline with very cheap vectorization at -O2 and -O3
> >> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
> >> >
> >> > The short story is:
> >> >
> >> > 1) -O2 generic performance
> >> >     kabylake (Intel):
> >> >                               very    cheap
> >> >         SPEC/SPEC2006/FP/total        ~       8.32%
> >> >       SPEC/SPEC2006/total     -0.38%  4.74%
> >> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> >> >
> >> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
> >> >       SPEC/SPEC2017/total     2.22%   6.52%
> >> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
> >> >     zen
> >> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
> >> >       SPEC/SPEC2006/total     0.26%   6.27%
> >> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> >> >
> >> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> >> >       SPEC/SPEC2017/total     3.02%   6.55%
> >> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
> >> >
> >> >  2) -O2 size:
> >> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >> >  3) build times:
> >> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> >> >     here I simply copied data from different configuratoins
> >> >
> >> > So for SPEC i would say that most of compile time costs are derrived
> >> > from code size growth which is a problem with cheap model but not with
> >> > very cheap.  Very cheap indeed results in code size improvements and
> >> > compile time impact is probably somewhere around 0.5%
> >> >
> >> > So from these scores alone this would seem that vectorization makes
> >> > sense at -O2 with very cheap model to me (I am sure we have other
> >> > optimizations with worse benefits to compile time tradeoffs).
> >>
> >> Thanks for running these.
> >>
> >> The biggest issue I know of for enabling very-cheap at -O2 is:
> >>
> >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> >>
> >> Perhaps we could get around that by (hopefully temporarily) disabling
> >> BB SLP within loop vectorisation for the very-cheap model.  This would
> >> purely be a workaround and we should remove it once the PR is fixed.
> >> (It would even be a compile-time win in the meantime :-))
> >>
> >> Thanks,
> >> Richard
> >>
> >> > However there are usual arguments against:
> >> >
> >> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> >> >      that argument is to enable it by default :)
> >> >   2) Workloads improved are more of -Ofast type workloads
> >> >
> >> > Here are non-spec benchmarks we track:
> >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> >
> >> > I also tried to run Firefox some time ago. Results are not surprising -
> >> > vectorizaiton helps rendering benchmarks which are those compiler with
> >> > aggressive flags anyway.
> >> >
> >> > Honza
> >
> > Hi:
> >   I would like to ask if we can turn on O2 vectorization now?
> 
> I think we still need to deal with the PR100089 issue that I mentioned above.
> Like I say, “dealing with” it could be as simple as disabling:
> 
>       /* If we applied if-conversion then try to vectorize the
> 	 BB of innermost loops.
> 	 ???  Ideally BB vectorization would learn to vectorize
> 	 control flow by applying if-conversion on-the-fly, the
> 	 following retains the if-converted loop body even when
> 	 only non-if-converted parts took part in BB vectorization.  */
>       if (flag_tree_slp_vectorize != 0
> 	  && loop_vectorized_call
> 	  && ! loop->inner)
> 
> for the very-cheap vector cost model until the PR is fixed properly.

Alternatively only enable loop vectorization at -O2 (the above checks
flag_tree_slp_vectorize as well).  At least the cost model kind
does not have any influence on BB vectorization, that is, we get the
same pros and cons as we do for -O3.

Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?

Richard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  8:22               ` Richard Sandiford
  2021-08-04  8:31                 ` Richard Biener
@ 2021-08-04  8:36                 ` Hongtao Liu
  1 sibling, 0 replies; 26+ messages in thread
From: Hongtao Liu @ 2021-08-04  8:36 UTC (permalink / raw)
  To: Hongtao Liu, Jan Hubicka, Segher Boessenkool, gcc-help,
	bin.cheng, Richard Biener, 172060045, Richard Sandiford

On Wed, Aug 4, 2021 at 4:22 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu <crazylht@gmail.com> writes:
> > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > <gcc-help@gcc.gnu.org> wrote:
> >>
> >> Jan Hubicka <hubicka@ucw.cz> writes:
> >> > Hi,
> >> > here are updated scores.
> >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> > compares
> >> >   base:  mainline
> >> >   1st column: mainline with very cheap vectorization at -O2 and -O3
> >> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
> >> >
> >> > The short story is:
> >> >
> >> > 1) -O2 generic performance
> >> >     kabylake (Intel):
> >> >                               very    cheap
> >> >         SPEC/SPEC2006/FP/total        ~       8.32%
> >> >       SPEC/SPEC2006/total     -0.38%  4.74%
> >> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> >> >
> >> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
> >> >       SPEC/SPEC2017/total     2.22%   6.52%
> >> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
> >> >     zen
> >> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
> >> >       SPEC/SPEC2006/total     0.26%   6.27%
> >> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> >> >
> >> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> >> >       SPEC/SPEC2017/total     3.02%   6.55%
> >> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
> >> >
> >> >  2) -O2 size:
> >> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >> >  3) build times:
> >> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> >> >     here I simply copied data from different configuratoins
> >> >
> >> > So for SPEC i would say that most of compile time costs are derrived
> >> > from code size growth which is a problem with cheap model but not with
> >> > very cheap.  Very cheap indeed results in code size improvements and
> >> > compile time impact is probably somewhere around 0.5%
> >> >
> >> > So from these scores alone this would seem that vectorization makes
> >> > sense at -O2 with very cheap model to me (I am sure we have other
> >> > optimizations with worse benefits to compile time tradeoffs).
> >>
> >> Thanks for running these.
> >>
> >> The biggest issue I know of for enabling very-cheap at -O2 is:
> >>
> >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> >>
> >> Perhaps we could get around that by (hopefully temporarily) disabling
> >> BB SLP within loop vectorisation for the very-cheap model.  This would
> >> purely be a workaround and we should remove it once the PR is fixed.
> >> (It would even be a compile-time win in the meantime :-))
> >>
> >> Thanks,
> >> Richard
> >>
> >> > However there are usual arguments against:
> >> >
> >> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> >> >      that argument is to enable it by default :)
> >> >   2) Workloads improved are more of -Ofast type workloads
> >> >
> >> > Here are non-spec benchmarks we track:
> >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> >
> >> > I also tried to run Firefox some time ago. Results are not surprising -
> >> > vectorizaiton helps rendering benchmarks which are those compiler with
> >> > aggressive flags anyway.
> >> >
> >> > Honza
> >
> > Hi:
> >   I would like to ask if we can turn on O2 vectorization now?
>
> I think we still need to deal with the PR100089 issue that I mentioned above.
> Like I say, “dealing with” it could be as simple as disabling:
>
>       /* If we applied if-conversion then try to vectorize the
>          BB of innermost loops.
>          ???  Ideally BB vectorization would learn to vectorize
>          control flow by applying if-conversion on-the-fly, the
>          following retains the if-converted loop body even when
>          only non-if-converted parts took part in BB vectorization.  */
>       if (flag_tree_slp_vectorize != 0
>           && loop_vectorized_call
>           && ! loop->inner)
>
> for the very-cheap vector cost model until the PR is fixed properly.
>
I can test its impact on x86 platforms.
> Thanks,
> Richard



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  8:31                 ` Richard Biener
@ 2021-08-04  9:10                   ` Richard Sandiford
  2021-08-04  9:56                     ` Segher Boessenkool
  2021-08-04  9:12                   ` Hongtao Liu
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Richard Sandiford @ 2021-08-04  9:10 UTC (permalink / raw)
  To: Richard Biener
  Cc: Hongtao Liu, Jan Hubicka, Segher Boessenkool, gcc-help,
	bin.cheng, 172060045

Richard Biener <rguenther@suse.de> writes:
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
>
>> Hongtao Liu <crazylht@gmail.com> writes:
>> > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
>> > <gcc-help@gcc.gnu.org> wrote:
>> >>
>> >> Jan Hubicka <hubicka@ucw.cz> writes:
>> >> > Hi,
>> >> > here are updated scores.
>> >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >> > compares
>> >> >   base:  mainline
>> >> >   1st column: mainline with very cheap vectorization at -O2 and -O3
>> >> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
>> >> >
>> >> > The short story is:
>> >> >
>> >> > 1) -O2 generic performance
>> >> >     kabylake (Intel):
>> >> >                               very    cheap
>> >> >         SPEC/SPEC2006/FP/total        ~       8.32%
>> >> >       SPEC/SPEC2006/total     -0.38%  4.74%
>> >> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
>> >> >
>> >> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
>> >> >       SPEC/SPEC2017/total     2.22%   6.52%
>> >> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
>> >> >     zen
>> >> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
>> >> >       SPEC/SPEC2006/total     0.26%   6.27%
>> >> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
>> >> >
>> >> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
>> >> >       SPEC/SPEC2017/total     3.02%   6.55%
>> >> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
>> >> >
>> >> >  2) -O2 size:
>> >> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>> >> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>> >> >  3) build times:
>> >> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>> >> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
>> >> >     here I simply copied data from different configuratoins
>> >> >
>> >> > So for SPEC i would say that most of compile time costs are derrived
>> >> > from code size growth which is a problem with cheap model but not with
>> >> > very cheap.  Very cheap indeed results in code size improvements and
>> >> > compile time impact is probably somewhere around 0.5%
>> >> >
>> >> > So from these scores alone this would seem that vectorization makes
>> >> > sense at -O2 with very cheap model to me (I am sure we have other
>> >> > optimizations with worse benefits to compile time tradeoffs).
>> >>
>> >> Thanks for running these.
>> >>
>> >> The biggest issue I know of for enabling very-cheap at -O2 is:
>> >>
>> >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>> >>
>> >> Perhaps we could get around that by (hopefully temporarily) disabling
>> >> BB SLP within loop vectorisation for the very-cheap model.  This would
>> >> purely be a workaround and we should remove it once the PR is fixed.
>> >> (It would even be a compile-time win in the meantime :-))
>> >>
>> >> Thanks,
>> >> Richard
>> >>
>> >> > However there are usual arguments against:
>> >> >
>> >> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
>> >> >      that argument is to enable it by default :)
>> >> >   2) Workloads improved are more of -Ofast type workloads
>> >> >
>> >> > Here are non-spec benchmarks we track:
>> >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >> >
>> >> > I also tried to run Firefox some time ago. Results are not surprising -
>> >> > vectorizaiton helps rendering benchmarks which are those compiler with
>> >> > aggressive flags anyway.
>> >> >
>> >> > Honza
>> >
>> > Hi:
>> >   I would like to ask if we can turn on O2 vectorization now?
>> 
>> I think we still need to deal with the PR100089 issue that I mentioned above.
>> Like I say, “dealing with” it could be as simple as disabling:
>> 
>>       /* If we applied if-conversion then try to vectorize the
>> 	 BB of innermost loops.
>> 	 ???  Ideally BB vectorization would learn to vectorize
>> 	 control flow by applying if-conversion on-the-fly, the
>> 	 following retains the if-converted loop body even when
>> 	 only non-if-converted parts took part in BB vectorization.  */
>>       if (flag_tree_slp_vectorize != 0
>> 	  && loop_vectorized_call
>> 	  && ! loop->inner)
>> 
>> for the very-cheap vector cost model until the PR is fixed properly.
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.

Yeah, but a lot of the loop vector cost model choice is about controlling
code size growth and avoiding excessive runtime versioning tests.  BB SLP
should be a win on both code size and performance (barring significant
target costing issues).

PR100089 was an exception because we ended up keeping unvectorised
scalar code that would never have existed otherwise.  BB SLP proper
shouldn't have that problem.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  8:31                 ` Richard Biener
  2021-08-04  9:10                   ` Richard Sandiford
@ 2021-08-04  9:12                   ` Hongtao Liu
  2021-08-11 17:14                   ` Jan Hubicka
  2021-08-16  3:22                   ` Kewen.Lin
  3 siblings, 0 replies; 26+ messages in thread
From: Hongtao Liu @ 2021-08-04  9:12 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, Jan Hubicka, Segher Boessenkool, gcc-help,
	bin.cheng, 172060045

On Wed, Aug 4, 2021 at 4:31 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
>
> > Hongtao Liu <crazylht@gmail.com> writes:
> > > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > > <gcc-help@gcc.gnu.org> wrote:
> > >>
> > >> Jan Hubicka <hubicka@ucw.cz> writes:
> > >> > Hi,
> > >> > here are updated scores.
> > >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >> > compares
> > >> >   base:  mainline
> > >> >   1st column: mainline with very cheap vectorization at -O2 and -O3
> > >> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
> > >> >
> > >> > The short story is:
> > >> >
> > >> > 1) -O2 generic performance
> > >> >     kabylake (Intel):
> > >> >                               very    cheap
> > >> >         SPEC/SPEC2006/FP/total        ~       8.32%
> > >> >       SPEC/SPEC2006/total     -0.38%  4.74%
> > >> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> > >> >
> > >> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
> > >> >       SPEC/SPEC2017/total     2.22%   6.52%
> > >> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
> > >> >     zen
> > >> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
> > >> >       SPEC/SPEC2006/total     0.26%   6.27%
> > >> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> > >> >
> > >> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> > >> >       SPEC/SPEC2017/total     3.02%   6.55%
> > >> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
> > >> >
> > >> >  2) -O2 size:
> > >> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > >> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > >> >  3) build times:
> > >> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > >> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> > >> >     here I simply copied data from different configuratoins
> > >> >
> > >> > So for SPEC i would say that most of compile time costs are derrived
> > >> > from code size growth which is a problem with cheap model but not with
> > >> > very cheap.  Very cheap indeed results in code size improvements and
> > >> > compile time impact is probably somewhere around 0.5%
> > >> >
> > >> > So from these scores alone this would seem that vectorization makes
> > >> > sense at -O2 with very cheap model to me (I am sure we have other
> > >> > optimizations with worse benefits to compile time tradeoffs).
> > >>
> > >> Thanks for running these.
> > >>
> > >> The biggest issue I know of for enabling very-cheap at -O2 is:
> > >>
> > >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > >>
> > >> Perhaps we could get around that by (hopefully temporarily) disabling
> > >> BB SLP within loop vectorisation for the very-cheap model.  This would
> > >> purely be a workaround and we should remove it once the PR is fixed.
> > >> (It would even be a compile-time win in the meantime :-))
> > >>
> > >> Thanks,
> > >> Richard
> > >>
> > >> > However there are usual arguments against:
> > >> >
> > >> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> > >> >      that argument is to enable it by default :)
> > >> >   2) Workloads improved are more of -Ofast type workloads
> > >> >
> > >> > Here are non-spec benchmarks we track:
> > >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >> >
> > >> > I also tried to run Firefox some time ago. Results are not surprising -
> > >> > vectorizaiton helps rendering benchmarks which are those compiler with
> > >> > aggressive flags anyway.
> > >> >
> > >> > Honza
> > >
> > > Hi:
> > >   I would like to ask if we can turn on O2 vectorization now?
> >
> > I think we still need to deal with the PR100089 issue that I mentioned above.
> > Like I say, “dealing with” it could be as simple as disabling:
> >
> >       /* If we applied if-conversion then try to vectorize the
> >        BB of innermost loops.
> >        ???  Ideally BB vectorization would learn to vectorize
> >        control flow by applying if-conversion on-the-fly, the
> >        following retains the if-converted loop body even when
> >        only non-if-converted parts took part in BB vectorization.  */
> >       if (flag_tree_slp_vectorize != 0
> >         && loop_vectorized_call
> >         && ! loop->inner)
> >
> > for the very-cheap vector cost model until the PR is fixed properly.
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
>
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
I can collect 4 sets of data including both codesize and performance on SPEC2017
1.  baseline: -O2
2.  baseline + both slp and loop vectorizer: O2 -ftree-vectorize
-fvect-cost-model=very-cheap.
3.  baseline + only loop vectorizer: O2 -ftree-loop-vectorize
-fvect-cost-model=very-cheap.
4.  baseline + only bb vectorizer: O2 -ftree-slp-vectorize.
>
> Richard.



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  9:10                   ` Richard Sandiford
@ 2021-08-04  9:56                     ` Segher Boessenkool
  2021-08-04 10:22                       ` Richard Sandiford
  0 siblings, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-08-04  9:56 UTC (permalink / raw)
  To: Richard Biener, Hongtao Liu, Jan Hubicka, gcc-help, bin.cheng,
	172060045, richard.sandiford

On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
> Richard Biener <rguenther@suse.de> writes:
> > Alternatively only enable loop vectorization at -O2 (the above checks
> > flag_tree_slp_vectorize as well).  At least the cost model kind
> > does not have any influence on BB vectorization, that is, we get the
> > same pros and cons as we do for -O3.
> 
> Yeah, but a lot of the loop vector cost model choice is about controlling
> code size growth and avoiding excessive runtime versioning tests.

Both of those depend a lot on the target, and target-specific conditions
as well (which CPU model is selected for example).  Can we factor that
in somehow?  Maybe we need some target hook that returns the expected
percentage code growth for vectorising a given loop, for example, and
-O2 vs. -O3 then selects what percentage is acceptable.

> BB SLP
> should be a win on both code size and performance (barring significant
> target costing issues).

Yeah -- but this could use a similar hook as well (just a straightline
piece of code instead of a loop).

> PR100089 was an exception because we ended up keeping unvectorised
> scalar code that would never have existed otherwise.  BB SLP proper
> shouldn't have that problem.

It also is a tiny piece of code.  There will always be tiny examples
that are much worse (or much better) than average.


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  9:56                     ` Segher Boessenkool
@ 2021-08-04 10:22                       ` Richard Sandiford
  2021-08-04 21:18                         ` Segher Boessenkool
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Sandiford @ 2021-08-04 10:22 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, Hongtao Liu, Jan Hubicka, gcc-help, bin.cheng, 172060045

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
>> Richard Biener <rguenther@suse.de> writes:
>> > Alternatively only enable loop vectorization at -O2 (the above checks
>> > flag_tree_slp_vectorize as well).  At least the cost model kind
>> > does not have any influence on BB vectorization, that is, we get the
>> > same pros and cons as we do for -O3.
>> 
>> Yeah, but a lot of the loop vector cost model choice is about controlling
>> code size growth and avoiding excessive runtime versioning tests.
>
> Both of those depend a lot on the target, and target-specific conditions
> as well (which CPU model is selected for example).  Can we factor that
> in somehow?  Maybe we need some target hook that returns the expected
> percentage code growth for vectorising a given loop, for example, and
> -O2 vs. -O3 then selects what percentage is acceptable.
>
>> BB SLP
>> should be a win on both code size and performance (barring significant
>> target costing issues).
>
> Yeah -- but this could use a similar hook as well (just a straightline
> piece of code instead of a loop).

I think anything like that should be driven by motivating use cases.
It's not something that we can easily decide in the abstract.

The results so far with using very-cheap at -O2 have been promising,
so I don't think new hooks should block that becoming the default.

>> PR100089 was an exception because we ended up keeping unvectorised
>> scalar code that would never have existed otherwise.  BB SLP proper
>> shouldn't have that problem.
>
> It also is a tiny piece of code.  There will always be tiny examples
> that are much worse (or much better) than average.

Yeah, what makes PR100089 important isn't IMO the test itself, but the
underlying problem that the PR exposed.  Enabling this “BB SLP in loop
vectorisation” code can lead to the generation of scalar COND_EXPRs even
though we know that ifcvt doesn't have a proper cost model for deciding
whether scalar COND_EXPRs are a win.

Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
(although still dubious), but I think it's something we need to avoid
for -O2, even if that means losing the optimisation.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04 10:22                       ` Richard Sandiford
@ 2021-08-04 21:18                         ` Segher Boessenkool
  0 siblings, 0 replies; 26+ messages in thread
From: Segher Boessenkool @ 2021-08-04 21:18 UTC (permalink / raw)
  To: Richard Biener, Hongtao Liu, Jan Hubicka, gcc-help, bin.cheng,
	172060045, richard.sandiford

On Wed, Aug 04, 2021 at 11:22:53AM +0100, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
> >> Richard Biener <rguenther@suse.de> writes:
> >> > Alternatively only enable loop vectorization at -O2 (the above checks
> >> > flag_tree_slp_vectorize as well).  At least the cost model kind
> >> > does not have any influence on BB vectorization, that is, we get the
> >> > same pros and cons as we do for -O3.
> >> 
> >> Yeah, but a lot of the loop vector cost model choice is about controlling
> >> code size growth and avoiding excessive runtime versioning tests.
> >
> > Both of those depend a lot on the target, and target-specific conditions
> > as well (which CPU model is selected for example).  Can we factor that
> > in somehow?  Maybe we need some target hook that returns the expected
> > percentage code growth for vectorising a given loop, for example, and
> > -O2 vs. -O3 then selects what percentage is acceptable.
> >
> >> BB SLP
> >> should be a win on both code size and performance (barring significant
> >> target costing issues).
> >
> > Yeah -- but this could use a similar hook as well (just a straightline
> > piece of code instead of a loop).
> 
> I think anything like that should be driven by motivating use cases.
> It's not something that we can easily decide in the abstract.
> 
> The results so far with using very-cheap at -O2 have been promising,
> so I don't think new hooks should block that becoming the default.

Right, but it wouldn't hurt to think a sec if we are on the right path
forward.  It's is crystal clear that to make good decisions about what
and how to vectorise you need to take *some* target characteristics into
account, and that will have to happen sooner rather than later.

This was all in reply to

> >> Yeah, but a lot of the loop vector cost model choice is about controlling
> >> code size growth and avoiding excessive runtime versioning tests.

It was not meant to hold up these patches :-)

> >> PR100089 was an exception because we ended up keeping unvectorised
> >> scalar code that would never have existed otherwise.  BB SLP proper
> >> shouldn't have that problem.
> >
> > It also is a tiny piece of code.  There will always be tiny examples
> > that are much worse (or much better) than average.
> 
> Yeah, what makes PR100089 important isn't IMO the test itself, but the
> underlying problem that the PR exposed.  Enabling this “BB SLP in loop
> vectorisation” code can lead to the generation of scalar COND_EXPRs even
> though we know that ifcvt doesn't have a proper cost model for deciding
> whether scalar COND_EXPRs are a win.
> 
> Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
> (although still dubious), but I think it's something we need to avoid
> for -O2, even if that means losing the optimisation.

Yeah -- -O2 should almost always do the right thing, while -O3 can do
bad things more often, it just has to be better "on average".


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  8:31                 ` Richard Biener
  2021-08-04  9:10                   ` Richard Sandiford
  2021-08-04  9:12                   ` Hongtao Liu
@ 2021-08-11 17:14                   ` Jan Hubicka
  2021-08-14 14:22                     ` Jan Hubicka
  2021-08-16  3:22                   ` Kewen.Lin
  3 siblings, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-08-11 17:14 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, Hongtao Liu, Segher Boessenkool, gcc-help,
	bin.cheng, 172060045

> 
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
> 
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?

I did last year and most perofmrance changes was tree-loop-vectorize as
expected.  I will re-run the benchmarks now.

Honza
> 
> Richard.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-11 17:14                   ` Jan Hubicka
@ 2021-08-14 14:22                     ` Jan Hubicka
  2021-08-16  8:03                       ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-08-14 14:22 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, Hongtao Liu, Segher Boessenkool, gcc-help,
	bin.cheng, 172060045

Hi,
runs finished for cheap vectorization model.  I am running very cheap
too.

https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on

compares default -O2 (base), loop vectorize, slp vectorize.
Put "O2" to machine name.

Overall scores are:

kaby.spec2006.O2_generic
Test 			loop 	alp
SPEC/SPEC2006/FP/total 	8.16% 	0.19% 	
SPEC/SPEC2006/total 	4.96% 	0.38% 	
SPEC/SPEC2006/INT/total 36.595 	0.58% 	0.65% 	

kaby.spec2006.O2_generic_lto
SPEC/SPEC2006/FP/total 	9.06% 	-0.36% 	
SPEC/SPEC2006/total 	5.32% 	~ 	
SPEC/SPEC2006/INT/total	0.24% 	0.27% 	

kaby.spec2017.O2_generic
SPEC/SPEC2017/INT/total	6.54% 	-1.15% 	
SPEC/SPEC2017/total 	5.66% 	-0.17% 	
SPEC/SPEC2017/FP/total 	5.00% 	0.59% 	

kaby.spec2017.O2_generic_lto
SPEC/SPEC2017/INT/total	6.62% 	-0.12% 	
SPEC/SPEC2017/total 	5.69% 	-0.14% 	
SPEC/SPEC2017/FP/total 	4.99% 	-0.16% 	

zenith.spec2006.O2_generic
SPEC/SPEC2006/FP/total 	10.23% 	-0.35% 	
SPEC/SPEC2006/total 	6.01% 	-0.48% 	
SPEC/SPEC2006/INT/total	0.31% 	-0.66% 	

zenith.spec2006.O2_generic_lto
SPEC/SPEC2006/FP/total 	12.03% 	0.82% 	
SPEC/SPEC2006/total 	6.90% 	0.44% 	
SPEC/SPEC2006/INT/total	~ 	-0.11% 	

zenith.spec2017.O2_generic
SPEC/SPEC2017/INT/total	7.46% 	-0.37% 	
SPEC/SPEC2017/total 	6.81% 	0.48% 	
SPEC/SPEC2017/FP/total 	6.31% 	1.15% 	

zenith.spec2017.O2_generic_lto
SPEC/SPEC2017/INT/total	7.81% 	-0.22% 	
SPEC/SPEC2017/total 	7.07% 	0.44% 	
SPEC/SPEC2017/FP/total 	6.50% 	0.94% 	

So loop vectorize is consistent win, slp is mostly neutral.

Code size growth is too large for -O2 for loop vectorize. SLP vectorize
seems slight size win overall.

Noteworthy regressions caused by slp are:
  5-6%: xz (kaby, lto), milc (zenith), astar (zenith)
  4-5%: xalancbmk (kaby), blender (kaby)
  3-4%: xz (kaby, nolto), dealII (zenith)
  2-3%: povray (kaby), astar (kaby), perlbench (kaby), sjeng (zenith),
  xz(zenith)

We get 10% improvement on imagemagick (kaby), 17.5% (zenith),
imagematick 6.99% (zenith)

https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on

Is for C++/polyhedron benchmarks.  It shows several interesting
regressions in polyhedron and tsvc for loop vectorization (over 100%)
and also some for slp.  Shall I try to search for bugzilla for these?
  
Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-04  8:31                 ` Richard Biener
                                     ` (2 preceding siblings ...)
  2021-08-11 17:14                   ` Jan Hubicka
@ 2021-08-16  3:22                   ` Kewen.Lin
  2021-08-16  6:00                     ` Hongtao Liu
  3 siblings, 1 reply; 26+ messages in thread
From: Kewen.Lin @ 2021-08-16  3:22 UTC (permalink / raw)
  To: Richard Biener
  Cc: bin.cheng, Segher Boessenkool, 172060045, gcc-help, Jan Hubicka,
	Richard Sandiford

on 2021/8/4 下午4:31, Richard Biener wrote:
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
> 
>> Hongtao Liu <crazylht@gmail.com> writes:
>>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
>>> <gcc-help@gcc.gnu.org> wrote:
>>>>
>>>> Jan Hubicka <hubicka@ucw.cz> writes:
>>>>> Hi,
>>>>> here are updated scores.
>>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>> compares
>>>>>   base:  mainline
>>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
>>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
>>>>>
>>>>> The short story is:
>>>>>
>>>>> 1) -O2 generic performance
>>>>>     kabylake (Intel):
>>>>>                               very    cheap
>>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
>>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
>>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
>>>>>
>>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
>>>>>       SPEC/SPEC2017/total     2.22%   6.52%
>>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
>>>>>     zen
>>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
>>>>>       SPEC/SPEC2006/total     0.26%   6.27%
>>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
>>>>>
>>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
>>>>>       SPEC/SPEC2017/total     3.02%   6.55%
>>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
>>>>>
>>>>>  2) -O2 size:
>>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>>>>>  3) build times:
>>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
>>>>>     here I simply copied data from different configuratoins
>>>>>
>>>>> So for SPEC i would say that most of compile time costs are derrived
>>>>> from code size growth which is a problem with cheap model but not with
>>>>> very cheap.  Very cheap indeed results in code size improvements and
>>>>> compile time impact is probably somewhere around 0.5%
>>>>>
>>>>> So from these scores alone this would seem that vectorization makes
>>>>> sense at -O2 with very cheap model to me (I am sure we have other
>>>>> optimizations with worse benefits to compile time tradeoffs).
>>>>
>>>> Thanks for running these.
>>>>
>>>> The biggest issue I know of for enabling very-cheap at -O2 is:
>>>>
>>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>>>>
>>>> Perhaps we could get around that by (hopefully temporarily) disabling
>>>> BB SLP within loop vectorisation for the very-cheap model.  This would
>>>> purely be a workaround and we should remove it once the PR is fixed.
>>>> (It would even be a compile-time win in the meantime :-))
>>>>
>>>> Thanks,
>>>> Richard
>>>>
>>>>> However there are usual arguments against:
>>>>>
>>>>>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
>>>>>      that argument is to enable it by default :)
>>>>>   2) Workloads improved are more of -Ofast type workloads
>>>>>
>>>>> Here are non-spec benchmarks we track:
>>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>>
>>>>> I also tried to run Firefox some time ago. Results are not surprising -
>>>>> vectorizaiton helps rendering benchmarks which are those compiler with
>>>>> aggressive flags anyway.
>>>>>
>>>>> Honza
>>>
>>> Hi:
>>>   I would like to ask if we can turn on O2 vectorization now?
>>
>> I think we still need to deal with the PR100089 issue that I mentioned above.
>> Like I say, “dealing with” it could be as simple as disabling:
>>
>>       /* If we applied if-conversion then try to vectorize the
>> 	 BB of innermost loops.
>> 	 ???  Ideally BB vectorization would learn to vectorize
>> 	 control flow by applying if-conversion on-the-fly, the
>> 	 following retains the if-converted loop body even when
>> 	 only non-if-converted parts took part in BB vectorization.  */
>>       if (flag_tree_slp_vectorize != 0
>> 	  && loop_vectorized_call
>> 	  && ! loop->inner)
>>
>> for the very-cheap vector cost model until the PR is fixed properly.
> 
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
> 
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> 


Here is the measured performance speedup at O2 vect with
very cheap cost model on both Power8 and Power9.

INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
FP: INT + -ffast-math

Column titles are:

<bmks>  <both loop and slp>  <loop only>  <slp only> (+:improvement, -:degradation)

Power8:
500.perlbench_r 	0.00% 0.00% 0.00%
502.gcc_r 		0.39% 0.78% 0.00%
505.mcf_r 		0.00% 0.00% 0.00%
520.omnetpp_r 		1.21% 0.30% 0.00%
523.xalancbmk_r 	0.00% 0.00% -0.57%
525.x264_r  		41.84%  42.55%  0.00%
531.deepsjeng_r 	0.00% -0.63%  0.00%
541.leela_r 		-3.44%  -2.75%  0.00%
548.exchange2_r 	1.66% 1.66% 0.00%
557.xz_r  		1.39% 1.04% 0.00%
Geomean 		3.67% 3.64% -0.06%

503.bwaves_r  		0.00% 0.00% 0.00%
507.cactuBSSN_r 	0.00% 0.29% 0.44%
508.namd_r  		0.00% 0.29% 0.00%
510.parest_r  		0.00% -0.36%  -0.54%
511.povray_r  		0.63% 0.31% 0.94%
519.lbm_r 		2.71% 2.71% 0.00%
521.wrf_r 		1.04% 1.04% 0.00%
526.blender_r 		-1.31%  -0.78%  0.00%
527.cam4_r  		-0.62%  -0.31%  -0.62%
538.imagick_r 		0.21% 0.21% -0.21%
544.nab_r 		0.00% 0.00% 0.00%
549.fotonik3d_r 	0.00% 0.00% 0.00%
554.roms_r  		0.30% 0.00% 0.00%
Geomean 		0.22% 0.26% 0.00%

Power9:

500.perlbench_r 	0.62% 0.62% -1.54%
502.gcc_r 		-0.60%  -0.60%  -0.81%
505.mcf_r 		2.05% 2.05% 0.00%
520.omnetpp_r 		-2.41%  -0.30%  -0.60%
523.xalancbmk_r 	-1.44%  -2.30%  -1.44%
525.x264_r  		24.26%  23.93%  -0.33%
531.deepsjeng_r 	0.32% 0.32% 0.00%
541.leela_r 		0.39% 1.18% -0.39%
548.exchange2_r 	0.76% 0.76% 0.00%
557.xz_r  		0.36% 0.36% -0.36%
Geomean 		2.19% 2.38% -0.55%

503.bwaves_r  		0.00% 0.36% 0.00%
507.cactuBSSN_r 	0.00% 0.00% 0.00%
508.namd_r  		-3.73%  -0.31%  -3.73%
510.parest_r  		-0.21%  -0.42%  -0.42%
511.povray_r  		-0.96%  -1.59%  0.64%
519.lbm_r 		2.31% 2.31% 0.17%
521.wrf_r 		2.66% 2.66% 0.00%
526.blender_r 		-1.96%  -1.68%  1.40%
527.cam4_r  		0.00% 0.91% 1.81%
538.imagick_r 		0.39% -0.19%  -10.29%  // known noise, imagick_r can have big jitter on P9 box sometimes.
544.nab_r 		0.25% 0.00% 0.00%
549.fotonik3d_r 	0.94% 0.94% 0.00%
554.roms_r  		0.00% 0.00% -1.05%
Geomean 		-0.03%  0.22% -0.93%


As above, the gains are mainly from loop vectorization.
btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.

BR,
Kewen

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-16  3:22                   ` Kewen.Lin
@ 2021-08-16  6:00                     ` Hongtao Liu
  2021-08-16  6:09                       ` Hongtao Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Hongtao Liu @ 2021-08-16  6:00 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, bin.cheng, Segher Boessenkool, 172060045,
	gcc-help, Richard Sandiford, Jan Hubicka

On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
<gcc-help@gcc.gnu.org> wrote:
>
> on 2021/8/4 下午4:31, Richard Biener wrote:
> > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> >
> >> Hongtao Liu <crazylht@gmail.com> writes:
> >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> >>> <gcc-help@gcc.gnu.org> wrote:
> >>>>
> >>>> Jan Hubicka <hubicka@ucw.cz> writes:
> >>>>> Hi,
> >>>>> here are updated scores.
> >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >>>>> compares
> >>>>>   base:  mainline
> >>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
> >>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
> >>>>>
> >>>>> The short story is:
> >>>>>
> >>>>> 1) -O2 generic performance
> >>>>>     kabylake (Intel):
> >>>>>                               very    cheap
> >>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
> >>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
> >>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> >>>>>
> >>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
> >>>>>       SPEC/SPEC2017/total     2.22%   6.52%
> >>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
> >>>>>     zen
> >>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
> >>>>>       SPEC/SPEC2006/total     0.26%   6.27%
> >>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> >>>>>
> >>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> >>>>>       SPEC/SPEC2017/total     3.02%   6.55%
> >>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
> >>>>>
> >>>>>  2) -O2 size:
> >>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >>>>>  3) build times:
> >>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> >>>>>     here I simply copied data from different configuratoins
> >>>>>
> >>>>> So for SPEC i would say that most of compile time costs are derrived
> >>>>> from code size growth which is a problem with cheap model but not with
> >>>>> very cheap.  Very cheap indeed results in code size improvements and
> >>>>> compile time impact is probably somewhere around 0.5%
> >>>>>
> >>>>> So from these scores alone this would seem that vectorization makes
> >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> >>>>> optimizations with worse benefits to compile time tradeoffs).
> >>>>
> >>>> Thanks for running these.
> >>>>
> >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> >>>>
> >>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> >>>>
> >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> >>>> BB SLP within loop vectorisation for the very-cheap model.  This would
> >>>> purely be a workaround and we should remove it once the PR is fixed.
> >>>> (It would even be a compile-time win in the meantime :-))
> >>>>
> >>>> Thanks,
> >>>> Richard
> >>>>
> >>>>> However there are usual arguments against:
> >>>>>
> >>>>>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> >>>>>      that argument is to enable it by default :)
> >>>>>   2) Workloads improved are more of -Ofast type workloads
> >>>>>
> >>>>> Here are non-spec benchmarks we track:
> >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >>>>>
> >>>>> I also tried to run Firefox some time ago. Results are not surprising -
> >>>>> vectorizaiton helps rendering benchmarks which are those compiler with
> >>>>> aggressive flags anyway.
> >>>>>
> >>>>> Honza
> >>>
> >>> Hi:
> >>>   I would like to ask if we can turn on O2 vectorization now?
> >>
> >> I think we still need to deal with the PR100089 issue that I mentioned above.
> >> Like I say, “dealing with” it could be as simple as disabling:
> >>
> >>       /* If we applied if-conversion then try to vectorize the
> >>       BB of innermost loops.
> >>       ???  Ideally BB vectorization would learn to vectorize
> >>       control flow by applying if-conversion on-the-fly, the
> >>       following retains the if-converted loop body even when
> >>       only non-if-converted parts took part in BB vectorization.  */
> >>       if (flag_tree_slp_vectorize != 0
> >>        && loop_vectorized_call
> >>        && ! loop->inner)
> >>
> >> for the very-cheap vector cost model until the PR is fixed properly.
> >
> > Alternatively only enable loop vectorization at -O2 (the above checks
> > flag_tree_slp_vectorize as well).  At least the cost model kind
> > does not have any influence on BB vectorization, that is, we get the
> > same pros and cons as we do for -O3.
> >
> > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> >
>
>
> Here is the measured performance speedup at O2 vect with
> very cheap cost model on both Power8 and Power9.
>
> INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
> FP: INT + -ffast-math
>
> Column titles are:
>
> <bmks>  <both loop and slp>  <loop only>  <slp only> (+:improvement, -:degradation)
>
> Power8:
> 500.perlbench_r         0.00% 0.00% 0.00%
> 502.gcc_r               0.39% 0.78% 0.00%
> 505.mcf_r               0.00% 0.00% 0.00%
> 520.omnetpp_r           1.21% 0.30% 0.00%
> 523.xalancbmk_r         0.00% 0.00% -0.57%
> 525.x264_r              41.84%  42.55%  0.00%
> 531.deepsjeng_r         0.00% -0.63%  0.00%
> 541.leela_r             -3.44%  -2.75%  0.00%
> 548.exchange2_r         1.66% 1.66% 0.00%
> 557.xz_r                1.39% 1.04% 0.00%
> Geomean                 3.67% 3.64% -0.06%
>
> 503.bwaves_r            0.00% 0.00% 0.00%
> 507.cactuBSSN_r         0.00% 0.29% 0.44%
> 508.namd_r              0.00% 0.29% 0.00%
> 510.parest_r            0.00% -0.36%  -0.54%
> 511.povray_r            0.63% 0.31% 0.94%
> 519.lbm_r               2.71% 2.71% 0.00%
> 521.wrf_r               1.04% 1.04% 0.00%
> 526.blender_r           -1.31%  -0.78%  0.00%
> 527.cam4_r              -0.62%  -0.31%  -0.62%
> 538.imagick_r           0.21% 0.21% -0.21%
> 544.nab_r               0.00% 0.00% 0.00%
> 549.fotonik3d_r         0.00% 0.00% 0.00%
> 554.roms_r              0.30% 0.00% 0.00%
> Geomean                 0.22% 0.26% 0.00%
>
> Power9:
>
> 500.perlbench_r         0.62% 0.62% -1.54%
> 502.gcc_r               -0.60%  -0.60%  -0.81%
> 505.mcf_r               2.05% 2.05% 0.00%
> 520.omnetpp_r           -2.41%  -0.30%  -0.60%
> 523.xalancbmk_r         -1.44%  -2.30%  -1.44%
> 525.x264_r              24.26%  23.93%  -0.33%
> 531.deepsjeng_r         0.32% 0.32% 0.00%
> 541.leela_r             0.39% 1.18% -0.39%
> 548.exchange2_r         0.76% 0.76% 0.00%
> 557.xz_r                0.36% 0.36% -0.36%
> Geomean                 2.19% 2.38% -0.55%
>
> 503.bwaves_r            0.00% 0.36% 0.00%
> 507.cactuBSSN_r         0.00% 0.00% 0.00%
> 508.namd_r              -3.73%  -0.31%  -3.73%
> 510.parest_r            -0.21%  -0.42%  -0.42%
> 511.povray_r            -0.96%  -1.59%  0.64%
> 519.lbm_r               2.31% 2.31% 0.17%
> 521.wrf_r               2.66% 2.66% 0.00%
> 526.blender_r           -1.96%  -1.68%  1.40%
> 527.cam4_r              0.00% 0.91% 1.81%
> 538.imagick_r           0.39% -0.19%  -10.29%  // known noise, imagick_r can have big jitter on P9 box sometimes.
> 544.nab_r               0.25% 0.00% 0.00%
> 549.fotonik3d_r         0.94% 0.94% 0.00%
> 554.roms_r              0.00% 0.00% -1.05%
> Geomean                 -0.03%  0.22% -0.93%
>
>
> As above, the gains are mainly from loop vectorization.
> btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.
>
> BR,
> Kewen

Here is data on CLX.
+ for performance means better.
- for codesize means better.

we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which
did not exist in our last measurement w/ gcc11.0.0 20210317, it's not
related to the fix of  PR100089.
others about the same as the last measurement.

                                O2 -ftree-vectorize very-cheap
loop vectorizer                bb vectorizer
codesize performce                codesize performance
codesize performance                codesize performance
500.perlbench_r                  0.34%        0.55%
    0.05%        0.80%                        0.29%        0.84%
502.gcc_r                        0.29%        -0.32%
     0.02%        -0.12%                        0.27%        -0.23%
505.mcf_r                        1.36%        -1.20%(noise)
    0.00%        -0.69%                        1.16%        -0.85%
520.omnetpp_r                    -0.65%        -0.83%
      0.05%        -0.97%                        -0.70%        -0.52%
523.xalancbmk_r                  0.04%        -0.59%
     0.26%        -0.56%                        -0.04%        -0.52%
525.x264_r                       1.29%        62.62%
     1.18%        64.80%                        0.13%        -0.29%
531.deepsjeng_r                  0.18%        -0.44%
     0.16%        -0.03%                        -0.05%        -0.50%
541.leela_r                      -1.10%        -0.12%
      -0.11%        0.59%                        -0.99%        -1.12%
548.exchange2_r                  -1.19%        0.34%
     -0.27%        -0.29%                        -1.02%        0.17%
557.xz_r                         -0.53%        -1.01%(cost model)
  -0.76%        -0.10%                        -0.10%        -1.28%
geomean for intrate                0.00%        4.60%
      0.06%        4.98%                        -0.11%        -0.43%
503.bwaves_r                   -0.29%        -1.19% (noise)
  0.00%        -0.86%                        -0.25%        -0.43%
507.cactuBSSN_r                0.01%        -0.55%
   0.01%        -0.35%                        0.01%        -0.37%
508.namd_r                     -0.61%        2.38%
   -0.13%        -0.09%                        -0.67%        2.45%
510.parest_r                   -0.41%        0.10%
   -0.16%        0.62%                        -0.50%        0.72%
511.povray_r                   -1.76%        3.79%
   -0.03%        0.41%                        -1.74%        4.61%
519.lbm_r                      0.38%        -0.33%
   0.00%        -0.31%                        0.38%        0.05%
521.wrf_r                      -0.85%        1.23%
   -0.03%        1.60%                        -0.94%        0.00%
526.blender_r                  -0.40%        -1.21%(nosie)
   0.00%        -1.49%                        -0.43%        -1.64%
527.cam4_r                     -0.27%        0.06%
   0.10%        -0.06%                        -0.39%        -0.01%
538.imagick_r                  -0.97%        1.10%
   -0.09%        0.32%                        -0.90%        2.49%
544.nab_r                      -0.65%        0.09%
   0.02%        0.20%                        -0.69%        0.09%
549.fotonik3d_r                3.36%        0.30%
  2.42%        0.44%                        0.93%        -0.08%
554.roms_r                     -0.28%        -0.20%
    0.25%        0.06%                        -0.52%        0.00%
geomean for fprate                -0.22%        0.42%
      0.18%        0.04%                        -0.44%        0.59%
geomean                        -0.12%        2.22%
   0.13%        2.16%                        -0.30%        0.15%



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-16  6:00                     ` Hongtao Liu
@ 2021-08-16  6:09                       ` Hongtao Liu
  2021-08-24  2:21                         ` Hongtao Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Hongtao Liu @ 2021-08-16  6:09 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, bin.cheng, Segher Boessenkool, 172060045,
	gcc-help, Richard Sandiford, Jan Hubicka

On Mon, Aug 16, 2021 at 2:00 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
> >
> > on 2021/8/4 下午4:31, Richard Biener wrote:
> > > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> > >
> > >> Hongtao Liu <crazylht@gmail.com> writes:
> > >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > >>> <gcc-help@gcc.gnu.org> wrote:
> > >>>>
> > >>>> Jan Hubicka <hubicka@ucw.cz> writes:
> > >>>>> Hi,
> > >>>>> here are updated scores.
> > >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >>>>> compares
> > >>>>>   base:  mainline
> > >>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
> > >>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
> > >>>>>
> > >>>>> The short story is:
> > >>>>>
> > >>>>> 1) -O2 generic performance
> > >>>>>     kabylake (Intel):
> > >>>>>                               very    cheap
> > >>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
> > >>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
> > >>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> > >>>>>
> > >>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
> > >>>>>       SPEC/SPEC2017/total     2.22%   6.52%
> > >>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
> > >>>>>     zen
> > >>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
> > >>>>>       SPEC/SPEC2006/total     0.26%   6.27%
> > >>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> > >>>>>
> > >>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> > >>>>>       SPEC/SPEC2017/total     3.02%   6.55%
> > >>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
> > >>>>>
> > >>>>>  2) -O2 size:
> > >>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > >>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > >>>>>  3) build times:
> > >>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > >>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> > >>>>>     here I simply copied data from different configuratoins
> > >>>>>
> > >>>>> So for SPEC i would say that most of compile time costs are derrived
> > >>>>> from code size growth which is a problem with cheap model but not with
> > >>>>> very cheap.  Very cheap indeed results in code size improvements and
> > >>>>> compile time impact is probably somewhere around 0.5%
> > >>>>>
> > >>>>> So from these scores alone this would seem that vectorization makes
> > >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> > >>>>> optimizations with worse benefits to compile time tradeoffs).
> > >>>>
> > >>>> Thanks for running these.
> > >>>>
> > >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> > >>>>
> > >>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > >>>>
> > >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> > >>>> BB SLP within loop vectorisation for the very-cheap model.  This would
> > >>>> purely be a workaround and we should remove it once the PR is fixed.
> > >>>> (It would even be a compile-time win in the meantime :-))
> > >>>>
> > >>>> Thanks,
> > >>>> Richard
> > >>>>
> > >>>>> However there are usual arguments against:
> > >>>>>
> > >>>>>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> > >>>>>      that argument is to enable it by default :)
> > >>>>>   2) Workloads improved are more of -Ofast type workloads
> > >>>>>
> > >>>>> Here are non-spec benchmarks we track:
> > >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >>>>>
> > >>>>> I also tried to run Firefox some time ago. Results are not surprising -
> > >>>>> vectorizaiton helps rendering benchmarks which are those compiler with
> > >>>>> aggressive flags anyway.
> > >>>>>
> > >>>>> Honza
> > >>>
> > >>> Hi:
> > >>>   I would like to ask if we can turn on O2 vectorization now?
> > >>
> > >> I think we still need to deal with the PR100089 issue that I mentioned above.
> > >> Like I say, “dealing with” it could be as simple as disabling:
> > >>
> > >>       /* If we applied if-conversion then try to vectorize the
> > >>       BB of innermost loops.
> > >>       ???  Ideally BB vectorization would learn to vectorize
> > >>       control flow by applying if-conversion on-the-fly, the
> > >>       following retains the if-converted loop body even when
> > >>       only non-if-converted parts took part in BB vectorization.  */
> > >>       if (flag_tree_slp_vectorize != 0
> > >>        && loop_vectorized_call
> > >>        && ! loop->inner)
> > >>
> > >> for the very-cheap vector cost model until the PR is fixed properly.
> > >
> > > Alternatively only enable loop vectorization at -O2 (the above checks
> > > flag_tree_slp_vectorize as well).  At least the cost model kind
> > > does not have any influence on BB vectorization, that is, we get the
> > > same pros and cons as we do for -O3.
> > >
> > > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> > >
> >
> >
> > Here is the measured performance speedup at O2 vect with
> > very cheap cost model on both Power8 and Power9.
> >
> > INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
> > FP: INT + -ffast-math
> >
> > Column titles are:
> >
> > <bmks>  <both loop and slp>  <loop only>  <slp only> (+:improvement, -:degradation)
> >
> > Power8:
> > 500.perlbench_r         0.00% 0.00% 0.00%
> > 502.gcc_r               0.39% 0.78% 0.00%
> > 505.mcf_r               0.00% 0.00% 0.00%
> > 520.omnetpp_r           1.21% 0.30% 0.00%
> > 523.xalancbmk_r         0.00% 0.00% -0.57%
> > 525.x264_r              41.84%  42.55%  0.00%
> > 531.deepsjeng_r         0.00% -0.63%  0.00%
> > 541.leela_r             -3.44%  -2.75%  0.00%
> > 548.exchange2_r         1.66% 1.66% 0.00%
> > 557.xz_r                1.39% 1.04% 0.00%
> > Geomean                 3.67% 3.64% -0.06%
> >
> > 503.bwaves_r            0.00% 0.00% 0.00%
> > 507.cactuBSSN_r         0.00% 0.29% 0.44%
> > 508.namd_r              0.00% 0.29% 0.00%
> > 510.parest_r            0.00% -0.36%  -0.54%
> > 511.povray_r            0.63% 0.31% 0.94%
> > 519.lbm_r               2.71% 2.71% 0.00%
> > 521.wrf_r               1.04% 1.04% 0.00%
> > 526.blender_r           -1.31%  -0.78%  0.00%
> > 527.cam4_r              -0.62%  -0.31%  -0.62%
> > 538.imagick_r           0.21% 0.21% -0.21%
> > 544.nab_r               0.00% 0.00% 0.00%
> > 549.fotonik3d_r         0.00% 0.00% 0.00%
> > 554.roms_r              0.30% 0.00% 0.00%
> > Geomean                 0.22% 0.26% 0.00%
> >
> > Power9:
> >
> > 500.perlbench_r         0.62% 0.62% -1.54%
> > 502.gcc_r               -0.60%  -0.60%  -0.81%
> > 505.mcf_r               2.05% 2.05% 0.00%
> > 520.omnetpp_r           -2.41%  -0.30%  -0.60%
> > 523.xalancbmk_r         -1.44%  -2.30%  -1.44%
> > 525.x264_r              24.26%  23.93%  -0.33%
> > 531.deepsjeng_r         0.32% 0.32% 0.00%
> > 541.leela_r             0.39% 1.18% -0.39%
> > 548.exchange2_r         0.76% 0.76% 0.00%
> > 557.xz_r                0.36% 0.36% -0.36%
> > Geomean                 2.19% 2.38% -0.55%
> >
> > 503.bwaves_r            0.00% 0.36% 0.00%
> > 507.cactuBSSN_r         0.00% 0.00% 0.00%
> > 508.namd_r              -3.73%  -0.31%  -3.73%
> > 510.parest_r            -0.21%  -0.42%  -0.42%
> > 511.povray_r            -0.96%  -1.59%  0.64%
> > 519.lbm_r               2.31% 2.31% 0.17%
> > 521.wrf_r               2.66% 2.66% 0.00%
> > 526.blender_r           -1.96%  -1.68%  1.40%
> > 527.cam4_r              0.00% 0.91% 1.81%
> > 538.imagick_r           0.39% -0.19%  -10.29%  // known noise, imagick_r can have big jitter on P9 box sometimes.
> > 544.nab_r               0.25% 0.00% 0.00%
> > 549.fotonik3d_r         0.94% 0.94% 0.00%
> > 554.roms_r              0.00% 0.00% -1.05%
> > Geomean                 -0.03%  0.22% -0.93%
> >
> >
> > As above, the gains are mainly from loop vectorization.
> > btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.
> >
> > BR,
> > Kewen
>
> Here is data on CLX.
> + for performance means better.
> - for codesize means better.
>
> we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which
> did not exist in our last measurement w/ gcc11.0.0 20210317, it's not
> related to the fix of  PR100089.
> others about the same as the last measurement.
>

                                O2 -ftree-vectorize very-cheap
                                codesize performance
500.perlbench_r                  0.34%         0.55%
502.gcc_r                        0.29%        -0.32%
505.mcf_r                        1.36%        -1.20%(noise)
520.omnetpp_r                   -0.65%        -0.83%
523.xalancbmk_r                  0.04%        -0.59%
525.x264_r                       1.29%        62.62%
531.deepsjeng_r                  0.18%        -0.44%
541.leela_r                     -1.10%        -0.12%
548.exchange2_r                 -1.19%         0.34%
557.xz_r                        -0.53%        -1.01%(cost model)
geomean for intrate              0.00%         4.60%
503.bwaves_r                    -0.29%        -1.19%
507.cactuBSSN_r                  0.01%        -0.55%
508.namd_r                      -0.61%         2.38%
510.parest_r                    -0.41%         0.10%
511.povray_r                    -1.76%         3.79%
519.lbm_r                        0.38%        -0.33%
521.wrf_r                       -0.85%         1.23%
526.blender_r                   -0.40%        -1.21%(nosie)
527.cam4_r                      -0.27%         0.06%
538.imagick_r                   -0.97%         1.10%
544.nab_r                       -0.65%         0.09%
549.fotonik3d_r                  3.36%         0.30%
554.roms_r                      -0.28%        -0.20%
geomean for fprate              -0.22%         0.42%
geomean                         -0.12%         2.22%

                           loop vectorizer                bb vectorizer
                           codesize performance       codesize performance
500.perlbench_r            0.05%        0.80%            0.29%        0.84%
502.gcc_r                  0.02%       -0.12%            0.27%       -0.23%
505.mcf_r                  0.00%       -0.69%            1.16%       -0.85%
520.omnetpp_r              0.05%       -0.97%           -0.70%       -0.52%
523.xalancbmk_r            0.26%       -0.56%           -0.04%       -0.52%
525.x264_r                 1.18%       64.80%            0.13%       -0.29%
531.deepsjeng_r            0.16%       -0.03%           -0.05%       -0.50%
541.leela_r               -0.11%        0.59%           -0.99%       -1.12%
548.exchange2_r           -0.27%       -0.29%           -1.02%        0.17%
557.xz_r                  -0.76%       -0.10%           -0.10%       -1.28%
geomean for intrate        0.06%        4.98%           -0.11%       -0.43%
503.bwaves_r               0.00%       -0.86%           -0.25%       -0.43%
507.cactuBSSN_r            0.01%       -0.35%            0.01%       -0.37%
508.namd_r                -0.13%       -0.09%           -0.67%        2.45%
510.parest_r              -0.16%        0.62%           -0.50%        0.72%
511.povray_r              -0.03%        0.41%           -1.74%        4.61%
519.lbm_r                  0.00%       -0.31%            0.38%        0.05%
521.wrf_r                 -0.03%        1.60%           -0.94%        0.00%
526.blender_r              0.00%       -1.49%           -0.43%       -1.64%
527.cam4_r                 0.10%       -0.06%           -0.39%       -0.01%
538.imagick_r             -0.09%        0.32%           -0.90%        2.49%
544.nab_r                  0.02%        0.20%           -0.69%        0.09%
549.fotonik3d_r            2.42%        0.44%            0.93%       -0.08%
554.roms_r                 0.25%        0.06%           -0.52%        0.00%
geomean for fprate         0.18%        0.04%           -0.44%        0.59%
geomean                    0.13%        2.16%           -0.30%        0.15%


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-14 14:22                     ` Jan Hubicka
@ 2021-08-16  8:03                       ` Richard Biener
  0 siblings, 0 replies; 26+ messages in thread
From: Richard Biener @ 2021-08-16  8:03 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Sandiford, Hongtao Liu, Segher Boessenkool, gcc-help,
	bin.cheng, 172060045

On Sat, 14 Aug 2021, Jan Hubicka wrote:

> Hi,
> runs finished for cheap vectorization model.  I am running very cheap
> too.
> 
> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on
> 
> compares default -O2 (base), loop vectorize, slp vectorize.
> Put "O2" to machine name.
> 
> Overall scores are:
> 
> kaby.spec2006.O2_generic
> Test 			loop 	alp
> SPEC/SPEC2006/FP/total 	8.16% 	0.19% 	
> SPEC/SPEC2006/total 	4.96% 	0.38% 	
> SPEC/SPEC2006/INT/total 36.595 	0.58% 	0.65% 	
> 
> kaby.spec2006.O2_generic_lto
> SPEC/SPEC2006/FP/total 	9.06% 	-0.36% 	
> SPEC/SPEC2006/total 	5.32% 	~ 	
> SPEC/SPEC2006/INT/total	0.24% 	0.27% 	
> 
> kaby.spec2017.O2_generic
> SPEC/SPEC2017/INT/total	6.54% 	-1.15% 	
> SPEC/SPEC2017/total 	5.66% 	-0.17% 	
> SPEC/SPEC2017/FP/total 	5.00% 	0.59% 	
> 
> kaby.spec2017.O2_generic_lto
> SPEC/SPEC2017/INT/total	6.62% 	-0.12% 	
> SPEC/SPEC2017/total 	5.69% 	-0.14% 	
> SPEC/SPEC2017/FP/total 	4.99% 	-0.16% 	
> 
> zenith.spec2006.O2_generic
> SPEC/SPEC2006/FP/total 	10.23% 	-0.35% 	
> SPEC/SPEC2006/total 	6.01% 	-0.48% 	
> SPEC/SPEC2006/INT/total	0.31% 	-0.66% 	
> 
> zenith.spec2006.O2_generic_lto
> SPEC/SPEC2006/FP/total 	12.03% 	0.82% 	
> SPEC/SPEC2006/total 	6.90% 	0.44% 	
> SPEC/SPEC2006/INT/total	~ 	-0.11% 	
> 
> zenith.spec2017.O2_generic
> SPEC/SPEC2017/INT/total	7.46% 	-0.37% 	
> SPEC/SPEC2017/total 	6.81% 	0.48% 	
> SPEC/SPEC2017/FP/total 	6.31% 	1.15% 	
> 
> zenith.spec2017.O2_generic_lto
> SPEC/SPEC2017/INT/total	7.81% 	-0.22% 	
> SPEC/SPEC2017/total 	7.07% 	0.44% 	
> SPEC/SPEC2017/FP/total 	6.50% 	0.94% 	
> 
> So loop vectorize is consistent win, slp is mostly neutral.
> 
> Code size growth is too large for -O2 for loop vectorize. SLP vectorize
> seems slight size win overall.
> 
> Noteworthy regressions caused by slp are:
>   5-6%: xz (kaby, lto), milc (zenith), astar (zenith)
>   4-5%: xalancbmk (kaby), blender (kaby)
>   3-4%: xz (kaby, nolto), dealII (zenith)
>   2-3%: povray (kaby), astar (kaby), perlbench (kaby), sjeng (zenith),
>   xz(zenith)
> 
> We get 10% improvement on imagemagick (kaby), 17.5% (zenith),
> imagematick 6.99% (zenith)
> 
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on
> 
> Is for C++/polyhedron benchmarks.  It shows several interesting
> regressions in polyhedron and tsvc for loop vectorization (over 100%)
> and also some for slp.  Shall I try to search for bugzilla for these?

Yeah, even more so if they reproduce with -O3.

Richard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Why vectorization didn't turn on by -O2
  2021-08-16  6:09                       ` Hongtao Liu
@ 2021-08-24  2:21                         ` Hongtao Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Hongtao Liu @ 2021-08-24  2:21 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, bin.cheng, Segher Boessenkool, 172060045,
	gcc-help, Richard Sandiford, Jan Hubicka

On Mon, Aug 16, 2021 at 2:09 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 2:00 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
> > <gcc-help@gcc.gnu.org> wrote:
> > >
> > > on 2021/8/4 下午4:31, Richard Biener wrote:
> > > > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> > > >
> > > >> Hongtao Liu <crazylht@gmail.com> writes:
> > > >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > > >>> <gcc-help@gcc.gnu.org> wrote:
> > > >>>>
> > > >>>> Jan Hubicka <hubicka@ucw.cz> writes:
> > > >>>>> Hi,
> > > >>>>> here are updated scores.
> > > >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > > >>>>> compares
> > > >>>>>   base:  mainline
> > > >>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
> > > >>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
> > > >>>>>
> > > >>>>> The short story is:
> > > >>>>>
> > > >>>>> 1) -O2 generic performance
> > > >>>>>     kabylake (Intel):
> > > >>>>>                               very    cheap
> > > >>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
> > > >>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
> > > >>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> > > >>>>>
> > > >>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
> > > >>>>>       SPEC/SPEC2017/total     2.22%   6.52%
> > > >>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
> > > >>>>>     zen
> > > >>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
> > > >>>>>       SPEC/SPEC2006/total     0.26%   6.27%
> > > >>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> > > >>>>>
> > > >>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> > > >>>>>       SPEC/SPEC2017/total     3.02%   6.55%
> > > >>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
> > > >>>>>
> > > >>>>>  2) -O2 size:
> > > >>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > > >>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > > >>>>>  3) build times:
> > > >>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > > >>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> > > >>>>>     here I simply copied data from different configuratoins
> > > >>>>>
> > > >>>>> So for SPEC i would say that most of compile time costs are derrived
> > > >>>>> from code size growth which is a problem with cheap model but not with
> > > >>>>> very cheap.  Very cheap indeed results in code size improvements and
> > > >>>>> compile time impact is probably somewhere around 0.5%
> > > >>>>>
> > > >>>>> So from these scores alone this would seem that vectorization makes
> > > >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> > > >>>>> optimizations with worse benefits to compile time tradeoffs).
> > > >>>>
> > > >>>> Thanks for running these.
> > > >>>>
> > > >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> > > >>>>
> > > >>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > > >>>>
> > > >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> > > >>>> BB SLP within loop vectorisation for the very-cheap model.  This would
> > > >>>> purely be a workaround and we should remove it once the PR is fixed.
> > > >>>> (It would even be a compile-time win in the meantime :-))
Fixed by

commit r12-3103-g819b7c3a339e3bdaf85cd55954c5536bd98aae09
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Aug 4 16:39:31 2021 +0800

    Disable slp in loop vectorizer when cost model is very-cheap.

    Performance impact for the commit with option:
    -march=x86-64 -O2 -ftree-vectorize -fvect-cost-model=very-cheap

    SPEC2017 fprate
    503.bwaves_r        BuildSame
    507.cactuBSSN_r         -0.04
    508.namd_r               0.14
    510.parest_r            -0.54
    511.povray_r             0.10
    519.lbm_r           BuildSame
    521.wrf_r                0.64
    526.blender_r           -0.32
    527.cam4_r               0.17
    538.imagick_r            0.09
    544.nab_r           BuildSame
    549.fotonik3d_r     BuildSame
    554.roms_r          BuildSame
    997.specrand_fr         -0.09
    Geometric mean:  0.02

    SPEC2017 intrate
    500.perlbench_r          0.26
    502.gcc_r                0.21
    505.mcf_r               -0.09
    520.omnetpp_r       BuildSame
    523.xalancbmk_r     BuildSame
    525.x264_r              -0.41
    531.deepsjeng_r     BuildSame
    541.leela_r              0.13
    548.exchange2_r     BuildSame
    557.xz_r            BuildSame
    999.specrand_ir     BuildSame
    Geometric mean:  0.02

    EEMBC: no regression, only improvement or build the same, the below is
    improved benchmarks.

    mp2decoddata1       7.59
    mp2decoddata2       31.80
    mp2decoddata3       12.15
    mp2decoddata4       11.16
    mp2decoddata5       11.19
    mp2decoddata1       7.06
    mp2decoddata2       24.12
    mp2decoddata3       10.83
    mp2decoddata4       10.04
    mp2decoddata5       10.07

    gcc/ChangeLog:

            PR tree-optimization/100089
            * tree-vectorizer.c (try_vectorize_loop_1): Disable slp in
            loop vectorizer when cost model is very-cheap.



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-08-24  2:16 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-08 12:07 Why vectorization didn't turn on by -O2 172060045
2021-05-08 16:54 ` Xi Ruoyao
2021-05-09 13:44   ` Segher Boessenkool
2021-05-10  8:21     ` Richard Biener
2021-05-10  9:11       ` Jan Hubicka
2021-05-10  9:27         ` Segher Boessenkool
2021-05-10 12:10           ` Jan Hubicka
2021-05-10  9:24       ` Segher Boessenkool
2021-05-17 16:03         ` Jan Hubicka
2021-05-17 18:56           ` Richard Sandiford
2021-08-04  8:21             ` Hongtao Liu
2021-08-04  8:22               ` Richard Sandiford
2021-08-04  8:31                 ` Richard Biener
2021-08-04  9:10                   ` Richard Sandiford
2021-08-04  9:56                     ` Segher Boessenkool
2021-08-04 10:22                       ` Richard Sandiford
2021-08-04 21:18                         ` Segher Boessenkool
2021-08-04  9:12                   ` Hongtao Liu
2021-08-11 17:14                   ` Jan Hubicka
2021-08-14 14:22                     ` Jan Hubicka
2021-08-16  8:03                       ` Richard Biener
2021-08-16  3:22                   ` Kewen.Lin
2021-08-16  6:00                     ` Hongtao Liu
2021-08-16  6:09                       ` Hongtao Liu
2021-08-24  2:21                         ` Hongtao Liu
2021-08-04  8:36                 ` Hongtao Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).