* Why vectorization didn't turn on by -O2
@ 2021-05-08 12:07 172060045
2021-05-08 16:54 ` Xi Ruoyao
0 siblings, 1 reply; 26+ messages in thread
From: 172060045 @ 2021-05-08 12:07 UTC (permalink / raw)
To: gcc-help; +Cc: bin.cheng, ook, rguenther
Hi all,
Recently I noticed that gcc -O2 didn't turn on vectorization optimization,
which it turns on in clang -O2.
Does GCC think it involves the trade-off of space speed, or other considerations?
Thanks in advance.
Yancheng
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-08 12:07 Why vectorization didn't turn on by -O2 172060045
@ 2021-05-08 16:54 ` Xi Ruoyao
2021-05-09 13:44 ` Segher Boessenkool
0 siblings, 1 reply; 26+ messages in thread
From: Xi Ruoyao @ 2021-05-08 16:54 UTC (permalink / raw)
To: 172060045, gcc-help; +Cc: rguenther, bin.cheng, ook, Jan Hubicka
On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
>
> Hi all,
>
> Recently I noticed that gcc -O2 didn't turn on vectorization
> optimization,
> which it turns on in clang -O2.
>
> Does GCC think it involves the trade-off of space speed, or other
> considerations?
>
> Thanks in advance.
>
> Yancheng
It's just a decision I think. The "original" reason may be that
vectorization can make code *slower* on some cases.
There was some discussion about enabling -ftree-loop-vectorization at -
O2 for x86, but that was too late (for GCC 9):
https://gcc.gnu.org/pipermail/gcc/2019-January/228102.html
CC Jan: would this be reconsidered for GCC 12?
--
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-08 16:54 ` Xi Ruoyao
@ 2021-05-09 13:44 ` Segher Boessenkool
2021-05-10 8:21 ` Richard Biener
0 siblings, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-05-09 13:44 UTC (permalink / raw)
To: Xi Ruoyao; +Cc: 172060045, gcc-help, bin.cheng, rguenther, ook, Jan Hubicka
On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > Recently I noticed that gcc -O2 didn't turn on vectorization
> > optimization,
> > which it turns on in clang -O2.
> >
> > Does GCC think it involves the trade-off of space speed, or other
> > considerations?
-O2 is for optimisations that (almost) never degrade code quality. -O3
is for those that are only beneficial "on average".
> It's just a decision I think. The "original" reason may be that
> vectorization can make code *slower* on some cases.
Yup. Vectorisation always causes hugely different code.
> There was some discussion about enabling -ftree-loop-vectorization at -
> O2 for x86, but that was too late (for GCC 9):
AFAIK the current plan is to enable vectorisation at -O2 with a more
conservative cost model. This will be a generic change, for all
architectures, and hopefully will arrive in GCC 12.
Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-09 13:44 ` Segher Boessenkool
@ 2021-05-10 8:21 ` Richard Biener
2021-05-10 9:11 ` Jan Hubicka
2021-05-10 9:24 ` Segher Boessenkool
0 siblings, 2 replies; 26+ messages in thread
From: Richard Biener @ 2021-05-10 8:21 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: Xi Ruoyao, 172060045, gcc-help, bin.cheng, Jan Hubicka
On Sun, 9 May 2021, Segher Boessenkool wrote:
> On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > optimization,
> > > which it turns on in clang -O2.
> > >
> > > Does GCC think it involves the trade-off of space speed, or other
> > > considerations?
>
> -O2 is for optimisations that (almost) never degrade code quality. -O3
> is for those that are only beneficial "on average".
In particular -O2 is a balance of compile-time, generated code size
and resulting performance. Vectorization with the -O2 default
cost model of "cheap" tends to mostly increase the first and the second
whilst only moderately affecting the last.
It's all of course hugely dependent on the source base you are
working with.
Richard.
> > It's just a decision I think. The "original" reason may be that
> > vectorization can make code *slower* on some cases.
>
> Yup. Vectorisation always causes hugely different code.
>
> > There was some discussion about enabling -ftree-loop-vectorization at -
> > O2 for x86, but that was too late (for GCC 9):
>
> AFAIK the current plan is to enable vectorisation at -O2 with a more
> conservative cost model. This will be a generic change, for all
> architectures, and hopefully will arrive in GCC 12.
>
>
> Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-10 8:21 ` Richard Biener
@ 2021-05-10 9:11 ` Jan Hubicka
2021-05-10 9:27 ` Segher Boessenkool
2021-05-10 9:24 ` Segher Boessenkool
1 sibling, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-05-10 9:11 UTC (permalink / raw)
To: Richard Biener
Cc: Segher Boessenkool, Xi Ruoyao, 172060045, gcc-help, bin.cheng
> On Sun, 9 May 2021, Segher Boessenkool wrote:
>
> > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > optimization,
> > > > which it turns on in clang -O2.
> > > >
> > > > Does GCC think it involves the trade-off of space speed, or other
> > > > considerations?
> >
> > -O2 is for optimisations that (almost) never degrade code quality. -O3
> > is for those that are only beneficial "on average".
>
> In particular -O2 is a balance of compile-time, generated code size
> and resulting performance. Vectorization with the -O2 default
> cost model of "cheap" tends to mostly increase the first and the second
> whilst only moderately affecting the last.
Last time I ran -O2 with cheap model enabled, the code size was actually
decreasing at average for SPEC which was a bit of a surprise.
I plan to re-do the benchmarks soon now gcc11 branched.
Honza
>
> It's all of course hugely dependent on the source base you are
> working with.
>
> Richard.
>
> > > It's just a decision I think. The "original" reason may be that
> > > vectorization can make code *slower* on some cases.
> >
> > Yup. Vectorisation always causes hugely different code.
> >
> > > There was some discussion about enabling -ftree-loop-vectorization at -
> > > O2 for x86, but that was too late (for GCC 9):
> >
> > AFAIK the current plan is to enable vectorisation at -O2 with a more
> > conservative cost model. This will be a generic change, for all
> > architectures, and hopefully will arrive in GCC 12.
> >
> >
> > Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-10 8:21 ` Richard Biener
2021-05-10 9:11 ` Jan Hubicka
@ 2021-05-10 9:24 ` Segher Boessenkool
2021-05-17 16:03 ` Jan Hubicka
1 sibling, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-05-10 9:24 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-help, Jan Hubicka, bin.cheng, 172060045
On Mon, May 10, 2021 at 10:21:04AM +0200, Richard Biener wrote:
> On Sun, 9 May 2021, Segher Boessenkool wrote:
> > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > optimization,
> > > > which it turns on in clang -O2.
> > > >
> > > > Does GCC think it involves the trade-off of space speed, or other
> > > > considerations?
> >
> > -O2 is for optimisations that (almost) never degrade code quality. -O3
> > is for those that are only beneficial "on average".
>
> In particular -O2 is a balance of compile-time, generated code size
> and resulting performance.
Yeah, sorry, I glossed over those details, but they do matter.
My point is that -O2 opts "almost never" should degrade quality (code
speed, code size, compilation speed, and tradeoffs in that all --
nothing is a hard and fast rule here).
> Vectorization with the -O2 default
> cost model of "cheap" tends to mostly increase the first and the second
> whilst only moderately affecting the last.
One promising plan is to use very-cheap instead. It should increase code
size even less (but see what Honza said).
> It's all of course hugely dependent on the source base you are
> working with.
Yeah.
Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-10 9:11 ` Jan Hubicka
@ 2021-05-10 9:27 ` Segher Boessenkool
2021-05-10 12:10 ` Jan Hubicka
0 siblings, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-05-10 9:27 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Richard Biener, gcc-help, bin.cheng, 172060045
On Mon, May 10, 2021 at 11:11:36AM +0200, Jan Hubicka wrote:
> > On Sun, 9 May 2021, Segher Boessenkool wrote:
> > > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > > optimization,
> > > > > which it turns on in clang -O2.
> > > > >
> > > > > Does GCC think it involves the trade-off of space speed, or other
> > > > > considerations?
> > >
> > > -O2 is for optimisations that (almost) never degrade code quality. -O3
> > > is for those that are only beneficial "on average".
> >
> > In particular -O2 is a balance of compile-time, generated code size
> > and resulting performance. Vectorization with the -O2 default
> > cost model of "cheap" tends to mostly increase the first and the second
> > whilst only moderately affecting the last.
>
> Last time I ran -O2 with cheap model enabled, the code size was actually
> decreasing at average for SPEC which was a bit of a surprise.
Huh! Will you investigate how/why this happens? Also, it probably is
quite different on different archs. Interesting times ahead.
> I plan to re-do the benchmarks soon now gcc11 branched.
Great, thanks in advance, gl;hf :-)
Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-10 9:27 ` Segher Boessenkool
@ 2021-05-10 12:10 ` Jan Hubicka
0 siblings, 0 replies; 26+ messages in thread
From: Jan Hubicka @ 2021-05-10 12:10 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: Richard Biener, gcc-help, bin.cheng, 172060045
> On Mon, May 10, 2021 at 11:11:36AM +0200, Jan Hubicka wrote:
> > > On Sun, 9 May 2021, Segher Boessenkool wrote:
> > > > On Sun, May 09, 2021 at 12:54:08AM +0800, Xi Ruoyao via Gcc-help wrote:
> > > > > On Sat, 2021-05-08 at 20:07 +0800, 172060045@hdu.edu.cn wrote:
> > > > > > Recently I noticed that gcc -O2 didn't turn on vectorization
> > > > > > optimization,
> > > > > > which it turns on in clang -O2.
> > > > > >
> > > > > > Does GCC think it involves the trade-off of space speed, or other
> > > > > > considerations?
> > > >
> > > > -O2 is for optimisations that (almost) never degrade code quality. -O3
> > > > is for those that are only beneficial "on average".
> > >
> > > In particular -O2 is a balance of compile-time, generated code size
> > > and resulting performance. Vectorization with the -O2 default
> > > cost model of "cheap" tends to mostly increase the first and the second
> > > whilst only moderately affecting the last.
> >
> > Last time I ran -O2 with cheap model enabled, the code size was actually
> > decreasing at average for SPEC which was a bit of a surprise.
>
> Huh! Will you investigate how/why this happens? Also, it probably is
> quite different on different archs. Interesting times ahead.
>
> > I plan to re-do the benchmarks soon now gcc11 branched.
>
> Great, thanks in advance, gl;hf :-)
I was referring to the very cheap model here. I think cheap model may
need some love since it seems to pick quite some code overhead and still
produce noticeably fewer speedups than dynamic. This however may be
very SPEC benchmarks centric since we tune vectorizer for years on
those.
I am re-running benchmarks now and lets see.
Honza
>
>
> Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-10 9:24 ` Segher Boessenkool
@ 2021-05-17 16:03 ` Jan Hubicka
2021-05-17 18:56 ` Richard Sandiford
0 siblings, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-05-17 16:03 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: Richard Biener, gcc-help, bin.cheng, 172060045
Hi,
here are updated scores.
https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
compares
base: mainline
1st column: mainline with very cheap vectorization at -O2 and -O3
2nd column: mainline with cheap vectorization at -O2 and -O3.
The short story is:
1) -O2 generic performance
kabylake (Intel):
very cheap
SPEC/SPEC2006/FP/total ~ 8.32%
SPEC/SPEC2006/total -0.38% 4.74%
SPEC/SPEC2006/INT/total -0.91% -0.14%
SPEC/SPEC2017/INT/total 4.71% 7.11%
SPEC/SPEC2017/total 2.22% 6.52%
SPEC/SPEC2017/FP/total 0.34% 6.06%
zen
SPEC/SPEC2006/FP/total 0.61% 10.23%
SPEC/SPEC2006/total 0.26% 6.27%
SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
SPEC/SPEC2017/total 3.02% 6.55%
SPEC/SPEC2017/FP/total 1.26% 5.60%
2) -O2 size:
-0.78% (very cheap) 6.51% (cheap) for spec2k2006
-0.32% (very cheap) 6.75% (cheap) for spec2k2017
3) build times:
0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
here I simply copied data from different configuratoins
So for SPEC i would say that most of compile time costs are derrived
from code size growth which is a problem with cheap model but not with
very cheap. Very cheap indeed results in code size improvements and
compile time impact is probably somewhere around 0.5%
So from these scores alone this would seem that vectorization makes
sense at -O2 with very cheap model to me (I am sure we have other
optimizations with worse benefits to compile time tradeoffs).
However there are usual arguments against:
1) Vectorizer being tuned for SPEC. I think the only way to overcome
that argument is to enable it by default :)
2) Workloads improved are more of -Ofast type workloads
Here are non-spec benchmarks we track:
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
I also tried to run Firefox some time ago. Results are not surprising -
vectorizaiton helps rendering benchmarks which are those compiler with
aggressive flags anyway.
Honza
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-17 16:03 ` Jan Hubicka
@ 2021-05-17 18:56 ` Richard Sandiford
2021-08-04 8:21 ` Hongtao Liu
0 siblings, 1 reply; 26+ messages in thread
From: Richard Sandiford @ 2021-05-17 18:56 UTC (permalink / raw)
To: Jan Hubicka
Cc: Segher Boessenkool, gcc-help, bin.cheng, Richard Biener, 172060045
Jan Hubicka <hubicka@ucw.cz> writes:
> Hi,
> here are updated scores.
> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> compares
> base: mainline
> 1st column: mainline with very cheap vectorization at -O2 and -O3
> 2nd column: mainline with cheap vectorization at -O2 and -O3.
>
> The short story is:
>
> 1) -O2 generic performance
> kabylake (Intel):
> very cheap
> SPEC/SPEC2006/FP/total ~ 8.32%
> SPEC/SPEC2006/total -0.38% 4.74%
> SPEC/SPEC2006/INT/total -0.91% -0.14%
>
> SPEC/SPEC2017/INT/total 4.71% 7.11%
> SPEC/SPEC2017/total 2.22% 6.52%
> SPEC/SPEC2017/FP/total 0.34% 6.06%
> zen
> SPEC/SPEC2006/FP/total 0.61% 10.23%
> SPEC/SPEC2006/total 0.26% 6.27%
> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
>
> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> SPEC/SPEC2017/total 3.02% 6.55%
> SPEC/SPEC2017/FP/total 1.26% 5.60%
>
> 2) -O2 size:
> -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> 3) build times:
> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> here I simply copied data from different configuratoins
>
> So for SPEC i would say that most of compile time costs are derrived
> from code size growth which is a problem with cheap model but not with
> very cheap. Very cheap indeed results in code size improvements and
> compile time impact is probably somewhere around 0.5%
>
> So from these scores alone this would seem that vectorization makes
> sense at -O2 with very cheap model to me (I am sure we have other
> optimizations with worse benefits to compile time tradeoffs).
Thanks for running these.
The biggest issue I know of for enabling very-cheap at -O2 is:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
Perhaps we could get around that by (hopefully temporarily) disabling
BB SLP within loop vectorisation for the very-cheap model. This would
purely be a workaround and we should remove it once the PR is fixed.
(It would even be a compile-time win in the meantime :-))
Thanks,
Richard
> However there are usual arguments against:
>
> 1) Vectorizer being tuned for SPEC. I think the only way to overcome
> that argument is to enable it by default :)
> 2) Workloads improved are more of -Ofast type workloads
>
> Here are non-spec benchmarks we track:
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>
> I also tried to run Firefox some time ago. Results are not surprising -
> vectorizaiton helps rendering benchmarks which are those compiler with
> aggressive flags anyway.
>
> Honza
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-05-17 18:56 ` Richard Sandiford
@ 2021-08-04 8:21 ` Hongtao Liu
2021-08-04 8:22 ` Richard Sandiford
0 siblings, 1 reply; 26+ messages in thread
From: Hongtao Liu @ 2021-08-04 8:21 UTC (permalink / raw)
To: Richard Sandiford, Jan Hubicka, Segher Boessenkool, gcc-help,
bin.cheng, Richard Biener, 172060045
On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
<gcc-help@gcc.gnu.org> wrote:
>
> Jan Hubicka <hubicka@ucw.cz> writes:
> > Hi,
> > here are updated scores.
> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > compares
> > base: mainline
> > 1st column: mainline with very cheap vectorization at -O2 and -O3
> > 2nd column: mainline with cheap vectorization at -O2 and -O3.
> >
> > The short story is:
> >
> > 1) -O2 generic performance
> > kabylake (Intel):
> > very cheap
> > SPEC/SPEC2006/FP/total ~ 8.32%
> > SPEC/SPEC2006/total -0.38% 4.74%
> > SPEC/SPEC2006/INT/total -0.91% -0.14%
> >
> > SPEC/SPEC2017/INT/total 4.71% 7.11%
> > SPEC/SPEC2017/total 2.22% 6.52%
> > SPEC/SPEC2017/FP/total 0.34% 6.06%
> > zen
> > SPEC/SPEC2006/FP/total 0.61% 10.23%
> > SPEC/SPEC2006/total 0.26% 6.27%
> > SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
> >
> > SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> > SPEC/SPEC2017/total 3.02% 6.55%
> > SPEC/SPEC2017/FP/total 1.26% 5.60%
> >
> > 2) -O2 size:
> > -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > 3) build times:
> > 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> > here I simply copied data from different configuratoins
> >
> > So for SPEC i would say that most of compile time costs are derrived
> > from code size growth which is a problem with cheap model but not with
> > very cheap. Very cheap indeed results in code size improvements and
> > compile time impact is probably somewhere around 0.5%
> >
> > So from these scores alone this would seem that vectorization makes
> > sense at -O2 with very cheap model to me (I am sure we have other
> > optimizations with worse benefits to compile time tradeoffs).
>
> Thanks for running these.
>
> The biggest issue I know of for enabling very-cheap at -O2 is:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>
> Perhaps we could get around that by (hopefully temporarily) disabling
> BB SLP within loop vectorisation for the very-cheap model. This would
> purely be a workaround and we should remove it once the PR is fixed.
> (It would even be a compile-time win in the meantime :-))
>
> Thanks,
> Richard
>
> > However there are usual arguments against:
> >
> > 1) Vectorizer being tuned for SPEC. I think the only way to overcome
> > that argument is to enable it by default :)
> > 2) Workloads improved are more of -Ofast type workloads
> >
> > Here are non-spec benchmarks we track:
> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >
> > I also tried to run Firefox some time ago. Results are not surprising -
> > vectorizaiton helps rendering benchmarks which are those compiler with
> > aggressive flags anyway.
> >
> > Honza
Hi:
I would like to ask if we can turn on O2 vectorization now?
--
BR,
Hongtao
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 8:21 ` Hongtao Liu
@ 2021-08-04 8:22 ` Richard Sandiford
2021-08-04 8:31 ` Richard Biener
2021-08-04 8:36 ` Hongtao Liu
0 siblings, 2 replies; 26+ messages in thread
From: Richard Sandiford @ 2021-08-04 8:22 UTC (permalink / raw)
To: Hongtao Liu
Cc: Jan Hubicka, Segher Boessenkool, gcc-help, bin.cheng,
Richard Biener, 172060045
Hongtao Liu <crazylht@gmail.com> writes:
> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
>>
>> Jan Hubicka <hubicka@ucw.cz> writes:
>> > Hi,
>> > here are updated scores.
>> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> > compares
>> > base: mainline
>> > 1st column: mainline with very cheap vectorization at -O2 and -O3
>> > 2nd column: mainline with cheap vectorization at -O2 and -O3.
>> >
>> > The short story is:
>> >
>> > 1) -O2 generic performance
>> > kabylake (Intel):
>> > very cheap
>> > SPEC/SPEC2006/FP/total ~ 8.32%
>> > SPEC/SPEC2006/total -0.38% 4.74%
>> > SPEC/SPEC2006/INT/total -0.91% -0.14%
>> >
>> > SPEC/SPEC2017/INT/total 4.71% 7.11%
>> > SPEC/SPEC2017/total 2.22% 6.52%
>> > SPEC/SPEC2017/FP/total 0.34% 6.06%
>> > zen
>> > SPEC/SPEC2006/FP/total 0.61% 10.23%
>> > SPEC/SPEC2006/total 0.26% 6.27%
>> > SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
>> >
>> > SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
>> > SPEC/SPEC2017/total 3.02% 6.55%
>> > SPEC/SPEC2017/FP/total 1.26% 5.60%
>> >
>> > 2) -O2 size:
>> > -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>> > -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>> > 3) build times:
>> > 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>> > 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
>> > here I simply copied data from different configuratoins
>> >
>> > So for SPEC i would say that most of compile time costs are derrived
>> > from code size growth which is a problem with cheap model but not with
>> > very cheap. Very cheap indeed results in code size improvements and
>> > compile time impact is probably somewhere around 0.5%
>> >
>> > So from these scores alone this would seem that vectorization makes
>> > sense at -O2 with very cheap model to me (I am sure we have other
>> > optimizations with worse benefits to compile time tradeoffs).
>>
>> Thanks for running these.
>>
>> The biggest issue I know of for enabling very-cheap at -O2 is:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>>
>> Perhaps we could get around that by (hopefully temporarily) disabling
>> BB SLP within loop vectorisation for the very-cheap model. This would
>> purely be a workaround and we should remove it once the PR is fixed.
>> (It would even be a compile-time win in the meantime :-))
>>
>> Thanks,
>> Richard
>>
>> > However there are usual arguments against:
>> >
>> > 1) Vectorizer being tuned for SPEC. I think the only way to overcome
>> > that argument is to enable it by default :)
>> > 2) Workloads improved are more of -Ofast type workloads
>> >
>> > Here are non-spec benchmarks we track:
>> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >
>> > I also tried to run Firefox some time ago. Results are not surprising -
>> > vectorizaiton helps rendering benchmarks which are those compiler with
>> > aggressive flags anyway.
>> >
>> > Honza
>
> Hi:
> I would like to ask if we can turn on O2 vectorization now?
I think we still need to deal with the PR100089 issue that I mentioned above.
Like I say, “dealing with” it could be as simple as disabling:
/* If we applied if-conversion then try to vectorize the
BB of innermost loops.
??? Ideally BB vectorization would learn to vectorize
control flow by applying if-conversion on-the-fly, the
following retains the if-converted loop body even when
only non-if-converted parts took part in BB vectorization. */
if (flag_tree_slp_vectorize != 0
&& loop_vectorized_call
&& ! loop->inner)
for the very-cheap vector cost model until the PR is fixed properly.
Thanks,
Richard
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 8:22 ` Richard Sandiford
@ 2021-08-04 8:31 ` Richard Biener
2021-08-04 9:10 ` Richard Sandiford
` (3 more replies)
2021-08-04 8:36 ` Hongtao Liu
1 sibling, 4 replies; 26+ messages in thread
From: Richard Biener @ 2021-08-04 8:31 UTC (permalink / raw)
To: Richard Sandiford
Cc: Hongtao Liu, Jan Hubicka, Segher Boessenkool, gcc-help,
bin.cheng, 172060045
On Wed, 4 Aug 2021, Richard Sandiford wrote:
> Hongtao Liu <crazylht@gmail.com> writes:
> > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > <gcc-help@gcc.gnu.org> wrote:
> >>
> >> Jan Hubicka <hubicka@ucw.cz> writes:
> >> > Hi,
> >> > here are updated scores.
> >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> > compares
> >> > base: mainline
> >> > 1st column: mainline with very cheap vectorization at -O2 and -O3
> >> > 2nd column: mainline with cheap vectorization at -O2 and -O3.
> >> >
> >> > The short story is:
> >> >
> >> > 1) -O2 generic performance
> >> > kabylake (Intel):
> >> > very cheap
> >> > SPEC/SPEC2006/FP/total ~ 8.32%
> >> > SPEC/SPEC2006/total -0.38% 4.74%
> >> > SPEC/SPEC2006/INT/total -0.91% -0.14%
> >> >
> >> > SPEC/SPEC2017/INT/total 4.71% 7.11%
> >> > SPEC/SPEC2017/total 2.22% 6.52%
> >> > SPEC/SPEC2017/FP/total 0.34% 6.06%
> >> > zen
> >> > SPEC/SPEC2006/FP/total 0.61% 10.23%
> >> > SPEC/SPEC2006/total 0.26% 6.27%
> >> > SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
> >> >
> >> > SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> >> > SPEC/SPEC2017/total 3.02% 6.55%
> >> > SPEC/SPEC2017/FP/total 1.26% 5.60%
> >> >
> >> > 2) -O2 size:
> >> > -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >> > -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >> > 3) build times:
> >> > 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >> > 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> >> > here I simply copied data from different configuratoins
> >> >
> >> > So for SPEC i would say that most of compile time costs are derrived
> >> > from code size growth which is a problem with cheap model but not with
> >> > very cheap. Very cheap indeed results in code size improvements and
> >> > compile time impact is probably somewhere around 0.5%
> >> >
> >> > So from these scores alone this would seem that vectorization makes
> >> > sense at -O2 with very cheap model to me (I am sure we have other
> >> > optimizations with worse benefits to compile time tradeoffs).
> >>
> >> Thanks for running these.
> >>
> >> The biggest issue I know of for enabling very-cheap at -O2 is:
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> >>
> >> Perhaps we could get around that by (hopefully temporarily) disabling
> >> BB SLP within loop vectorisation for the very-cheap model. This would
> >> purely be a workaround and we should remove it once the PR is fixed.
> >> (It would even be a compile-time win in the meantime :-))
> >>
> >> Thanks,
> >> Richard
> >>
> >> > However there are usual arguments against:
> >> >
> >> > 1) Vectorizer being tuned for SPEC. I think the only way to overcome
> >> > that argument is to enable it by default :)
> >> > 2) Workloads improved are more of -Ofast type workloads
> >> >
> >> > Here are non-spec benchmarks we track:
> >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> >
> >> > I also tried to run Firefox some time ago. Results are not surprising -
> >> > vectorizaiton helps rendering benchmarks which are those compiler with
> >> > aggressive flags anyway.
> >> >
> >> > Honza
> >
> > Hi:
> > I would like to ask if we can turn on O2 vectorization now?
>
> I think we still need to deal with the PR100089 issue that I mentioned above.
> Like I say, “dealing with” it could be as simple as disabling:
>
> /* If we applied if-conversion then try to vectorize the
> BB of innermost loops.
> ??? Ideally BB vectorization would learn to vectorize
> control flow by applying if-conversion on-the-fly, the
> following retains the if-converted loop body even when
> only non-if-converted parts took part in BB vectorization. */
> if (flag_tree_slp_vectorize != 0
> && loop_vectorized_call
> && ! loop->inner)
>
> for the very-cheap vector cost model until the PR is fixed properly.
Alternatively only enable loop vectorization at -O2 (the above checks
flag_tree_slp_vectorize as well). At least the cost model kind
does not have any influence on BB vectorization, that is, we get the
same pros and cons as we do for -O3.
Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
Richard.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 8:22 ` Richard Sandiford
2021-08-04 8:31 ` Richard Biener
@ 2021-08-04 8:36 ` Hongtao Liu
1 sibling, 0 replies; 26+ messages in thread
From: Hongtao Liu @ 2021-08-04 8:36 UTC (permalink / raw)
To: Hongtao Liu, Jan Hubicka, Segher Boessenkool, gcc-help,
bin.cheng, Richard Biener, 172060045, Richard Sandiford
On Wed, Aug 4, 2021 at 4:22 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu <crazylht@gmail.com> writes:
> > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > <gcc-help@gcc.gnu.org> wrote:
> >>
> >> Jan Hubicka <hubicka@ucw.cz> writes:
> >> > Hi,
> >> > here are updated scores.
> >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> > compares
> >> > base: mainline
> >> > 1st column: mainline with very cheap vectorization at -O2 and -O3
> >> > 2nd column: mainline with cheap vectorization at -O2 and -O3.
> >> >
> >> > The short story is:
> >> >
> >> > 1) -O2 generic performance
> >> > kabylake (Intel):
> >> > very cheap
> >> > SPEC/SPEC2006/FP/total ~ 8.32%
> >> > SPEC/SPEC2006/total -0.38% 4.74%
> >> > SPEC/SPEC2006/INT/total -0.91% -0.14%
> >> >
> >> > SPEC/SPEC2017/INT/total 4.71% 7.11%
> >> > SPEC/SPEC2017/total 2.22% 6.52%
> >> > SPEC/SPEC2017/FP/total 0.34% 6.06%
> >> > zen
> >> > SPEC/SPEC2006/FP/total 0.61% 10.23%
> >> > SPEC/SPEC2006/total 0.26% 6.27%
> >> > SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
> >> >
> >> > SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> >> > SPEC/SPEC2017/total 3.02% 6.55%
> >> > SPEC/SPEC2017/FP/total 1.26% 5.60%
> >> >
> >> > 2) -O2 size:
> >> > -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >> > -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >> > 3) build times:
> >> > 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >> > 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> >> > here I simply copied data from different configuratoins
> >> >
> >> > So for SPEC i would say that most of compile time costs are derrived
> >> > from code size growth which is a problem with cheap model but not with
> >> > very cheap. Very cheap indeed results in code size improvements and
> >> > compile time impact is probably somewhere around 0.5%
> >> >
> >> > So from these scores alone this would seem that vectorization makes
> >> > sense at -O2 with very cheap model to me (I am sure we have other
> >> > optimizations with worse benefits to compile time tradeoffs).
> >>
> >> Thanks for running these.
> >>
> >> The biggest issue I know of for enabling very-cheap at -O2 is:
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> >>
> >> Perhaps we could get around that by (hopefully temporarily) disabling
> >> BB SLP within loop vectorisation for the very-cheap model. This would
> >> purely be a workaround and we should remove it once the PR is fixed.
> >> (It would even be a compile-time win in the meantime :-))
> >>
> >> Thanks,
> >> Richard
> >>
> >> > However there are usual arguments against:
> >> >
> >> > 1) Vectorizer being tuned for SPEC. I think the only way to overcome
> >> > that argument is to enable it by default :)
> >> > 2) Workloads improved are more of -Ofast type workloads
> >> >
> >> > Here are non-spec benchmarks we track:
> >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >> >
> >> > I also tried to run Firefox some time ago. Results are not surprising -
> >> > vectorizaiton helps rendering benchmarks which are those compiler with
> >> > aggressive flags anyway.
> >> >
> >> > Honza
> >
> > Hi:
> > I would like to ask if we can turn on O2 vectorization now?
>
> I think we still need to deal with the PR100089 issue that I mentioned above.
> Like I say, “dealing with” it could be as simple as disabling:
>
> /* If we applied if-conversion then try to vectorize the
> BB of innermost loops.
> ??? Ideally BB vectorization would learn to vectorize
> control flow by applying if-conversion on-the-fly, the
> following retains the if-converted loop body even when
> only non-if-converted parts took part in BB vectorization. */
> if (flag_tree_slp_vectorize != 0
> && loop_vectorized_call
> && ! loop->inner)
>
> for the very-cheap vector cost model until the PR is fixed properly.
>
I can test its impact on x86 platforms.
> Thanks,
> Richard
--
BR,
Hongtao
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 8:31 ` Richard Biener
@ 2021-08-04 9:10 ` Richard Sandiford
2021-08-04 9:56 ` Segher Boessenkool
2021-08-04 9:12 ` Hongtao Liu
` (2 subsequent siblings)
3 siblings, 1 reply; 26+ messages in thread
From: Richard Sandiford @ 2021-08-04 9:10 UTC (permalink / raw)
To: Richard Biener
Cc: Hongtao Liu, Jan Hubicka, Segher Boessenkool, gcc-help,
bin.cheng, 172060045
Richard Biener <rguenther@suse.de> writes:
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
>
>> Hongtao Liu <crazylht@gmail.com> writes:
>> > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
>> > <gcc-help@gcc.gnu.org> wrote:
>> >>
>> >> Jan Hubicka <hubicka@ucw.cz> writes:
>> >> > Hi,
>> >> > here are updated scores.
>> >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >> > compares
>> >> > base: mainline
>> >> > 1st column: mainline with very cheap vectorization at -O2 and -O3
>> >> > 2nd column: mainline with cheap vectorization at -O2 and -O3.
>> >> >
>> >> > The short story is:
>> >> >
>> >> > 1) -O2 generic performance
>> >> > kabylake (Intel):
>> >> > very cheap
>> >> > SPEC/SPEC2006/FP/total ~ 8.32%
>> >> > SPEC/SPEC2006/total -0.38% 4.74%
>> >> > SPEC/SPEC2006/INT/total -0.91% -0.14%
>> >> >
>> >> > SPEC/SPEC2017/INT/total 4.71% 7.11%
>> >> > SPEC/SPEC2017/total 2.22% 6.52%
>> >> > SPEC/SPEC2017/FP/total 0.34% 6.06%
>> >> > zen
>> >> > SPEC/SPEC2006/FP/total 0.61% 10.23%
>> >> > SPEC/SPEC2006/total 0.26% 6.27%
>> >> > SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
>> >> >
>> >> > SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
>> >> > SPEC/SPEC2017/total 3.02% 6.55%
>> >> > SPEC/SPEC2017/FP/total 1.26% 5.60%
>> >> >
>> >> > 2) -O2 size:
>> >> > -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>> >> > -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>> >> > 3) build times:
>> >> > 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>> >> > 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
>> >> > here I simply copied data from different configuratoins
>> >> >
>> >> > So for SPEC i would say that most of compile time costs are derrived
>> >> > from code size growth which is a problem with cheap model but not with
>> >> > very cheap. Very cheap indeed results in code size improvements and
>> >> > compile time impact is probably somewhere around 0.5%
>> >> >
>> >> > So from these scores alone this would seem that vectorization makes
>> >> > sense at -O2 with very cheap model to me (I am sure we have other
>> >> > optimizations with worse benefits to compile time tradeoffs).
>> >>
>> >> Thanks for running these.
>> >>
>> >> The biggest issue I know of for enabling very-cheap at -O2 is:
>> >>
>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>> >>
>> >> Perhaps we could get around that by (hopefully temporarily) disabling
>> >> BB SLP within loop vectorisation for the very-cheap model. This would
>> >> purely be a workaround and we should remove it once the PR is fixed.
>> >> (It would even be a compile-time win in the meantime :-))
>> >>
>> >> Thanks,
>> >> Richard
>> >>
>> >> > However there are usual arguments against:
>> >> >
>> >> > 1) Vectorizer being tuned for SPEC. I think the only way to overcome
>> >> > that argument is to enable it by default :)
>> >> > 2) Workloads improved are more of -Ofast type workloads
>> >> >
>> >> > Here are non-spec benchmarks we track:
>> >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >> >
>> >> > I also tried to run Firefox some time ago. Results are not surprising -
>> >> > vectorizaiton helps rendering benchmarks which are those compiler with
>> >> > aggressive flags anyway.
>> >> >
>> >> > Honza
>> >
>> > Hi:
>> > I would like to ask if we can turn on O2 vectorization now?
>>
>> I think we still need to deal with the PR100089 issue that I mentioned above.
>> Like I say, “dealing with” it could be as simple as disabling:
>>
>> /* If we applied if-conversion then try to vectorize the
>> BB of innermost loops.
>> ??? Ideally BB vectorization would learn to vectorize
>> control flow by applying if-conversion on-the-fly, the
>> following retains the if-converted loop body even when
>> only non-if-converted parts took part in BB vectorization. */
>> if (flag_tree_slp_vectorize != 0
>> && loop_vectorized_call
>> && ! loop->inner)
>>
>> for the very-cheap vector cost model until the PR is fixed properly.
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well). At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
Yeah, but a lot of the loop vector cost model choice is about controlling
code size growth and avoiding excessive runtime versioning tests. BB SLP
should be a win on both code size and performance (barring significant
target costing issues).
PR100089 was an exception because we ended up keeping unvectorised
scalar code that would never have existed otherwise. BB SLP proper
shouldn't have that problem.
Thanks,
Richard
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 8:31 ` Richard Biener
2021-08-04 9:10 ` Richard Sandiford
@ 2021-08-04 9:12 ` Hongtao Liu
2021-08-11 17:14 ` Jan Hubicka
2021-08-16 3:22 ` Kewen.Lin
3 siblings, 0 replies; 26+ messages in thread
From: Hongtao Liu @ 2021-08-04 9:12 UTC (permalink / raw)
To: Richard Biener
Cc: Richard Sandiford, Jan Hubicka, Segher Boessenkool, gcc-help,
bin.cheng, 172060045
On Wed, Aug 4, 2021 at 4:31 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
>
> > Hongtao Liu <crazylht@gmail.com> writes:
> > > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > > <gcc-help@gcc.gnu.org> wrote:
> > >>
> > >> Jan Hubicka <hubicka@ucw.cz> writes:
> > >> > Hi,
> > >> > here are updated scores.
> > >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >> > compares
> > >> > base: mainline
> > >> > 1st column: mainline with very cheap vectorization at -O2 and -O3
> > >> > 2nd column: mainline with cheap vectorization at -O2 and -O3.
> > >> >
> > >> > The short story is:
> > >> >
> > >> > 1) -O2 generic performance
> > >> > kabylake (Intel):
> > >> > very cheap
> > >> > SPEC/SPEC2006/FP/total ~ 8.32%
> > >> > SPEC/SPEC2006/total -0.38% 4.74%
> > >> > SPEC/SPEC2006/INT/total -0.91% -0.14%
> > >> >
> > >> > SPEC/SPEC2017/INT/total 4.71% 7.11%
> > >> > SPEC/SPEC2017/total 2.22% 6.52%
> > >> > SPEC/SPEC2017/FP/total 0.34% 6.06%
> > >> > zen
> > >> > SPEC/SPEC2006/FP/total 0.61% 10.23%
> > >> > SPEC/SPEC2006/total 0.26% 6.27%
> > >> > SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
> > >> >
> > >> > SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> > >> > SPEC/SPEC2017/total 3.02% 6.55%
> > >> > SPEC/SPEC2017/FP/total 1.26% 5.60%
> > >> >
> > >> > 2) -O2 size:
> > >> > -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > >> > -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > >> > 3) build times:
> > >> > 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > >> > 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> > >> > here I simply copied data from different configuratoins
> > >> >
> > >> > So for SPEC i would say that most of compile time costs are derrived
> > >> > from code size growth which is a problem with cheap model but not with
> > >> > very cheap. Very cheap indeed results in code size improvements and
> > >> > compile time impact is probably somewhere around 0.5%
> > >> >
> > >> > So from these scores alone this would seem that vectorization makes
> > >> > sense at -O2 with very cheap model to me (I am sure we have other
> > >> > optimizations with worse benefits to compile time tradeoffs).
> > >>
> > >> Thanks for running these.
> > >>
> > >> The biggest issue I know of for enabling very-cheap at -O2 is:
> > >>
> > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > >>
> > >> Perhaps we could get around that by (hopefully temporarily) disabling
> > >> BB SLP within loop vectorisation for the very-cheap model. This would
> > >> purely be a workaround and we should remove it once the PR is fixed.
> > >> (It would even be a compile-time win in the meantime :-))
> > >>
> > >> Thanks,
> > >> Richard
> > >>
> > >> > However there are usual arguments against:
> > >> >
> > >> > 1) Vectorizer being tuned for SPEC. I think the only way to overcome
> > >> > that argument is to enable it by default :)
> > >> > 2) Workloads improved are more of -Ofast type workloads
> > >> >
> > >> > Here are non-spec benchmarks we track:
> > >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >> >
> > >> > I also tried to run Firefox some time ago. Results are not surprising -
> > >> > vectorizaiton helps rendering benchmarks which are those compiler with
> > >> > aggressive flags anyway.
> > >> >
> > >> > Honza
> > >
> > > Hi:
> > > I would like to ask if we can turn on O2 vectorization now?
> >
> > I think we still need to deal with the PR100089 issue that I mentioned above.
> > Like I say, “dealing with” it could be as simple as disabling:
> >
> > /* If we applied if-conversion then try to vectorize the
> > BB of innermost loops.
> > ??? Ideally BB vectorization would learn to vectorize
> > control flow by applying if-conversion on-the-fly, the
> > following retains the if-converted loop body even when
> > only non-if-converted parts took part in BB vectorization. */
> > if (flag_tree_slp_vectorize != 0
> > && loop_vectorized_call
> > && ! loop->inner)
> >
> > for the very-cheap vector cost model until the PR is fixed properly.
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well). At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
>
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
I can collect 4 sets of data including both codesize and performance on SPEC2017
1. baseline: -O2
2. baseline + both slp and loop vectorizer: O2 -ftree-vectorize
-fvect-cost-model=very-cheap.
3. baseline + only loop vectorizer: O2 -ftree-loop-vectorize
-fvect-cost-model=very-cheap.
4. baseline + only bb vectorizer: O2 -ftree-slp-vectorize.
>
> Richard.
--
BR,
Hongtao
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 9:10 ` Richard Sandiford
@ 2021-08-04 9:56 ` Segher Boessenkool
2021-08-04 10:22 ` Richard Sandiford
0 siblings, 1 reply; 26+ messages in thread
From: Segher Boessenkool @ 2021-08-04 9:56 UTC (permalink / raw)
To: Richard Biener, Hongtao Liu, Jan Hubicka, gcc-help, bin.cheng,
172060045, richard.sandiford
On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
> Richard Biener <rguenther@suse.de> writes:
> > Alternatively only enable loop vectorization at -O2 (the above checks
> > flag_tree_slp_vectorize as well). At least the cost model kind
> > does not have any influence on BB vectorization, that is, we get the
> > same pros and cons as we do for -O3.
>
> Yeah, but a lot of the loop vector cost model choice is about controlling
> code size growth and avoiding excessive runtime versioning tests.
Both of those depend a lot on the target, and target-specific conditions
as well (which CPU model is selected for example). Can we factor that
in somehow? Maybe we need some target hook that returns the expected
percentage code growth for vectorising a given loop, for example, and
-O2 vs. -O3 then selects what percentage is acceptable.
> BB SLP
> should be a win on both code size and performance (barring significant
> target costing issues).
Yeah -- but this could use a similar hook as well (just a straightline
piece of code instead of a loop).
> PR100089 was an exception because we ended up keeping unvectorised
> scalar code that would never have existed otherwise. BB SLP proper
> shouldn't have that problem.
It also is a tiny piece of code. There will always be tiny examples
that are much worse (or much better) than average.
Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 9:56 ` Segher Boessenkool
@ 2021-08-04 10:22 ` Richard Sandiford
2021-08-04 21:18 ` Segher Boessenkool
0 siblings, 1 reply; 26+ messages in thread
From: Richard Sandiford @ 2021-08-04 10:22 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Richard Biener, Hongtao Liu, Jan Hubicka, gcc-help, bin.cheng, 172060045
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
>> Richard Biener <rguenther@suse.de> writes:
>> > Alternatively only enable loop vectorization at -O2 (the above checks
>> > flag_tree_slp_vectorize as well). At least the cost model kind
>> > does not have any influence on BB vectorization, that is, we get the
>> > same pros and cons as we do for -O3.
>>
>> Yeah, but a lot of the loop vector cost model choice is about controlling
>> code size growth and avoiding excessive runtime versioning tests.
>
> Both of those depend a lot on the target, and target-specific conditions
> as well (which CPU model is selected for example). Can we factor that
> in somehow? Maybe we need some target hook that returns the expected
> percentage code growth for vectorising a given loop, for example, and
> -O2 vs. -O3 then selects what percentage is acceptable.
>
>> BB SLP
>> should be a win on both code size and performance (barring significant
>> target costing issues).
>
> Yeah -- but this could use a similar hook as well (just a straightline
> piece of code instead of a loop).
I think anything like that should be driven by motivating use cases.
It's not something that we can easily decide in the abstract.
The results so far with using very-cheap at -O2 have been promising,
so I don't think new hooks should block that becoming the default.
>> PR100089 was an exception because we ended up keeping unvectorised
>> scalar code that would never have existed otherwise. BB SLP proper
>> shouldn't have that problem.
>
> It also is a tiny piece of code. There will always be tiny examples
> that are much worse (or much better) than average.
Yeah, what makes PR100089 important isn't IMO the test itself, but the
underlying problem that the PR exposed. Enabling this “BB SLP in loop
vectorisation” code can lead to the generation of scalar COND_EXPRs even
though we know that ifcvt doesn't have a proper cost model for deciding
whether scalar COND_EXPRs are a win.
Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
(although still dubious), but I think it's something we need to avoid
for -O2, even if that means losing the optimisation.
Thanks,
Richard
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 10:22 ` Richard Sandiford
@ 2021-08-04 21:18 ` Segher Boessenkool
0 siblings, 0 replies; 26+ messages in thread
From: Segher Boessenkool @ 2021-08-04 21:18 UTC (permalink / raw)
To: Richard Biener, Hongtao Liu, Jan Hubicka, gcc-help, bin.cheng,
172060045, richard.sandiford
On Wed, Aug 04, 2021 at 11:22:53AM +0100, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
> >> Richard Biener <rguenther@suse.de> writes:
> >> > Alternatively only enable loop vectorization at -O2 (the above checks
> >> > flag_tree_slp_vectorize as well). At least the cost model kind
> >> > does not have any influence on BB vectorization, that is, we get the
> >> > same pros and cons as we do for -O3.
> >>
> >> Yeah, but a lot of the loop vector cost model choice is about controlling
> >> code size growth and avoiding excessive runtime versioning tests.
> >
> > Both of those depend a lot on the target, and target-specific conditions
> > as well (which CPU model is selected for example). Can we factor that
> > in somehow? Maybe we need some target hook that returns the expected
> > percentage code growth for vectorising a given loop, for example, and
> > -O2 vs. -O3 then selects what percentage is acceptable.
> >
> >> BB SLP
> >> should be a win on both code size and performance (barring significant
> >> target costing issues).
> >
> > Yeah -- but this could use a similar hook as well (just a straightline
> > piece of code instead of a loop).
>
> I think anything like that should be driven by motivating use cases.
> It's not something that we can easily decide in the abstract.
>
> The results so far with using very-cheap at -O2 have been promising,
> so I don't think new hooks should block that becoming the default.
Right, but it wouldn't hurt to think a sec if we are on the right path
forward. It's is crystal clear that to make good decisions about what
and how to vectorise you need to take *some* target characteristics into
account, and that will have to happen sooner rather than later.
This was all in reply to
> >> Yeah, but a lot of the loop vector cost model choice is about controlling
> >> code size growth and avoiding excessive runtime versioning tests.
It was not meant to hold up these patches :-)
> >> PR100089 was an exception because we ended up keeping unvectorised
> >> scalar code that would never have existed otherwise. BB SLP proper
> >> shouldn't have that problem.
> >
> > It also is a tiny piece of code. There will always be tiny examples
> > that are much worse (or much better) than average.
>
> Yeah, what makes PR100089 important isn't IMO the test itself, but the
> underlying problem that the PR exposed. Enabling this “BB SLP in loop
> vectorisation” code can lead to the generation of scalar COND_EXPRs even
> though we know that ifcvt doesn't have a proper cost model for deciding
> whether scalar COND_EXPRs are a win.
>
> Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
> (although still dubious), but I think it's something we need to avoid
> for -O2, even if that means losing the optimisation.
Yeah -- -O2 should almost always do the right thing, while -O3 can do
bad things more often, it just has to be better "on average".
Segher
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 8:31 ` Richard Biener
2021-08-04 9:10 ` Richard Sandiford
2021-08-04 9:12 ` Hongtao Liu
@ 2021-08-11 17:14 ` Jan Hubicka
2021-08-14 14:22 ` Jan Hubicka
2021-08-16 3:22 ` Kewen.Lin
3 siblings, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-08-11 17:14 UTC (permalink / raw)
To: Richard Biener
Cc: Richard Sandiford, Hongtao Liu, Segher Boessenkool, gcc-help,
bin.cheng, 172060045
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well). At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
>
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
I did last year and most perofmrance changes was tree-loop-vectorize as
expected. I will re-run the benchmarks now.
Honza
>
> Richard.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-11 17:14 ` Jan Hubicka
@ 2021-08-14 14:22 ` Jan Hubicka
2021-08-16 8:03 ` Richard Biener
0 siblings, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2021-08-14 14:22 UTC (permalink / raw)
To: Richard Biener
Cc: Richard Sandiford, Hongtao Liu, Segher Boessenkool, gcc-help,
bin.cheng, 172060045
Hi,
runs finished for cheap vectorization model. I am running very cheap
too.
https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on
compares default -O2 (base), loop vectorize, slp vectorize.
Put "O2" to machine name.
Overall scores are:
kaby.spec2006.O2_generic
Test loop alp
SPEC/SPEC2006/FP/total 8.16% 0.19%
SPEC/SPEC2006/total 4.96% 0.38%
SPEC/SPEC2006/INT/total 36.595 0.58% 0.65%
kaby.spec2006.O2_generic_lto
SPEC/SPEC2006/FP/total 9.06% -0.36%
SPEC/SPEC2006/total 5.32% ~
SPEC/SPEC2006/INT/total 0.24% 0.27%
kaby.spec2017.O2_generic
SPEC/SPEC2017/INT/total 6.54% -1.15%
SPEC/SPEC2017/total 5.66% -0.17%
SPEC/SPEC2017/FP/total 5.00% 0.59%
kaby.spec2017.O2_generic_lto
SPEC/SPEC2017/INT/total 6.62% -0.12%
SPEC/SPEC2017/total 5.69% -0.14%
SPEC/SPEC2017/FP/total 4.99% -0.16%
zenith.spec2006.O2_generic
SPEC/SPEC2006/FP/total 10.23% -0.35%
SPEC/SPEC2006/total 6.01% -0.48%
SPEC/SPEC2006/INT/total 0.31% -0.66%
zenith.spec2006.O2_generic_lto
SPEC/SPEC2006/FP/total 12.03% 0.82%
SPEC/SPEC2006/total 6.90% 0.44%
SPEC/SPEC2006/INT/total ~ -0.11%
zenith.spec2017.O2_generic
SPEC/SPEC2017/INT/total 7.46% -0.37%
SPEC/SPEC2017/total 6.81% 0.48%
SPEC/SPEC2017/FP/total 6.31% 1.15%
zenith.spec2017.O2_generic_lto
SPEC/SPEC2017/INT/total 7.81% -0.22%
SPEC/SPEC2017/total 7.07% 0.44%
SPEC/SPEC2017/FP/total 6.50% 0.94%
So loop vectorize is consistent win, slp is mostly neutral.
Code size growth is too large for -O2 for loop vectorize. SLP vectorize
seems slight size win overall.
Noteworthy regressions caused by slp are:
5-6%: xz (kaby, lto), milc (zenith), astar (zenith)
4-5%: xalancbmk (kaby), blender (kaby)
3-4%: xz (kaby, nolto), dealII (zenith)
2-3%: povray (kaby), astar (kaby), perlbench (kaby), sjeng (zenith),
xz(zenith)
We get 10% improvement on imagemagick (kaby), 17.5% (zenith),
imagematick 6.99% (zenith)
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on
Is for C++/polyhedron benchmarks. It shows several interesting
regressions in polyhedron and tsvc for loop vectorization (over 100%)
and also some for slp. Shall I try to search for bugzilla for these?
Honza
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-04 8:31 ` Richard Biener
` (2 preceding siblings ...)
2021-08-11 17:14 ` Jan Hubicka
@ 2021-08-16 3:22 ` Kewen.Lin
2021-08-16 6:00 ` Hongtao Liu
3 siblings, 1 reply; 26+ messages in thread
From: Kewen.Lin @ 2021-08-16 3:22 UTC (permalink / raw)
To: Richard Biener
Cc: bin.cheng, Segher Boessenkool, 172060045, gcc-help, Jan Hubicka,
Richard Sandiford
on 2021/8/4 下午4:31, Richard Biener wrote:
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
>
>> Hongtao Liu <crazylht@gmail.com> writes:
>>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
>>> <gcc-help@gcc.gnu.org> wrote:
>>>>
>>>> Jan Hubicka <hubicka@ucw.cz> writes:
>>>>> Hi,
>>>>> here are updated scores.
>>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>> compares
>>>>> base: mainline
>>>>> 1st column: mainline with very cheap vectorization at -O2 and -O3
>>>>> 2nd column: mainline with cheap vectorization at -O2 and -O3.
>>>>>
>>>>> The short story is:
>>>>>
>>>>> 1) -O2 generic performance
>>>>> kabylake (Intel):
>>>>> very cheap
>>>>> SPEC/SPEC2006/FP/total ~ 8.32%
>>>>> SPEC/SPEC2006/total -0.38% 4.74%
>>>>> SPEC/SPEC2006/INT/total -0.91% -0.14%
>>>>>
>>>>> SPEC/SPEC2017/INT/total 4.71% 7.11%
>>>>> SPEC/SPEC2017/total 2.22% 6.52%
>>>>> SPEC/SPEC2017/FP/total 0.34% 6.06%
>>>>> zen
>>>>> SPEC/SPEC2006/FP/total 0.61% 10.23%
>>>>> SPEC/SPEC2006/total 0.26% 6.27%
>>>>> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
>>>>>
>>>>> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
>>>>> SPEC/SPEC2017/total 3.02% 6.55%
>>>>> SPEC/SPEC2017/FP/total 1.26% 5.60%
>>>>>
>>>>> 2) -O2 size:
>>>>> -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>>>>> -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>>>>> 3) build times:
>>>>> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>>>>> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
>>>>> here I simply copied data from different configuratoins
>>>>>
>>>>> So for SPEC i would say that most of compile time costs are derrived
>>>>> from code size growth which is a problem with cheap model but not with
>>>>> very cheap. Very cheap indeed results in code size improvements and
>>>>> compile time impact is probably somewhere around 0.5%
>>>>>
>>>>> So from these scores alone this would seem that vectorization makes
>>>>> sense at -O2 with very cheap model to me (I am sure we have other
>>>>> optimizations with worse benefits to compile time tradeoffs).
>>>>
>>>> Thanks for running these.
>>>>
>>>> The biggest issue I know of for enabling very-cheap at -O2 is:
>>>>
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>>>>
>>>> Perhaps we could get around that by (hopefully temporarily) disabling
>>>> BB SLP within loop vectorisation for the very-cheap model. This would
>>>> purely be a workaround and we should remove it once the PR is fixed.
>>>> (It would even be a compile-time win in the meantime :-))
>>>>
>>>> Thanks,
>>>> Richard
>>>>
>>>>> However there are usual arguments against:
>>>>>
>>>>> 1) Vectorizer being tuned for SPEC. I think the only way to overcome
>>>>> that argument is to enable it by default :)
>>>>> 2) Workloads improved are more of -Ofast type workloads
>>>>>
>>>>> Here are non-spec benchmarks we track:
>>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>>
>>>>> I also tried to run Firefox some time ago. Results are not surprising -
>>>>> vectorizaiton helps rendering benchmarks which are those compiler with
>>>>> aggressive flags anyway.
>>>>>
>>>>> Honza
>>>
>>> Hi:
>>> I would like to ask if we can turn on O2 vectorization now?
>>
>> I think we still need to deal with the PR100089 issue that I mentioned above.
>> Like I say, “dealing with” it could be as simple as disabling:
>>
>> /* If we applied if-conversion then try to vectorize the
>> BB of innermost loops.
>> ??? Ideally BB vectorization would learn to vectorize
>> control flow by applying if-conversion on-the-fly, the
>> following retains the if-converted loop body even when
>> only non-if-converted parts took part in BB vectorization. */
>> if (flag_tree_slp_vectorize != 0
>> && loop_vectorized_call
>> && ! loop->inner)
>>
>> for the very-cheap vector cost model until the PR is fixed properly.
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well). At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
>
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
>
Here is the measured performance speedup at O2 vect with
very cheap cost model on both Power8 and Power9.
INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
FP: INT + -ffast-math
Column titles are:
<bmks> <both loop and slp> <loop only> <slp only> (+:improvement, -:degradation)
Power8:
500.perlbench_r 0.00% 0.00% 0.00%
502.gcc_r 0.39% 0.78% 0.00%
505.mcf_r 0.00% 0.00% 0.00%
520.omnetpp_r 1.21% 0.30% 0.00%
523.xalancbmk_r 0.00% 0.00% -0.57%
525.x264_r 41.84% 42.55% 0.00%
531.deepsjeng_r 0.00% -0.63% 0.00%
541.leela_r -3.44% -2.75% 0.00%
548.exchange2_r 1.66% 1.66% 0.00%
557.xz_r 1.39% 1.04% 0.00%
Geomean 3.67% 3.64% -0.06%
503.bwaves_r 0.00% 0.00% 0.00%
507.cactuBSSN_r 0.00% 0.29% 0.44%
508.namd_r 0.00% 0.29% 0.00%
510.parest_r 0.00% -0.36% -0.54%
511.povray_r 0.63% 0.31% 0.94%
519.lbm_r 2.71% 2.71% 0.00%
521.wrf_r 1.04% 1.04% 0.00%
526.blender_r -1.31% -0.78% 0.00%
527.cam4_r -0.62% -0.31% -0.62%
538.imagick_r 0.21% 0.21% -0.21%
544.nab_r 0.00% 0.00% 0.00%
549.fotonik3d_r 0.00% 0.00% 0.00%
554.roms_r 0.30% 0.00% 0.00%
Geomean 0.22% 0.26% 0.00%
Power9:
500.perlbench_r 0.62% 0.62% -1.54%
502.gcc_r -0.60% -0.60% -0.81%
505.mcf_r 2.05% 2.05% 0.00%
520.omnetpp_r -2.41% -0.30% -0.60%
523.xalancbmk_r -1.44% -2.30% -1.44%
525.x264_r 24.26% 23.93% -0.33%
531.deepsjeng_r 0.32% 0.32% 0.00%
541.leela_r 0.39% 1.18% -0.39%
548.exchange2_r 0.76% 0.76% 0.00%
557.xz_r 0.36% 0.36% -0.36%
Geomean 2.19% 2.38% -0.55%
503.bwaves_r 0.00% 0.36% 0.00%
507.cactuBSSN_r 0.00% 0.00% 0.00%
508.namd_r -3.73% -0.31% -3.73%
510.parest_r -0.21% -0.42% -0.42%
511.povray_r -0.96% -1.59% 0.64%
519.lbm_r 2.31% 2.31% 0.17%
521.wrf_r 2.66% 2.66% 0.00%
526.blender_r -1.96% -1.68% 1.40%
527.cam4_r 0.00% 0.91% 1.81%
538.imagick_r 0.39% -0.19% -10.29% // known noise, imagick_r can have big jitter on P9 box sometimes.
544.nab_r 0.25% 0.00% 0.00%
549.fotonik3d_r 0.94% 0.94% 0.00%
554.roms_r 0.00% 0.00% -1.05%
Geomean -0.03% 0.22% -0.93%
As above, the gains are mainly from loop vectorization.
btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.
BR,
Kewen
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-16 3:22 ` Kewen.Lin
@ 2021-08-16 6:00 ` Hongtao Liu
2021-08-16 6:09 ` Hongtao Liu
0 siblings, 1 reply; 26+ messages in thread
From: Hongtao Liu @ 2021-08-16 6:00 UTC (permalink / raw)
To: Kewen.Lin
Cc: Richard Biener, bin.cheng, Segher Boessenkool, 172060045,
gcc-help, Richard Sandiford, Jan Hubicka
On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
<gcc-help@gcc.gnu.org> wrote:
>
> on 2021/8/4 下午4:31, Richard Biener wrote:
> > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> >
> >> Hongtao Liu <crazylht@gmail.com> writes:
> >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> >>> <gcc-help@gcc.gnu.org> wrote:
> >>>>
> >>>> Jan Hubicka <hubicka@ucw.cz> writes:
> >>>>> Hi,
> >>>>> here are updated scores.
> >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >>>>> compares
> >>>>> base: mainline
> >>>>> 1st column: mainline with very cheap vectorization at -O2 and -O3
> >>>>> 2nd column: mainline with cheap vectorization at -O2 and -O3.
> >>>>>
> >>>>> The short story is:
> >>>>>
> >>>>> 1) -O2 generic performance
> >>>>> kabylake (Intel):
> >>>>> very cheap
> >>>>> SPEC/SPEC2006/FP/total ~ 8.32%
> >>>>> SPEC/SPEC2006/total -0.38% 4.74%
> >>>>> SPEC/SPEC2006/INT/total -0.91% -0.14%
> >>>>>
> >>>>> SPEC/SPEC2017/INT/total 4.71% 7.11%
> >>>>> SPEC/SPEC2017/total 2.22% 6.52%
> >>>>> SPEC/SPEC2017/FP/total 0.34% 6.06%
> >>>>> zen
> >>>>> SPEC/SPEC2006/FP/total 0.61% 10.23%
> >>>>> SPEC/SPEC2006/total 0.26% 6.27%
> >>>>> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
> >>>>>
> >>>>> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> >>>>> SPEC/SPEC2017/total 3.02% 6.55%
> >>>>> SPEC/SPEC2017/FP/total 1.26% 5.60%
> >>>>>
> >>>>> 2) -O2 size:
> >>>>> -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >>>>> -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >>>>> 3) build times:
> >>>>> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >>>>> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> >>>>> here I simply copied data from different configuratoins
> >>>>>
> >>>>> So for SPEC i would say that most of compile time costs are derrived
> >>>>> from code size growth which is a problem with cheap model but not with
> >>>>> very cheap. Very cheap indeed results in code size improvements and
> >>>>> compile time impact is probably somewhere around 0.5%
> >>>>>
> >>>>> So from these scores alone this would seem that vectorization makes
> >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> >>>>> optimizations with worse benefits to compile time tradeoffs).
> >>>>
> >>>> Thanks for running these.
> >>>>
> >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> >>>>
> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> >>>>
> >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> >>>> BB SLP within loop vectorisation for the very-cheap model. This would
> >>>> purely be a workaround and we should remove it once the PR is fixed.
> >>>> (It would even be a compile-time win in the meantime :-))
> >>>>
> >>>> Thanks,
> >>>> Richard
> >>>>
> >>>>> However there are usual arguments against:
> >>>>>
> >>>>> 1) Vectorizer being tuned for SPEC. I think the only way to overcome
> >>>>> that argument is to enable it by default :)
> >>>>> 2) Workloads improved are more of -Ofast type workloads
> >>>>>
> >>>>> Here are non-spec benchmarks we track:
> >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >>>>>
> >>>>> I also tried to run Firefox some time ago. Results are not surprising -
> >>>>> vectorizaiton helps rendering benchmarks which are those compiler with
> >>>>> aggressive flags anyway.
> >>>>>
> >>>>> Honza
> >>>
> >>> Hi:
> >>> I would like to ask if we can turn on O2 vectorization now?
> >>
> >> I think we still need to deal with the PR100089 issue that I mentioned above.
> >> Like I say, “dealing with” it could be as simple as disabling:
> >>
> >> /* If we applied if-conversion then try to vectorize the
> >> BB of innermost loops.
> >> ??? Ideally BB vectorization would learn to vectorize
> >> control flow by applying if-conversion on-the-fly, the
> >> following retains the if-converted loop body even when
> >> only non-if-converted parts took part in BB vectorization. */
> >> if (flag_tree_slp_vectorize != 0
> >> && loop_vectorized_call
> >> && ! loop->inner)
> >>
> >> for the very-cheap vector cost model until the PR is fixed properly.
> >
> > Alternatively only enable loop vectorization at -O2 (the above checks
> > flag_tree_slp_vectorize as well). At least the cost model kind
> > does not have any influence on BB vectorization, that is, we get the
> > same pros and cons as we do for -O3.
> >
> > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> >
>
>
> Here is the measured performance speedup at O2 vect with
> very cheap cost model on both Power8 and Power9.
>
> INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
> FP: INT + -ffast-math
>
> Column titles are:
>
> <bmks> <both loop and slp> <loop only> <slp only> (+:improvement, -:degradation)
>
> Power8:
> 500.perlbench_r 0.00% 0.00% 0.00%
> 502.gcc_r 0.39% 0.78% 0.00%
> 505.mcf_r 0.00% 0.00% 0.00%
> 520.omnetpp_r 1.21% 0.30% 0.00%
> 523.xalancbmk_r 0.00% 0.00% -0.57%
> 525.x264_r 41.84% 42.55% 0.00%
> 531.deepsjeng_r 0.00% -0.63% 0.00%
> 541.leela_r -3.44% -2.75% 0.00%
> 548.exchange2_r 1.66% 1.66% 0.00%
> 557.xz_r 1.39% 1.04% 0.00%
> Geomean 3.67% 3.64% -0.06%
>
> 503.bwaves_r 0.00% 0.00% 0.00%
> 507.cactuBSSN_r 0.00% 0.29% 0.44%
> 508.namd_r 0.00% 0.29% 0.00%
> 510.parest_r 0.00% -0.36% -0.54%
> 511.povray_r 0.63% 0.31% 0.94%
> 519.lbm_r 2.71% 2.71% 0.00%
> 521.wrf_r 1.04% 1.04% 0.00%
> 526.blender_r -1.31% -0.78% 0.00%
> 527.cam4_r -0.62% -0.31% -0.62%
> 538.imagick_r 0.21% 0.21% -0.21%
> 544.nab_r 0.00% 0.00% 0.00%
> 549.fotonik3d_r 0.00% 0.00% 0.00%
> 554.roms_r 0.30% 0.00% 0.00%
> Geomean 0.22% 0.26% 0.00%
>
> Power9:
>
> 500.perlbench_r 0.62% 0.62% -1.54%
> 502.gcc_r -0.60% -0.60% -0.81%
> 505.mcf_r 2.05% 2.05% 0.00%
> 520.omnetpp_r -2.41% -0.30% -0.60%
> 523.xalancbmk_r -1.44% -2.30% -1.44%
> 525.x264_r 24.26% 23.93% -0.33%
> 531.deepsjeng_r 0.32% 0.32% 0.00%
> 541.leela_r 0.39% 1.18% -0.39%
> 548.exchange2_r 0.76% 0.76% 0.00%
> 557.xz_r 0.36% 0.36% -0.36%
> Geomean 2.19% 2.38% -0.55%
>
> 503.bwaves_r 0.00% 0.36% 0.00%
> 507.cactuBSSN_r 0.00% 0.00% 0.00%
> 508.namd_r -3.73% -0.31% -3.73%
> 510.parest_r -0.21% -0.42% -0.42%
> 511.povray_r -0.96% -1.59% 0.64%
> 519.lbm_r 2.31% 2.31% 0.17%
> 521.wrf_r 2.66% 2.66% 0.00%
> 526.blender_r -1.96% -1.68% 1.40%
> 527.cam4_r 0.00% 0.91% 1.81%
> 538.imagick_r 0.39% -0.19% -10.29% // known noise, imagick_r can have big jitter on P9 box sometimes.
> 544.nab_r 0.25% 0.00% 0.00%
> 549.fotonik3d_r 0.94% 0.94% 0.00%
> 554.roms_r 0.00% 0.00% -1.05%
> Geomean -0.03% 0.22% -0.93%
>
>
> As above, the gains are mainly from loop vectorization.
> btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.
>
> BR,
> Kewen
Here is data on CLX.
+ for performance means better.
- for codesize means better.
we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which
did not exist in our last measurement w/ gcc11.0.0 20210317, it's not
related to the fix of PR100089.
others about the same as the last measurement.
O2 -ftree-vectorize very-cheap
loop vectorizer bb vectorizer
codesize performce codesize performance
codesize performance codesize performance
500.perlbench_r 0.34% 0.55%
0.05% 0.80% 0.29% 0.84%
502.gcc_r 0.29% -0.32%
0.02% -0.12% 0.27% -0.23%
505.mcf_r 1.36% -1.20%(noise)
0.00% -0.69% 1.16% -0.85%
520.omnetpp_r -0.65% -0.83%
0.05% -0.97% -0.70% -0.52%
523.xalancbmk_r 0.04% -0.59%
0.26% -0.56% -0.04% -0.52%
525.x264_r 1.29% 62.62%
1.18% 64.80% 0.13% -0.29%
531.deepsjeng_r 0.18% -0.44%
0.16% -0.03% -0.05% -0.50%
541.leela_r -1.10% -0.12%
-0.11% 0.59% -0.99% -1.12%
548.exchange2_r -1.19% 0.34%
-0.27% -0.29% -1.02% 0.17%
557.xz_r -0.53% -1.01%(cost model)
-0.76% -0.10% -0.10% -1.28%
geomean for intrate 0.00% 4.60%
0.06% 4.98% -0.11% -0.43%
503.bwaves_r -0.29% -1.19% (noise)
0.00% -0.86% -0.25% -0.43%
507.cactuBSSN_r 0.01% -0.55%
0.01% -0.35% 0.01% -0.37%
508.namd_r -0.61% 2.38%
-0.13% -0.09% -0.67% 2.45%
510.parest_r -0.41% 0.10%
-0.16% 0.62% -0.50% 0.72%
511.povray_r -1.76% 3.79%
-0.03% 0.41% -1.74% 4.61%
519.lbm_r 0.38% -0.33%
0.00% -0.31% 0.38% 0.05%
521.wrf_r -0.85% 1.23%
-0.03% 1.60% -0.94% 0.00%
526.blender_r -0.40% -1.21%(nosie)
0.00% -1.49% -0.43% -1.64%
527.cam4_r -0.27% 0.06%
0.10% -0.06% -0.39% -0.01%
538.imagick_r -0.97% 1.10%
-0.09% 0.32% -0.90% 2.49%
544.nab_r -0.65% 0.09%
0.02% 0.20% -0.69% 0.09%
549.fotonik3d_r 3.36% 0.30%
2.42% 0.44% 0.93% -0.08%
554.roms_r -0.28% -0.20%
0.25% 0.06% -0.52% 0.00%
geomean for fprate -0.22% 0.42%
0.18% 0.04% -0.44% 0.59%
geomean -0.12% 2.22%
0.13% 2.16% -0.30% 0.15%
--
BR,
Hongtao
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-16 6:00 ` Hongtao Liu
@ 2021-08-16 6:09 ` Hongtao Liu
2021-08-24 2:21 ` Hongtao Liu
0 siblings, 1 reply; 26+ messages in thread
From: Hongtao Liu @ 2021-08-16 6:09 UTC (permalink / raw)
To: Kewen.Lin
Cc: Richard Biener, bin.cheng, Segher Boessenkool, 172060045,
gcc-help, Richard Sandiford, Jan Hubicka
On Mon, Aug 16, 2021 at 2:00 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
> >
> > on 2021/8/4 下午4:31, Richard Biener wrote:
> > > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> > >
> > >> Hongtao Liu <crazylht@gmail.com> writes:
> > >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > >>> <gcc-help@gcc.gnu.org> wrote:
> > >>>>
> > >>>> Jan Hubicka <hubicka@ucw.cz> writes:
> > >>>>> Hi,
> > >>>>> here are updated scores.
> > >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >>>>> compares
> > >>>>> base: mainline
> > >>>>> 1st column: mainline with very cheap vectorization at -O2 and -O3
> > >>>>> 2nd column: mainline with cheap vectorization at -O2 and -O3.
> > >>>>>
> > >>>>> The short story is:
> > >>>>>
> > >>>>> 1) -O2 generic performance
> > >>>>> kabylake (Intel):
> > >>>>> very cheap
> > >>>>> SPEC/SPEC2006/FP/total ~ 8.32%
> > >>>>> SPEC/SPEC2006/total -0.38% 4.74%
> > >>>>> SPEC/SPEC2006/INT/total -0.91% -0.14%
> > >>>>>
> > >>>>> SPEC/SPEC2017/INT/total 4.71% 7.11%
> > >>>>> SPEC/SPEC2017/total 2.22% 6.52%
> > >>>>> SPEC/SPEC2017/FP/total 0.34% 6.06%
> > >>>>> zen
> > >>>>> SPEC/SPEC2006/FP/total 0.61% 10.23%
> > >>>>> SPEC/SPEC2006/total 0.26% 6.27%
> > >>>>> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
> > >>>>>
> > >>>>> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> > >>>>> SPEC/SPEC2017/total 3.02% 6.55%
> > >>>>> SPEC/SPEC2017/FP/total 1.26% 5.60%
> > >>>>>
> > >>>>> 2) -O2 size:
> > >>>>> -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > >>>>> -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > >>>>> 3) build times:
> > >>>>> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > >>>>> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> > >>>>> here I simply copied data from different configuratoins
> > >>>>>
> > >>>>> So for SPEC i would say that most of compile time costs are derrived
> > >>>>> from code size growth which is a problem with cheap model but not with
> > >>>>> very cheap. Very cheap indeed results in code size improvements and
> > >>>>> compile time impact is probably somewhere around 0.5%
> > >>>>>
> > >>>>> So from these scores alone this would seem that vectorization makes
> > >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> > >>>>> optimizations with worse benefits to compile time tradeoffs).
> > >>>>
> > >>>> Thanks for running these.
> > >>>>
> > >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> > >>>>
> > >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > >>>>
> > >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> > >>>> BB SLP within loop vectorisation for the very-cheap model. This would
> > >>>> purely be a workaround and we should remove it once the PR is fixed.
> > >>>> (It would even be a compile-time win in the meantime :-))
> > >>>>
> > >>>> Thanks,
> > >>>> Richard
> > >>>>
> > >>>>> However there are usual arguments against:
> > >>>>>
> > >>>>> 1) Vectorizer being tuned for SPEC. I think the only way to overcome
> > >>>>> that argument is to enable it by default :)
> > >>>>> 2) Workloads improved are more of -Ofast type workloads
> > >>>>>
> > >>>>> Here are non-spec benchmarks we track:
> > >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >>>>>
> > >>>>> I also tried to run Firefox some time ago. Results are not surprising -
> > >>>>> vectorizaiton helps rendering benchmarks which are those compiler with
> > >>>>> aggressive flags anyway.
> > >>>>>
> > >>>>> Honza
> > >>>
> > >>> Hi:
> > >>> I would like to ask if we can turn on O2 vectorization now?
> > >>
> > >> I think we still need to deal with the PR100089 issue that I mentioned above.
> > >> Like I say, “dealing with” it could be as simple as disabling:
> > >>
> > >> /* If we applied if-conversion then try to vectorize the
> > >> BB of innermost loops.
> > >> ??? Ideally BB vectorization would learn to vectorize
> > >> control flow by applying if-conversion on-the-fly, the
> > >> following retains the if-converted loop body even when
> > >> only non-if-converted parts took part in BB vectorization. */
> > >> if (flag_tree_slp_vectorize != 0
> > >> && loop_vectorized_call
> > >> && ! loop->inner)
> > >>
> > >> for the very-cheap vector cost model until the PR is fixed properly.
> > >
> > > Alternatively only enable loop vectorization at -O2 (the above checks
> > > flag_tree_slp_vectorize as well). At least the cost model kind
> > > does not have any influence on BB vectorization, that is, we get the
> > > same pros and cons as we do for -O3.
> > >
> > > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> > >
> >
> >
> > Here is the measured performance speedup at O2 vect with
> > very cheap cost model on both Power8 and Power9.
> >
> > INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
> > FP: INT + -ffast-math
> >
> > Column titles are:
> >
> > <bmks> <both loop and slp> <loop only> <slp only> (+:improvement, -:degradation)
> >
> > Power8:
> > 500.perlbench_r 0.00% 0.00% 0.00%
> > 502.gcc_r 0.39% 0.78% 0.00%
> > 505.mcf_r 0.00% 0.00% 0.00%
> > 520.omnetpp_r 1.21% 0.30% 0.00%
> > 523.xalancbmk_r 0.00% 0.00% -0.57%
> > 525.x264_r 41.84% 42.55% 0.00%
> > 531.deepsjeng_r 0.00% -0.63% 0.00%
> > 541.leela_r -3.44% -2.75% 0.00%
> > 548.exchange2_r 1.66% 1.66% 0.00%
> > 557.xz_r 1.39% 1.04% 0.00%
> > Geomean 3.67% 3.64% -0.06%
> >
> > 503.bwaves_r 0.00% 0.00% 0.00%
> > 507.cactuBSSN_r 0.00% 0.29% 0.44%
> > 508.namd_r 0.00% 0.29% 0.00%
> > 510.parest_r 0.00% -0.36% -0.54%
> > 511.povray_r 0.63% 0.31% 0.94%
> > 519.lbm_r 2.71% 2.71% 0.00%
> > 521.wrf_r 1.04% 1.04% 0.00%
> > 526.blender_r -1.31% -0.78% 0.00%
> > 527.cam4_r -0.62% -0.31% -0.62%
> > 538.imagick_r 0.21% 0.21% -0.21%
> > 544.nab_r 0.00% 0.00% 0.00%
> > 549.fotonik3d_r 0.00% 0.00% 0.00%
> > 554.roms_r 0.30% 0.00% 0.00%
> > Geomean 0.22% 0.26% 0.00%
> >
> > Power9:
> >
> > 500.perlbench_r 0.62% 0.62% -1.54%
> > 502.gcc_r -0.60% -0.60% -0.81%
> > 505.mcf_r 2.05% 2.05% 0.00%
> > 520.omnetpp_r -2.41% -0.30% -0.60%
> > 523.xalancbmk_r -1.44% -2.30% -1.44%
> > 525.x264_r 24.26% 23.93% -0.33%
> > 531.deepsjeng_r 0.32% 0.32% 0.00%
> > 541.leela_r 0.39% 1.18% -0.39%
> > 548.exchange2_r 0.76% 0.76% 0.00%
> > 557.xz_r 0.36% 0.36% -0.36%
> > Geomean 2.19% 2.38% -0.55%
> >
> > 503.bwaves_r 0.00% 0.36% 0.00%
> > 507.cactuBSSN_r 0.00% 0.00% 0.00%
> > 508.namd_r -3.73% -0.31% -3.73%
> > 510.parest_r -0.21% -0.42% -0.42%
> > 511.povray_r -0.96% -1.59% 0.64%
> > 519.lbm_r 2.31% 2.31% 0.17%
> > 521.wrf_r 2.66% 2.66% 0.00%
> > 526.blender_r -1.96% -1.68% 1.40%
> > 527.cam4_r 0.00% 0.91% 1.81%
> > 538.imagick_r 0.39% -0.19% -10.29% // known noise, imagick_r can have big jitter on P9 box sometimes.
> > 544.nab_r 0.25% 0.00% 0.00%
> > 549.fotonik3d_r 0.94% 0.94% 0.00%
> > 554.roms_r 0.00% 0.00% -1.05%
> > Geomean -0.03% 0.22% -0.93%
> >
> >
> > As above, the gains are mainly from loop vectorization.
> > btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.
> >
> > BR,
> > Kewen
>
> Here is data on CLX.
> + for performance means better.
> - for codesize means better.
>
> we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which
> did not exist in our last measurement w/ gcc11.0.0 20210317, it's not
> related to the fix of PR100089.
> others about the same as the last measurement.
>
O2 -ftree-vectorize very-cheap
codesize performance
500.perlbench_r 0.34% 0.55%
502.gcc_r 0.29% -0.32%
505.mcf_r 1.36% -1.20%(noise)
520.omnetpp_r -0.65% -0.83%
523.xalancbmk_r 0.04% -0.59%
525.x264_r 1.29% 62.62%
531.deepsjeng_r 0.18% -0.44%
541.leela_r -1.10% -0.12%
548.exchange2_r -1.19% 0.34%
557.xz_r -0.53% -1.01%(cost model)
geomean for intrate 0.00% 4.60%
503.bwaves_r -0.29% -1.19%
507.cactuBSSN_r 0.01% -0.55%
508.namd_r -0.61% 2.38%
510.parest_r -0.41% 0.10%
511.povray_r -1.76% 3.79%
519.lbm_r 0.38% -0.33%
521.wrf_r -0.85% 1.23%
526.blender_r -0.40% -1.21%(nosie)
527.cam4_r -0.27% 0.06%
538.imagick_r -0.97% 1.10%
544.nab_r -0.65% 0.09%
549.fotonik3d_r 3.36% 0.30%
554.roms_r -0.28% -0.20%
geomean for fprate -0.22% 0.42%
geomean -0.12% 2.22%
loop vectorizer bb vectorizer
codesize performance codesize performance
500.perlbench_r 0.05% 0.80% 0.29% 0.84%
502.gcc_r 0.02% -0.12% 0.27% -0.23%
505.mcf_r 0.00% -0.69% 1.16% -0.85%
520.omnetpp_r 0.05% -0.97% -0.70% -0.52%
523.xalancbmk_r 0.26% -0.56% -0.04% -0.52%
525.x264_r 1.18% 64.80% 0.13% -0.29%
531.deepsjeng_r 0.16% -0.03% -0.05% -0.50%
541.leela_r -0.11% 0.59% -0.99% -1.12%
548.exchange2_r -0.27% -0.29% -1.02% 0.17%
557.xz_r -0.76% -0.10% -0.10% -1.28%
geomean for intrate 0.06% 4.98% -0.11% -0.43%
503.bwaves_r 0.00% -0.86% -0.25% -0.43%
507.cactuBSSN_r 0.01% -0.35% 0.01% -0.37%
508.namd_r -0.13% -0.09% -0.67% 2.45%
510.parest_r -0.16% 0.62% -0.50% 0.72%
511.povray_r -0.03% 0.41% -1.74% 4.61%
519.lbm_r 0.00% -0.31% 0.38% 0.05%
521.wrf_r -0.03% 1.60% -0.94% 0.00%
526.blender_r 0.00% -1.49% -0.43% -1.64%
527.cam4_r 0.10% -0.06% -0.39% -0.01%
538.imagick_r -0.09% 0.32% -0.90% 2.49%
544.nab_r 0.02% 0.20% -0.69% 0.09%
549.fotonik3d_r 2.42% 0.44% 0.93% -0.08%
554.roms_r 0.25% 0.06% -0.52% 0.00%
geomean for fprate 0.18% 0.04% -0.44% 0.59%
geomean 0.13% 2.16% -0.30% 0.15%
--
BR,
Hongtao
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-14 14:22 ` Jan Hubicka
@ 2021-08-16 8:03 ` Richard Biener
0 siblings, 0 replies; 26+ messages in thread
From: Richard Biener @ 2021-08-16 8:03 UTC (permalink / raw)
To: Jan Hubicka
Cc: Richard Sandiford, Hongtao Liu, Segher Boessenkool, gcc-help,
bin.cheng, 172060045
On Sat, 14 Aug 2021, Jan Hubicka wrote:
> Hi,
> runs finished for cheap vectorization model. I am running very cheap
> too.
>
> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on
>
> compares default -O2 (base), loop vectorize, slp vectorize.
> Put "O2" to machine name.
>
> Overall scores are:
>
> kaby.spec2006.O2_generic
> Test loop alp
> SPEC/SPEC2006/FP/total 8.16% 0.19%
> SPEC/SPEC2006/total 4.96% 0.38%
> SPEC/SPEC2006/INT/total 36.595 0.58% 0.65%
>
> kaby.spec2006.O2_generic_lto
> SPEC/SPEC2006/FP/total 9.06% -0.36%
> SPEC/SPEC2006/total 5.32% ~
> SPEC/SPEC2006/INT/total 0.24% 0.27%
>
> kaby.spec2017.O2_generic
> SPEC/SPEC2017/INT/total 6.54% -1.15%
> SPEC/SPEC2017/total 5.66% -0.17%
> SPEC/SPEC2017/FP/total 5.00% 0.59%
>
> kaby.spec2017.O2_generic_lto
> SPEC/SPEC2017/INT/total 6.62% -0.12%
> SPEC/SPEC2017/total 5.69% -0.14%
> SPEC/SPEC2017/FP/total 4.99% -0.16%
>
> zenith.spec2006.O2_generic
> SPEC/SPEC2006/FP/total 10.23% -0.35%
> SPEC/SPEC2006/total 6.01% -0.48%
> SPEC/SPEC2006/INT/total 0.31% -0.66%
>
> zenith.spec2006.O2_generic_lto
> SPEC/SPEC2006/FP/total 12.03% 0.82%
> SPEC/SPEC2006/total 6.90% 0.44%
> SPEC/SPEC2006/INT/total ~ -0.11%
>
> zenith.spec2017.O2_generic
> SPEC/SPEC2017/INT/total 7.46% -0.37%
> SPEC/SPEC2017/total 6.81% 0.48%
> SPEC/SPEC2017/FP/total 6.31% 1.15%
>
> zenith.spec2017.O2_generic_lto
> SPEC/SPEC2017/INT/total 7.81% -0.22%
> SPEC/SPEC2017/total 7.07% 0.44%
> SPEC/SPEC2017/FP/total 6.50% 0.94%
>
> So loop vectorize is consistent win, slp is mostly neutral.
>
> Code size growth is too large for -O2 for loop vectorize. SLP vectorize
> seems slight size win overall.
>
> Noteworthy regressions caused by slp are:
> 5-6%: xz (kaby, lto), milc (zenith), astar (zenith)
> 4-5%: xalancbmk (kaby), blender (kaby)
> 3-4%: xz (kaby, nolto), dealII (zenith)
> 2-3%: povray (kaby), astar (kaby), perlbench (kaby), sjeng (zenith),
> xz(zenith)
>
> We get 10% improvement on imagemagick (kaby), 17.5% (zenith),
> imagematick 6.99% (zenith)
>
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on
>
> Is for C++/polyhedron benchmarks. It shows several interesting
> regressions in polyhedron and tsvc for loop vectorization (over 100%)
> and also some for slp. Shall I try to search for bugzilla for these?
Yeah, even more so if they reproduce with -O3.
Richard.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Why vectorization didn't turn on by -O2
2021-08-16 6:09 ` Hongtao Liu
@ 2021-08-24 2:21 ` Hongtao Liu
0 siblings, 0 replies; 26+ messages in thread
From: Hongtao Liu @ 2021-08-24 2:21 UTC (permalink / raw)
To: Kewen.Lin
Cc: Richard Biener, bin.cheng, Segher Boessenkool, 172060045,
gcc-help, Richard Sandiford, Jan Hubicka
On Mon, Aug 16, 2021 at 2:09 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 2:00 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
> > <gcc-help@gcc.gnu.org> wrote:
> > >
> > > on 2021/8/4 下午4:31, Richard Biener wrote:
> > > > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> > > >
> > > >> Hongtao Liu <crazylht@gmail.com> writes:
> > > >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > > >>> <gcc-help@gcc.gnu.org> wrote:
> > > >>>>
> > > >>>> Jan Hubicka <hubicka@ucw.cz> writes:
> > > >>>>> Hi,
> > > >>>>> here are updated scores.
> > > >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > > >>>>> compares
> > > >>>>> base: mainline
> > > >>>>> 1st column: mainline with very cheap vectorization at -O2 and -O3
> > > >>>>> 2nd column: mainline with cheap vectorization at -O2 and -O3.
> > > >>>>>
> > > >>>>> The short story is:
> > > >>>>>
> > > >>>>> 1) -O2 generic performance
> > > >>>>> kabylake (Intel):
> > > >>>>> very cheap
> > > >>>>> SPEC/SPEC2006/FP/total ~ 8.32%
> > > >>>>> SPEC/SPEC2006/total -0.38% 4.74%
> > > >>>>> SPEC/SPEC2006/INT/total -0.91% -0.14%
> > > >>>>>
> > > >>>>> SPEC/SPEC2017/INT/total 4.71% 7.11%
> > > >>>>> SPEC/SPEC2017/total 2.22% 6.52%
> > > >>>>> SPEC/SPEC2017/FP/total 0.34% 6.06%
> > > >>>>> zen
> > > >>>>> SPEC/SPEC2006/FP/total 0.61% 10.23%
> > > >>>>> SPEC/SPEC2006/total 0.26% 6.27%
> > > >>>>> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90%
> > > >>>>>
> > > >>>>> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80%
> > > >>>>> SPEC/SPEC2017/total 3.02% 6.55%
> > > >>>>> SPEC/SPEC2017/FP/total 1.26% 5.60%
> > > >>>>>
> > > >>>>> 2) -O2 size:
> > > >>>>> -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > > >>>>> -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > > >>>>> 3) build times:
> > > >>>>> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > > >>>>> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017
> > > >>>>> here I simply copied data from different configuratoins
> > > >>>>>
> > > >>>>> So for SPEC i would say that most of compile time costs are derrived
> > > >>>>> from code size growth which is a problem with cheap model but not with
> > > >>>>> very cheap. Very cheap indeed results in code size improvements and
> > > >>>>> compile time impact is probably somewhere around 0.5%
> > > >>>>>
> > > >>>>> So from these scores alone this would seem that vectorization makes
> > > >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> > > >>>>> optimizations with worse benefits to compile time tradeoffs).
> > > >>>>
> > > >>>> Thanks for running these.
> > > >>>>
> > > >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> > > >>>>
> > > >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > > >>>>
> > > >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> > > >>>> BB SLP within loop vectorisation for the very-cheap model. This would
> > > >>>> purely be a workaround and we should remove it once the PR is fixed.
> > > >>>> (It would even be a compile-time win in the meantime :-))
Fixed by
commit r12-3103-g819b7c3a339e3bdaf85cd55954c5536bd98aae09
Author: liuhongt <hongtao.liu@intel.com>
Date: Wed Aug 4 16:39:31 2021 +0800
Disable slp in loop vectorizer when cost model is very-cheap.
Performance impact for the commit with option:
-march=x86-64 -O2 -ftree-vectorize -fvect-cost-model=very-cheap
SPEC2017 fprate
503.bwaves_r BuildSame
507.cactuBSSN_r -0.04
508.namd_r 0.14
510.parest_r -0.54
511.povray_r 0.10
519.lbm_r BuildSame
521.wrf_r 0.64
526.blender_r -0.32
527.cam4_r 0.17
538.imagick_r 0.09
544.nab_r BuildSame
549.fotonik3d_r BuildSame
554.roms_r BuildSame
997.specrand_fr -0.09
Geometric mean: 0.02
SPEC2017 intrate
500.perlbench_r 0.26
502.gcc_r 0.21
505.mcf_r -0.09
520.omnetpp_r BuildSame
523.xalancbmk_r BuildSame
525.x264_r -0.41
531.deepsjeng_r BuildSame
541.leela_r 0.13
548.exchange2_r BuildSame
557.xz_r BuildSame
999.specrand_ir BuildSame
Geometric mean: 0.02
EEMBC: no regression, only improvement or build the same, the below is
improved benchmarks.
mp2decoddata1 7.59
mp2decoddata2 31.80
mp2decoddata3 12.15
mp2decoddata4 11.16
mp2decoddata5 11.19
mp2decoddata1 7.06
mp2decoddata2 24.12
mp2decoddata3 10.83
mp2decoddata4 10.04
mp2decoddata5 10.07
gcc/ChangeLog:
PR tree-optimization/100089
* tree-vectorizer.c (try_vectorize_loop_1): Disable slp in
loop vectorizer when cost model is very-cheap.
--
BR,
Hongtao
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2021-08-24 2:16 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-08 12:07 Why vectorization didn't turn on by -O2 172060045
2021-05-08 16:54 ` Xi Ruoyao
2021-05-09 13:44 ` Segher Boessenkool
2021-05-10 8:21 ` Richard Biener
2021-05-10 9:11 ` Jan Hubicka
2021-05-10 9:27 ` Segher Boessenkool
2021-05-10 12:10 ` Jan Hubicka
2021-05-10 9:24 ` Segher Boessenkool
2021-05-17 16:03 ` Jan Hubicka
2021-05-17 18:56 ` Richard Sandiford
2021-08-04 8:21 ` Hongtao Liu
2021-08-04 8:22 ` Richard Sandiford
2021-08-04 8:31 ` Richard Biener
2021-08-04 9:10 ` Richard Sandiford
2021-08-04 9:56 ` Segher Boessenkool
2021-08-04 10:22 ` Richard Sandiford
2021-08-04 21:18 ` Segher Boessenkool
2021-08-04 9:12 ` Hongtao Liu
2021-08-11 17:14 ` Jan Hubicka
2021-08-14 14:22 ` Jan Hubicka
2021-08-16 8:03 ` Richard Biener
2021-08-16 3:22 ` Kewen.Lin
2021-08-16 6:00 ` Hongtao Liu
2021-08-16 6:09 ` Hongtao Liu
2021-08-24 2:21 ` Hongtao Liu
2021-08-04 8:36 ` Hongtao Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).