* GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
@ 2010-04-29 16:29 Vladimir Makarov
2010-04-29 16:49 ` Jan Hubicka
` (3 more replies)
0 siblings, 4 replies; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 16:29 UTC (permalink / raw)
To: gcc.gcc.gnu.org
GCC-4.5.0 and LLVM-2.7 were released recently. To understand
where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
for x86/x86-64 and posted the comparison of it with the
previous GCC releases and LLVM-2.7.
Even benchmarking SPEC2000 takes a lot of time on the fastest
machine I have. So I don't plan to use SPEC2006 for this in near
future.
You can find the comparison on
http://vmakarov.fedorapeople.org/spec/ (please just click links at the
bottom of the left frame starting with link "GCC release comparison").
If you need exact numbers, please use the tables (the links to them
are also given) which were used to generate the corresponding bar
graphs.
In general GCC-4.5.0 became faster (upto 10%) in -O2 mode. This is
first considerable compilation speed improvement since GCC-4.2.
GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
SPECFP2000 in -O2 mode) code too in comparison with the previous
release. That is not including LTO and Graphite which can gives even
more (especially LTO) in many cases.
GCC-4.5.0 has new big optimizations LTO and Graphite (more
accurately graphite was introduced in the previous release).
Therefore I ran additional benchmarks to test them.
LTO is a promising technology especially for integer benchmarks for
which it results in smaller and faster code. But it might result in
degradations too on SPECFP2000 mainly because of big degradations on a
few benchmarks like wupwise or facerec. Another annoying thing about
LTO, it considerably slows down the compiler.
Currently Graphite gives small improvements on x86 (one exception is
2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
maximum one more than 10% for SPECFP2000 because of big degradations
on mgrid and swim). So further work is needed on the project because
it seems not mature yet.
As for LLVM, LLVM became slower (e.g. in comparison with llvm-2.5 on
15%-50% for x86-64). So the gap between compilation speed of GCC and
LLVM decreased and sometimes achieves 4% on x86_64 and 8% on x86 (both
for SPECInt2000 in -O2 mode). May be I am wrong but I don't think
CLANG will improve this situation significantly (in -O2 and -O3 mode)
because optimizations still take most of time of any serious
optimizing compiler.
LLVM did a progress in code performance especially for floating
point benchmarks. But the gap between LLVM-2.7 and GCC-4.5 in peak
performance (not including GCC LTO and Graphite) still 6-7% on
SPECInt200 and 13-17% on SPECFP2000.
In general, IMHO GCC-4.5.0 is a good and promising release.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
@ 2010-04-29 16:49 ` Jan Hubicka
2010-04-29 17:25 ` Vladimir Makarov
2010-04-29 18:26 ` Xinliang David Li
` (2 subsequent siblings)
3 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 16:49 UTC (permalink / raw)
To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org
> GCC-4.5.0 and LLVM-2.7 were released recently. To understand
> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
> for x86/x86-64 and posted the comparison of it with the
> previous GCC releases and LLVM-2.7.
>
> Even benchmarking SPEC2000 takes a lot of time on the fastest
> machine I have. So I don't plan to use SPEC2006 for this in near
> future.
>
> You can find the comparison on
> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
> bottom of the left frame starting with link "GCC release comparison").
>
> If you need exact numbers, please use the tables (the links to them
> are also given) which were used to generate the corresponding bar
> graphs.
>
>
> In general GCC-4.5.0 became faster (upto 10%) in -O2 mode. This is
> first considerable compilation speed improvement since GCC-4.2.
> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
> SPECFP2000 in -O2 mode) code too in comparison with the previous
> release. That is not including LTO and Graphite which can gives even
> more (especially LTO) in many cases.
>
> GCC-4.5.0 has new big optimizations LTO and Graphite (more
> accurately graphite was introduced in the previous release).
> Therefore I ran additional benchmarks to test them.
>
> LTO is a promising technology especially for integer benchmarks for
> which it results in smaller and faster code. But it might result in
> degradations too on SPECFP2000 mainly because of big degradations on a
> few benchmarks like wupwise or facerec. Another annoying thing about
> LTO, it considerably slows down the compiler.
Seems like something sensitive for setup. In our daily benchmarking LTO
fatster on wupwise (2116 compared to 1600), and facerec is 2003 compared to
2041 (so about the same).
http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ai-64/list.html
http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ipa-64/list.html
Did you test with -fwhole-program?
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 16:49 ` Jan Hubicka
@ 2010-04-29 17:25 ` Vladimir Makarov
2010-04-29 18:17 ` Vladimir Makarov
0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 17:25 UTC (permalink / raw)
To: Jan Hubicka; +Cc: gcc.gcc.gnu.org
Jan Hubicka wrote:
>> GCC-4.5.0 and LLVM-2.7 were released recently. To understand
>> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
>> for x86/x86-64 and posted the comparison of it with the
>> previous GCC releases and LLVM-2.7.
>>
>> Even benchmarking SPEC2000 takes a lot of time on the fastest
>> machine I have. So I don't plan to use SPEC2006 for this in near
>> future.
>>
>> You can find the comparison on
>> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
>> bottom of the left frame starting with link "GCC release comparison").
>>
>> If you need exact numbers, please use the tables (the links to them
>> are also given) which were used to generate the corresponding bar
>> graphs.
>>
>>
>> In general GCC-4.5.0 became faster (upto 10%) in -O2 mode. This is
>> first considerable compilation speed improvement since GCC-4.2.
>> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
>> SPECFP2000 in -O2 mode) code too in comparison with the previous
>> release. That is not including LTO and Graphite which can gives even
>> more (especially LTO) in many cases.
>>
>> GCC-4.5.0 has new big optimizations LTO and Graphite (more
>> accurately graphite was introduced in the previous release).
>> Therefore I ran additional benchmarks to test them.
>>
>> LTO is a promising technology especially for integer benchmarks for
>> which it results in smaller and faster code. But it might result in
>> degradations too on SPECFP2000 mainly because of big degradations on a
>> few benchmarks like wupwise or facerec. Another annoying thing about
>> LTO, it considerably slows down the compiler.
>>
>
> Seems like something sensitive for setup. In our daily benchmarking LTO
> fatster on wupwise (2116 compared to 1600), and facerec is 2003 compared to
> 2041 (so about the same).
>
> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ai-64/list.html
> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ipa-64/list.html
>
> Did you test with -fwhole-program?
>
Yes, I used -flto -fwhole-program. All this info is on the page. The
test machine are also not experimental ones (the both are Dell machines).
I used the released sources may be a reason for the difference is in
different sources. In any case, I'll check the current trunk on these
machines.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 17:25 ` Vladimir Makarov
@ 2010-04-29 18:17 ` Vladimir Makarov
0 siblings, 0 replies; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 18:17 UTC (permalink / raw)
To: Jan Hubicka; +Cc: gcc.gcc.gnu.org
Vladimir Makarov wrote:
> Jan Hubicka wrote:
>>
>> Seems like something sensitive for setup. In our daily benchmarking LTO
>> fatster on wupwise (2116 compared to 1600), and facerec is 2003
>> compared to
>> 2041 (so about the same).
>>
>> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ai-64/list.html
>> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ipa-64/list.html
>>
>> Did you test with -fwhole-program?
>>
> Yes, I used -flto -fwhole-program. All this info is on the page. The
> test machine are also not experimental ones (the both are Dell machines).
>
> I used the released sources may be a reason for the difference is in
> different sources. In any case, I'll check the current trunk on these
> machines.
>
>
The following I got on the today trunk for x86_64 (2.93 GHz Core i7):
wupwise
-O3 2670
-O3 -flto -fwhole-program 2211
-O3 -ffast-math 2753
-O3 -flto -fwhole-program -ffast-math 4325
So nothing is wrong with my test machine. We simply measure different
things. You use -ffast-math, I don't use it.
For the comparison I used simple combination of options for GCC and
LLVM. For me it is obvious that GCC results can be improved more than
LLVM by finding right options because it has much more optimizations.
Still it would be nice to fix LTO SPEC2000 degradations when -ffast-math
is not used.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
2010-04-29 16:49 ` Jan Hubicka
@ 2010-04-29 18:26 ` Xinliang David Li
2010-04-29 18:57 ` Vladimir Makarov
2010-04-29 22:42 ` Jack Howarth
2010-11-13 23:15 ` Xinliang David Li
3 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 18:26 UTC (permalink / raw)
To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org
On Thu, Apr 29, 2010 at 9:25 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
> GCC-4.5.0 and LLVM-2.7 were released recently. To understand
> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
> for x86/x86-64 and posted the comparison of it with the
> previous GCC releases and LLVM-2.7.
>
> Even benchmarking SPEC2000 takes a lot of time on the fastest
> machine I have. So I don't plan to use SPEC2006 for this in near
> future.
>
> You can find the comparison on
> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
> bottom of the left frame starting with link "GCC release comparison").
>
> If you need exact numbers, please use the tables (the links to them
> are also given) which were used to generate the corresponding bar
> graphs.
>
>
> In general GCC-4.5.0 became faster (upto 10%) in -O2 mode. This is
> first considerable compilation speed improvement since GCC-4.2.
> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
> SPECFP2000 in -O2 mode) code too in comparison with the previous
> release. That is not including LTO and Graphite which can gives even
> more (especially LTO) in many cases.
>
> GCC-4.5.0 has new big optimizations LTO and Graphite (more
> accurately graphite was introduced in the previous release).
> Therefore I ran additional benchmarks to test them.
>
> LTO is a promising technology especially for integer benchmarks for
> which it results in smaller and faster code. But it might result in
> degradations too on SPECFP2000 mainly because of big degradations on a
> few benchmarks like wupwise or facerec. Another annoying thing about
> LTO, it considerably slows down the compiler.
The LTO improvement on spec2000int is is only 1.86%
4.5 4.5+lto Improvement
164.gzip 955 950 -0.52% <-- degrade
175.vpr 588 594 1.02%
176.gcc 1211 1216 0.41%
181.mcf 699 698 -0.14%
186.crafty 1011 987 -2.37% <--- degrade
197.parser 792 813 2.65%
252.eon 1026 1023 -0.29% <-- degrade
253.perlbmk 1312 1294 -1.37% <-- degrade
254.gap 1021 1037 1.57%
255.vortex 1123 1319 17.45%
256.bzip2 737 768 4.21%
300.twolf 773 779 0.78%
-----------------------------------------------------
SPECint2000 913 930 1.86%
This matches our previous observation that to bring the best out of
LTO, FDO is also needed. (As a reference, LIPO improves over plain FDO
by ~4.5%, vortex improves 23%). You will probably see even smaller
improvement in SPEC2006.
It would be great if there is number collected comparing LTO + FDO vs
plain FDO in the same setup.
Thanks,
David
>
> Currently Graphite gives small improvements on x86 (one exception is
> 2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
> maximum one more than 10% for SPECFP2000 because of big degradations
> on mgrid and swim). So further work is needed on the project because
> it seems not mature yet.
>
> As for LLVM, LLVM became slower (e.g. in comparison with llvm-2.5 on
> 15%-50% for x86-64). So the gap between compilation speed of GCC and
> LLVM decreased and sometimes achieves 4% on x86_64 and 8% on x86 (both
> for SPECInt2000 in -O2 mode). May be I am wrong but I don't think
> CLANG will improve this situation significantly (in -O2 and -O3 mode)
> because optimizations still take most of time of any serious
> optimizing compiler.
>
> LLVM did a progress in code performance especially for floating
> point benchmarks. But the gap between LLVM-2.7 and GCC-4.5 in peak
> performance (not including GCC LTO and Graphite) still 6-7% on
> SPECInt200 and 13-17% on SPECFP2000.
>
> In general, IMHO GCC-4.5.0 is a good and promising release.
>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 18:26 ` Xinliang David Li
@ 2010-04-29 18:57 ` Vladimir Makarov
2010-04-29 19:42 ` Xinliang David Li
2010-04-29 21:33 ` Jan Hubicka
0 siblings, 2 replies; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 18:57 UTC (permalink / raw)
To: Xinliang David Li; +Cc: gcc.gcc.gnu.org
Xinliang David Li wrote:
> On Thu, Apr 29, 2010 at 9:25 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>
>> GCC-4.5.0 and LLVM-2.7 were released recently. To understand
>> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
>> for x86/x86-64 and posted the comparison of it with the
>> previous GCC releases and LLVM-2.7.
>>
>> Even benchmarking SPEC2000 takes a lot of time on the fastest
>> machine I have. So I don't plan to use SPEC2006 for this in near
>> future.
>>
>> You can find the comparison on
>> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
>> bottom of the left frame starting with link "GCC release comparison").
>>
>> If you need exact numbers, please use the tables (the links to them
>> are also given) which were used to generate the corresponding bar
>> graphs.
>>
>>
>> In general GCC-4.5.0 became faster (upto 10%) in -O2 mode. This is
>> first considerable compilation speed improvement since GCC-4.2.
>> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
>> SPECFP2000 in -O2 mode) code too in comparison with the previous
>> release. That is not including LTO and Graphite which can gives even
>> more (especially LTO) in many cases.
>>
>> GCC-4.5.0 has new big optimizations LTO and Graphite (more
>> accurately graphite was introduced in the previous release).
>> Therefore I ran additional benchmarks to test them.
>>
>> LTO is a promising technology especially for integer benchmarks for
>> which it results in smaller and faster code. But it might result in
>> degradations too on SPECFP2000 mainly because of big degradations on a
>> few benchmarks like wupwise or facerec. Another annoying thing about
>> LTO, it considerably slows down the compiler.
>>
>
>
> The LTO improvement on spec2000int is is only 1.86%
>
> 4.5 4.5+lto Improvement
> 164.gzip 955 950 -0.52% <-- degrade
> 175.vpr 588 594 1.02%
> 176.gcc 1211 1216 0.41%
> 181.mcf 699 698 -0.14%
> 186.crafty 1011 987 -2.37% <--- degrade
> 197.parser 792 813 2.65%
> 252.eon 1026 1023 -0.29% <-- degrade
> 253.perlbmk 1312 1294 -1.37% <-- degrade
> 254.gap 1021 1037 1.57%
> 255.vortex 1123 1319 17.45%
> 256.bzip2 737 768 4.21%
> 300.twolf 773 779 0.78%
> -----------------------------------------------------
> SPECint2000 913 930 1.86%
>
>
> This matches our previous observation that to bring the best out of
> LTO, FDO is also needed. (As a reference, LIPO improves over plain FDO
> by ~4.5%, vortex improves 23%). You will probably see even smaller
> improvement in SPEC2006.
>
>
Thanks for the comments. FDO will probably improve SPEC2000 score.
Although it is not obvious for some tests because the train data sets
for them are different from the reference data sets and it might
actually mislead the compiler.
FDO is important for optimizations where all possible data sets do not
change branch probability distribution much. IMHO therefore FDO is not
widely used by most of developers (although I am sure that for Google
applications it is extremely important) and therefore I don't measure it
and it is not so interesting for me. Although bigger reason not use FDO
is inconvenience to use it for regular compiler user.
As for vortex FDO improvement, vortex contains a moderate size loop in
which most of time is spent. The loop has if-then-else on the top loop
level. On all SPEC2000 data sets, one if-branch is taken practically
always (like 1 to 1,000,000). So it is not amazing for me that FDO
gives such improvement for vortex.
> It would be great if there is number collected comparing LTO + FDO vs
> plain FDO in the same setup.
>
>
Usually after such posting the comparisons, I am getting a lot of
requests. I'd like to do all of them but unfortunately running and the
result preparation takes a lot of my time. May be I'll do such
comparison next year.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 18:57 ` Vladimir Makarov
@ 2010-04-29 19:42 ` Xinliang David Li
2010-04-29 20:19 ` Vladimir Makarov
2010-04-29 21:33 ` Jan Hubicka
1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 19:42 UTC (permalink / raw)
To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org
>>
>
> Thanks for the comments. FDO will probably improve SPEC2000 score.
> Although it is not obvious for some tests because the train data sets for
> them are different from the reference data sets and it might actually
> mislead the compiler.
>
> FDO is important for optimizations where all possible data sets do not
> change branch probability distribution much. IMHO therefore FDO is not
> widely used by most of developers (although I am sure that for Google
> applications it is extremely important) and therefore I don't measure it and
> it is not so interesting for me. Although bigger reason not use FDO is
> inconvenience to use it for regular compiler user.
>
> As for vortex FDO improvement, vortex contains a moderate size loop in which
> most of time is spent. The loop has if-then-else on the top loop level. On
> all SPEC2000 data sets, one if-branch is taken practically always (like 1
> to 1,000,000). So it is not amazing for me that FDO gives such
> improvement for vortex.
Actually what I was trying to say is that LTO will be more powerful
when combined with FDO. In other words, I expect LTO + FDO improves
over plain FDO more than 1.86%.
>>
>> It would be great if there is number collected comparing LTO + FDO vs
>> plain FDO in the same setup.
>>
>>
>
> Usually after such posting the comparisons, I am getting a lot of requests.
> I'd like to do all of them but unfortunately running and the result
> preparation takes a lot of my time. May be I'll do such comparison next
> year.
Ok. Another comment is that using SPEC2000 for performance testing
won't be indicative of today's real world program size. Even
SPEC2006's largest C++ programs are not that big.
Thanks,
David
>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 19:42 ` Xinliang David Li
@ 2010-04-29 20:19 ` Vladimir Makarov
2010-04-29 20:40 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 20:19 UTC (permalink / raw)
To: Xinliang David Li; +Cc: gcc.gcc.gnu.org
Xinliang David Li wrote:
>> Thanks for the comments. FDO will probably improve SPEC2000 score.
>> Although it is not obvious for some tests because the train data sets for
>> them are different from the reference data sets and it might actually
>> mislead the compiler.
>>
>> FDO is important for optimizations where all possible data sets do not
>> change branch probability distribution much. IMHO therefore FDO is not
>> widely used by most of developers (although I am sure that for Google
>> applications it is extremely important) and therefore I don't measure it and
>> it is not so interesting for me. Although bigger reason not use FDO is
>> inconvenience to use it for regular compiler user.
>>
>> As for vortex FDO improvement, vortex contains a moderate size loop in which
>> most of time is spent. The loop has if-then-else on the top loop level. On
>> all SPEC2000 data sets, one if-branch is taken practically always (like 1
>> to 1,000,000). So it is not amazing for me that FDO gives such
>> improvement for vortex.
>>
>
> Actually what I was trying to say is that LTO will be more powerful
> when combined with FDO. In other words, I expect LTO + FDO improves
> over plain FDO more than 1.86%.
>
>
>
>>> It would be great if there is number collected comparing LTO + FDO vs
>>> plain FDO in the same setup.
>>>
>>>
>>>
>> Usually after such posting the comparisons, I am getting a lot of requests.
>> I'd like to do all of them but unfortunately running and the result
>> preparation takes a lot of my time. May be I'll do such comparison next
>> year.
>>
>
> Ok. Another comment is that using SPEC2000 for performance testing
> won't be indicative of today's real world program size. Even
> SPEC2006's largest C++ programs are not that big.
>
>
>
It is very subjective what is today's real world program size. Usually
it reflects what you are working on. I understand that Google
applications are huge and their speed is important for saving money (or
energy) for their employees. Firefox is big enough but for regular
desktop user 1% improvement may be invisible or not important if it is
already fast enough.
A math-physics program can be small but its speed may be really
important because it takes hours or days on fast machine. Even big and
intensively used applications like some logistic system can have small
program parts (e.g. ILP solver or compression algorithms like gzip for
speeding Internet communication up) whose optimization are the most
important for the application and SPEC contains such
calculation-intensive code (a lot of NP-complete task solvers and math
physics programs). So I would not say using SPEC for performance
testing is not important for improving today's real world size
program. Of course it is not so important than testing the program you
are working on. In order words, this program is most important
benchmark for you but probably not for others.
As for me, GCC itself is very important program and SPEC contains it
(2000 old one version and 2006 more recent one). So SPEC is pretty
important and good for me (not perfect of course at least because it is
not free) although it is not the single one which I care of.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 20:19 ` Vladimir Makarov
@ 2010-04-29 20:40 ` Xinliang David Li
0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 20:40 UTC (permalink / raw)
To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org
Point well put. The benchmark suite should have good mixture of
programs with different sizes. SPEC2k programs cluster at the lower
end of the spectrum though.
David
On Thu, Apr 29, 2010 at 12:43 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
> Xinliang David Li wrote:
>>>
>>> Thanks for the comments. FDO will probably improve SPEC2000 score.
>>> Although it is not obvious for some tests because the train data sets
>>> for
>>> them are different from the reference data sets and it might actually
>>> mislead the compiler.
>>>
>>> FDO is important for optimizations where all possible data sets do not
>>> change branch probability distribution much. IMHO therefore FDO is not
>>> widely used by most of developers (although I am sure that for Google
>>> applications it is extremely important) and therefore I don't measure it
>>> and
>>> it is not so interesting for me. Although bigger reason not use FDO is
>>> inconvenience to use it for regular compiler user.
>>>
>>> As for vortex FDO improvement, vortex contains a moderate size loop in
>>> which
>>> most of time is spent. The loop has if-then-else on the top loop level.
>>> On
>>> all SPEC2000 data sets, one if-branch is taken practically always
>>> (like 1
>>> to 1,000,000). So it is not amazing for me that FDO gives such
>>> improvement for vortex.
>>>
>>
>> Actually what I was trying to say is that LTO will be more powerful
>> when combined with FDO. In other words, I expect LTO + FDO improves
>> over plain FDO more than 1.86%.
>>
>>
>>
>>>>
>>>> It would be great if there is number collected comparing LTO + FDO vs
>>>> plain FDO in the same setup.
>>>>
>>>>
>>>>
>>>
>>> Usually after such posting the comparisons, I am getting a lot of
>>> requests.
>>> I'd like to do all of them but unfortunately running and the result
>>> preparation takes a lot of my time. May be I'll do such comparison next
>>> year.
>>>
>>
>> Ok. Another comment is that using SPEC2000 for performance testing
>> won't be indicative of today's real world program size. Even
>> SPEC2006's largest C++ programs are not that big.
>>
>>
>>
>
> It is very subjective what is today's real world program size. Usually it
> reflects what you are working on. I understand that Google applications are
> huge and their speed is important for saving money (or energy) for their
> employees. Firefox is big enough but for regular desktop user 1%
> improvement may be invisible or not important if it is already fast enough.
>
> A math-physics program can be small but its speed may be really important
> because it takes hours or days on fast machine. Even big and intensively
> used applications like some logistic system can have small program parts
> (e.g. ILP solver or compression algorithms like gzip for speeding Internet
> communication up) whose optimization are the most important for the
> application and SPEC contains such calculation-intensive code (a lot of
> NP-complete task solvers and math physics programs). So I would not say
> using SPEC for performance testing is not important for improving today's
> real world size program. Of course it is not so important than testing the
> program you are working on. In order words, this program is most important
> benchmark for you but probably not for others.
>
> As for me, GCC itself is very important program and SPEC contains it (2000
> old one version and 2006 more recent one). So SPEC is pretty important and
> good for me (not perfect of course at least because it is not free)
> although it is not the single one which I care of.
>
>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 18:57 ` Vladimir Makarov
2010-04-29 19:42 ` Xinliang David Li
@ 2010-04-29 21:33 ` Jan Hubicka
2010-04-29 21:34 ` Jan Hubicka
` (3 more replies)
1 sibling, 4 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 21:33 UTC (permalink / raw)
To: Vladimir Makarov; +Cc: Xinliang David Li, gcc.gcc.gnu.org
> Thanks for the comments. FDO will probably improve SPEC2000 score.
> Although it is not obvious for some tests because the train data sets
> for them are different from the reference data sets and it might
> actually mislead the compiler.
There are several studies on the topic and it is not that bad in practice.
In wast majority of cases even pretty bad training runs gets significant
portion of improvement you can get from training on the final benchmark
data. In SPEC case FDO improves pretty much all benchmarks.
I think the FDO is relatively little used because it is relatively hard
to use (i.e. user has to modify makefiles and learn how the feature works)
and also because there is very little support for it (i.e. in automake and such)
> As for vortex FDO improvement, vortex contains a moderate size loop in
> which most of time is spent. The loop has if-then-else on the top loop
> level. On all SPEC2000 data sets, one if-branch is taken practically
> always (like 1 to 1,000,000). So it is not amazing for me that FDO
> gives such improvement for vortex.
It would be interesting to know if same improvement happens with LTO and if
not what LIPO does. I will unbreak vortex on our tester.
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 21:33 ` Jan Hubicka
@ 2010-04-29 21:34 ` Jan Hubicka
2010-04-29 21:36 ` Xinliang David Li
` (2 subsequent siblings)
3 siblings, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 21:34 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Vladimir Makarov, Xinliang David Li, gcc.gcc.gnu.org
BTW we are also tracking SPEC2k6 with and without LTO (not FDO runs)
http://gcc.opensuse.org/SPEC/CINT/sb-barbella.suse.de-ai-64/recent.html
http://gcc.opensuse.org/SPEC/CINT/sb-barbella.suse.de-head-64-2006/recent.html
not all 2k6 tests pass with LTO so it will need a bit care to compare results.
Honza
>
> Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 21:33 ` Jan Hubicka
2010-04-29 21:34 ` Jan Hubicka
@ 2010-04-29 21:36 ` Xinliang David Li
2010-05-01 9:36 ` Jan Hubicka
2010-04-29 21:38 ` Xinliang David Li
2010-04-29 21:45 ` Steven Bosscher
3 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 21:36 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
On Thu, Apr 29, 2010 at 2:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Thanks for the comments. FDO will probably improve SPEC2000 score.
>> Although it is not obvious for some tests because the train data sets
>> for them are different from the reference data sets and it might
>> actually mislead the compiler.
>
> There are several studies on the topic and it is not that bad in practice.
> In wast majority of cases even pretty bad training runs gets significant
> portion of improvement you can get from training on the final benchmark
> data. In SPEC case FDO improves pretty much all benchmarks.
Agree.
>
> I think the FDO is relatively little used because it is relatively hard
> to use (i.e. user has to modify makefiles and learn how the feature works)
> and also because there is very little support for it (i.e. in automake and such)
>> As for vortex FDO improvement, vortex contains a moderate size loop in
>> which most of time is spent. The loop has if-then-else on the top loop
>> level. On all SPEC2000 data sets, one if-branch is taken practically
>> always (like 1 to 1,000,000). So it is not amazing for me that FDO
>> gives such improvement for vortex.
>
> It would be interesting to know if same improvement happens with LTO and if
> not what LIPO does. I will unbreak vortex on our tester.
>
Vortex needs -fno-strict-aliasing. It casts between two record types
with one record being a 'prefix' of another.
David
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 21:33 ` Jan Hubicka
2010-04-29 21:34 ` Jan Hubicka
2010-04-29 21:36 ` Xinliang David Li
@ 2010-04-29 21:38 ` Xinliang David Li
2010-04-29 21:46 ` Jan Hubicka
2010-04-29 21:45 ` Steven Bosscher
3 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 21:38 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
I noticed eon's peak options do not include FDO, is that intended?
David
On Thu, Apr 29, 2010 at 2:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Thanks for the comments. FDO will probably improve SPEC2000 score.
>> Although it is not obvious for some tests because the train data sets
>> for them are different from the reference data sets and it might
>> actually mislead the compiler.
>
> There are several studies on the topic and it is not that bad in practice.
> In wast majority of cases even pretty bad training runs gets significant
> portion of improvement you can get from training on the final benchmark
> data. In SPEC case FDO improves pretty much all benchmarks.
>
> I think the FDO is relatively little used because it is relatively hard
> to use (i.e. user has to modify makefiles and learn how the feature works)
> and also because there is very little support for it (i.e. in automake and such)
>> As for vortex FDO improvement, vortex contains a moderate size loop in
>> which most of time is spent. The loop has if-then-else on the top loop
>> level. On all SPEC2000 data sets, one if-branch is taken practically
>> always (like 1 to 1,000,000). So it is not amazing for me that FDO
>> gives such improvement for vortex.
>
> It would be interesting to know if same improvement happens with LTO and if
> not what LIPO does. I will unbreak vortex on our tester.
>
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 21:33 ` Jan Hubicka
` (2 preceding siblings ...)
2010-04-29 21:38 ` Xinliang David Li
@ 2010-04-29 21:45 ` Steven Bosscher
2010-04-29 22:35 ` Xinliang David Li
3 siblings, 1 reply; 68+ messages in thread
From: Steven Bosscher @ 2010-04-29 21:45 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Vladimir Makarov, Xinliang David Li, gcc.gcc.gnu.org
On Thu, Apr 29, 2010 at 11:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> It would be interesting to know if same improvement happens with LTO and if
> not what LIPO does. I will unbreak vortex on our tester.
Perhaps you can add a LIPO tester? It looks like a very interesting
and promising approach.
Ciao!
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 21:38 ` Xinliang David Li
@ 2010-04-29 21:46 ` Jan Hubicka
0 siblings, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 21:46 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org
> I noticed eon's peak options do not include FDO, is that intended?
I think it is just bug in page header, but I will double check.
Base and peak should match otherwise.
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 21:45 ` Steven Bosscher
@ 2010-04-29 22:35 ` Xinliang David Li
2010-04-29 22:50 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 22:35 UTC (permalink / raw)
To: Steven Bosscher
Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok
Thanks for the suggestion. Raksit currently is busy with merging trunk
changes back to lw-ipo branch which can be a daunting task. After that
this can be done. (Our internal release is based on 4.4).
David
On Thu, Apr 29, 2010 at 2:38 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Thu, Apr 29, 2010 at 11:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> It would be interesting to know if same improvement happens with LTO and if
>> not what LIPO does. I will unbreak vortex on our tester.
>
> Perhaps you can add a LIPO tester? It looks like a very interesting
> and promising approach.
>
> Ciao!
> Steven
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
2010-04-29 16:49 ` Jan Hubicka
2010-04-29 18:26 ` Xinliang David Li
@ 2010-04-29 22:42 ` Jack Howarth
2010-11-13 23:15 ` Xinliang David Li
3 siblings, 0 replies; 68+ messages in thread
From: Jack Howarth @ 2010-04-29 22:42 UTC (permalink / raw)
To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org
On Thu, Apr 29, 2010 at 12:25:15PM -0400, Vladimir Makarov wrote:
....
>
> Currently Graphite gives small improvements on x86 (one exception is
> 2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
> maximum one more than 10% for SPECFP2000 because of big degradations
> on mgrid and swim). So further work is needed on the project because
> it seems not mature yet.
>
Vladimir,
Keep in mind that -fgraphite-identity currently still causes
vectorization opportunities to be missed. Once that if fixed
the higher level graphite optimizations may look alot better.
Jack
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 22:35 ` Xinliang David Li
@ 2010-04-29 22:50 ` Jan Hubicka
2010-04-29 22:51 ` Steven Bosscher
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 22:50 UTC (permalink / raw)
To: Xinliang David Li
Cc: Steven Bosscher, Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org,
Raksit Ashok
> Thanks for the suggestion. Raksit currently is busy with merging trunk
> changes back to lw-ipo branch which can be a daunting task. After that
> this can be done. (Our internal release is based on 4.4).
I must say that LIPO is something I always intend to look into but didn't
seriously find time for that yet (well, hoping that submitting the thesis will
make this easier).
What are the LIPO's features that are missing in -flto -fprofile-use?
Honza
>
> David
>
> On Thu, Apr 29, 2010 at 2:38 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> > On Thu, Apr 29, 2010 at 11:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> It would be interesting to know if same improvement happens with LTO and if
> >> not what LIPO does. Â I will unbreak vortex on our tester.
> >
> > Perhaps you can add a LIPO tester? It looks like a very interesting
> > and promising approach.
> >
> > Ciao!
> > Steven
> >
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 22:50 ` Jan Hubicka
@ 2010-04-29 22:51 ` Steven Bosscher
2010-04-29 23:06 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Steven Bosscher @ 2010-04-29 22:51 UTC (permalink / raw)
To: Jan Hubicka
Cc: Xinliang David Li, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok
2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
>> Thanks for the suggestion. Raksit currently is busy with merging trunk
>> changes back to lw-ipo branch which can be a daunting task. After that
>> this can be done. (Our internal release is based on 4.4).
>
> I must say that LIPO is something I always intend to look into but didn't
> seriously find time for that yet (well, hoping that submitting the thesis will
> make this easier).
> What are the LIPO's features that are missing in -flto -fprofile-use?
LIPO is a completely different approach, basically independent of LTO.
There is a good explanation of it on the wiki, see
http://gcc.gnu.org/wiki/LightweightIpo.
Ciao!
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 22:51 ` Steven Bosscher
@ 2010-04-29 23:06 ` Jan Hubicka
2010-04-29 23:47 ` Steven Bosscher
2010-04-30 0:57 ` Xinliang David Li
0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 23:06 UTC (permalink / raw)
To: Steven Bosscher
Cc: Jan Hubicka, Xinliang David Li, Vladimir Makarov,
gcc.gcc.gnu.org, Raksit Ashok
> 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
> >> Thanks for the suggestion. Raksit currently is busy with merging trunk
> >> changes back to lw-ipo branch which can be a daunting task. After that
> >> this can be done. Â (Our internal release is based on 4.4).
> >
> > I must say that LIPO is something I always intend to look into but didn't
> > seriously find time for that yet (well, hoping that submitting the thesis will
> > make this easier).
> > What are the LIPO's features that are missing in -flto -fprofile-use?
>
> LIPO is a completely different approach, basically independent of LTO.
> There is a good explanation of it on the wiki, see
> http://gcc.gnu.org/wiki/LightweightIpo.
Yep, I read that page (and saw some of implementation too). Just was not able
to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
than LTO+FDO then why)
Honza
>
> Ciao!
> Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 23:06 ` Jan Hubicka
@ 2010-04-29 23:47 ` Steven Bosscher
2010-09-28 0:28 ` Neil Vachharajani
2010-04-30 0:57 ` Xinliang David Li
1 sibling, 1 reply; 68+ messages in thread
From: Steven Bosscher @ 2010-04-29 23:47 UTC (permalink / raw)
To: Jan Hubicka
Cc: Xinliang David Li, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok
2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
> Yep, I read that page (and saw some of implementation too). Just was not able
> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
> than LTO+FDO then why)
OK, that's an interesting question. The first question (if...) is
something you'll have to try yourself, I suppose :-)
BTW will the CGO presentation about LIPO and sampled FDO be published
somewhere in the open?
Ciao!
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 23:06 ` Jan Hubicka
2010-04-29 23:47 ` Steven Bosscher
@ 2010-04-30 0:57 ` Xinliang David Li
2010-04-30 8:42 ` Jan Hubicka
1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-30 0:57 UTC (permalink / raw)
To: Jan Hubicka
Cc: Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok
On Thu, Apr 29, 2010 at 4:03 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
>> >> Thanks for the suggestion. Raksit currently is busy with merging trunk
>> >> changes back to lw-ipo branch which can be a daunting task. After that
>> >> this can be done. (Our internal release is based on 4.4).
>> >
>> > I must say that LIPO is something I always intend to look into but didn't
>> > seriously find time for that yet (well, hoping that submitting the thesis will
>> > make this easier).
>> > What are the LIPO's features that are missing in -flto -fprofile-use?
>>
>> LIPO is a completely different approach, basically independent of LTO.
>> There is a good explanation of it on the wiki, see
>> http://gcc.gnu.org/wiki/LightweightIpo.
>
> Yep, I read that page (and saw some of implementation too). Just was not able
> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
> than LTO+FDO then why)
>
In theory, LIPO should not generate better results than LTO+FDO. What
makes LIPO attractive is that it allows distributed build from the
beginning. Its integration with large distributed build system is also
easy. Another point is that LIPO can be decoupled from FDO as well.
The reason is that cross module call clusters do not change that much
and can be determined statically or determined once using sample
profiling information. The grouping info can then be used for regular
O2 builds. This will remove the need for people to move functions into
header files which tend to penalize compile time unnecessarily.
If there is performance difference, the following unique things in
LIPO may contribute to it ( I have not validate them)
1) LIPO supports tracking indirect call targets across modules. This
is not feasible for plain FDO as there will be cgraph pid conflicts.
LIPO uses unique function id == (module_id << 32) + func_def_no, which
makes it possible.
2) comdat function resolution -- since LIPO uses aux module functions
for inlining purpose only, it has the freedom to choose which copy to
use. The current scheme chooses copy in current module with priority
for better profile data context sensitivity (see below)
3) in profile-gen phase, allow more inlining for comdat functions (in
einline2 and ipa-inline) -- this will cause profile data to be tracked
with module sensitivity (note that counters are not in comdat group)
Thanks,
David
> Honza
>>
>> Ciao!
>> Steven
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-30 0:57 ` Xinliang David Li
@ 2010-04-30 8:42 ` Jan Hubicka
2010-04-30 18:13 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-30 8:42 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org,
Raksit Ashok
> In theory, LIPO should not generate better results than LTO+FDO. What
> makes LIPO attractive is that it allows distributed build from the
> beginning. Its integration with large distributed build system is also
> easy. Another point is that LIPO can be decoupled from FDO as well.
The integration should be pretty much same as with current FDO, right?
Just arrange to get everything build twice and trained in between.
> The reason is that cross module call clusters do not change that much
> and can be determined statically or determined once using sample
> profiling information. The grouping info can then be used for regular
> O2 builds. This will remove the need for people to move functions into
This means, build once to gather callgraph and instead of deciding grouping
at runtime with profile info in it, just do it via some tool statically?
> header files which tend to penalize compile time unnecessarily.
>
> If there is performance difference, the following unique things in
> LIPO may contribute to it ( I have not validate them)
>
> 1) LIPO supports tracking indirect call targets across modules. This
> is not feasible for plain FDO as there will be cgraph pid conflicts.
> LIPO uses unique function id == (module_id << 32) + func_def_no, which
> makes it possible.
Interesting. My plan for profiling with LTO is to ultimately make it linktime
transform. This will be more difficult with WHOPR (i.e. instrumenting need
function bodies that are not available at WPA time), but I believe it is
solvable: just assign uids to the edges and do instrumentation at ltrans. Then
we will save cgraph profile in some easier way so WHOPR can read it in and read
rest of stuff in ltrans. This would invovlve shipping the correct profiles for
given function etc so it will be a bit of implementation challenge.
> 2) comdat function resolution -- since LIPO uses aux module functions
> for inlining purpose only, it has the freedom to choose which copy to
> use. The current scheme chooses copy in current module with priority
> for better profile data context sensitivity (see below)
This is interesting. How do you solve the problem when given comdat function
"loose"? I.e. it is replaced at linktime by other function that may or may
not be profiled from other unit?
I am aware that current FDO gets this wrong (it assumes that comdat functions
are never replaced from other unit). I guess situation can be improved a bit
by doing some localization even at no -fwhole-program or teach runtime to merge
in profiles into each individual copy of comdat...
Honza
> 3) in profile-gen phase, allow more inlining for comdat functions (in
> einline2 and ipa-inline) -- this will cause profile data to be tracked
> with module sensitivity (note that counters are not in comdat group)
>
> Thanks,
>
> David
>
>
>
> > Honza
> >>
> >> Ciao!
> >> Steven
> >
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-30 8:42 ` Jan Hubicka
@ 2010-04-30 18:13 ` Xinliang David Li
2010-04-30 18:32 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-30 18:13 UTC (permalink / raw)
To: Jan Hubicka
Cc: Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok
On Fri, Apr 30, 2010 at 1:37 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> In theory, LIPO should not generate better results than LTO+FDO. What
>> makes LIPO attractive is that it allows distributed build from the
>> beginning. Its integration with large distributed build system is also
>> easy. Another point is that LIPO can be decoupled from FDO as well.
>
> The integration should be pretty much same as with current FDO, right?
> Just arrange to get everything build twice and trained in between.
Right. LIPO behaves similarly to plain FDO (no IPO).
>
>> The reason is that cross module call clusters do not change that much
>> and can be determined statically or determined once using sample
>> profiling information. The grouping info can then be used for regular
>> O2 builds. This will remove the need for people to move functions into
>
> This means, build once to gather callgraph and instead of deciding grouping
> at runtime with profile info in it, just do it via some tool statically?
>
One could run LIPO instrumented binary one and get the grouping and
reuse the grouping. It is also possible to determine this grouping
using regular binary with hardware profiling (we have an internal tool
to do that). Or if the user knows the module affinity (but does not
want to rearrange source structures for SW engineering reasons), he
can choose to specify the grouping statically (currently not supported
yet). For instance, we can invent some directives for that:
#include_aux_module "../a/b/c.cpp"
The real scenario can be more complicated that this in order to
support different options, include search paths etc.
>> header files which tend to penalize compile time unnecessarily.
>>
>> If there is performance difference, the following unique things in
>> LIPO may contribute to it ( I have not validate them)
>>
>> 1) LIPO supports tracking indirect call targets across modules. This
>> is not feasible for plain FDO as there will be cgraph pid conflicts.
>> LIPO uses unique function id == (module_id << 32) + func_def_no, which
>> makes it possible.
>
> Interesting. My plan for profiling with LTO is to ultimately make it linktime
> transform. This will be more difficult with WHOPR (i.e. instrumenting need
> function bodies that are not available at WPA time), but I believe it is
> solvable: just assign uids to the edges and do instrumentation at ltrans. Then
> we will save cgraph profile in some easier way so WHOPR can read it in and read
> rest of stuff in ltrans. This would invovlve shipping the correct profiles for
> given function etc so it will be a bit of implementation challenge.
This can be tricky -- to maximize FDO benefit, the
profile-use/annotation needs to happen early which means
instrumentation also needs to happen early (to avoid cfg mismatches).
>
>> 2) comdat function resolution -- since LIPO uses aux module functions
>> for inlining purpose only, it has the freedom to choose which copy to
>> use. The current scheme chooses copy in current module with priority
>> for better profile data context sensitivity (see below)
>
> This is interesting. How do you solve the problem when given comdat function
> "loose"? I.e. it is replaced at linktime by other function that may or may
> not be profiled from other unit?
Whatever function that is selected will have profile data (assuming it
called at runtime) -- but the profile data are merged from different
contexts including from calls in different modules. For instance,
both a.C and b.C define foo. and b.C:foo is selected at runtime, and
a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
a.C:foo won't have any profile data, and b.C:foo has merged profile
data resulting from calls in both a.C and b.C.
>
> I am aware that current FDO gets this wrong (it assumes that comdat functions
> are never replaced from other unit). I guess situation can be improved a bit
> by doing some localization even at no -fwhole-program or teach runtime to merge
> in profiles into each individual copy of comdat...
Yes, current FDO assumption is wrong.
Thanks,
David
>
> Honza
>
>> 3) in profile-gen phase, allow more inlining for comdat functions (in
>> einline2 and ipa-inline) -- this will cause profile data to be tracked
>> with module sensitivity (note that counters are not in comdat group)
>>
>> Thanks,
>>
>> David
>>
>>
>>
>> > Honza
>> >>
>> >> Ciao!
>> >> Steven
>> >
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-30 18:13 ` Xinliang David Li
@ 2010-04-30 18:32 ` Jan Hubicka
2010-04-30 20:13 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-30 18:32 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org,
Raksit Ashok
> >
> > Interesting. Â My plan for profiling with LTO is to ultimately make it linktime
> > transform. Â This will be more difficult with WHOPR (i.e. instrumenting need
> > function bodies that are not available at WPA time), but I believe it is
> > solvable: just assign uids to the edges and do instrumentation at ltrans. Â Then
> > we will save cgraph profile in some easier way so WHOPR can read it in and read
> > rest of stuff in ltrans. Â This would invovlve shipping the correct profiles for
> > given function etc so it will be a bit of implementation challenge.
>
> This can be tricky -- to maximize FDO benefit, the
> profile-use/annotation needs to happen early which means
> instrumentation also needs to happen early (to avoid cfg mismatches).
I don't see much problem in this particular area.
GCC optimization queue is organized in a way that we first do early
optimizatoins that all are intended to be simple cleanups without size/speed
tradeoffs. Then we do IPA and late optimizations that are both driven by
profile (estimated or read).
Profile reading happens early because we use same infrastructure for gcov and
profile feedback. This is not giving profile feedback better benefit, quite a
converse since early passes may not be able to update profile precisely and we
also get higher profile overhead.
So I think decoupling gcov and profile feedback and pushing profile feedback
back in queue is going to be win.
Yes, optimization must match, but with LTO this is not problem and in general
the early optimization should be stable wrt memory layout (nothing else
changes). This used to be excercised before profiling was updated to tree
level in 4.x.
I would be very interested in the low overhead support - there is a lot to gain
especially because the profiling resuls are less dependent on setup and can be
better reused. I know part of code was contributed (the support for reading not
100% valid profiles). Is there any extra info available on this?
Main problem IMO is how to get profile into WHOPR without having function bodies.
I guess we will end up with summarizing the info in WHOR firendly way and
letting it to stream the other counters to LTRANS that will annotate the function
body once read in from the file.
>
>
> >
> >> 2) comdat function resolution -- since LIPO uses aux module functions
> >> for inlining purpose only, it has the freedom to choose which copy to
> >> use. The current scheme chooses copy in current module with priority
> >> for better profile data context sensitivity (see below)
> >
> > This is interesting. Â How do you solve the problem when given comdat function
> > "loose"? I.e. it is replaced at linktime by other function that may or may
> > not be profiled from other unit?
>
> Whatever function that is selected will have profile data (assuming it
> called at runtime) -- but the profile data are merged from different
> contexts including from calls in different modules. For instance,
> both a.C and b.C define foo. and b.C:foo is selected at runtime, and
> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
> a.C:foo won't have any profile data, and b.C:foo has merged profile
> data resulting from calls in both a.C and b.C.
Yes, but this is what I am concerned about. Without LTO at least when
compiling a.C with profile feedback we will have foo with 0 counts.
We might however work out that calls of foo are frequent and decide to
inline foo. We will take the counts and rescale resulting in inlining
foo optimized for size.
When comdats are resolved within LTO, this will not be deal, but LTO
still produce comdats that are later resolved with library etc., so we don't
solve the problem this way.
At very least we should be able to figure out that we are having function
that has no profile and do something more sane.
Do you have any idea how common these scenarios are?
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-30 18:32 ` Jan Hubicka
@ 2010-04-30 20:13 ` Xinliang David Li
2010-09-28 0:29 ` Neil Vachharajani
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-30 20:13 UTC (permalink / raw)
To: Jan Hubicka
Cc: Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok,
Neil Vachharajani
On Fri, Apr 30, 2010 at 11:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >
>> > Interesting. My plan for profiling with LTO is to ultimately make it linktime
>> > transform. This will be more difficult with WHOPR (i.e. instrumenting need
>> > function bodies that are not available at WPA time), but I believe it is
>> > solvable: just assign uids to the edges and do instrumentation at ltrans. Then
>> > we will save cgraph profile in some easier way so WHOPR can read it in and read
>> > rest of stuff in ltrans. This would invovlve shipping the correct profiles for
>> > given function etc so it will be a bit of implementation challenge.
>>
>> This can be tricky -- to maximize FDO benefit, the
>> profile-use/annotation needs to happen early which means
>> instrumentation also needs to happen early (to avoid cfg mismatches).
>
> I don't see much problem in this particular area.
>
> GCC optimization queue is organized in a way that we first do early
> optimizatoins that all are intended to be simple cleanups without size/speed
> tradeoffs. Then we do IPA and late optimizations that are both driven by
> profile (estimated or read).
> Profile reading happens early because we use same infrastructure for gcov and
> profile feedback. This is not giving profile feedback better benefit, quite a
> converse since early passes may not be able to update profile precisely and we
> also get higher profile overhead.
>
> So I think decoupling gcov and profile feedback and pushing profile feedback
> back in queue is going to be win.
>
There are two parts of profile-feedback
1) cfg edge counts annotation.
For this part, yes, most of the early phases (other than possibly
einline-2) do not need/depend on, and can probably pushed back (in
fact the static/guessed profile pass is later).
2) value profile transformations:
This part may benefit more from doing early -- not only because of
more cleanups, but also due to the requirement for getting more
precise inline summary.
> Yes, optimization must match, but with LTO this is not problem and in general
> the early optimization should be stable wrt memory layout (nothing else
> changes). This used to be excercised before profiling was updated to tree
> level in 4.x.
You mean CFG layout is stable? but ccp, copy_prop, dce, tail recursion
etc all can change cfg.
>
> I would be very interested in the low overhead support - there is a lot to gain
> especially because the profiling resuls are less dependent on setup and can be
> better reused. I know part of code was contributed (the support for reading not
> 100% valid profiles). Is there any extra info available on this?
>
For profile smoothing, Neil may point to more information.
> Main problem IMO is how to get profile into WHOPR without having function bodies.
> I guess we will end up with summarizing the info in WHOR firendly way and
> letting it to stream the other counters to LTRANS that will annotate the function
> body once read in from the file.
>>
I am a little lost here :)
>>
>> >
>> >> 2) comdat function resolution -- since LIPO uses aux module functions
>> >> for inlining purpose only, it has the freedom to choose which copy to
>> >> use. The current scheme chooses copy in current module with priority
>> >> for better profile data context sensitivity (see below)
>> >
>> > This is interesting. How do you solve the problem when given comdat function
>> > "loose"? I.e. it is replaced at linktime by other function that may or may
>> > not be profiled from other unit?
>>
>> Whatever function that is selected will have profile data (assuming it
>> called at runtime) -- but the profile data are merged from different
>> contexts including from calls in different modules. For instance,
>> both a.C and b.C define foo. and b.C:foo is selected at runtime, and
>> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
>> a.C:foo won't have any profile data, and b.C:foo has merged profile
>> data resulting from calls in both a.C and b.C.
>
> Yes, but this is what I am concerned about. Without LTO at least when
> compiling a.C with profile feedback we will have foo with 0 counts.
> We might however work out that calls of foo are frequent and decide to
> inline foo. We will take the counts and rescale resulting in inlining
> foo optimized for size
Not always ideal though -- scaling does not expose whether foo is hot
or not (the call edge may be cold, but is still worth inlining).
.
>
> When comdats are resolved within LTO, this will not be deal, but LTO
> still produce comdats that are later resolved with library etc., so we don't
> solve the problem this way.
> At very least we should be able to figure out that we are having function
> that has no profile and do something more sane.
You mean LTO does not discard duplicate bodies? Why ?
>
> Do you have any idea how common these scenarios are?
I don't have direct data, but I think it can be common.
Thanks,
David
>
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 21:36 ` Xinliang David Li
@ 2010-05-01 9:36 ` Jan Hubicka
2010-05-02 7:04 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-05-01 9:36 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org
>
> Vortex needs -fno-strict-aliasing. It casts between two record types
> with one record being a 'prefix' of another.
So today runs are complette. Thanks to Richi who fixed ICE in symtab merging
that affected perl and GCC. With vortex problem was that in addition to
-fno-strict-aliasing it is writting to closed files that cause ICE depending on
partiuclar glibc version.
Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.
Any idea if it is one of value transforms or just edge profile making the
difference? There are some cases of write only globals we can constant
propagate with -fwhole-program in SPEC, but I think it is parser.
Honza
>
> David
>
>
>
> > Honza
> >
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-05-01 9:36 ` Jan Hubicka
@ 2010-05-02 7:04 ` Xinliang David Li
2010-05-02 13:46 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-05-02 7:04 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
On Sat, May 1, 2010 at 2:36 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> Vortex needs -fno-strict-aliasing. It casts between two record types
>> with one record being a 'prefix' of another.
>
> So today runs are complette. Thanks to Richi who fixed ICE in symtab merging
> that affected perl and GCC. With vortex problem was that in addition to
> -fno-strict-aliasing it is writting to closed files that cause ICE depending on
> partiuclar glibc version.
>
> Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
> vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
> http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
> has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.
>
> Any idea if it is one of value transforms or just edge profile making the
> difference? There are some cases of write only globals we can constant
> propagate with -fwhole-program in SPEC, but I think it is parser.
>
I got the following number for O2, FDO, and LIPO : 2351, 2761 (17%), 3448 (24%).
The FDO improvement over O2 come from both edge profile and vpt
(div,rem). With FDO, one of the important loops in Part_Delete may get
tail duplicated which helps performance.
LIPO improvement mainly come from cross module ininling of hot
functions Mem_GetWord, Mem_GetAddr, Chunk_ChkGetChunk.
David
> Honza
>>
>> David
>>
>>
>>
>> > Honza
>> >
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-05-02 7:04 ` Xinliang David Li
@ 2010-05-02 13:46 ` Jan Hubicka
2010-05-03 4:57 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-05-02 13:46 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org
> On Sat, May 1, 2010 at 2:36 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>
> >> Vortex needs -fno-strict-aliasing. Â It casts between two record types
> >> with one record being a 'prefix' of another.
> >
> > So today runs are complette. Â Thanks to Richi who fixed ICE in symtab merging
> > that affected perl and GCC. Â With vortex problem was that in addition to
> > -fno-strict-aliasing it is writting to closed files that cause ICE depending on
> > partiuclar glibc version.
> >
> > Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
> > vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
> > http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
> > has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.
> >
> > Any idea if it is one of value transforms or just edge profile making the
> > difference? Â There are some cases of write only globals we can constant
> > propagate with -fwhole-program in SPEC, but I think it is parser.
> >
>
> I got the following number for O2, FDO, and LIPO : 2351, 2761 (17%), 3448 (24%).
>
> The FDO improvement over O2 come from both edge profile and vpt
> (div,rem). With FDO, one of the important loops in Part_Delete may get
I see. I am particularly interested in the div/rem transform. With LTO such
things are sometimes doable at compile time (propagating that the divisor is
know constant value). We currently make no constant propagation across global
variables except for simple detection if it is readonly and initialized. It
would be possible to be a bit smarter here and look for vars that are only used
to store a constant value into it and then replace all the division/rem by that
constant counting on fact that the value 0 can not reach the division.
Is this case detectable at compile time without feedback?
Honza
> tail duplicated which helps performance.
>
> LIPO improvement mainly come from cross module ininling of hot
> functions Mem_GetWord, Mem_GetAddr, Chunk_ChkGetChunk.
>
> David
>
>
> > Honza
> >>
> >> David
> >>
> >>
> >>
> >> > Honza
> >> >
> >
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-05-02 13:46 ` Jan Hubicka
@ 2010-05-03 4:57 ` Xinliang David Li
2010-05-04 18:04 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-05-03 4:57 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
On Sun, May 2, 2010 at 6:45 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Sat, May 1, 2010 at 2:36 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >>
>> >> Vortex needs -fno-strict-aliasing. It casts between two record types
>> >> with one record being a 'prefix' of another.
>> >
>> > So today runs are complette. Thanks to Richi who fixed ICE in symtab merging
>> > that affected perl and GCC. With vortex problem was that in addition to
>> > -fno-strict-aliasing it is writting to closed files that cause ICE depending on
>> > partiuclar glibc version.
>> >
>> > Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
>> > vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
>> > http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
>> > has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.
>> >
>> > Any idea if it is one of value transforms or just edge profile making the
>> > difference? There are some cases of write only globals we can constant
>> > propagate with -fwhole-program in SPEC, but I think it is parser.
>> >
>>
>> I got the following number for O2, FDO, and LIPO : 2351, 2761 (17%), 3448 (24%).
>>
>> The FDO improvement over O2 come from both edge profile and vpt
>> (div,rem). With FDO, one of the important loops in Part_Delete may get
>
> I see. I am particularly interested in the div/rem transform. With LTO such
> things are sometimes doable at compile time (propagating that the divisor is
> know constant value). We currently make no constant propagation across global
> variables except for simple detection if it is readonly and initialized. It
> would be possible to be a bit smarter here and look for vars that are only used
> to store a constant value into it and then replace all the division/rem by that
> constant counting on fact that the value 0 can not reach the division.
>
> Is this case detectable at compile time without feedback?
That depends. The following cases exist in vortex:
1) the value is runtime constant -- it is read from input file but
never changed -- e.g.: QueBug. Nothing can be done by the compiler in
this case;
2) Global variable written only once in the program, e.g
StrucAlignment. Compiler needs to prove that the definition
dominates (interprocedurally) all uses. Sjeng has similar cases.
3) The simplest case -- global variable only initialized statically
and never written in the program -- compiler should be able to
recognize it.
4) Local variable with known constant value sets -- AllocSize -- can
be handled by compiler with the help of static prediction.
David
>
> Honza
>> tail duplicated which helps performance.
>>
>> LIPO improvement mainly come from cross module ininling of hot
>> functions Mem_GetWord, Mem_GetAddr, Chunk_ChkGetChunk.
>>
>> David
>>
>>
>> > Honza
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> > Honza
>> >> >
>> >
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-05-03 4:57 ` Xinliang David Li
@ 2010-05-04 18:04 ` Jan Hubicka
0 siblings, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-05-04 18:04 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org
> On Sun, May 2, 2010 at 6:45 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> That depends. The following cases exist in vortex:
>
> 1) the value is runtime constant -- it is read from input file but
> never changed -- e.g.: QueBug. Nothing can be done by the compiler in
> this case;
>
> 2) Global variable written only once in the program, e.g
> StrucAlignment. Compiler needs to prove that the definition
> dominates (interprocedurally) all uses. Sjeng has similar cases.
>
> 3) The simplest case -- global variable only initialized statically
> and never written in the program -- compiler should be able to
> recognize it.
We handle this in ipa-reference already.
>
> 4) Local variable with known constant value sets -- AllocSize -- can
> be handled by compiler with the help of static prediction.
Yep, I was basically interested if vortex case is 1) or not. I am thinking
about extending my ipa-ref collecting code to collect list of known constants
variable is initialized with and feed it into some local optimization passes
(value range profiling is good case) as well as implement expansion for
division and modulo to something like:
switch (divisor)
{
case known_cst1:
res = a/known_cst1;
case known_cst2:
res = a/known_cst2;
...
}
special casing 0 as impossible value. I know this can handle well parser in
spec2000, but I am curious if it is worth the (relatively little) effort to
implement this.
Given how easy is to get this done, I guess I will do so.
Also ipa-cp can be extended to handle static vars used in function as parameters
and propagate across them that can get some of this with flow sensitivity.
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 23:47 ` Steven Bosscher
@ 2010-09-28 0:28 ` Neil Vachharajani
2010-09-28 1:01 ` Jack Howarth
0 siblings, 1 reply; 68+ messages in thread
From: Neil Vachharajani @ 2010-09-28 0:28 UTC (permalink / raw)
To: Steven Bosscher
Cc: Jan Hubicka, Xinliang David Li, Vladimir Makarov,
gcc.gcc.gnu.org, Raksit Ashok
On Thu, Apr 29, 2010 at 4:07 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
>> Yep, I read that page (and saw some of implementation too). Just was not able
>> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
>> than LTO+FDO then why)
>
> OK, that's an interesting question. The first question (if...) is
> something you'll have to try yourself, I suppose :-)
>
> BTW will the CGO presentation about LIPO and sampled FDO be published
> somewhere in the open?
All the slides from CGO are available here:
http://www.cgo.org/cgo2010/talks/
>
> Ciao!
> Steven
>
--
Neil Vachharajani
Google
650-214-1804
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-30 20:13 ` Xinliang David Li
@ 2010-09-28 0:29 ` Neil Vachharajani
0 siblings, 0 replies; 68+ messages in thread
From: Neil Vachharajani @ 2010-09-28 0:29 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org,
Raksit Ashok
On Fri, Apr 30, 2010 at 12:07 PM, Xinliang David Li <davidxl@google.com> wrote:
>
> On Fri, Apr 30, 2010 at 11:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> >
> >> > Interesting. My plan for profiling with LTO is to ultimately make it linktime
> >> > transform. This will be more difficult with WHOPR (i.e. instrumenting need
> >> > function bodies that are not available at WPA time), but I believe it is
> >> > solvable: just assign uids to the edges and do instrumentation at ltrans. Then
> >> > we will save cgraph profile in some easier way so WHOPR can read it in and read
> >> > rest of stuff in ltrans. This would invovlve shipping the correct profiles for
> >> > given function etc so it will be a bit of implementation challenge.
> >>
> >> This can be tricky -- to maximize FDO benefit, the
> >> profile-use/annotation needs to happen early which means
> >> instrumentation also needs to happen early (to avoid cfg mismatches).
> >
> > I don't see much problem in this particular area.
> >
> > GCC optimization queue is organized in a way that we first do early
> > optimizatoins that all are intended to be simple cleanups without size/speed
> > tradeoffs. Then we do IPA and late optimizations that are both driven by
> > profile (estimated or read).
> > Profile reading happens early because we use same infrastructure for gcov and
> > profile feedback. This is not giving profile feedback better benefit, quite a
> > converse since early passes may not be able to update profile precisely and we
> > also get higher profile overhead.
> >
> > So I think decoupling gcov and profile feedback and pushing profile feedback
> > back in queue is going to be win.
> >
>
> There are two parts of profile-feedback
> 1) cfg edge counts annotation.
>
> For this part, yes, most of the early phases (other than possibly
> einline-2) do not need/depend on, and can probably pushed back (in
> fact the static/guessed profile pass is later).
>
> 2) value profile transformations:
>
> This part may benefit more from doing early -- not only because of
> more cleanups, but also due to the requirement for getting more
> precise inline summary.
>
>
> > Yes, optimization must match, but with LTO this is not problem and in general
> > the early optimization should be stable wrt memory layout (nothing else
> > changes). This used to be excercised before profiling was updated to tree
> > level in 4.x.
>
>
> You mean CFG layout is stable? but ccp, copy_prop, dce, tail recursion
> etc all can change cfg.
>
> >
> > I would be very interested in the low overhead support - there is a lot to gain
> > especially because the profiling resuls are less dependent on setup and can be
> > better reused. I know part of code was contributed (the support for reading not
> > 100% valid profiles). Is there any extra info available on this?
> >
>
> For profile smoothing, Neil may point to more information.
Sorry for the *very* delayed response, but some email filters went a bit wild.
Profile smoothing does a good job of taking imprecise profiles and
fixing them up. This doesn't address the stale profile problem with
GCC instrumentation based FDO profile collection. There are checks
which completely discard profiles if the function line numbers (IIRC)
do not match. I have some patches I've been meaning to send upstream
which help ease this restriction (i.e., add the ability to retain more
of a stale profile), but this opens up many bugs which I've been
incrementally squashing throughout the rest of the compiler.
>
> > Main problem IMO is how to get profile into WHOPR without having function bodies.
> > I guess we will end up with summarizing the info in WHOR firendly way and
> > letting it to stream the other counters to LTRANS that will annotate the function
> > body once read in from the file.
> >>
>
> I am a little lost here :)
>
> >>
> >> >
> >> >> 2) comdat function resolution -- since LIPO uses aux module functions
> >> >> for inlining purpose only, it has the freedom to choose which copy to
> >> >> use. The current scheme chooses copy in current module with priority
> >> >> for better profile data context sensitivity (see below)
> >> >
> >> > This is interesting. How do you solve the problem when given comdat function
> >> > "loose"? I.e. it is replaced at linktime by other function that may or may
> >> > not be profiled from other unit?
> >>
> >> Whatever function that is selected will have profile data (assuming it
> >> called at runtime) -- but the profile data are merged from different
> >> contexts including from calls in different modules. For instance,
> >> both a.C and b.C define foo. and b.C:foo is selected at runtime, and
> >> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
> >> a.C:foo won't have any profile data, and b.C:foo has merged profile
> >> data resulting from calls in both a.C and b.C.
> >
> > Yes, but this is what I am concerned about. Without LTO at least when
> > compiling a.C with profile feedback we will have foo with 0 counts.
> > We might however work out that calls of foo are frequent and decide to
> > inline foo. We will take the counts and rescale resulting in inlining
> > foo optimized for size
>
> Not always ideal though -- scaling does not expose whether foo is hot
> or not (the call edge may be cold, but is still worth inlining).
>
> .
> >
> > When comdats are resolved within LTO, this will not be deal, but LTO
> > still produce comdats that are later resolved with library etc., so we don't
> > solve the problem this way.
> > At very least we should be able to figure out that we are having function
> > that has no profile and do something more sane.
>
> You mean LTO does not discard duplicate bodies? Why ?
>
> >
> > Do you have any idea how common these scenarios are?
>
> I don't have direct data, but I think it can be common.
>
> Thanks,
>
> David
>
> >
> > Honza
> >
--
Neil Vachharajani
Google
650-214-1804
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-09-28 0:28 ` Neil Vachharajani
@ 2010-09-28 1:01 ` Jack Howarth
0 siblings, 0 replies; 68+ messages in thread
From: Jack Howarth @ 2010-09-28 1:01 UTC (permalink / raw)
To: Neil Vachharajani
Cc: Steven Bosscher, Jan Hubicka, Xinliang David Li,
Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok
On Mon, Sep 27, 2010 at 11:04:10AM -0700, Neil Vachharajani wrote:
> On Thu, Apr 29, 2010@4:07 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> > 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
> >> Yep, I read that page (and saw some of implementation too). Just was not able
> >> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
> >> than LTO+FDO then why)
> >
> > OK, that's an interesting question. The first question (if...) is
> > something you'll have to try yourself, I suppose :-)
> >
> > BTW will the CGO presentation about LIPO and sampled FDO be published
> > somewhere in the open?
>
> All the slides from CGO are available here:
> http://www.cgo.org/cgo2010/talks/
>
> >
> > Ciao!
> > Steven
> >
>
FYI, my recent Polyhedron 2008 benchmark runs for llvm-gcc-4.2 2.8rc2 on
x86_64-apple-darwin10 indicates that there are some significant performance
regressions between 2.7 and 2.8.
http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-September/034780.html
>
>
> --
> Neil Vachharajani
> Google
> 650-214-1804
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
` (2 preceding siblings ...)
2010-04-29 22:42 ` Jack Howarth
@ 2010-11-13 23:15 ` Xinliang David Li
2010-11-14 14:48 ` Paolo Bonzini
` (2 more replies)
3 siblings, 3 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-13 23:15 UTC (permalink / raw)
To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org
I re-measured the performance difference using trunk gcc and trunk
clang/llvm on a core-2 box. -fno-strict-aliasing is added to gcc
because clang/llvm's type based aliasing is not incomplete and not
enabled by default. I also added -fomit-frame-pointer to clang/llvm as
this is gcc's default. The base option is -O2.
32bit:
164.gzip 1210 1239 2.44%
175.vpr 1662 1621 -2.42%
181.mcf 2733 3109 13.75%
186.crafty 1812 1721 -5.00%
197.parser 1328 1289 -2.92%
253.perlbmk 2086 2580 23.67%
254.gap 1968 1912 -2.86%
255.vortex 1842 1965 6.66%
256.bzip2 1440 1553 7.82%
300.twolf 2284 2213 -3.08%
64bit:
164.gzip 1268 1320 4.15%
175.vpr 1605 1534 -4.42%
176.gcc 2203 2315 5.08%
181.mcf 1625 1737 6.85%
186.crafty 2411 2307 -4.30%
197.parser 1173 1166 -0.57%
252.eon 2245 2464 9.72%
253.perlbmk 2214 2444 10.37%
254.gap 1987 1978 -0.47%
255.vortex 2497 2422 -3.00%
256.bzip2 1585 1740 9.80%
300.twolf 2294 2281 -0.58%
Though gcc leads LLVM in performance overrall, there are a couple of
benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
twolf (32bit), vortex (64bit). This needs to be triaged. gcc
miscompiles gcc and eon in 32bit -- is there a bug tracking the
problem?
Thanks,
David
On Thu, Apr 29, 2010 at 9:25 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
> GCC-4.5.0 and LLVM-2.7 were released recently. To understand
> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
> for x86/x86-64 and posted the comparison of it with the
> previous GCC releases and LLVM-2.7.
>
> Even benchmarking SPEC2000 takes a lot of time on the fastest
> machine I have. So I don't plan to use SPEC2006 for this in near
> future.
>
> You can find the comparison on
> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
> bottom of the left frame starting with link "GCC release comparison").
>
> If you need exact numbers, please use the tables (the links to them
> are also given) which were used to generate the corresponding bar
> graphs.
>
>
> In general GCC-4.5.0 became faster (upto 10%) in -O2 mode. This is
> first considerable compilation speed improvement since GCC-4.2.
> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
> SPECFP2000 in -O2 mode) code too in comparison with the previous
> release. That is not including LTO and Graphite which can gives even
> more (especially LTO) in many cases.
>
> GCC-4.5.0 has new big optimizations LTO and Graphite (more
> accurately graphite was introduced in the previous release).
> Therefore I ran additional benchmarks to test them.
>
> LTO is a promising technology especially for integer benchmarks for
> which it results in smaller and faster code. But it might result in
> degradations too on SPECFP2000 mainly because of big degradations on a
> few benchmarks like wupwise or facerec. Another annoying thing about
> LTO, it considerably slows down the compiler.
>
> Currently Graphite gives small improvements on x86 (one exception is
> 2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
> maximum one more than 10% for SPECFP2000 because of big degradations
> on mgrid and swim). So further work is needed on the project because
> it seems not mature yet.
>
> As for LLVM, LLVM became slower (e.g. in comparison with llvm-2.5 on
> 15%-50% for x86-64). So the gap between compilation speed of GCC and
> LLVM decreased and sometimes achieves 4% on x86_64 and 8% on x86 (both
> for SPECInt2000 in -O2 mode). May be I am wrong but I don't think
> CLANG will improve this situation significantly (in -O2 and -O3 mode)
> because optimizations still take most of time of any serious
> optimizing compiler.
>
> LLVM did a progress in code performance especially for floating
> point benchmarks. But the gap between LLVM-2.7 and GCC-4.5 in peak
> performance (not including GCC LTO and Graphite) still 6-7% on
> SPECInt200 and 13-17% on SPECFP2000.
>
> In general, IMHO GCC-4.5.0 is a good and promising release.
>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-13 23:15 ` Xinliang David Li
@ 2010-11-14 14:48 ` Paolo Bonzini
2010-11-14 15:43 ` Xinliang David Li
2010-11-14 21:12 ` H.J. Lu
2010-11-15 15:49 ` Andrey Belevantsev
2 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2010-11-14 14:48 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
On 11/13/2010 10:08 PM, Xinliang David Li wrote:
> Though gcc leads LLVM in performance overrall, there are a couple of
> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
> twolf (32bit), vortex (64bit). This needs to be triaged. gcc
> miscompiles gcc and eon in 32bit -- is there a bug tracking the
> problem?
Have you tried -ffast-math or -mfpmath=sse for eon?
Paolo
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-14 14:48 ` Paolo Bonzini
@ 2010-11-14 15:43 ` Xinliang David Li
0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-14 15:43 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
On Sat, Nov 13, 2010 at 2:39 PM, Paolo Bonzini <bonzini@gnu.org> wrote:
> On 11/13/2010 10:08 PM, Xinliang David Li wrote:
>>
>> Though gcc leads LLVM in performance overrall, there are a couple of
>> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
>> twolf (32bit), vortex (64bit). This needs to be triaged. gcc
>> miscompiles gcc and eon in 32bit -- is there a bug tracking the
>> problem?
>
> Have you tried -ffast-math or -mfpmath=sse for eon?
>
-ffast-math is used on eon.
David
> Paolo
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-13 23:15 ` Xinliang David Li
2010-11-14 14:48 ` Paolo Bonzini
@ 2010-11-14 21:12 ` H.J. Lu
2010-11-15 9:29 ` Xinliang David Li
2010-11-15 15:49 ` Andrey Belevantsev
2 siblings, 1 reply; 68+ messages in thread
From: H.J. Lu @ 2010-11-14 21:12 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
On Sat, Nov 13, 2010 at 1:08 PM, Xinliang David Li <davidxl@google.com> wrote:
>
> Though gcc leads LLVM in performance overrall, there are a couple of
> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
> twolf (32bit), vortex (64bit). This needs to be triaged. gcc
> miscompiles gcc and eon in 32bit -- is there a bug tracking the
> problem?
>
GCC trunk compiles and runs SPEC CPU 2K correctly at
-O2 and -O3 for both 32bit and 64bit on x86:
http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00977.html
http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00983.html
You need alternate source for eon. I use:
252.eon=default=default=default:
CXXPORTABILITY = -DHAS_ERRLIST
EXTRA_CXXFLAGS=-ffast-math -mpc64
EXTRA_LDFLAGS = -ffast-math -mpc64
srcalt=gcc43
176.gcc=default=default=default:
CPORTABILITY = -Dalloca=_alloca
--
H.J.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-14 21:12 ` H.J. Lu
@ 2010-11-15 9:29 ` Xinliang David Li
0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-15 9:29 UTC (permalink / raw)
To: H.J. Lu; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
Thanks, this works.
gcc vs llvm
176.gcc: +3.7%
252.eon: +6.1%
David
On Sat, Nov 13, 2010 at 3:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sat, Nov 13, 2010 at 1:08 PM, Xinliang David Li <davidxl@google.com> wrote:
>>
>> Though gcc leads LLVM in performance overrall, there are a couple of
>> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
>> twolf (32bit), vortex (64bit). This needs to be triaged. gcc
>> miscompiles gcc and eon in 32bit -- is there a bug tracking the
>> problem?
>>
>
> GCC trunk compiles and runs SPEC CPU 2K correctly at
> -O2 and -O3 for both 32bit and 64bit on x86:
>
> http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00977.html
> http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00983.html
>
> You need alternate source for eon. I use:
>
> 252.eon=default=default=default:
> CXXPORTABILITY = -DHAS_ERRLIST
> EXTRA_CXXFLAGS=-ffast-math -mpc64
> EXTRA_LDFLAGS = -ffast-math -mpc64
> srcalt=gcc43
>
> 176.gcc=default=default=default:
> CPORTABILITY = -Dalloca=_alloca
>
>
> --
> H.J.
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-13 23:15 ` Xinliang David Li
2010-11-14 14:48 ` Paolo Bonzini
2010-11-14 21:12 ` H.J. Lu
@ 2010-11-15 15:49 ` Andrey Belevantsev
2010-11-15 17:41 ` Xinliang David Li
2 siblings, 1 reply; 68+ messages in thread
From: Andrey Belevantsev @ 2010-11-15 15:49 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
Hello,
On 14.11.2010 0:08, Xinliang David Li wrote:
> I re-measured the performance difference using trunk gcc and trunk
> clang/llvm on a core-2 box. -fno-strict-aliasing is added to gcc
> because clang/llvm's type based aliasing is not incomplete and not
> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
> this is gcc's default. The base option is -O2.
It would be very interesting to compare also peak numbers, i.e. with LTO
and strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
similar to Vlad's or OpenSUSE's options. Can you try to measure these?
Maybe you can also run SPEC2k6, if there is enough machine resources, but
that's probably asking too much...
Andrey
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-15 15:49 ` Andrey Belevantsev
@ 2010-11-15 17:41 ` Xinliang David Li
2010-11-15 18:31 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-15 17:41 UTC (permalink / raw)
To: Andrey Belevantsev; +Cc: Vladimir Makarov, gcc.gcc.gnu.org
For peak, FDO is the most effective option. It can boost performance
by 7-10% depending on the program. The options you suggested probably
won't make too big a dent. -funroll-loops can hurt performance
without profiling. More aggressive inlining, ipa-cp, unswitching etc
enabled by O3 may help a little if there is any. -ffast-math won't
help for integer benchmarks other than eon. Traditionally, O3 helps
FP performance because of the loop transformation enabled, but this
won't be the case for gcc for now.
Thanks,
David
On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
> Hello,
>
> On 14.11.2010 0:08, Xinliang David Li wrote:
>>
>> I re-measured the performance difference using trunk gcc and trunk
>> clang/llvm on a core-2 box. -fno-strict-aliasing is added to gcc
>> because clang/llvm's type based aliasing is not incomplete and not
>> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
>> this is gcc's default. The base option is -O2.
>
> It would be very interesting to compare also peak numbers, i.e. with LTO and
> strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
> similar to Vlad's or OpenSUSE's options. Can you try to measure these?
> Maybe you can also run SPEC2k6, if there is enough machine resources, but
> that's probably asking too much...
>
> Andrey
>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-15 17:41 ` Xinliang David Li
@ 2010-11-15 18:31 ` Jan Hubicka
2010-11-15 22:25 ` Richard Guenther
2010-11-15 22:47 ` Xinliang David Li
0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-15 18:31 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
> For peak, FDO is the most effective option. It can boost performance
> by 7-10% depending on the program. The options you suggested probably
> won't make too big a dent. -funroll-loops can hurt performance
> without profiling. More aggressive inlining, ipa-cp, unswitching etc
-funroll-loops overall was 2.2% win on SPECint, -funrol-all-loops 2.5% last
time I noted down the SPECint results of this (that was in 2003, heh :)
http://www.ucw.cz/~hubicka/papers/amd64/node4.html
> enabled by O3 may help a little if there is any. -ffast-math won't
> help for integer benchmarks other than eon. Traditionally, O3 helps
> FP performance because of the loop transformation enabled, but this
> won't be the case for gcc for now.
Function inlining definitly helps. -O3 also imply vectorization and other stuff.
Honza
>
> Thanks,
>
> David
>
> On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
> > Hello,
> >
> > On 14.11.2010 0:08, Xinliang David Li wrote:
> >>
> >> I re-measured the performance difference using trunk gcc and trunk
> >> clang/llvm on a core-2 box. Â -fno-strict-aliasing is added to gcc
> >> because clang/llvm's type based aliasing is not incomplete and not
> >> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
> >> this is gcc's default. The base option is -O2.
> >
> > It would be very interesting to compare also peak numbers, i.e. with LTO and
> > strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
> > similar to Vlad's or OpenSUSE's options. Â Can you try to measure these?
> > Maybe you can also run SPEC2k6, if there is enough machine resources, but
> > that's probably asking too much...
> >
> > Andrey
> >
> >
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-15 18:31 ` Jan Hubicka
@ 2010-11-15 22:25 ` Richard Guenther
2010-11-15 22:47 ` Xinliang David Li
1 sibling, 0 replies; 68+ messages in thread
From: Richard Guenther @ 2010-11-15 22:25 UTC (permalink / raw)
To: Jan Hubicka
Cc: Xinliang David Li, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
2010/11/15 Jan Hubicka <hubicka@ucw.cz>:
>> For peak, FDO is the most effective option. It can boost performance
>> by 7-10% depending on the program. The options you suggested probably
>> won't make too big a dent. -funroll-loops can hurt performance
>> without profiling. More aggressive inlining, ipa-cp, unswitching etc
>
> -funroll-loops overall was 2.2% win on SPECint, -funrol-all-loops 2.5% last
> time I noted down the SPECint results of this (that was in 2003, heh :)
> http://www.ucw.cz/~hubicka/papers/amd64/node4.html
>
>> enabled by O3 may help a little if there is any. -ffast-math won't
>> help for integer benchmarks other than eon. Traditionally, O3 helps
>> FP performance because of the loop transformation enabled, but this
>> won't be the case for gcc for now.
>
> Function inlining definitly helps. -O3 also imply vectorization and other stuff.
Indeed. You can look at the various testers at gcc.opensuse.org which compare
-O2 vs. -O3 but also -O3 vs. -O3 -funroll-loops (and other things) to
get an idea
what helps and what not.
Richard.
> Honza
>>
>> Thanks,
>>
>> David
>>
>> On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
>> > Hello,
>> >
>> > On 14.11.2010 0:08, Xinliang David Li wrote:
>> >>
>> >> I re-measured the performance difference using trunk gcc and trunk
>> >> clang/llvm on a core-2 box. -fno-strict-aliasing is added to gcc
>> >> because clang/llvm's type based aliasing is not incomplete and not
>> >> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
>> >> this is gcc's default. The base option is -O2.
>> >
>> > It would be very interesting to compare also peak numbers, i.e. with LTO and
>> > strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
>> > similar to Vlad's or OpenSUSE's options. Can you try to measure these?
>> > Maybe you can also run SPEC2k6, if there is enough machine resources, but
>> > that's probably asking too much...
>> >
>> > Andrey
>> >
>> >
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-15 18:31 ` Jan Hubicka
2010-11-15 22:25 ` Richard Guenther
@ 2010-11-15 22:47 ` Xinliang David Li
2010-11-15 23:06 ` Jan Hubicka
1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-15 22:47 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
I did some measurement (64bit).
Experiment 1:
-O2 -funroll-loops vs -O2
It improves performance (geomean) by 0.56%, not too much:
O2 O2 unroll-loops
164.gzip 1324 1331 0.56%
175.vpr 1694 1605 -5.24%
176.gcc 2293 2350 2.47%
181.mcf 1772 1788 0.90%
186.crafty 2320 2326 0.26%
197.parser 1166 1162 -0.32%
252.eon 2443 2529 3.50%
253.perlbmk 2410 2460 2.07%
254.gap 1987 2019 1.58%
255.vortex 2392 2406 0.58%
256.bzip2 1719 1715 -0.25%
300.twolf 2288 2308 0.88%
Experiment 2: O3 vs O2:
The improvement on SPEC2k is larger than large internal programs
tested -- geomean 2.38%.
164.gzip 1324 1329 0.40%
175.vpr 1694 1700 0.31%
176.gcc 2293 2336 1.89%
181.mcf 1772 1739 -1.81%
186.crafty 2320 2323 0.14%
197.parser 1166 1252 7.39%
252.eon 2443 2645 8.23%
253.perlbmk 2410 2452 1.74%
254.gap 1987 2020 1.62%
255.vortex 2392 2473 3.39%
256.bzip2 1719 1766 2.74%
300.twolf 2288 2350 2.70%
Experiment 3: O2 lto vs O2: geomean 0.72%
O2 O2 LTO
164.gzip 1324 1317 -0.53%
175.vpr 1694 1697 0.18%
176.gcc 2293 2291 -0.08%
181.mcf 1772 1760 -0.65%
186.crafty 2320 2245 -3.26%
197.parser 1166 1163 -0.29%
252.eon 2443 2576 5.44%
253.perlbmk 2410 2433 0.93%
254.gap 1987 1995 0.36%
255.vortex 2392 2588 8.19%
256.bzip2 1719 1729 0.56%
300.twolf 2288 2248 -1.77%
David
On Mon, Nov 15, 2010 at 9:54 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> For peak, FDO is the most effective option. It can boost performance
>> by 7-10% depending on the program. The options you suggested probably
>> won't make too big a dent. -funroll-loops can hurt performance
>> without profiling. More aggressive inlining, ipa-cp, unswitching etc
>
> -funroll-loops overall was 2.2% win on SPECint, -funrol-all-loops 2.5% last
> time I noted down the SPECint results of this (that was in 2003, heh :)
> http://www.ucw.cz/~hubicka/papers/amd64/node4.html
>
>> enabled by O3 may help a little if there is any. -ffast-math won't
>> help for integer benchmarks other than eon. Traditionally, O3 helps
>> FP performance because of the loop transformation enabled, but this
>> won't be the case for gcc for now.
>
> Function inlining definitly helps. -O3 also imply vectorization and other stuff.
>
> Honza
>>
>> Thanks,
>>
>> David
>>
>> On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
>> > Hello,
>> >
>> > On 14.11.2010 0:08, Xinliang David Li wrote:
>> >>
>> >> I re-measured the performance difference using trunk gcc and trunk
>> >> clang/llvm on a core-2 box. -fno-strict-aliasing is added to gcc
>> >> because clang/llvm's type based aliasing is not incomplete and not
>> >> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
>> >> this is gcc's default. The base option is -O2.
>> >
>> > It would be very interesting to compare also peak numbers, i.e. with LTO and
>> > strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
>> > similar to Vlad's or OpenSUSE's options. Can you try to measure these?
>> > Maybe you can also run SPEC2k6, if there is enough machine resources, but
>> > that's probably asking too much...
>> >
>> > Andrey
>> >
>> >
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-15 22:47 ` Xinliang David Li
@ 2010-11-15 23:06 ` Jan Hubicka
2010-11-16 0:41 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-15 23:06 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
> I did some measurement (64bit).
>
> Experiment 1:
>
> -O2 -funroll-loops vs -O2
>
> It improves performance (geomean) by 0.56%, not too much:
> O2 O2 unroll-loops
> 164.gzip 1324 1331 0.56%
> 175.vpr 1694 1605 -5.24%
> 176.gcc 2293 2350 2.47%
> 181.mcf 1772 1788 0.90%
> 186.crafty 2320 2326 0.26%
> 197.parser 1166 1162 -0.32%
> 252.eon 2443 2529 3.50%
> 253.perlbmk 2410 2460 2.07%
> 254.gap 1987 2019 1.58%
> 255.vortex 2392 2406 0.58%
> 256.bzip2 1719 1715 -0.25%
> 300.twolf 2288 2308 0.88%
Can you also try -funroll-all-loops? As for pretty small programs, like
spec2k, -funroll-all-loops is often win. In just few loops we can work out
number of iterations.
>
> Experiment 3: O2 lto vs O2: geomean 0.72%
> O2 O2 LTO
> 164.gzip 1324 1317 -0.53%
> 175.vpr 1694 1697 0.18%
> 176.gcc 2293 2291 -0.08%
> 181.mcf 1772 1760 -0.65%
> 186.crafty 2320 2245 -3.26%
> 197.parser 1166 1163 -0.29%
> 252.eon 2443 2576 5.44%
> 253.perlbmk 2410 2433 0.93%
> 254.gap 1987 1995 0.36%
> 255.vortex 2392 2588 8.19%
> 256.bzip2 1719 1729 0.56%
> 300.twolf 2288 2248 -1.77%
You need -O3 -fwhole-program -flto for resonable cross module inlining to happen.
-fwhole-program is quite essential to get resonable win from LTO (w/o profile feedback).
At least our nightly tester then gets quite nice improvements on few benchmark at spec2k,
see also my gccsummit slides.
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-15 23:06 ` Jan Hubicka
@ 2010-11-16 0:41 ` Xinliang David Li
2010-11-16 0:53 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 0:41 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
Just measured: lto +O3 improves over O2 by a decent 4.8% geomean. More
data come later.
164.gzip 1324 1322 -0.10%
175.vpr 1694 1703 0.51%
176.gcc 2293 2347 2.34%
181.mcf 1772 1797 1.43%
186.crafty 2320 2486 7.12%
197.parser 1166 1236 6.02%
252.eon 2443 2810 14.98%
253.perlbmk 2410 2407 -0.16%
254.gap 1987 2024 1.82%
255.vortex 2392 2826 18.13%
256.bzip2 1719 1760 2.38%
300.twolf 2288 2394 4.63%
David
On Mon, Nov 15, 2010 at 2:38 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> I did some measurement (64bit).
>>
>> Experiment 1:
>>
>> -O2 -funroll-loops vs -O2
>>
>> It improves performance (geomean) by 0.56%, not too much:
>> O2 O2 unroll-loops
>> 164.gzip 1324 1331 0.56%
>> 175.vpr 1694 1605 -5.24%
>> 176.gcc 2293 2350 2.47%
>> 181.mcf 1772 1788 0.90%
>> 186.crafty 2320 2326 0.26%
>> 197.parser 1166 1162 -0.32%
>> 252.eon 2443 2529 3.50%
>> 253.perlbmk 2410 2460 2.07%
>> 254.gap 1987 2019 1.58%
>> 255.vortex 2392 2406 0.58%
>> 256.bzip2 1719 1715 -0.25%
>> 300.twolf 2288 2308 0.88%
>
> Can you also try -funroll-all-loops? As for pretty small programs, like
> spec2k, -funroll-all-loops is often win. In just few loops we can work out
> number of iterations.
>
>>
>> Experiment 3: O2 lto vs O2: geomean 0.72%
>> O2 O2 LTO
>> 164.gzip 1324 1317 -0.53%
>> 175.vpr 1694 1697 0.18%
>> 176.gcc 2293 2291 -0.08%
>> 181.mcf 1772 1760 -0.65%
>> 186.crafty 2320 2245 -3.26%
>> 197.parser 1166 1163 -0.29%
>> 252.eon 2443 2576 5.44%
>> 253.perlbmk 2410 2433 0.93%
>> 254.gap 1987 1995 0.36%
>> 255.vortex 2392 2588 8.19%
>> 256.bzip2 1719 1729 0.56%
>> 300.twolf 2288 2248 -1.77%
>
> You need -O3 -fwhole-program -flto for resonable cross module inlining to happen.
> -fwhole-program is quite essential to get resonable win from LTO (w/o profile feedback).
>
> At least our nightly tester then gets quite nice improvements on few benchmark at spec2k,
> see also my gccsummit slides.
>
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 0:41 ` Xinliang David Li
@ 2010-11-16 0:53 ` Xinliang David Li
2010-11-16 1:02 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 0:53 UTC (permalink / raw)
To: Jan Hubicka; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
This means O3 level inlining should be turned on also for lto build by
default -- as -O2 lto performance is too unimpressive.
David
On Mon, Nov 15, 2010 at 3:36 PM, Xinliang David Li <davidxl@google.com> wrote:
> Just measured: lto +O3 improves over O2 by a decent 4.8% geomean. More
> data come later.
>
> 164.gzip 1324 1322 -0.10%
> 175.vpr 1694 1703 0.51%
> 176.gcc 2293 2347 2.34%
> 181.mcf 1772 1797 1.43%
> 186.crafty 2320 2486 7.12%
> 197.parser 1166 1236 6.02%
> 252.eon 2443 2810 14.98%
> 253.perlbmk 2410 2407 -0.16%
> 254.gap 1987 2024 1.82%
> 255.vortex 2392 2826 18.13%
> 256.bzip2 1719 1760 2.38%
> 300.twolf 2288 2394 4.63%
>
>
> David
>
>
> On Mon, Nov 15, 2010 at 2:38 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> I did some measurement (64bit).
>>>
>>> Experiment 1:
>>>
>>> -O2 -funroll-loops vs -O2
>>>
>>> It improves performance (geomean) by 0.56%, not too much:
>>> O2 O2 unroll-loops
>>> 164.gzip 1324 1331 0.56%
>>> 175.vpr 1694 1605 -5.24%
>>> 176.gcc 2293 2350 2.47%
>>> 181.mcf 1772 1788 0.90%
>>> 186.crafty 2320 2326 0.26%
>>> 197.parser 1166 1162 -0.32%
>>> 252.eon 2443 2529 3.50%
>>> 253.perlbmk 2410 2460 2.07%
>>> 254.gap 1987 2019 1.58%
>>> 255.vortex 2392 2406 0.58%
>>> 256.bzip2 1719 1715 -0.25%
>>> 300.twolf 2288 2308 0.88%
>>
>> Can you also try -funroll-all-loops? As for pretty small programs, like
>> spec2k, -funroll-all-loops is often win. In just few loops we can work out
>> number of iterations.
>>
>>>
>>> Experiment 3: O2 lto vs O2: geomean 0.72%
>>> O2 O2 LTO
>>> 164.gzip 1324 1317 -0.53%
>>> 175.vpr 1694 1697 0.18%
>>> 176.gcc 2293 2291 -0.08%
>>> 181.mcf 1772 1760 -0.65%
>>> 186.crafty 2320 2245 -3.26%
>>> 197.parser 1166 1163 -0.29%
>>> 252.eon 2443 2576 5.44%
>>> 253.perlbmk 2410 2433 0.93%
>>> 254.gap 1987 1995 0.36%
>>> 255.vortex 2392 2588 8.19%
>>> 256.bzip2 1719 1729 0.56%
>>> 300.twolf 2288 2248 -1.77%
>>
>> You need -O3 -fwhole-program -flto for resonable cross module inlining to happen.
>> -fwhole-program is quite essential to get resonable win from LTO (w/o profile feedback).
>>
>> At least our nightly tester then gets quite nice improvements on few benchmark at spec2k,
>> see also my gccsummit slides.
>>
>> Honza
>>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 0:53 ` Xinliang David Li
@ 2010-11-16 1:02 ` Jan Hubicka
2010-11-16 1:19 ` Jan Hubicka
2010-11-16 1:24 ` Xinliang David Li
0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16 1:02 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
> This means O3 level inlining should be turned on also for lto build by
> default -- as -O2 lto performance is too unimpressive.
I am just re-tunning the inliner and hope to get more speedups for smaller
costs than we get right now. I however don't think we can resonably enable it
as it is at LTO with -O2. We sort of declare that -O2 is the level where
compiler optimize hard without bloating code size. Automatic inlining bloats a
lot. Enabling it at -O2 will make developers who care about code size unhappy.
Can you, please, try -O2 -fwhole-program, too?
Testing Firefox I however noticed that enabling inlining and --param
inline-unit-growth=5 gets most of speedups from inlining at very little cost of
code size (in fact code size gets smaller at firefox because of better
optimization). This is sort of logical: when not doing LTO, limiting unit
growth at each separate comilation unit lose, since the inliner has too little
freedom (some units require a lot of unit growth to copmile well, while most of
units won't need it at all).
When doing LTO however the inliner can use the space constrain more resonably.
I am wondering what to do here - I just tried that pushing down unit growth from
30% to 15% hurts some of benchmarks (like tramp3d). I guess we will need to make
unit growth to depend on unit size somehow: at the moment we bypass unit growht
at very tiny units via large-unit-insns parameter, but this is not good enough.
For medium sized units we need growths as big as 30%, for large units we need 5%.
I guess I can either define very-large-unit-growth and very-large-unit-insns
to jump down in growth at some point, or define the growth to be function of 1/size.
Do we know of better alternatives?
Enabling such extensively trimmed down automatic inlining at -O2 IMO can make
sense if we can prove it makes binaries of about same size and brings
noticeable speedups.
After all, we want to make LTO selling well - most people will probably repeat
mistake you did and try it at -O2 without -fwhole-program. The second I am hoping to
fight with enabling -fuse-linker-plugin by default as discussed on the summit
(that has similar effects to -fwhole-program code quality wise even if underlying
implementation is different).
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 1:02 ` Jan Hubicka
@ 2010-11-16 1:19 ` Jan Hubicka
2010-11-16 1:24 ` Xinliang David Li
1 sibling, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16 1:19 UTC (permalink / raw)
To: Jan Hubicka
Cc: Xinliang David Li, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
> > This means O3 level inlining should be turned on also for lto build by
> > default -- as -O2 lto performance is too unimpressive.
>
> I am just re-tunning the inliner and hope to get more speedups for smaller
> costs than we get right now. I however don't think we can resonably enable it
> as it is at LTO with -O2. We sort of declare that -O2 is the level where
> compiler optimize hard without bloating code size. Automatic inlining bloats a
> lot. Enabling it at -O2 will make developers who care about code size unhappy.
>
> Can you, please, try -O2 -fwhole-program, too?
Also for my code size work, it would be great if you tracked also sizes of the
stripped binaries in your tests ;)
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 1:02 ` Jan Hubicka
2010-11-16 1:19 ` Jan Hubicka
@ 2010-11-16 1:24 ` Xinliang David Li
2010-11-16 1:39 ` Jan Hubicka
1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 1:24 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
On Mon, Nov 15, 2010 at 4:25 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> This means O3 level inlining should be turned on also for lto build by
>> default -- as -O2 lto performance is too unimpressive.
>
> I am just re-tunning the inliner and hope to get more speedups for smaller
> costs than we get right now. I however don't think we can resonably enable it
> as it is at LTO with -O2. We sort of declare that -O2 is the level where
> compiler optimize hard without bloating code size. Automatic inlining bloats a
> lot. Enabling it at -O2 will make developers who care about code size unhappy.
Looks like you want to brand LTO as a size optimization technology
more than performance one :) -- is that the right promotion for lto?
If more people care about performance, then the default should be
tuned toward it. For size optimization, use -Os -flto.
>
> Can you, please, try -O2 -fwhole-program, too?
Too many experiments -- but sure, I can do it.
>
> Testing Firefox I however noticed that enabling inlining and --param
> inline-unit-growth=5 gets most of speedups from inlining at very little cost of
> code size (in fact code size gets smaller at firefox because of better
> optimization). This is sort of logical: when not doing LTO, limiting unit
> growth at each separate comilation unit lose, since the inliner has too little
> freedom (some units require a lot of unit growth to copmile well, while most of
> units won't need it at all).
Yes, that is what I call adaptive budget -- better with profiling.
> When doing LTO however the inliner can use the space constrain more resonably.
>
yes -- global decision can be made.
> I am wondering what to do here - I just tried that pushing down unit growth from
> 30% to 15% hurts some of benchmarks (like tramp3d). I guess we will need to make
> unit growth to depend on unit size somehow:
yes.
>at the moment we bypass unit growht
> at very tiny units via large-unit-insns parameter, but this is not good enough.
> For medium sized units we need growths as big as 30%, for large units we need 5%.
> I guess I can either define very-large-unit-growth and very-large-unit-insns
> to jump down in growth at some point, or define the growth to be function of 1/size.
> Do we know of better alternatives?
>
Mark can provide some suggestions -- he has many inliner patches
related to performance/size trade off.
> Enabling such extensively trimmed down automatic inlining at -O2 IMO can make
> sense if we can prove it makes binaries of about same size and brings
> noticeable speedups.
> After all, we want to make LTO selling well - most people will probably repeat
> mistake you did and try it at -O2 without -fwhole-program. The second I am hoping to
> fight with enabling -fuse-linker-plugin by default as discussed on the summit
> (that has similar effects to -fwhole-program code quality wise even if underlying
> implementation is different).
>
I don't think that is a mistake -- a large percent of people will
likely not (be able to) use -fwhole-program for various reasons -- for
instance shared library build, partially available source, option
limitations etc. It is therefore more (at least equally) important to
sell lto without -fwhole-program.
Thanks,
David
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 1:24 ` Xinliang David Li
@ 2010-11-16 1:39 ` Jan Hubicka
2010-11-16 1:45 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16 1:39 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
gcc.gcc.gnu.org, Mark Heffernan
> On Mon, Nov 15, 2010 at 4:25 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> This means O3 level inlining should be turned on also for lto build by
> >> default -- as -O2 lto performance is too unimpressive.
> >
> > I am just re-tunning the inliner and hope to get more speedups for smaller
> > costs than we get right now. Â I however don't think we can resonably enable it
> > as it is at LTO with -O2. We sort of declare that -O2 is the level where
> > compiler optimize hard without bloating code size. Automatic inlining bloats a
> > lot. Â Enabling it at -O2 will make developers who care about code size unhappy.
>
> Looks like you want to brand LTO as a size optimization technology
> more than performance one :) -- is that the right promotion for lto?
No, I don't want to brand is as size optimization. I however want -O2 -flto to
be right setting for compiling majority of program in your distro to you good
overall system performance.
The size matters here - you don't want to bloat your distro by 20% to get 4% of
performance benefit when everything is in cache since overall performance will
likely degrade. If this was desirable, everyone would be using -O3 already. In
my tests (without -fwhole-program) -O3 +LTO bloats code even more than -O3
alone.
-Ofast is for those who want maximal performance now and -O3 for those who
don't care about size but are old fashioned and affraid to use -Ofast ;)
> >at the moment we bypass unit growht
> > at very tiny units via large-unit-insns parameter, but this is not good enough.
> > For medium sized units we need growths as big as 30%, for large units we need 5%.
> > I guess I can either define very-large-unit-growth and very-large-unit-insns
> > to jump down in growth at some point, or define the growth to be function of 1/size.
> > Do we know of better alternatives?
> >
>
> Mark can provide some suggestions -- he has many inliner patches
> related to performance/size trade off.
I would be definitly interested to see them, too.
>
> > Enabling such extensively trimmed down automatic inlining at -O2 IMO can make
> > sense if we can prove it makes binaries of about same size and brings
> > noticeable speedups.
> > After all, we want to make LTO selling well - most people will probably repeat
> > mistake you did and try it at -O2 without -fwhole-program. Â The second I am hoping to
> > fight with enabling -fuse-linker-plugin by default as discussed on the summit
> > (that has similar effects to -fwhole-program code quality wise even if underlying
> > implementation is different).
> >
>
> I don't think that is a mistake -- a large percent of people will
> likely not (be able to) use -fwhole-program for various reasons -- for
> instance shared library build, partially available source, option
> limitations etc. It is therefore more (at least equally) important to
Fortunately linker plugin solves the problem here and this is why I want to
have it by default. GCC then can do effectively -fwhole-program for binaries
(since linker knows what will be bound elsewhere) and take advantage of
visibility((hidden)) hints for shared libraries same way. Most of important
shared libraries gets visibility ((hidden)) right.
It is sad that LTO w/o linker plugin doesn't give that much benefits.
Ideas are welcome here.
Honza
> sell lto without -fwhole-program.
>
> Thanks,
>
> David
>
> > Honza
> >
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 1:39 ` Jan Hubicka
@ 2010-11-16 1:45 ` Xinliang David Li
2010-11-16 4:11 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 1:45 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
> Fortunately linker plugin solves the problem here and this is why I want to
> have it by default. GCC then can do effectively -fwhole-program for binaries
> (since linker knows what will be bound elsewhere) and take advantage of
> visibility((hidden)) hints for shared libraries same way. Most of important
> shared libraries gets visibility ((hidden)) right.
>
> It is sad that LTO w/o linker plugin doesn't give that much benefits.
> Ideas are welcome here.
Linker feedback will be limited here -- mostly global variable
aliasing (as I remember only 2/3 spec programs benefit from it), it
helps You don't get whole program points-to, whole program mod-ref
(with context sensitivity), whole program structure layout. The latter
are the real kickers (in terms of SPEC performance), but promoting LTO
with those numbers can be misleading as many programs won't get it.
David
>
> Honza
>> sell lto without -fwhole-program.
>>
>> Thanks,
>>
>> David
>>
>> > Honza
>> >
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 1:45 ` Xinliang David Li
@ 2010-11-16 4:11 ` Jan Hubicka
2010-11-16 6:56 ` Xinliang David Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16 4:11 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
gcc.gcc.gnu.org, Mark Heffernan
> > Fortunately linker plugin solves the problem here and this is why I want to
> > have it by default. Â GCC then can do effectively -fwhole-program for binaries
> > (since linker knows what will be bound elsewhere) and take advantage of
> > visibility((hidden)) hints for shared libraries same way. Â Most of important
> > shared libraries gets visibility ((hidden)) right.
> >
> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
> > Ideas are welcome here.
>
> Linker feedback will be limited here -- mostly global variable
> aliasing (as I remember only 2/3 spec programs benefit from it), it
> helps You don't get whole program points-to, whole program mod-ref
> (with context sensitivity), whole program structure layout. The latter
> are the real kickers (in terms of SPEC performance), but promoting LTO
> with those numbers can be misleading as many programs won't get it.
Well, I am speaking of our linker plugin here. What it does is to pass GCC
resolution information so it knows what symbols are bound externally. Since
typically you link LTO alone or with small non-LTO part, most of symbols are
not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
declare everything static except for main ())
We don't really do whole program points-to or structure layout. Mod-ref is just
simple ipa-reference code. How you get context sensitivity on mod/ref?
Honza
>
> David
>
> >
> > Honza
> >> sell lto without -fwhole-program.
> >>
> >> Thanks,
> >>
> >> David
> >>
> >> > Honza
> >> >
> >
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 4:11 ` Jan Hubicka
@ 2010-11-16 6:56 ` Xinliang David Li
2010-11-16 8:26 ` Jan Hubicka
0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 6:56 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Fortunately linker plugin solves the problem here and this is why I want to
>> > have it by default. GCC then can do effectively -fwhole-program for binaries
>> > (since linker knows what will be bound elsewhere) and take advantage of
>> > visibility((hidden)) hints for shared libraries same way. Most of important
>> > shared libraries gets visibility ((hidden)) right.
>> >
>> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>> > Ideas are welcome here.
>>
>> Linker feedback will be limited here -- mostly global variable
>> aliasing (as I remember only 2/3 spec programs benefit from it), it
>> helps You don't get whole program points-to, whole program mod-ref
>> (with context sensitivity), whole program structure layout. The latter
>> are the real kickers (in terms of SPEC performance), but promoting LTO
>> with those numbers can be misleading as many programs won't get it.
>
> Well, I am speaking of our linker plugin here. What it does is to pass GCC
> resolution information so it knows what symbols are bound externally. Since
> typically you link LTO alone or with small non-LTO part, most of symbols are
> not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
> declare everything static except for main ())
>
> We don't really do whole program points-to or structure layout.
gcc will eventually, right?
> Mod-ref is just
> simple ipa-reference code. How you get context sensitivity on mod/ref?
mod-ref relies on points-to. With context sensitive points-to, you can
also get CS mod-ref -- basically mod-ref info per callsite.
Thanks,
David
>
> Honza
>>
>> David
>>
>> >
>> > Honza
>> >> sell lto without -fwhole-program.
>> >>
>> >> Thanks,
>> >>
>> >> David
>> >>
>> >> > Honza
>> >> >
>> >
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 6:56 ` Xinliang David Li
@ 2010-11-16 8:26 ` Jan Hubicka
2010-11-16 9:00 ` Xinliang David Li
2010-11-16 15:43 ` Richard Guenther
0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16 8:26 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
gcc.gcc.gnu.org, Mark Heffernan
> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> > Fortunately linker plugin solves the problem here and this is why I want to
> >> > have it by default. Â GCC then can do effectively -fwhole-program for binaries
> >> > (since linker knows what will be bound elsewhere) and take advantage of
> >> > visibility((hidden)) hints for shared libraries same way. Â Most of important
> >> > shared libraries gets visibility ((hidden)) right.
> >> >
> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
> >> > Ideas are welcome here.
> >>
> >> Linker feedback will be limited here -- mostly global variable
> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
> >> helps  You don't get whole program points-to, whole program mod-ref
> >> (with context sensitivity), whole program structure layout. The latter
> >> are the real kickers (in terms of SPEC performance), but promoting LTO
> >> with those numbers can be misleading as many programs won't get it.
> >
> > Well, I am speaking of our linker plugin here. Â What it does is to pass GCC
> > resolution information so it knows what symbols are bound externally. Since
> > typically you link LTO alone or with small non-LTO part, most of symbols are
> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
> > declare everything static except for main ())
> >
> > We don't really do whole program points-to or structure layout.
>
> gcc will eventually, right?
Sure hope so ;)
We really need to solve scalability with our IPA points-to and make it
compatible with WHOPR.
>
> > Mod-ref is just
> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>
> mod-ref relies on points-to. With context sensitive points-to, you can
> also get CS mod-ref -- basically mod-ref info per callsite.
Ah sure, I was too focused on our current "mod/ref" :)
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 8:26 ` Jan Hubicka
@ 2010-11-16 9:00 ` Xinliang David Li
2010-11-16 14:23 ` Xinliang David Li
2010-11-16 15:43 ` Richard Guenther
1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 9:00 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
More performance data:
-O2 -funroll-all-loops vs O2: +1.1% geomean
O2 O2 unroll-all-loops
164.gzip 1324 1336 0.94%
175.vpr 1694 1670 -1.44%
176.gcc 2293 2353 2.60%
181.mcf 1772 1793 1.20%
186.crafty 2320 2300 -0.86%
197.parser 1166 1171 0.39%
252.eon 2443 2515 2.93%
253.perlbmk 2410 2250 -6.66%
254.gap 1987 2041 2.68%
255.vortex 2392 2411 0.78%
256.bzip2 1719 1806 5.08%
300.twolf 2288 2436 6.44%
-O3 -flto -fwhole-program vs -O2 : geomean +6% (-fwhole-program add ~1% )
164.gzip 1324 1318 -0.45%
175.vpr 1694 1717 1.34%
176.gcc 2293 2359 2.88%
181.mcf 1772 1772 0.02%
186.crafty 2320 2526 8.86%
197.parser 1166 1248 7.04%
252.eon 2443 2898 18.59%
253.perlbmk 2410 2323 -3.62%
254.gap 1987 2039 2.58%
255.vortex 2392 2918 21.99%
256.bzip2 1719 1946 13.19%
300.twolf 2288 2342 2.34%
-O2 -flto -fwhole-program vs -O2: geomean +3.4% . mainly from three
programs: vortex, eon and bzip2.
164.gzip 1324 1313 -0.82%
175.vpr 1694 1659 -2.05%
176.gcc 2293 2300 0.30%
181.mcf 1772 1781 0.52%
186.crafty 2320 2327 0.30%
197.parser 1166 1188 1.92%
252.eon 2443 2664 9.00%
253.perlbmk 2410 2470 2.47%
254.gap 1987 1987 -0.02%
255.vortex 2392 2883 20.53%
256.bzip2 1719 1839 7.00%
300.twolf 2288 2365 3.34%
Thanks,
David
On Mon, Nov 15, 2010 at 5:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>> >> > have it by default. GCC then can do effectively -fwhole-program for binaries
>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>> >> > visibility((hidden)) hints for shared libraries same way. Most of important
>> >> > shared libraries gets visibility ((hidden)) right.
>> >> >
>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>> >> > Ideas are welcome here.
>> >>
>> >> Linker feedback will be limited here -- mostly global variable
>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>> >> helps You don't get whole program points-to, whole program mod-ref
>> >> (with context sensitivity), whole program structure layout. The latter
>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>> >> with those numbers can be misleading as many programs won't get it.
>> >
>> > Well, I am speaking of our linker plugin here. What it does is to pass GCC
>> > resolution information so it knows what symbols are bound externally. Since
>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>> > declare everything static except for main ())
>> >
>> > We don't really do whole program points-to or structure layout.
>>
>> gcc will eventually, right?
>
> Sure hope so ;)
> We really need to solve scalability with our IPA points-to and make it
> compatible with WHOPR.
>>
>> > Mod-ref is just
>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>
>> mod-ref relies on points-to. With context sensitive points-to, you can
>> also get CS mod-ref -- basically mod-ref info per callsite.
>
> Ah sure, I was too focused on our current "mod/ref" :)
>
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 9:00 ` Xinliang David Li
@ 2010-11-16 14:23 ` Xinliang David Li
2010-11-16 17:10 ` Jan Hubicka
2010-11-18 11:48 ` Xinliang David Li
0 siblings, 2 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 14:23 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
More FDO related performance numbers
Experiment 1: trunk gcc O2 + FDO vs O2: FDO improves performance
by 5% geomean
Experiment 2: our internal gcc compiler (4.4.3 based with many local
patches) O2 + FDO vs O2 (trunk gcc): FDO improves perf by 6.6%
geomean
Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
O2 (trunk gcc): LIPO improves by 12%
Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2: LTO +
FDO improves by 10.8%
1. Trunk gcc FDO vs O2 (5%)
164.gzip 1324 1302 -1.64%
175.vpr 1694 1725 1.84%
176.gcc 2293 2387 4.07%
181.mcf 1772 1756 -0.88%
186.crafty 2320 2280 -1.75%
197.parser 1166 1556 33.42%
252.eon 2443 2552 4.45%
253.perlbmk 2410 2586 7.28%
254.gap 1987 2021 1.71%
255.vortex 2392 2720 13.71%
256.bzip2 1719 1717 -0.12%
300.twolf 2288 2331 1.86%
2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)
164.gzip 1324 1317 -0.48%
175.vpr 1694 1758 3.76%
176.gcc 2293 2472 7.79%
181.mcf 1772 1730 -2.35%
186.crafty 2320 2353 1.40%
197.parser 1166 1652 41.70%
252.eon 2443 2610 6.82%
253.perlbmk 2410 2561 6.23%
254.gap 1987 1987 -0.04%
255.vortex 2392 2801 17.09%
256.bzip2 1719 1748 1.68%
300.twolf 2288 2335 2.04%
3. LIPO vs trunk O2 (12%)
164.gzip 1324 1350 1.99%
175.vpr 1694 1758 3.77%
176.gcc 2293 2519 9.83%
181.mcf 1772 1766 -0.33%
186.crafty 2320 2394 3.16%
197.parser 1166 1683 44.32%
252.eon 2443 2879 17.80%
253.perlbmk 2410 2556 6.04%
254.gap 1987 2139 7.61%
255.vortex 2392 3669 53.40%
256.bzip2 1719 1824 6.09%
300.twolf 2288 2345 2.49%
4. LTO + -fwhole-program + O2 + FDO vs O2 (10.8%)
164.gzip 1324 1340 1.25%
175.vpr 1694 1709 0.87%
176.gcc 2293 2411 5.13%
181.mcf 1772 1757 -0.80%
186.crafty 2320 2566 10.59%
197.parser 1166 1614 38.44%
252.eon 2443 2785 13.98%
253.perlbmk 2410 2618 8.61%
254.gap 1987 2063 3.81%
255.vortex 2392 3294 37.69%
256.bzip2 1719 1956 13.77%
300.twolf 2288 2404 5.07%
David
On Mon, Nov 15, 2010 at 6:18 PM, Xinliang David Li <davidxl@google.com> wrote:
> More performance data:
>
> -O2 -funroll-all-loops vs O2: +1.1% geomean
>
> O2 O2 unroll-all-loops
> 164.gzip 1324 1336 0.94%
> 175.vpr 1694 1670 -1.44%
> 176.gcc 2293 2353 2.60%
> 181.mcf 1772 1793 1.20%
> 186.crafty 2320 2300 -0.86%
> 197.parser 1166 1171 0.39%
> 252.eon 2443 2515 2.93%
> 253.perlbmk 2410 2250 -6.66%
> 254.gap 1987 2041 2.68%
> 255.vortex 2392 2411 0.78%
> 256.bzip2 1719 1806 5.08%
> 300.twolf 2288 2436 6.44%
>
>
> -O3 -flto -fwhole-program vs -O2 : geomean +6% (-fwhole-program add ~1% )
>
> 164.gzip 1324 1318 -0.45%
> 175.vpr 1694 1717 1.34%
> 176.gcc 2293 2359 2.88%
> 181.mcf 1772 1772 0.02%
> 186.crafty 2320 2526 8.86%
> 197.parser 1166 1248 7.04%
> 252.eon 2443 2898 18.59%
> 253.perlbmk 2410 2323 -3.62%
> 254.gap 1987 2039 2.58%
> 255.vortex 2392 2918 21.99%
> 256.bzip2 1719 1946 13.19%
> 300.twolf 2288 2342 2.34%
>
>
> -O2 -flto -fwhole-program vs -O2: geomean +3.4% . mainly from three
> programs: vortex, eon and bzip2.
>
> 164.gzip 1324 1313 -0.82%
> 175.vpr 1694 1659 -2.05%
> 176.gcc 2293 2300 0.30%
> 181.mcf 1772 1781 0.52%
> 186.crafty 2320 2327 0.30%
> 197.parser 1166 1188 1.92%
> 252.eon 2443 2664 9.00%
> 253.perlbmk 2410 2470 2.47%
> 254.gap 1987 1987 -0.02%
> 255.vortex 2392 2883 20.53%
> 256.bzip2 1719 1839 7.00%
> 300.twolf 2288 2365 3.34%
>
>
> Thanks,
>
> David
>
>
> On Mon, Nov 15, 2010 at 5:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>>> >> > have it by default. GCC then can do effectively -fwhole-program for binaries
>>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>>> >> > visibility((hidden)) hints for shared libraries same way. Most of important
>>> >> > shared libraries gets visibility ((hidden)) right.
>>> >> >
>>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>>> >> > Ideas are welcome here.
>>> >>
>>> >> Linker feedback will be limited here -- mostly global variable
>>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>>> >> helps You don't get whole program points-to, whole program mod-ref
>>> >> (with context sensitivity), whole program structure layout. The latter
>>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>>> >> with those numbers can be misleading as many programs won't get it.
>>> >
>>> > Well, I am speaking of our linker plugin here. What it does is to pass GCC
>>> > resolution information so it knows what symbols are bound externally. Since
>>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>>> > declare everything static except for main ())
>>> >
>>> > We don't really do whole program points-to or structure layout.
>>>
>>> gcc will eventually, right?
>>
>> Sure hope so ;)
>> We really need to solve scalability with our IPA points-to and make it
>> compatible with WHOPR.
>>>
>>> > Mod-ref is just
>>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>>
>>> mod-ref relies on points-to. With context sensitive points-to, you can
>>> also get CS mod-ref -- basically mod-ref info per callsite.
>>
>> Ah sure, I was too focused on our current "mod/ref" :)
>>
>> Honza
>>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 8:26 ` Jan Hubicka
2010-11-16 9:00 ` Xinliang David Li
@ 2010-11-16 15:43 ` Richard Guenther
1 sibling, 0 replies; 68+ messages in thread
From: Richard Guenther @ 2010-11-16 15:43 UTC (permalink / raw)
To: Jan Hubicka
Cc: Xinliang David Li, Andrey Belevantsev, Vladimir Makarov,
gcc.gcc.gnu.org, Mark Heffernan
2010/11/16 Jan Hubicka <hubicka@ucw.cz>:
>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>> >> > have it by default. GCC then can do effectively -fwhole-program for binaries
>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>> >> > visibility((hidden)) hints for shared libraries same way. Most of important
>> >> > shared libraries gets visibility ((hidden)) right.
>> >> >
>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>> >> > Ideas are welcome here.
>> >>
>> >> Linker feedback will be limited here -- mostly global variable
>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>> >> helps You don't get whole program points-to, whole program mod-ref
>> >> (with context sensitivity), whole program structure layout. The latter
>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>> >> with those numbers can be misleading as many programs won't get it.
>> >
>> > Well, I am speaking of our linker plugin here. What it does is to pass GCC
>> > resolution information so it knows what symbols are bound externally. Since
>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>> > declare everything static except for main ())
>> >
>> > We don't really do whole program points-to or structure layout.
>>
>> gcc will eventually, right?
>
> Sure hope so ;)
> We really need to solve scalability with our IPA points-to and make it
> compatible with WHOPR.
>>
>> > Mod-ref is just
>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>
>> mod-ref relies on points-to. With context sensitive points-to, you can
>> also get CS mod-ref -- basically mod-ref info per callsite.
>
> Ah sure, I was too focused on our current "mod/ref" :)
Btw, IPA-PTA also performs mod/ref analysis (but of course it is
context insensitive).
Richard.
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 14:23 ` Xinliang David Li
@ 2010-11-16 17:10 ` Jan Hubicka
2010-11-16 19:11 ` Xinliang David Li
2010-11-18 11:48 ` Xinliang David Li
1 sibling, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16 17:10 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
gcc.gcc.gnu.org, Mark Heffernan
> More FDO related performance numbers
>
> Experiment 1: trunk gcc O2 + FDO vs O2: FDO improves performance
> by 5% geomean
> Experiment 2: our internal gcc compiler (4.4.3 based with many local
> patches) O2 + FDO vs O2 (trunk gcc): FDO improves perf by 6.6%
> geomean
> Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
> O2 (trunk gcc): LIPO improves by 12%
> Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2: LTO +
> FDO improves by 10.8%
>
>
> 1. Trunk gcc FDO vs O2 (5%)
>
> 164.gzip 1324 1302 -1.64%
> 175.vpr 1694 1725 1.84%
> 176.gcc 2293 2387 4.07%
> 181.mcf 1772 1756 -0.88%
> 186.crafty 2320 2280 -1.75%
> 197.parser 1166 1556 33.42%
> 252.eon 2443 2552 4.45%
> 253.perlbmk 2410 2586 7.28%
> 254.gap 1987 2021 1.71%
> 255.vortex 2392 2720 13.71%
> 256.bzip2 1719 1717 -0.12%
> 300.twolf 2288 2331 1.86%
>
> 2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)
Interesting, any idea from where this 1.6% is comming? I guess LIPO this might
be also reason for that 2% difference in LIPO results (in general LTO
-fwhole-program + FDO should be stronger, but it is not tunned at all yet).
Since the LIPO branch was updated to mainline some time ago, it would be nice
to compare the LIPO from the branch with mainline LTO. i guess more fair comparsion
should be O2+FDO+LTO WRT O2+LIPO as LIPO makes no whole program assumptions
at all, right?
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 17:10 ` Jan Hubicka
@ 2010-11-16 19:11 ` Xinliang David Li
0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 19:11 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org,
Mark Heffernan, Raksit Ashok
On Tue, Nov 16, 2010 at 6:35 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> More FDO related performance numbers
>>
>> Experiment 1: trunk gcc O2 + FDO vs O2: FDO improves performance
>> by 5% geomean
>> Experiment 2: our internal gcc compiler (4.4.3 based with many local
>> patches) O2 + FDO vs O2 (trunk gcc): FDO improves perf by 6.6%
>> geomean
>> Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
>> O2 (trunk gcc): LIPO improves by 12%
>> Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2: LTO +
>> FDO improves by 10.8%
>>
>>
>> 1. Trunk gcc FDO vs O2 (5%)
>>
>> 164.gzip 1324 1302 -1.64%
>> 175.vpr 1694 1725 1.84%
>> 176.gcc 2293 2387 4.07%
>> 181.mcf 1772 1756 -0.88%
>> 186.crafty 2320 2280 -1.75%
>> 197.parser 1166 1556 33.42%
>> 252.eon 2443 2552 4.45%
>> 253.perlbmk 2410 2586 7.28%
>> 254.gap 1987 2021 1.71%
>> 255.vortex 2392 2720 13.71%
>> 256.bzip2 1719 1717 -0.12%
>> 300.twolf 2288 2331 1.86%
>>
>> 2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)
>
> Interesting, any idea from where this 1.6% is comming?
Probably due to local patches (inliner, lrs, etc) we have, but I have
not studied it.
> I guess LIPO this might
> be also reason for that 2% difference in LIPO results (in general LTO
> -fwhole-program + FDO should be stronger, but it is not tunned at all yet).
>
> Since the LIPO branch was updated to mainline some time ago, it would be nice
> to compare the LIPO from the branch with mainline LTO. i guess more fair comparsion
> should be O2+FDO+LTO WRT O2+LIPO as LIPO makes no whole program assumptions
> at all, right?
Yes. Raksit maintains the upstream lipo branch, but it has not been
tuned for performance yet. We have open sourced our compiler changes
via android. It is better to use that if any one is interested.
Thanks,
David
>
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-16 14:23 ` Xinliang David Li
2010-11-16 17:10 ` Jan Hubicka
@ 2010-11-18 11:48 ` Xinliang David Li
2010-11-18 13:06 ` Jan Hubicka
2010-11-18 13:28 ` Jan Hubicka
1 sibling, 2 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-18 11:48 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
Some text size measurement.
Summary:
1) LTO with -O3 bloats up code considerably;
2) LTO with -O2 reduces text size compared with -O2
3) Google 4.4.3 based compiler is really effective in reducing C++
program size -- this is where the focus of the tuning was done.
Witnessed by eon in SPEC2k and all C++ apps in SPEC06
Notes:
1. -ffunction-sections -Wl,-gc-sections are used in the build.
2. SPEC06 dealII does not build with trunk GCC with some parsing
error. Hj Lu, what alt source should be used? (it builds fine with
4.4.3 compiler)
3. xalancbmk and omnetpp do not build with TOT gcc compiler using
FDO -- compiler ICEes. Will investigate when there is time.
David
SPEC06 C++ program Data (the first data column is the TOT O2 base number)
1. TOT O3 vs TOT O2 ( 3.35% total increase)
471.omnetpp/ 853708 867988 1.67%
450.soplex/ 643273 656349 2.03%
483.xalancbmk/ 3634416 3777600 3.94%
444.namd/ 393142 402038 2.26%
473.astar/ 102182 111038 8.67%
size_sum 5626721 5815013 3.35%
2. TOT LTO+whole program + O3 vs TOT O2 (0.35% total increase)
471.omnetpp/ 853708 937728 9.84%
450.soplex/ 643273 654057 1.68%
483.xalancbmk/ 3634416 3540646 -2.58%
444.namd/ 393142 401318 2.08%
473.astar/ 102182 112538 10.13%
size_sum 5626721 5646287 0.35%
3. TOT LTO+whole program + O2 vs TOT O2 (8.10% total reduction)
471.omnetpp/ 853708 822868 -3.61%
450.soplex/ 643273 611653 -4.92%
483.xalancbmk/ 3634416 3245157 -10.71%
444.namd/ 393142 391698 -0.37%
473.astar/ 102182 99586 -2.54%
size_sum 5626721 5170962 -8.10%
4. google 4.4.3 compiler O2 vs TOT O2 (13.95% total reduction)
471.omnetpp/ 853708 545840 -36.06%
450.soplex/ 643273 374674 -41.76%
483.xalancbmk/ 3634416 3556306 -2.15%
444.namd/ 393142 329897 -16.09%
473.astar/ 102182 35301 -65.45%
size_sum 5626721 4842018 -13.95%
5. Google 4.4.3 compiler O2 FDO vs TOT O2 (24.81% total reduction)
471.omnetpp/ 853708 514732 -39.71%
450.soplex/ 643273 357426 -44.44%
483.xalancbmk/ 3634416 2985761 -17.85%
444.namd/ 393142 332806 -15.35%
473.astar/ 102182 39797 -61.05%
size_sum 5626721 4230522 -24.81%
6. Google 4.4.3 compiler O2 LIPO vs TOT O2 (20.86 % total reduction)
471.omnetpp/ 853708 559944 -34.41%
450.soplex/ 643273 393399 -38.84%
483.xalancbmk/ 3634416 3126428 -13.98%
444.namd/ 393142 334666 -14.87%
473.astar/ 102182 38749 -62.08%
size_sum 5626721 4453186 -20.86%
SPEC2k text size data:
1. tot O1 vs tot O2 ( 4.48% total reduction)
300.twolf/ 182884 177223 -3.10%
181.mcf/ 11794 11338 -3.87%
164.gzip/ 36705 34388 -6.31%
186.crafty/ 171663 164898 -3.94%
255.vortex/ 463463 456034 -1.60%
256.bzip2/ 28803 28091 -2.47%
176.gcc/ 1422042 1368365 -3.77%
197.parser/ 103225 96644 -6.38%
253.perlbmk/ 563927 515898 -8.52%
175.vpr/ 139321 134316 -3.59%
252.eon/ 607704 591780 -2.62%
254.gap/ 496262 459593 -7.39%
size_sum 4227793 4038568 -4.48%
2. tot O3 vs tot O2 : (10.8% total size increase)
300.twolf/ 182884 194620 6.42%
181.mcf/ 11794 13290 12.68%
164.gzip/ 36705 46049 25.46%
186.crafty/ 171663 189892 10.62%
255.vortex/ 463463 495875 6.99%
256.bzip2/ 28803 39939 38.66%
176.gcc/ 1422042 1609786 13.20%
197.parser/ 103225 143558 39.07%
253.perlbmk/ 563927 616855 9.39%
175.vpr/ 139321 147081 5.57%
252.eon/ 607704 625176 2.88%
254.gap/ 496262 563187 13.49%
size_sum 4227793 4685308 10.82%
3. tot LTO + -fwhole-program + -O2 vs tot O2 : (3.65% total size reduction)
300.twolf/ 182884 176572 -3.45%
181.mcf/ 11794 9594 -18.65%
164.gzip/ 36705 34439 -6.17%
186.crafty/ 171663 173071 0.82%
255.vortex/ 463463 382157 -17.54%
256.bzip2/ 28803 27142 -5.77%
176.gcc/ 1422042 1364796 -4.03%
197.parser/ 103225 94997 -7.97%
253.perlbmk/ 563927 590087 4.64%
175.vpr/ 139321 123572 -11.30%
252.eon/ 607704 606226 -0.24%
254.gap/ 496262 491006 -1.06%
size_sum 4227793 4073659 -3.65%
4. tot LTO + -fwhole-program + -O3 : (16.57% total increase)
300.twolf/ 182884 196316 7.34%
181.mcf/ 11794 11402 -3.32%
164.gzip/ 36705 51477 40.25%
186.crafty/ 171663 214700 25.07%
255.vortex/ 463463 462329 -0.24%
256.bzip2/ 28803 34950 21.34%
176.gcc/ 1422042 1724868 21.30%
197.parser/ 103225 124698 20.80%
253.perlbmk/ 563927 729119 29.29%
175.vpr/ 139321 139729 0.29%
252.eon/ 607704 627194 3.21%
254.gap/ 496262 611515 23.22%
size_sum 4227793 4928297 16.57%
5. tot O2 FDO vs tot O2: (1.15% total increase)
300.twolf/ 182884 178247 -2.54%
181.mcf/ 11794 17370 47.28%
164.gzip/ 36705 42889 16.85%
186.crafty/ 171663 184085 7.24%
255.vortex/ 463463 483428 4.31%
256.bzip2/ 28803 33635 16.78%
176.gcc/ 1422042 1441797 1.39%
197.parser/ 103225 140401 36.01%
253.perlbmk/ 563927 546447 -3.10%
175.vpr/ 139321 147153 5.62%
252.eon/ 607704 572388 -5.81%
254.gap/ 496262 488758 -1.51%
size_sum 4227793 4276598 1.15%
6. google local compiler O2 FDO vs tot O2 : (6.33% total increase)
Pay attention to the large reduction in C++ program's text size --
which is where the size tuning is done.
300.twolf/ 182884 184736 1.01%
181.mcf/ 11794 26560 125.20%
164.gzip/ 36705 48499 32.13%
186.crafty/ 171663 187406 9.17%
255.vortex/ 463463 482090 4.02%
256.bzip2/ 28803 37905 31.60%
176.gcc/ 1422042 1729480 21.62%
197.parser/ 103225 237148 129.74%
253.perlbmk/ 563927 557040 -1.22%
175.vpr/ 139321 153453 10.14%
252.eon/ 607704 312506 -48.58%
254.gap/ 496262 538534 8.52%
size_sum 4227793 4495357 6.33%
Also for reference, the google compiler vanilla O2 vs tot O2 -- large
reduction in C++ size, overall size increase a little.
300.twolf/ 182884 207829 13.64%
181.mcf/ 11794 12008 1.81%
164.gzip/ 36705 41528 13.14%
186.crafty/ 171663 177104 3.17%
255.vortex/ 463463 473298 2.12%
256.bzip2/ 28803 37961 31.80%
176.gcc/ 1422042 1592952 12.02%
197.parser/ 103225 139969 35.60%
253.perlbmk/ 563927 598632 6.15%
175.vpr/ 139321 156869 12.60%
252.eon/ 607704 322478 -46.94%
254.gap/ 496262 550451 10.92%
size_sum 4227793 4311079 1.97%
7. LIPO vs tot O2: (23.2% total increase)
300.twolf/ 182884 185960 1.68%
181.mcf/ 11794 26544 125.06%
164.gzip/ 36705 54827 49.37%
186.crafty/ 171663 234494 36.60%
255.vortex/ 463463 596394 28.68%
256.bzip2/ 28803 40492 40.58%
176.gcc/ 1422042 2070851 45.63%
197.parser/ 103225 250537 142.71%
253.perlbmk/ 563927 638320 13.19%
175.vpr/ 139321 156117 12.06%
252.eon/ 607704 370949 -38.96%
254.gap/ 496262 588139 18.51%
size_sum 4227793 5213624 23.32%
8. LTO + whole-program +O2 + FDO vs O2:
300.twolf/ 182884 174919 -4.36%
181.mcf/ 11794 16346 38.60%
164.gzip/ 36705 40743 11.00%
186.crafty/ 171663 197698 15.17%
255.vortex/ 463463 395626 -14.64%
256.bzip2/ 28803 36238 25.81%
176.gcc/ 1422042 1439295 1.21%
197.parser/ 103225 143237 38.76%
253.perlbmk/ 563927 590687 4.75%
175.vpr/ 139321 135276 -2.90%
252.eon/ 607704 585954 -3.58%
254.gap/ 496262 487289 -1.81%
size_sum 4227793 4243308 0.37%
On Tue, Nov 16, 2010 at 12:26 AM, Xinliang David Li <davidxl@google.com> wrote:
> More FDO related performance numbers
>
> Experiment 1: trunk gcc O2 + FDO vs O2: FDO improves performance
> by 5% geomean
> Experiment 2: our internal gcc compiler (4.4.3 based with many local
> patches) O2 + FDO vs O2 (trunk gcc): FDO improves perf by 6.6%
> geomean
> Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
> O2 (trunk gcc): LIPO improves by 12%
> Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2: LTO +
> FDO improves by 10.8%
>
>
> 1. Trunk gcc FDO vs O2 (5%)
>
> 164.gzip 1324 1302 -1.64%
> 175.vpr 1694 1725 1.84%
> 176.gcc 2293 2387 4.07%
> 181.mcf 1772 1756 -0.88%
> 186.crafty 2320 2280 -1.75%
> 197.parser 1166 1556 33.42%
> 252.eon 2443 2552 4.45%
> 253.perlbmk 2410 2586 7.28%
> 254.gap 1987 2021 1.71%
> 255.vortex 2392 2720 13.71%
> 256.bzip2 1719 1717 -0.12%
> 300.twolf 2288 2331 1.86%
>
> 2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)
>
> 164.gzip 1324 1317 -0.48%
> 175.vpr 1694 1758 3.76%
> 176.gcc 2293 2472 7.79%
> 181.mcf 1772 1730 -2.35%
> 186.crafty 2320 2353 1.40%
> 197.parser 1166 1652 41.70%
> 252.eon 2443 2610 6.82%
> 253.perlbmk 2410 2561 6.23%
> 254.gap 1987 1987 -0.04%
> 255.vortex 2392 2801 17.09%
> 256.bzip2 1719 1748 1.68%
> 300.twolf 2288 2335 2.04%
>
> 3. LIPO vs trunk O2 (12%)
>
> 164.gzip 1324 1350 1.99%
> 175.vpr 1694 1758 3.77%
> 176.gcc 2293 2519 9.83%
> 181.mcf 1772 1766 -0.33%
> 186.crafty 2320 2394 3.16%
> 197.parser 1166 1683 44.32%
> 252.eon 2443 2879 17.80%
> 253.perlbmk 2410 2556 6.04%
> 254.gap 1987 2139 7.61%
> 255.vortex 2392 3669 53.40%
> 256.bzip2 1719 1824 6.09%
> 300.twolf 2288 2345 2.49%
>
> 4. LTO + -fwhole-program + O2 + FDO vs O2 (10.8%)
>
> 164.gzip 1324 1340 1.25%
> 175.vpr 1694 1709 0.87%
> 176.gcc 2293 2411 5.13%
> 181.mcf 1772 1757 -0.80%
> 186.crafty 2320 2566 10.59%
> 197.parser 1166 1614 38.44%
> 252.eon 2443 2785 13.98%
> 253.perlbmk 2410 2618 8.61%
> 254.gap 1987 2063 3.81%
> 255.vortex 2392 3294 37.69%
> 256.bzip2 1719 1956 13.77%
> 300.twolf 2288 2404 5.07%
>
>
> David
>
>
> On Mon, Nov 15, 2010 at 6:18 PM, Xinliang David Li <davidxl@google.com> wrote:
>> More performance data:
>>
>> -O2 -funroll-all-loops vs O2: +1.1% geomean
>>
>> O2 O2 unroll-all-loops
>> 164.gzip 1324 1336 0.94%
>> 175.vpr 1694 1670 -1.44%
>> 176.gcc 2293 2353 2.60%
>> 181.mcf 1772 1793 1.20%
>> 186.crafty 2320 2300 -0.86%
>> 197.parser 1166 1171 0.39%
>> 252.eon 2443 2515 2.93%
>> 253.perlbmk 2410 2250 -6.66%
>> 254.gap 1987 2041 2.68%
>> 255.vortex 2392 2411 0.78%
>> 256.bzip2 1719 1806 5.08%
>> 300.twolf 2288 2436 6.44%
>>
>>
>> -O3 -flto -fwhole-program vs -O2 : geomean +6% (-fwhole-program add ~1% )
>>
>> 164.gzip 1324 1318 -0.45%
>> 175.vpr 1694 1717 1.34%
>> 176.gcc 2293 2359 2.88%
>> 181.mcf 1772 1772 0.02%
>> 186.crafty 2320 2526 8.86%
>> 197.parser 1166 1248 7.04%
>> 252.eon 2443 2898 18.59%
>> 253.perlbmk 2410 2323 -3.62%
>> 254.gap 1987 2039 2.58%
>> 255.vortex 2392 2918 21.99%
>> 256.bzip2 1719 1946 13.19%
>> 300.twolf 2288 2342 2.34%
>>
>>
>> -O2 -flto -fwhole-program vs -O2: geomean +3.4% . mainly from three
>> programs: vortex, eon and bzip2.
>>
>> 164.gzip 1324 1313 -0.82%
>> 175.vpr 1694 1659 -2.05%
>> 176.gcc 2293 2300 0.30%
>> 181.mcf 1772 1781 0.52%
>> 186.crafty 2320 2327 0.30%
>> 197.parser 1166 1188 1.92%
>> 252.eon 2443 2664 9.00%
>> 253.perlbmk 2410 2470 2.47%
>> 254.gap 1987 1987 -0.02%
>> 255.vortex 2392 2883 20.53%
>> 256.bzip2 1719 1839 7.00%
>> 300.twolf 2288 2365 3.34%
>>
>>
>> Thanks,
>>
>> David
>>
>>
>> On Mon, Nov 15, 2010 at 5:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>>>> >> > have it by default. GCC then can do effectively -fwhole-program for binaries
>>>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>>>> >> > visibility((hidden)) hints for shared libraries same way. Most of important
>>>> >> > shared libraries gets visibility ((hidden)) right.
>>>> >> >
>>>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>>>> >> > Ideas are welcome here.
>>>> >>
>>>> >> Linker feedback will be limited here -- mostly global variable
>>>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>>>> >> helps You don't get whole program points-to, whole program mod-ref
>>>> >> (with context sensitivity), whole program structure layout. The latter
>>>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>>>> >> with those numbers can be misleading as many programs won't get it.
>>>> >
>>>> > Well, I am speaking of our linker plugin here. What it does is to pass GCC
>>>> > resolution information so it knows what symbols are bound externally. Since
>>>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>>>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>>>> > declare everything static except for main ())
>>>> >
>>>> > We don't really do whole program points-to or structure layout.
>>>>
>>>> gcc will eventually, right?
>>>
>>> Sure hope so ;)
>>> We really need to solve scalability with our IPA points-to and make it
>>> compatible with WHOPR.
>>>>
>>>> > Mod-ref is just
>>>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>>>
>>>> mod-ref relies on points-to. With context sensitive points-to, you can
>>>> also get CS mod-ref -- basically mod-ref info per callsite.
>>>
>>> Ah sure, I was too focused on our current "mod/ref" :)
>>>
>>> Honza
>>>
>>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-18 11:48 ` Xinliang David Li
@ 2010-11-18 13:06 ` Jan Hubicka
2010-11-18 17:20 ` Xinliang David Li
[not found] ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
2010-11-18 13:28 ` Jan Hubicka
1 sibling, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-18 13:06 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
gcc.gcc.gnu.org, Mark Heffernan
> Some text size measurement.
>
> Summary:
> 1) LTO with -O3 bloats up code considerably;
Yes, you need either -fwhole-program or -fuse-linker-plugin to make it behave
sanely.
For Mozilla I have best experience with -fuse-linker-plugin --param
inline-unit-growth=5 That gives me about 16% code size savings (so LTO -O3 is
same size as -Os). This means that it is smaller than -O2+LTO+-fwhole-program.
> 2) LTO with -O2 reduces text size compared with -O2
An more so with -fwhole-program as you noticed ;)
> 3) Google 4.4.3 based compiler is really effective in reducing C++
> program size -- this is where the focus of the tuning was done.
> Witnessed by eon in SPEC2k and all C++ apps in SPEC06
Again, I would be very interested to see the patches (and sooner than later
given tham I am re-tunning inliner for 4.6 now). Looking at Mozilla I also
concluded that we have a lot of room for improvement. How old is the tree you
use for testing? I recently improved code size somewhat at mainline.
>
> SPEC06 C++ program Data (the first data column is the TOT O2 base number)
>
> 1. TOT O3 vs TOT O2 ( 3.35% total increase)
>
> 471.omnetpp/ 853708 867988 1.67%
> 450.soplex/ 643273 656349 2.03%
> 483.xalancbmk/ 3634416 3777600 3.94%
> 444.namd/ 393142 402038 2.26%
> 473.astar/ 102182 111038 8.67%
> size_sum 5626721 5815013 3.35%
>
> 2. TOT LTO+whole program + O3 vs TOT O2 (0.35% total increase)
>
> 471.omnetpp/ 853708 937728 9.84%
> 450.soplex/ 643273 654057 1.68%
> 483.xalancbmk/ 3634416 3540646 -2.58%
> 444.namd/ 393142 401318 2.08%
> 473.astar/ 102182 112538 10.13%
> size_sum 5626721 5646287 0.35%
>
> 3. TOT LTO+whole program + O2 vs TOT O2 (8.10% total reduction)
>
> 471.omnetpp/ 853708 822868 -3.61%
> 450.soplex/ 643273 611653 -4.92%
> 483.xalancbmk/ 3634416 3245157 -10.71%
> 444.namd/ 393142 391698 -0.37%
> 473.astar/ 102182 99586 -2.54%
> size_sum 5626721 5170962 -8.10%
>
> 4. google 4.4.3 compiler O2 vs TOT O2 (13.95% total reduction)
>
> 471.omnetpp/ 853708 545840 -36.06%
> 450.soplex/ 643273 374674 -41.76%
> 483.xalancbmk/ 3634416 3556306 -2.15%
> 444.namd/ 393142 329897 -16.09%
> 473.astar/ 102182 35301 -65.45%
> size_sum 5626721 4842018 -13.95%
Hmm, this really seems interesting. Why the changes was not contributed this stage1?
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-18 11:48 ` Xinliang David Li
2010-11-18 13:06 ` Jan Hubicka
@ 2010-11-18 13:28 ` Jan Hubicka
2010-11-18 18:18 ` Xinliang David Li
1 sibling, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-18 13:28 UTC (permalink / raw)
To: Xinliang David Li
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
gcc.gcc.gnu.org, Mark Heffernan
Hi,
and for size, could you please also do -Os comparsions? I am aware that -O2
inliner is tuned somewhat up at C++. This is given by fact that we do have C++
benchmark suite we use to monitor inlining.
http://gcc.opensuse.org/c++bench-frescobaldi/
Programs there are a lot more aggressive on abstraction than whatever SPEC2k
and SPEC2k6 does. I know I can tune the inliner down for SPEC but get
regression there... But given we can get 10% difference on normal C++ program
it might be interesting to consider some compromise, at least for -O2.
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-18 13:06 ` Jan Hubicka
@ 2010-11-18 17:20 ` Xinliang David Li
[not found] ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
1 sibling, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-18 17:20 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
On Thu, Nov 18, 2010 at 3:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Some text size measurement.
>>
>> Summary:
>> 1) LTO with -O3 bloats up code considerably;
> Yes, you need either -fwhole-program or -fuse-linker-plugin to make it behave
> sanely.
>
> For Mozilla I have best experience with -fuse-linker-plugin --param
> inline-unit-growth=5 That gives me about 16% code size savings (so LTO -O3 is
> same size as -Os). This means that it is smaller than -O2+LTO+-fwhole-program.
>
>> 2) LTO with -O2 reduces text size compared with -O2
>
> An more so with -fwhole-program as you noticed ;)
Actually all the LTO experiments were done with -fwhole-program.
>
>> 3) Google 4.4.3 based compiler is really effective in reducing C++
>> program size -- this is where the focus of the tuning was done.
>> Witnessed by eon in SPEC2k and all C++ apps in SPEC06
>
> Again, I would be very interested to see the patches (and sooner than later
> given tham I am re-tunning inliner for 4.6 now). Looking at Mozilla I also
> concluded that we have a lot of room for improvement. How old is the tree you
> use for testing? I recently improved code size somewhat at mainline.
>
yes -- it may take a while as the code base for our tuning is 4.4.3.
>>
>> SPEC06 C++ program Data (the first data column is the TOT O2 base number)
>>
>> 1. TOT O3 vs TOT O2 ( 3.35% total increase)
>>
>> 471.omnetpp/ 853708 867988 1.67%
>> 450.soplex/ 643273 656349 2.03%
>> 483.xalancbmk/ 3634416 3777600 3.94%
>> 444.namd/ 393142 402038 2.26%
>> 473.astar/ 102182 111038 8.67%
>> size_sum 5626721 5815013 3.35%
>>
>> 2. TOT LTO+whole program + O3 vs TOT O2 (0.35% total increase)
>>
>> 471.omnetpp/ 853708 937728 9.84%
>> 450.soplex/ 643273 654057 1.68%
>> 483.xalancbmk/ 3634416 3540646 -2.58%
>> 444.namd/ 393142 401318 2.08%
>> 473.astar/ 102182 112538 10.13%
>> size_sum 5626721 5646287 0.35%
>>
>> 3. TOT LTO+whole program + O2 vs TOT O2 (8.10% total reduction)
>>
>> 471.omnetpp/ 853708 822868 -3.61%
>> 450.soplex/ 643273 611653 -4.92%
>> 483.xalancbmk/ 3634416 3245157 -10.71%
>> 444.namd/ 393142 391698 -0.37%
>> 473.astar/ 102182 99586 -2.54%
>> size_sum 5626721 5170962 -8.10%
>>
>> 4. google 4.4.3 compiler O2 vs TOT O2 (13.95% total reduction)
>>
>> 471.omnetpp/ 853708 545840 -36.06%
>> 450.soplex/ 643273 374674 -41.76%
>> 483.xalancbmk/ 3634416 3556306 -2.15%
>> 444.namd/ 393142 329897 -16.09%
>> 473.astar/ 102182 35301 -65.45%
>> size_sum 5626721 4842018 -13.95%
>
> Hmm, this really seems interesting. Why the changes was not contributed this stage1?
>
As you can see, the savings are mainly for C++, and may hurt C --
there are more work needed to make it suitable upstream.
Thanks,
David
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-18 13:28 ` Jan Hubicka
@ 2010-11-18 18:18 ` Xinliang David Li
0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-18 18:18 UTC (permalink / raw)
To: Jan Hubicka
Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan
I found an error in my size experiment set up -- (libstdc++ shared vs
non shared) -- please discard the size numbers -- will remeasure.
Thanks,
David
On Thu, Nov 18, 2010 at 4:02 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> and for size, could you please also do -Os comparsions? I am aware that -O2
> inliner is tuned somewhat up at C++. This is given by fact that we do have C++
> benchmark suite we use to monitor inlining.
> http://gcc.opensuse.org/c++bench-frescobaldi/
>
> Programs there are a lot more aggressive on abstraction than whatever SPEC2k
> and SPEC2k6 does. I know I can tune the inliner down for SPEC but get
> regression there... But given we can get 10% difference on normal C++ program
> it might be interesting to consider some compromise, at least for -O2.
>
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
[not found] ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
@ 2010-11-19 4:22 ` Jan Hubicka
2010-11-19 7:26 ` Xinliang David Li
[not found] ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-19 4:22 UTC (permalink / raw)
To: Mark Heffernan
Cc: Jan Hubicka, Xinliang David Li, Andrey Belevantsev,
Vladimir Makarov, gcc.gcc.gnu.org
Hi,
> I'll get back to you with our local inlining changes. We're looking to move
> development closer to trunk to reduce this divergence in the future.
>
> Our tuning was done primarily on big c++ programs. A significant size
> improvement came from aggressively inlining functions which might be
> eliminated by the linker (garbage collection of uncalled functions). We
> found that for non-static functions, if all callsites of a function are
> inlined, the function rarely appears in the final binary (excepting address
> taken functions). Of course, this doesn't necessarily work for libraries
> which may need all non-static functions to be emitted.
Interesting. Coincidentally I recently added logic for this for comdat
functions (setting probability to 20%) to deal with problems that a lot of C++
programs does template instatiations that produce comdat functoins for now good
reason. This indeed helped quite a lot. I didn't got so far to set similar
logic for normal external functions, since current toolchain won't eliminate
them by default.
Did the posted size numbers include function garbage collection and unification
that is same for mainline as for google copmiler?
Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
2010-11-19 4:22 ` Jan Hubicka
@ 2010-11-19 7:26 ` Xinliang David Li
[not found] ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
1 sibling, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-19 7:26 UTC (permalink / raw)
To: Jan Hubicka
Cc: Mark Heffernan, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
On Thu, Nov 18, 2010 at 4:12 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
>> I'll get back to you with our local inlining changes. We're looking to move
>> development closer to trunk to reduce this divergence in the future.
>>
>> Our tuning was done primarily on big c++ programs. A significant size
>> improvement came from aggressively inlining functions which might be
>> eliminated by the linker (garbage collection of uncalled functions). We
>> found that for non-static functions, if all callsites of a function are
>> inlined, the function rarely appears in the final binary (excepting address
>> taken functions). Of course, this doesn't necessarily work for libraries
>> which may need all non-static functions to be emitted.
>
> Interesting. Coincidentally I recently added logic for this for comdat
> functions (setting probability to 20%) to deal with problems that a lot of C++
> programs does template instatiations that produce comdat functoins for now good
> reason. This indeed helped quite a lot. I didn't got so far to set similar
> logic for normal external functions, since current toolchain won't eliminate
> them by default.
>
> Did the posted size numbers include function garbage collection and unification
> that is same for mainline as for google copmiler?
My previous size numbers are wrong --- looks like trunk gcc does
pretty well in terms of text size. Update numbers will be posted
later.
Yes, all experiments were done with GC, but safe ICF was not turned on.
David
>
> Honza
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
[not found] ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
@ 2010-11-19 14:12 ` Xinliang David Li
0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-19 14:12 UTC (permalink / raw)
To: Mark Heffernan
Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org
New size data -- hopefully it is sane this time.
Changes in experiment
1) shared libstdc++ is used with trunk gcc
2) bfd linker is used in both trunk and patched 4.4.3 compiler (which
used gold).
The size comparison for all C benchmarks in previous report is still
valid. The following is the corrected SPEC06 C++ number and some new
data (Os).
a) SPEC06 size
1. tot O3 vs tot O2
471.omnetpp/ 564839 579088 2.52%
450.soplex/ 352709 365785 3.71%
483.xalancbmk/ 3357954 3501139 4.26%
444.namd/ 319553 328449 2.78%
473.astar/ 31343 40199 28.26%
size_sum 4626398 4814660 4.07%
2. tot Os vs tot O2
471.omnetpp/ 564839 512157 -9.33%
450.soplex/ 352709 256364 -27.32%
483.xalancbmk/ 3357954 2747152 -18.19%
444.namd/ 319553 261799 -18.07%
473.astar/ 31343 26201 -16.41%
size_sum 4626398 3803673 -17.78%
3. tot O2 lto wpa vs tot O2
471.omnetpp/ 564839 531878 -5.84%
450.soplex/ 352709 319476 -9.42%
483.xalancbmk/ 3357954 2966800 -11.65%
444.namd/ 319553 318099 -0.46%
473.astar/ 31343 28747 -8.28%
size_sum 4626398 4165000 -9.97%
4. tot O3 lto wpa vs O2
471.omnetpp/ 564839 646708 14.49%
450.soplex/ 352709 361895 2.60%
483.xalancbmk/ 3357954 3262255 -2.85%
444.namd/ 319553 327738 2.56%
473.astar/ 31343 41707 33.07%
size_sum 4626398 4640303 0.30%
5. patched 4.43 O2 vs tot O2
471.omnetpp/ 564839 539237 -4.53%
450.soplex/ 352709 373263 5.83%
483.xalancbmk/ 3357954 3476137 3.52%
444.namd/ 319553 329769 3.20%
473.astar/ 31343 35250 12.47%
size_sum 4626398 4753656 2.75%
6. Patched 4.4.3 Os vs tot O2
471.omnetpp/ 564839 486838 -13.81%
450.soplex/ 352709 272146 -22.84%
483.xalancbmk/ 3357954 2769330 -17.53%
444.namd/ 319553 255295 -20.11%
473.astar/ 31343 25852 -17.52%
size_sum 4626398 3809461 -17.66%
7. patched 4.4.3 O2 FDO vs tot O2:
471.omnetpp/ 564839 508676 -9.94%
450.soplex/ 352709 356223 1.00%
483.xalancbmk/ 3357954 2919924 -13.04%
444.namd/ 319553 332664 4.10%
473.astar/ 31343 39738 26.78%
size_sum 4626398 4157225 -10.14%
8. patched 4.43 O2 LIPO vs tot O2:
471.omnetpp/ 564839 552361 -2.21%
450.soplex/ 352709 392106 11.17%
483.xalancbmk/ 3357954 3058259 -8.92%
444.namd/ 319553 334522 4.68%
473.astar/ 31343 38690 23.44%
size_sum 4626398 4375938 -5.41%
SPEC2k Os:
1. tot Os vs tot O2
300.twolf/ 182884 150921 -17.48%
181.mcf/ 11794 10246 -13.13%
164.gzip/ 36705 30983 -15.59%
186.crafty/ 171663 149301 -13.03%
255.vortex/ 463463 398908 -13.93%
256.bzip2/ 28803 24795 -13.92%
176.gcc/ 1422042 1158844 -18.51%
197.parser/ 103225 84814 -17.84%
253.perlbmk/ 563927 457664 -18.84%
175.vpr/ 139321 118330 -15.07%
252.eon/ 314603 258560 -17.81%
254.gap/ 496262 403633 -18.67%
size_sum 3934692 3246999 -17.48%
2. patched 4.4.3 Os vs tot Os:
300.twolf/ 150921 156185 3.49%
181.mcf/ 10246 10062 -1.80%
164.gzip/ 30983 30991 0.03%
186.crafty/ 149301 151477 1.46%
255.vortex/ 398908 402780 0.97%
256.bzip2/ 24795 24619 -0.71%
176.gcc/ 1158844 1177628 1.62%
197.parser/ 84814 82718 -2.47%
253.perlbmk/ 457664 466152 1.85%
175.vpr/ 118330 121446 2.63%
252.eon/ 258560 281061 8.70%
254.gap/ 403633 411540 1.96%
size_sum 3246999 3316659 2.15%
David
On Thu, Nov 18, 2010 at 4:37 PM, Mark Heffernan <meheff@google.com> wrote:
> On Thu, Nov 18, 2010 at 4:12 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> Interesting. Coincidentally I recently added logic for this for comdat
>> functions (setting probability to 20%) to deal with problems that a lot of
>> C++
>> programs does template instatiations that produce comdat functoins for now
>> good
>> reason. This indeed helped quite a lot. I didn't got so far to set
>> similar
>> logic for normal external functions, since current toolchain won't
>> eliminate
>> them by default.
>
> For non-static, no-address-taken functions, I found that they are emitted in
> the binary (after linker garbage collection) only about 20% of the time or
> so. Surprisingly small. This is for large c++ programs. I'd guess a fair
> number of these functions are template instantiations which may be
> instantiated a particular way (eg, with the same types) in only one
> compilation unit. Plus if all callsites of a function are inlined in one
> compilation unit, it's more likely that they might be inlined in other
> compilation units too. However, I didn't dive down deep to figure out
> exactly why this number is so low.
>>
>> Did the posted size numbers include function garbage collection and
>> unification
>> that is same for mainline as for google copmiler?
>
> I think the size numbers David posted earlier had some problems (statically
> linking stdc++ vs non-statically linked, I believe), so I'd wait until he
> reposts them to draw any conclusions. Not sure if garbage collection was
> enabled or not. In any case, we found maybe a 2% reduction in code size for
> -Os on x86-64 over our benchmark set comparing our local 4.4.3 vs vanilla
> 4.4.3. -O2 is comparable in size, but faster because we inline more
> aggressively which balances out the code size reduction. I have not done
> the comparison vs trunk.
> Mark
>
>>
>> Honza
>
>
^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2010-11-19 7:26 UTC | newest]
Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
2010-04-29 16:49 ` Jan Hubicka
2010-04-29 17:25 ` Vladimir Makarov
2010-04-29 18:17 ` Vladimir Makarov
2010-04-29 18:26 ` Xinliang David Li
2010-04-29 18:57 ` Vladimir Makarov
2010-04-29 19:42 ` Xinliang David Li
2010-04-29 20:19 ` Vladimir Makarov
2010-04-29 20:40 ` Xinliang David Li
2010-04-29 21:33 ` Jan Hubicka
2010-04-29 21:34 ` Jan Hubicka
2010-04-29 21:36 ` Xinliang David Li
2010-05-01 9:36 ` Jan Hubicka
2010-05-02 7:04 ` Xinliang David Li
2010-05-02 13:46 ` Jan Hubicka
2010-05-03 4:57 ` Xinliang David Li
2010-05-04 18:04 ` Jan Hubicka
2010-04-29 21:38 ` Xinliang David Li
2010-04-29 21:46 ` Jan Hubicka
2010-04-29 21:45 ` Steven Bosscher
2010-04-29 22:35 ` Xinliang David Li
2010-04-29 22:50 ` Jan Hubicka
2010-04-29 22:51 ` Steven Bosscher
2010-04-29 23:06 ` Jan Hubicka
2010-04-29 23:47 ` Steven Bosscher
2010-09-28 0:28 ` Neil Vachharajani
2010-09-28 1:01 ` Jack Howarth
2010-04-30 0:57 ` Xinliang David Li
2010-04-30 8:42 ` Jan Hubicka
2010-04-30 18:13 ` Xinliang David Li
2010-04-30 18:32 ` Jan Hubicka
2010-04-30 20:13 ` Xinliang David Li
2010-09-28 0:29 ` Neil Vachharajani
2010-04-29 22:42 ` Jack Howarth
2010-11-13 23:15 ` Xinliang David Li
2010-11-14 14:48 ` Paolo Bonzini
2010-11-14 15:43 ` Xinliang David Li
2010-11-14 21:12 ` H.J. Lu
2010-11-15 9:29 ` Xinliang David Li
2010-11-15 15:49 ` Andrey Belevantsev
2010-11-15 17:41 ` Xinliang David Li
2010-11-15 18:31 ` Jan Hubicka
2010-11-15 22:25 ` Richard Guenther
2010-11-15 22:47 ` Xinliang David Li
2010-11-15 23:06 ` Jan Hubicka
2010-11-16 0:41 ` Xinliang David Li
2010-11-16 0:53 ` Xinliang David Li
2010-11-16 1:02 ` Jan Hubicka
2010-11-16 1:19 ` Jan Hubicka
2010-11-16 1:24 ` Xinliang David Li
2010-11-16 1:39 ` Jan Hubicka
2010-11-16 1:45 ` Xinliang David Li
2010-11-16 4:11 ` Jan Hubicka
2010-11-16 6:56 ` Xinliang David Li
2010-11-16 8:26 ` Jan Hubicka
2010-11-16 9:00 ` Xinliang David Li
2010-11-16 14:23 ` Xinliang David Li
2010-11-16 17:10 ` Jan Hubicka
2010-11-16 19:11 ` Xinliang David Li
2010-11-18 11:48 ` Xinliang David Li
2010-11-18 13:06 ` Jan Hubicka
2010-11-18 17:20 ` Xinliang David Li
[not found] ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
2010-11-19 4:22 ` Jan Hubicka
2010-11-19 7:26 ` Xinliang David Li
[not found] ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
2010-11-19 14:12 ` Xinliang David Li
2010-11-18 13:28 ` Jan Hubicka
2010-11-18 18:18 ` Xinliang David Li
2010-11-16 15:43 ` Richard Guenther
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).