GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
@ 2010-04-29 16:29 Vladimir Makarov
  2010-04-29 16:49 ` Jan Hubicka
                   ` (3 more replies)
  0 siblings, 4 replies; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 16:29 UTC (permalink / raw)
  To: gcc.gcc.gnu.org

  GCC-4.5.0 and LLVM-2.7 were released recently.  To understand
where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
for x86/x86-64 and posted the comparison of it with the
previous GCC releases and LLVM-2.7.

  Even benchmarking SPEC2000 takes a lot of time on the fastest
machine I have. So I don't plan to use SPEC2006 for this in near
future.

  You can find the comparison on
http://vmakarov.fedorapeople.org/spec/ (please just click links at the
bottom of the left frame starting with link "GCC release comparison").

  If you need exact numbers, please use the tables (the links to them
are also given) which were used to generate the corresponding bar
graphs.

  In general GCC-4.5.0 became faster (upto 10%) in -O2 mode.  This is
first considerable compilation speed improvement since GCC-4.2.
GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
SPECFP2000 in -O2 mode) code too in comparison with the previous
release.  That is not including LTO and Graphite which can gives even
more (especially LTO) in many cases.

  GCC-4.5.0 has new big optimizations LTO and Graphite (more
accurately graphite was introduced in the previous release).
Therefore I ran additional benchmarks to test them.

  LTO is a promising technology especially for integer benchmarks for
which it results in smaller and faster code.  But it might result in
degradations too on SPECFP2000 mainly because of big degradations on a
few benchmarks like wupwise or facerec.  Another annoying thing about
LTO, it considerably slows down the compiler.

  Currently Graphite gives small improvements on x86 (one exception is
2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
maximum one more than 10% for SPECFP2000 because of big degradations
on mgrid and swim).  So further work is needed on the project because
it seems not mature yet.

  As for LLVM, LLVM became slower (e.g. in comparison with llvm-2.5 on
15%-50% for x86-64).  So the gap between compilation speed of GCC and
LLVM decreased and sometimes achieves 4% on x86_64 and 8% on x86 (both
for SPECInt2000 in -O2 mode).  May be I am wrong but I don't think
CLANG will improve this situation significantly (in -O2 and -O3 mode)
because optimizations still take most of time of any serious
optimizing compiler.

  LLVM did a progress in code performance especially for floating
point benchmarks.  But the gap between LLVM-2.7 and GCC-4.5 in peak
performance (not including GCC LTO and Graphite) still 6-7% on
SPECInt200 and 13-17% on SPECFP2000.

  In general, IMHO GCC-4.5.0 is a good and promising release.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
@ 2010-04-29 16:49 ` Jan Hubicka
  2010-04-29 17:25   ` Vladimir Makarov
  2010-04-29 18:26 ` Xinliang David Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 16:49 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org

>  GCC-4.5.0 and LLVM-2.7 were released recently.  To understand
> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
> for x86/x86-64 and posted the comparison of it with the
> previous GCC releases and LLVM-2.7.
>
>  Even benchmarking SPEC2000 takes a lot of time on the fastest
> machine I have. So I don't plan to use SPEC2006 for this in near
> future.
>
>  You can find the comparison on
> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
> bottom of the left frame starting with link "GCC release comparison").
>
>  If you need exact numbers, please use the tables (the links to them
> are also given) which were used to generate the corresponding bar
> graphs.
>
>
>  In general GCC-4.5.0 became faster (upto 10%) in -O2 mode.  This is
> first considerable compilation speed improvement since GCC-4.2.
> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
> SPECFP2000 in -O2 mode) code too in comparison with the previous
> release.  That is not including LTO and Graphite which can gives even
> more (especially LTO) in many cases.
>
>  GCC-4.5.0 has new big optimizations LTO and Graphite (more
> accurately graphite was introduced in the previous release).
> Therefore I ran additional benchmarks to test them.
>
>  LTO is a promising technology especially for integer benchmarks for
> which it results in smaller and faster code.  But it might result in
> degradations too on SPECFP2000 mainly because of big degradations on a
> few benchmarks like wupwise or facerec.  Another annoying thing about
> LTO, it considerably slows down the compiler.

Seems like something sensitive for setup.  In our daily benchmarking LTO
fatster on wupwise (2116 compared to 1600), and facerec is 2003 compared to
2041 (so about the same).

http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ai-64/list.html
http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ipa-64/list.html

Did you test with -fwhole-program?

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 16:49 ` Jan Hubicka
@ 2010-04-29 17:25   ` Vladimir Makarov
  2010-04-29 18:17     ` Vladimir Makarov
  0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 17:25 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc.gcc.gnu.org

Jan Hubicka wrote:
>>  GCC-4.5.0 and LLVM-2.7 were released recently.  To understand
>> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
>> for x86/x86-64 and posted the comparison of it with the
>> previous GCC releases and LLVM-2.7.
>>
>>  Even benchmarking SPEC2000 takes a lot of time on the fastest
>> machine I have. So I don't plan to use SPEC2006 for this in near
>> future.
>>
>>  You can find the comparison on
>> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
>> bottom of the left frame starting with link "GCC release comparison").
>>
>>  If you need exact numbers, please use the tables (the links to them
>> are also given) which were used to generate the corresponding bar
>> graphs.
>>
>>
>>  In general GCC-4.5.0 became faster (upto 10%) in -O2 mode.  This is
>> first considerable compilation speed improvement since GCC-4.2.
>> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
>> SPECFP2000 in -O2 mode) code too in comparison with the previous
>> release.  That is not including LTO and Graphite which can gives even
>> more (especially LTO) in many cases.
>>
>>  GCC-4.5.0 has new big optimizations LTO and Graphite (more
>> accurately graphite was introduced in the previous release).
>> Therefore I ran additional benchmarks to test them.
>>
>>  LTO is a promising technology especially for integer benchmarks for
>> which it results in smaller and faster code.  But it might result in
>> degradations too on SPECFP2000 mainly because of big degradations on a
>> few benchmarks like wupwise or facerec.  Another annoying thing about
>> LTO, it considerably slows down the compiler.
>>     
>
> Seems like something sensitive for setup.  In our daily benchmarking LTO
> fatster on wupwise (2116 compared to 1600), and facerec is 2003 compared to
> 2041 (so about the same).
>
> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ai-64/list.html
> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ipa-64/list.html
>
> Did you test with -fwhole-program?
>   
Yes, I used -flto -fwhole-program.  All this info is on the page.  The 
test machine are also not experimental ones (the both are Dell machines).

I used the released sources may be a reason for the difference is in 
different sources.  In any case, I'll check the current trunk on these 
machines.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 17:25   ` Vladimir Makarov
@ 2010-04-29 18:17     ` Vladimir Makarov
  0 siblings, 0 replies; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 18:17 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc.gcc.gnu.org

Vladimir Makarov wrote:
> Jan Hubicka wrote:
>>
>> Seems like something sensitive for setup.  In our daily benchmarking LTO
>> fatster on wupwise (2116 compared to 1600), and facerec is 2003 
>> compared to
>> 2041 (so about the same).
>>
>> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ai-64/list.html
>> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-ipa-64/list.html
>>
>> Did you test with -fwhole-program?
>>   
> Yes, I used -flto -fwhole-program.  All this info is on the page.  The 
> test machine are also not experimental ones (the both are Dell machines).
>
> I used the released sources may be a reason for the difference is in 
> different sources.  In any case, I'll check the current trunk on these 
> machines.
>
>
The following I got on the today trunk for x86_64 (2.93 GHz Core i7):

                                              wupwise
-O3                                           2670 
                                                            
-O3 -flto -fwhole-program                     2211
-O3 -ffast-math                               2753
-O3 -flto -fwhole-program -ffast-math         4325


So nothing is wrong with my test machine.  We simply measure different 
things.  You use -ffast-math, I don't use it.

For the comparison I used simple combination of options for GCC and 
LLVM.  For me it is obvious that GCC results can be improved more than 
LLVM by finding right options because it has much  more optimizations.

Still it would be nice to fix LTO SPEC2000 degradations when -ffast-math 
is not used.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
  2010-04-29 16:49 ` Jan Hubicka
@ 2010-04-29 18:26 ` Xinliang David Li
  2010-04-29 18:57   ` Vladimir Makarov
  2010-04-29 22:42 ` Jack Howarth
  2010-11-13 23:15 ` Xinliang David Li
  3 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 18:26 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org

On Thu, Apr 29, 2010 at 9:25 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>  GCC-4.5.0 and LLVM-2.7 were released recently.  To understand
> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
> for x86/x86-64 and posted the comparison of it with the
> previous GCC releases and LLVM-2.7.
>
>  Even benchmarking SPEC2000 takes a lot of time on the fastest
> machine I have. So I don't plan to use SPEC2006 for this in near
> future.
>
>  You can find the comparison on
> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
> bottom of the left frame starting with link "GCC release comparison").
>
>  If you need exact numbers, please use the tables (the links to them
> are also given) which were used to generate the corresponding bar
> graphs.
>
>
>  In general GCC-4.5.0 became faster (upto 10%) in -O2 mode.  This is
> first considerable compilation speed improvement since GCC-4.2.
> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
> SPECFP2000 in -O2 mode) code too in comparison with the previous
> release.  That is not including LTO and Graphite which can gives even
> more (especially LTO) in many cases.
>
>  GCC-4.5.0 has new big optimizations LTO and Graphite (more
> accurately graphite was introduced in the previous release).
> Therefore I ran additional benchmarks to test them.
>
>  LTO is a promising technology especially for integer benchmarks for
> which it results in smaller and faster code.  But it might result in
> degradations too on SPECFP2000 mainly because of big degradations on a
> few benchmarks like wupwise or facerec.  Another annoying thing about
> LTO, it considerably slows down the compiler.


The LTO improvement on spec2000int is is only 1.86%

                4.5     4.5+lto Improvement
164.gzip        955     950     -0.52%       <-- degrade
175.vpr         588     594     1.02%
176.gcc         1211    1216    0.41%
181.mcf         699     698     -0.14%
186.crafty      1011    987     -2.37%    <--- degrade
197.parser      792     813     2.65%
252.eon         1026    1023    -0.29%   <-- degrade
253.perlbmk     1312    1294    -1.37%  <-- degrade
254.gap         1021    1037    1.57%
255.vortex      1123    1319    17.45%
256.bzip2       737     768     4.21%
300.twolf       773     779     0.78%
-----------------------------------------------------
SPECint2000     913     930     1.86%


This matches our previous observation that to bring the best out of
LTO, FDO is also needed. (As a reference, LIPO improves over plain FDO
by ~4.5%, vortex improves 23%).  You will probably see even smaller
improvement in SPEC2006.

It would be great if there is number collected comparing LTO + FDO vs
plain FDO in the same setup.

Thanks,

David




>
>  Currently Graphite gives small improvements on x86 (one exception is
> 2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
> maximum one more than 10% for SPECFP2000 because of big degradations
> on mgrid and swim).  So further work is needed on the project because
> it seems not mature yet.
>
>  As for LLVM, LLVM became slower (e.g. in comparison with llvm-2.5 on
> 15%-50% for x86-64).  So the gap between compilation speed of GCC and
> LLVM decreased and sometimes achieves 4% on x86_64 and 8% on x86 (both
> for SPECInt2000 in -O2 mode).  May be I am wrong but I don't think
> CLANG will improve this situation significantly (in -O2 and -O3 mode)
> because optimizations still take most of time of any serious
> optimizing compiler.
>
>  LLVM did a progress in code performance especially for floating
> point benchmarks.  But the gap between LLVM-2.7 and GCC-4.5 in peak
> performance (not including GCC LTO and Graphite) still 6-7% on
> SPECInt200 and 13-17% on SPECFP2000.
>
>  In general, IMHO GCC-4.5.0 is a good and promising release.
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 18:26 ` Xinliang David Li
@ 2010-04-29 18:57   ` Vladimir Makarov
  2010-04-29 19:42     ` Xinliang David Li
  2010-04-29 21:33     ` Jan Hubicka
  0 siblings, 2 replies; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 18:57 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: gcc.gcc.gnu.org

Xinliang David Li wrote:
> On Thu, Apr 29, 2010 at 9:25 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>   
>>  GCC-4.5.0 and LLVM-2.7 were released recently.  To understand
>> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
>> for x86/x86-64 and posted the comparison of it with the
>> previous GCC releases and LLVM-2.7.
>>
>>  Even benchmarking SPEC2000 takes a lot of time on the fastest
>> machine I have. So I don't plan to use SPEC2006 for this in near
>> future.
>>
>>  You can find the comparison on
>> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
>> bottom of the left frame starting with link "GCC release comparison").
>>
>>  If you need exact numbers, please use the tables (the links to them
>> are also given) which were used to generate the corresponding bar
>> graphs.
>>
>>
>>  In general GCC-4.5.0 became faster (upto 10%) in -O2 mode.  This is
>> first considerable compilation speed improvement since GCC-4.2.
>> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
>> SPECFP2000 in -O2 mode) code too in comparison with the previous
>> release.  That is not including LTO and Graphite which can gives even
>> more (especially LTO) in many cases.
>>
>>  GCC-4.5.0 has new big optimizations LTO and Graphite (more
>> accurately graphite was introduced in the previous release).
>> Therefore I ran additional benchmarks to test them.
>>
>>  LTO is a promising technology especially for integer benchmarks for
>> which it results in smaller and faster code.  But it might result in
>> degradations too on SPECFP2000 mainly because of big degradations on a
>> few benchmarks like wupwise or facerec.  Another annoying thing about
>> LTO, it considerably slows down the compiler.
>>     
>
>
> The LTO improvement on spec2000int is is only 1.86%
>
>                 4.5     4.5+lto Improvement
> 164.gzip        955     950     -0.52%       <-- degrade
> 175.vpr         588     594     1.02%
> 176.gcc         1211    1216    0.41%
> 181.mcf         699     698     -0.14%
> 186.crafty      1011    987     -2.37%    <--- degrade
> 197.parser      792     813     2.65%
> 252.eon         1026    1023    -0.29%   <-- degrade
> 253.perlbmk     1312    1294    -1.37%  <-- degrade
> 254.gap         1021    1037    1.57%
> 255.vortex      1123    1319    17.45%
> 256.bzip2       737     768     4.21%
> 300.twolf       773     779     0.78%
> -----------------------------------------------------
> SPECint2000     913     930     1.86%
>
>
> This matches our previous observation that to bring the best out of
> LTO, FDO is also needed. (As a reference, LIPO improves over plain FDO
> by ~4.5%, vortex improves 23%).  You will probably see even smaller
> improvement in SPEC2006.
>
>   
Thanks for the comments.  FDO will probably improve SPEC2000 score.  
Although it is not obvious for some tests because the train data sets 
for them are different from the reference data sets and it might 
actually mislead the  compiler.

FDO is important for optimizations where all possible data sets do not 
change branch probability distribution much.  IMHO therefore FDO is not 
widely used by most of developers (although I am sure that for Google 
applications it is extremely important) and therefore I don't measure it 
and it is not so interesting for me.  Although bigger reason not use FDO 
is inconvenience to use it for regular compiler user.

As for vortex FDO improvement, vortex contains a moderate size loop in 
which most of time is spent.  The loop has if-then-else on the top loop 
level.  On all SPEC2000 data sets, one if-branch  is  taken practically 
always  (like 1 to  1,000,000).   So it is not amazing for me that FDO 
gives such improvement for vortex.
> It would be great if there is number collected comparing LTO + FDO vs
> plain FDO in the same setup.
>
>   
Usually after such posting the comparisons,  I am getting a lot of 
requests.  I'd like to do all of them but unfortunately running and the 
result preparation takes a lot of my time.  May be I'll do such 
comparison next year.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 18:57   ` Vladimir Makarov
@ 2010-04-29 19:42     ` Xinliang David Li
  2010-04-29 20:19       ` Vladimir Makarov
  2010-04-29 21:33     ` Jan Hubicka
  1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 19:42 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org

>>
>
> Thanks for the comments.  FDO will probably improve SPEC2000 score.
>  Although it is not obvious for some tests because the train data sets for
> them are different from the reference data sets and it might actually
> mislead the  compiler.
>
> FDO is important for optimizations where all possible data sets do not
> change branch probability distribution much.  IMHO therefore FDO is not
> widely used by most of developers (although I am sure that for Google
> applications it is extremely important) and therefore I don't measure it and
> it is not so interesting for me.  Although bigger reason not use FDO is
> inconvenience to use it for regular compiler user.
>
> As for vortex FDO improvement, vortex contains a moderate size loop in which
> most of time is spent.  The loop has if-then-else on the top loop level.  On
> all SPEC2000 data sets, one if-branch  is  taken practically always  (like 1
> to  1,000,000).   So it is not amazing for me that FDO gives such
> improvement for vortex.

Actually what I was trying to say is that LTO will be more powerful
when combined with FDO. In other words, I expect LTO + FDO improves
over plain FDO more than 1.86%.


>>
>> It would be great if there is number collected comparing LTO + FDO vs
>> plain FDO in the same setup.
>>
>>
>
> Usually after such posting the comparisons,  I am getting a lot of requests.
>  I'd like to do all of them but unfortunately running and the result
> preparation takes a lot of my time.  May be I'll do such comparison next
> year.

Ok. Another comment is that using SPEC2000 for performance testing
won't be indicative of today's real world program size. Even
SPEC2006's largest C++ programs are not that big.

Thanks,

David

>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 19:42     ` Xinliang David Li
@ 2010-04-29 20:19       ` Vladimir Makarov
  2010-04-29 20:40         ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Makarov @ 2010-04-29 20:19 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: gcc.gcc.gnu.org

Xinliang David Li wrote:
>> Thanks for the comments.  FDO will probably improve SPEC2000 score.
>>  Although it is not obvious for some tests because the train data sets for
>> them are different from the reference data sets and it might actually
>> mislead the  compiler.
>>
>> FDO is important for optimizations where all possible data sets do not
>> change branch probability distribution much.  IMHO therefore FDO is not
>> widely used by most of developers (although I am sure that for Google
>> applications it is extremely important) and therefore I don't measure it and
>> it is not so interesting for me.  Although bigger reason not use FDO is
>> inconvenience to use it for regular compiler user.
>>
>> As for vortex FDO improvement, vortex contains a moderate size loop in which
>> most of time is spent.  The loop has if-then-else on the top loop level.  On
>> all SPEC2000 data sets, one if-branch  is  taken practically always  (like 1
>> to  1,000,000).   So it is not amazing for me that FDO gives such
>> improvement for vortex.
>>     
>
> Actually what I was trying to say is that LTO will be more powerful
> when combined with FDO. In other words, I expect LTO + FDO improves
> over plain FDO more than 1.86%.
>
>
>   
>>> It would be great if there is number collected comparing LTO + FDO vs
>>> plain FDO in the same setup.
>>>
>>>
>>>       
>> Usually after such posting the comparisons,  I am getting a lot of requests.
>>  I'd like to do all of them but unfortunately running and the result
>> preparation takes a lot of my time.  May be I'll do such comparison next
>> year.
>>     
>
> Ok. Another comment is that using SPEC2000 for performance testing
> won't be indicative of today's real world program size. Even
> SPEC2006's largest C++ programs are not that big.
>
>
>   
It is very subjective what is today's real world program size.  Usually 
it reflects what you are working on.  I understand that Google 
applications are huge and their speed is important for saving money (or 
energy) for their employees.   Firefox is  big enough but for regular 
desktop user 1% improvement may be invisible or not important if it is 
already fast enough.

A math-physics program can be small but its speed may be really 
important because it takes hours or days on fast machine.   Even big and 
intensively used applications like some logistic system can  have small  
program parts (e.g. ILP  solver or compression algorithms like gzip for 
speeding Internet communication up) whose optimization are the most 
important for the application and SPEC contains such 
calculation-intensive code (a lot of NP-complete task solvers and math 
physics programs).  So I would not say using SPEC for performance 
testing is not important for improving today's real world  size 
program.  Of course it is not so important than testing the program you 
are working on.  In order words, this program is most important 
benchmark for you but probably not for others.
 
As for me, GCC itself is very important program and SPEC contains it 
(2000 old one version and 2006 more recent one).  So SPEC is pretty 
important and good  for me (not perfect of course at least because it is 
not free) although it is not the single one which I care of.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 20:19       ` Vladimir Makarov
@ 2010-04-29 20:40         ` Xinliang David Li
  0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 20:40 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org

Point well put. The benchmark suite should have good mixture of
programs with different sizes. SPEC2k programs cluster at the lower
end of the spectrum though.

David

On Thu, Apr 29, 2010 at 12:43 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
> Xinliang David Li wrote:
>>>
>>> Thanks for the comments.  FDO will probably improve SPEC2000 score.
>>>  Although it is not obvious for some tests because the train data sets
>>> for
>>> them are different from the reference data sets and it might actually
>>> mislead the  compiler.
>>>
>>> FDO is important for optimizations where all possible data sets do not
>>> change branch probability distribution much.  IMHO therefore FDO is not
>>> widely used by most of developers (although I am sure that for Google
>>> applications it is extremely important) and therefore I don't measure it
>>> and
>>> it is not so interesting for me.  Although bigger reason not use FDO is
>>> inconvenience to use it for regular compiler user.
>>>
>>> As for vortex FDO improvement, vortex contains a moderate size loop in
>>> which
>>> most of time is spent.  The loop has if-then-else on the top loop level.
>>>  On
>>> all SPEC2000 data sets, one if-branch  is  taken practically always
>>>  (like 1
>>> to  1,000,000).   So it is not amazing for me that FDO gives such
>>> improvement for vortex.
>>>
>>
>> Actually what I was trying to say is that LTO will be more powerful
>> when combined with FDO. In other words, I expect LTO + FDO improves
>> over plain FDO more than 1.86%.
>>
>>
>>
>>>>
>>>> It would be great if there is number collected comparing LTO + FDO vs
>>>> plain FDO in the same setup.
>>>>
>>>>
>>>>
>>>
>>> Usually after such posting the comparisons,  I am getting a lot of
>>> requests.
>>>  I'd like to do all of them but unfortunately running and the result
>>> preparation takes a lot of my time.  May be I'll do such comparison next
>>> year.
>>>
>>
>> Ok. Another comment is that using SPEC2000 for performance testing
>> won't be indicative of today's real world program size. Even
>> SPEC2006's largest C++ programs are not that big.
>>
>>
>>
>
> It is very subjective what is today's real world program size.  Usually it
> reflects what you are working on.  I understand that Google applications are
> huge and their speed is important for saving money (or energy) for their
> employees.   Firefox is  big enough but for regular desktop user 1%
> improvement may be invisible or not important if it is already fast enough.
>
> A math-physics program can be small but its speed may be really important
> because it takes hours or days on fast machine.   Even big and intensively
> used applications like some logistic system can  have small  program parts
> (e.g. ILP  solver or compression algorithms like gzip for speeding Internet
> communication up) whose optimization are the most important for the
> application and SPEC contains such calculation-intensive code (a lot of
> NP-complete task solvers and math physics programs).  So I would not say
> using SPEC for performance testing is not important for improving today's
> real world  size program.  Of course it is not so important than testing the
> program you are working on.  In order words, this program is most important
> benchmark for you but probably not for others.
>
> As for me, GCC itself is very important program and SPEC contains it (2000
> old one version and 2006 more recent one).  So SPEC is pretty important and
> good  for me (not perfect of course at least because it is not free)
> although it is not the single one which I care of.
>
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 18:57   ` Vladimir Makarov
  2010-04-29 19:42     ` Xinliang David Li
@ 2010-04-29 21:33     ` Jan Hubicka
  2010-04-29 21:34       ` Jan Hubicka
                         ` (3 more replies)
  1 sibling, 4 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 21:33 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: Xinliang David Li, gcc.gcc.gnu.org

> Thanks for the comments.  FDO will probably improve SPEC2000 score.   
> Although it is not obvious for some tests because the train data sets  
> for them are different from the reference data sets and it might  
> actually mislead the  compiler.

There are several studies on the topic and it is not that bad in practice.
In wast majority of cases even pretty bad training runs gets significant
portion of improvement you can get from training on the final benchmark
data.  In SPEC case FDO improves pretty much all benchmarks.

I think the FDO is relatively little used because it is relatively hard
to use (i.e. user has to modify makefiles and learn how the feature works)
and also because there is very little support for it (i.e. in automake and such)
> As for vortex FDO improvement, vortex contains a moderate size loop in  
> which most of time is spent.  The loop has if-then-else on the top loop  
> level.  On all SPEC2000 data sets, one if-branch  is  taken practically  
> always  (like 1 to  1,000,000).   So it is not amazing for me that FDO  
> gives such improvement for vortex.

It would be interesting to know if same improvement happens with LTO and if
not what LIPO does.  I will unbreak vortex on our tester.

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 21:33     ` Jan Hubicka
@ 2010-04-29 21:34       ` Jan Hubicka
  2010-04-29 21:36       ` Xinliang David Li
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 21:34 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Vladimir Makarov, Xinliang David Li, gcc.gcc.gnu.org


BTW we are also tracking SPEC2k6 with and without LTO (not FDO runs)
http://gcc.opensuse.org/SPEC/CINT/sb-barbella.suse.de-ai-64/recent.html
http://gcc.opensuse.org/SPEC/CINT/sb-barbella.suse.de-head-64-2006/recent.html

not all 2k6 tests pass with LTO so it will need a bit care to compare results.

Honza
> 
> Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 21:33     ` Jan Hubicka
  2010-04-29 21:34       ` Jan Hubicka
@ 2010-04-29 21:36       ` Xinliang David Li
  2010-05-01  9:36         ` Jan Hubicka
  2010-04-29 21:38       ` Xinliang David Li
  2010-04-29 21:45       ` Steven Bosscher
  3 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 21:36 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

On Thu, Apr 29, 2010 at 2:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Thanks for the comments.  FDO will probably improve SPEC2000 score.
>> Although it is not obvious for some tests because the train data sets
>> for them are different from the reference data sets and it might
>> actually mislead the  compiler.
>
> There are several studies on the topic and it is not that bad in practice.
> In wast majority of cases even pretty bad training runs gets significant
> portion of improvement you can get from training on the final benchmark
> data.  In SPEC case FDO improves pretty much all benchmarks.

Agree.

>
> I think the FDO is relatively little used because it is relatively hard
> to use (i.e. user has to modify makefiles and learn how the feature works)
> and also because there is very little support for it (i.e. in automake and such)
>> As for vortex FDO improvement, vortex contains a moderate size loop in
>> which most of time is spent.  The loop has if-then-else on the top loop
>> level.  On all SPEC2000 data sets, one if-branch  is  taken practically
>> always  (like 1 to  1,000,000).   So it is not amazing for me that FDO
>> gives such improvement for vortex.
>
> It would be interesting to know if same improvement happens with LTO and if
> not what LIPO does.  I will unbreak vortex on our tester.
>

Vortex needs -fno-strict-aliasing.  It casts between two record types
with one record being a 'prefix' of another.

David



> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 21:33     ` Jan Hubicka
  2010-04-29 21:34       ` Jan Hubicka
  2010-04-29 21:36       ` Xinliang David Li
@ 2010-04-29 21:38       ` Xinliang David Li
  2010-04-29 21:46         ` Jan Hubicka
  2010-04-29 21:45       ` Steven Bosscher
  3 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 21:38 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

I noticed eon's peak options do not include FDO, is that intended?

David


On Thu, Apr 29, 2010 at 2:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Thanks for the comments.  FDO will probably improve SPEC2000 score.
>> Although it is not obvious for some tests because the train data sets
>> for them are different from the reference data sets and it might
>> actually mislead the  compiler.
>
> There are several studies on the topic and it is not that bad in practice.
> In wast majority of cases even pretty bad training runs gets significant
> portion of improvement you can get from training on the final benchmark
> data.  In SPEC case FDO improves pretty much all benchmarks.
>
> I think the FDO is relatively little used because it is relatively hard
> to use (i.e. user has to modify makefiles and learn how the feature works)
> and also because there is very little support for it (i.e. in automake and such)
>> As for vortex FDO improvement, vortex contains a moderate size loop in
>> which most of time is spent.  The loop has if-then-else on the top loop
>> level.  On all SPEC2000 data sets, one if-branch  is  taken practically
>> always  (like 1 to  1,000,000).   So it is not amazing for me that FDO
>> gives such improvement for vortex.
>
> It would be interesting to know if same improvement happens with LTO and if
> not what LIPO does.  I will unbreak vortex on our tester.
>
> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 21:33     ` Jan Hubicka
                         ` (2 preceding siblings ...)
  2010-04-29 21:38       ` Xinliang David Li
@ 2010-04-29 21:45       ` Steven Bosscher
  2010-04-29 22:35         ` Xinliang David Li
  3 siblings, 1 reply; 68+ messages in thread
From: Steven Bosscher @ 2010-04-29 21:45 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Vladimir Makarov, Xinliang David Li, gcc.gcc.gnu.org

On Thu, Apr 29, 2010 at 11:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> It would be interesting to know if same improvement happens with LTO and if
> not what LIPO does.  I will unbreak vortex on our tester.

Perhaps you can add a LIPO tester? It looks like a very interesting
and promising approach.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 21:38       ` Xinliang David Li
@ 2010-04-29 21:46         ` Jan Hubicka
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 21:46 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org

> I noticed eon's peak options do not include FDO, is that intended?
I think it is just bug in page header, but I will double check.
Base and peak should match otherwise.

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 21:45       ` Steven Bosscher
@ 2010-04-29 22:35         ` Xinliang David Li
  2010-04-29 22:50           ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-29 22:35 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok

Thanks for the suggestion. Raksit currently is busy with merging trunk
changes back to lw-ipo branch which can be a daunting task. After that
this can be done.  (Our internal release is based on 4.4).

David

On Thu, Apr 29, 2010 at 2:38 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Thu, Apr 29, 2010 at 11:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> It would be interesting to know if same improvement happens with LTO and if
>> not what LIPO does.  I will unbreak vortex on our tester.
>
> Perhaps you can add a LIPO tester? It looks like a very interesting
> and promising approach.
>
> Ciao!
> Steven
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
  2010-04-29 16:49 ` Jan Hubicka
  2010-04-29 18:26 ` Xinliang David Li
@ 2010-04-29 22:42 ` Jack Howarth
  2010-11-13 23:15 ` Xinliang David Li
  3 siblings, 0 replies; 68+ messages in thread
From: Jack Howarth @ 2010-04-29 22:42 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org

On Thu, Apr 29, 2010 at 12:25:15PM -0400, Vladimir Makarov wrote:
....
>
>  Currently Graphite gives small improvements on x86 (one exception is
> 2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
> maximum one more than 10% for SPECFP2000 because of big degradations
> on mgrid and swim).  So further work is needed on the project because
> it seems not mature yet.
>

Vladimir,
  Keep in mind that -fgraphite-identity currently still causes
vectorization opportunities to be missed. Once that if fixed
the higher level graphite optimizations may look alot better.
             Jack

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 22:35         ` Xinliang David Li
@ 2010-04-29 22:50           ` Jan Hubicka
  2010-04-29 22:51             ` Steven Bosscher
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 22:50 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Steven Bosscher, Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org,
	Raksit Ashok

> Thanks for the suggestion. Raksit currently is busy with merging trunk
> changes back to lw-ipo branch which can be a daunting task. After that
> this can be done.  (Our internal release is based on 4.4).

I must say that LIPO is something I always intend to look into but didn't
seriously find time for that yet (well, hoping that submitting the thesis will
make this easier).
What are the LIPO's features that are missing in -flto -fprofile-use?

Honza
> 
> David
> 
> On Thu, Apr 29, 2010 at 2:38 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> > On Thu, Apr 29, 2010 at 11:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> It would be interesting to know if same improvement happens with LTO and if
> >> not what LIPO does. Â I will unbreak vortex on our tester.
> >
> > Perhaps you can add a LIPO tester? It looks like a very interesting
> > and promising approach.
> >
> > Ciao!
> > Steven
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 22:50           ` Jan Hubicka
@ 2010-04-29 22:51             ` Steven Bosscher
  2010-04-29 23:06               ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Steven Bosscher @ 2010-04-29 22:51 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok

2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
>> Thanks for the suggestion. Raksit currently is busy with merging trunk
>> changes back to lw-ipo branch which can be a daunting task. After that
>> this can be done.  (Our internal release is based on 4.4).
>
> I must say that LIPO is something I always intend to look into but didn't
> seriously find time for that yet (well, hoping that submitting the thesis will
> make this easier).
> What are the LIPO's features that are missing in -flto -fprofile-use?

LIPO is a completely different approach, basically independent of LTO.
There is a good explanation of it on the wiki, see
http://gcc.gnu.org/wiki/LightweightIpo.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 22:51             ` Steven Bosscher
@ 2010-04-29 23:06               ` Jan Hubicka
  2010-04-29 23:47                 ` Steven Bosscher
  2010-04-30  0:57                 ` Xinliang David Li
  0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-04-29 23:06 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Jan Hubicka, Xinliang David Li, Vladimir Makarov,
	gcc.gcc.gnu.org, Raksit Ashok

> 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
> >> Thanks for the suggestion. Raksit currently is busy with merging trunk
> >> changes back to lw-ipo branch which can be a daunting task. After that
> >> this can be done. Â (Our internal release is based on 4.4).
> >
> > I must say that LIPO is something I always intend to look into but didn't
> > seriously find time for that yet (well, hoping that submitting the thesis will
> > make this easier).
> > What are the LIPO's features that are missing in -flto -fprofile-use?
> 
> LIPO is a completely different approach, basically independent of LTO.
> There is a good explanation of it on the wiki, see
> http://gcc.gnu.org/wiki/LightweightIpo.

Yep, I read that page (and saw some of implementation too).  Just was not able
to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
than LTO+FDO then why)

Honza
> 
> Ciao!
> Steven

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 23:06               ` Jan Hubicka
@ 2010-04-29 23:47                 ` Steven Bosscher
  2010-09-28  0:28                   ` Neil Vachharajani
  2010-04-30  0:57                 ` Xinliang David Li
  1 sibling, 1 reply; 68+ messages in thread
From: Steven Bosscher @ 2010-04-29 23:47 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok

2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
> Yep, I read that page (and saw some of implementation too).  Just was not able
> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
> than LTO+FDO then why)

OK, that's an interesting question. The first question (if...) is
something you'll have to try yourself, I suppose :-)

BTW will the CGO presentation about LIPO and sampled FDO be published
somewhere in the open?

Ciao!
Steven

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-29 23:06               ` Jan Hubicka
  2010-04-29 23:47                 ` Steven Bosscher
@ 2010-04-30  0:57                 ` Xinliang David Li
  2010-04-30  8:42                   ` Jan Hubicka
  1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-30  0:57 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok

On Thu, Apr 29, 2010 at 4:03 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
>> >> Thanks for the suggestion. Raksit currently is busy with merging trunk
>> >> changes back to lw-ipo branch which can be a daunting task. After that
>> >> this can be done.  (Our internal release is based on 4.4).
>> >
>> > I must say that LIPO is something I always intend to look into but didn't
>> > seriously find time for that yet (well, hoping that submitting the thesis will
>> > make this easier).
>> > What are the LIPO's features that are missing in -flto -fprofile-use?
>>
>> LIPO is a completely different approach, basically independent of LTO.
>> There is a good explanation of it on the wiki, see
>> http://gcc.gnu.org/wiki/LightweightIpo.
>
> Yep, I read that page (and saw some of implementation too).  Just was not able
> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
> than LTO+FDO then why)
>

In theory, LIPO should not generate better results than LTO+FDO. What
makes LIPO attractive is that it allows distributed build from the
beginning. Its integration with large distributed build system is also
easy.  Another point is that LIPO can be decoupled from FDO as well.
The reason is that cross module call clusters do not change that much
and can be determined statically or determined once using sample
profiling information. The grouping info can then be used for regular
O2 builds. This will remove the need for people to move functions into
header files which tend to penalize compile time unnecessarily.

If there is performance difference, the following unique things in
LIPO may contribute to it ( I have not validate them)

1) LIPO supports tracking indirect call targets across modules. This
is not feasible for plain FDO as there will be cgraph pid conflicts.
LIPO uses unique function id == (module_id << 32) + func_def_no, which
makes it possible.
2) comdat function resolution -- since LIPO uses aux module functions
for inlining purpose only, it has the freedom to choose which copy to
use. The current scheme chooses copy in current module with priority
for better profile data context sensitivity (see below)
3) in profile-gen phase, allow more inlining for comdat functions (in
einline2 and ipa-inline) -- this will cause profile data to be tracked
with module sensitivity (note that counters are not in comdat group)

Thanks,

David

> Honza
>>
>> Ciao!
>> Steven
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-30  0:57                 ` Xinliang David Li
@ 2010-04-30  8:42                   ` Jan Hubicka
  2010-04-30 18:13                     ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-30  8:42 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org,
	Raksit Ashok

> In theory, LIPO should not generate better results than LTO+FDO. What
> makes LIPO attractive is that it allows distributed build from the
> beginning. Its integration with large distributed build system is also
> easy.  Another point is that LIPO can be decoupled from FDO as well.

The integration should be pretty much same as with current FDO, right?
Just arrange to get everything build twice and trained in between.

> The reason is that cross module call clusters do not change that much
> and can be determined statically or determined once using sample
> profiling information. The grouping info can then be used for regular
> O2 builds. This will remove the need for people to move functions into

This means, build once to gather callgraph and instead of deciding grouping
at runtime with profile info in it, just do it via some tool statically?

> header files which tend to penalize compile time unnecessarily.
> 
> If there is performance difference, the following unique things in
> LIPO may contribute to it ( I have not validate them)
> 
> 1) LIPO supports tracking indirect call targets across modules. This
> is not feasible for plain FDO as there will be cgraph pid conflicts.
> LIPO uses unique function id == (module_id << 32) + func_def_no, which
> makes it possible.

Interesting.  My plan for profiling with LTO is to ultimately make it linktime
transform.  This will be more difficult with WHOPR (i.e. instrumenting need
function bodies that are not available at WPA time), but I believe it is
solvable: just assign uids to the edges and do instrumentation at ltrans.  Then
we will save cgraph profile in some easier way so WHOPR can read it in and read
rest of stuff in ltrans.  This would invovlve shipping the correct profiles for
given function etc so it will be a bit of implementation challenge.

> 2) comdat function resolution -- since LIPO uses aux module functions
> for inlining purpose only, it has the freedom to choose which copy to
> use. The current scheme chooses copy in current module with priority
> for better profile data context sensitivity (see below)

This is interesting.  How do you solve the problem when given comdat function
"loose"? I.e. it is replaced at linktime by other function that may or may
not be profiled from other unit?

I am aware that current FDO gets this wrong (it assumes that comdat functions
are never replaced from other unit).  I guess situation can be improved a bit
by doing some localization even at no -fwhole-program or teach runtime to merge
in profiles into each individual copy of comdat...

Honza

> 3) in profile-gen phase, allow more inlining for comdat functions (in
> einline2 and ipa-inline) -- this will cause profile data to be tracked
> with module sensitivity (note that counters are not in comdat group)
> 
> Thanks,
> 
> David
> 
> 
> 
> > Honza
> >>
> >> Ciao!
> >> Steven
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-30  8:42                   ` Jan Hubicka
@ 2010-04-30 18:13                     ` Xinliang David Li
  2010-04-30 18:32                       ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-30 18:13 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok

On Fri, Apr 30, 2010 at 1:37 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> In theory, LIPO should not generate better results than LTO+FDO. What
>> makes LIPO attractive is that it allows distributed build from the
>> beginning. Its integration with large distributed build system is also
>> easy.  Another point is that LIPO can be decoupled from FDO as well.

>
> The integration should be pretty much same as with current FDO, right?
> Just arrange to get everything build twice and trained in between.


Right. LIPO behaves similarly to plain FDO (no IPO).

>
>> The reason is that cross module call clusters do not change that much
>> and can be determined statically or determined once using sample
>> profiling information. The grouping info can then be used for regular
>> O2 builds. This will remove the need for people to move functions into
>
> This means, build once to gather callgraph and instead of deciding grouping
> at runtime with profile info in it, just do it via some tool statically?
>

One could run LIPO instrumented binary one and get the grouping and
reuse the grouping.  It is also possible to determine this grouping
using regular binary with hardware profiling (we have an internal tool
to do that). Or if the user knows the module affinity (but does not
want to rearrange source structures for SW engineering reasons), he
can choose to specify the grouping statically (currently not supported
yet). For instance, we can invent some directives for  that:

#include_aux_module "../a/b/c.cpp"

The real scenario can be more complicated that this in order to
support different options, include search paths etc.


>> header files which tend to penalize compile time unnecessarily.
>>
>> If there is performance difference, the following unique things in
>> LIPO may contribute to it ( I have not validate them)
>>
>> 1) LIPO supports tracking indirect call targets across modules. This
>> is not feasible for plain FDO as there will be cgraph pid conflicts.
>> LIPO uses unique function id == (module_id << 32) + func_def_no, which
>> makes it possible.
>
> Interesting.  My plan for profiling with LTO is to ultimately make it linktime
> transform.  This will be more difficult with WHOPR (i.e. instrumenting need
> function bodies that are not available at WPA time), but I believe it is
> solvable: just assign uids to the edges and do instrumentation at ltrans.  Then
> we will save cgraph profile in some easier way so WHOPR can read it in and read
> rest of stuff in ltrans.  This would invovlve shipping the correct profiles for
> given function etc so it will be a bit of implementation challenge.

This can be tricky -- to maximize FDO benefit, the
profile-use/annotation needs to happen early which means
instrumentation also needs to happen early (to avoid cfg mismatches).


>
>> 2) comdat function resolution -- since LIPO uses aux module functions
>> for inlining purpose only, it has the freedom to choose which copy to
>> use. The current scheme chooses copy in current module with priority
>> for better profile data context sensitivity (see below)
>
> This is interesting.  How do you solve the problem when given comdat function
> "loose"? I.e. it is replaced at linktime by other function that may or may
> not be profiled from other unit?

Whatever function that is selected will have profile data (assuming it
called at runtime) -- but the profile data are merged from different
contexts including from calls in different modules.   For instance,
both a.C and b.C define foo. and b.C:foo is selected at runtime, and
a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
a.C:foo won't have any profile data, and b.C:foo has merged profile
data resulting from calls in both a.C and b.C.


>
> I am aware that current FDO gets this wrong (it assumes that comdat functions
> are never replaced from other unit).  I guess situation can be improved a bit
> by doing some localization even at no -fwhole-program or teach runtime to merge
> in profiles into each individual copy of comdat...

Yes, current FDO assumption is wrong.

Thanks,

David

>
> Honza
>
>> 3) in profile-gen phase, allow more inlining for comdat functions (in
>> einline2 and ipa-inline) -- this will cause profile data to be tracked
>> with module sensitivity (note that counters are not in comdat group)
>>
>> Thanks,
>>
>> David
>>
>>
>>
>> > Honza
>> >>
>> >> Ciao!
>> >> Steven
>> >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-30 18:13                     ` Xinliang David Li
@ 2010-04-30 18:32                       ` Jan Hubicka
  2010-04-30 20:13                         ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-04-30 18:32 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org,
	Raksit Ashok

> >
> > Interesting. Â My plan for profiling with LTO is to ultimately make it linktime
> > transform. Â This will be more difficult with WHOPR (i.e. instrumenting need
> > function bodies that are not available at WPA time), but I believe it is
> > solvable: just assign uids to the edges and do instrumentation at ltrans. Â Then
> > we will save cgraph profile in some easier way so WHOPR can read it in and read
> > rest of stuff in ltrans. Â This would invovlve shipping the correct profiles for
> > given function etc so it will be a bit of implementation challenge.
> 
> This can be tricky -- to maximize FDO benefit, the
> profile-use/annotation needs to happen early which means
> instrumentation also needs to happen early (to avoid cfg mismatches).

I don't see much problem in this particular area.

GCC optimization queue is organized in a way that we first do early
optimizatoins that all are intended to be simple cleanups without size/speed
tradeoffs.  Then we do IPA and late optimizations that are both driven by
profile (estimated or read).
Profile reading happens early because we use same infrastructure for gcov and
profile feedback.  This is not giving profile feedback better benefit, quite a
converse since early passes may not be able to update profile precisely and we
also get higher profile overhead.

So I think decoupling gcov and profile feedback and pushing profile feedback
back in queue is going to be win.

Yes, optimization must match, but with LTO this is not problem and in general
the early optimization should be stable wrt memory layout (nothing else
changes).  This used to be excercised before profiling was updated to tree
level in 4.x.

I would be very interested in the low overhead support - there is a lot to gain
especially because the profiling resuls are less dependent on setup and can be
better reused.  I know part of code was contributed (the support for reading not
100% valid profiles). Is there any extra info available on this?

Main problem IMO is how to get profile into WHOPR without having function bodies.
I guess we will end up with summarizing the info in WHOR firendly way and
letting it to stream the other counters to LTRANS that will annotate the function
body once read in from the file.
> 
> 
> >
> >> 2) comdat function resolution -- since LIPO uses aux module functions
> >> for inlining purpose only, it has the freedom to choose which copy to
> >> use. The current scheme chooses copy in current module with priority
> >> for better profile data context sensitivity (see below)
> >
> > This is interesting. Â How do you solve the problem when given comdat function
> > "loose"? I.e. it is replaced at linktime by other function that may or may
> > not be profiled from other unit?
> 
> Whatever function that is selected will have profile data (assuming it
> called at runtime) -- but the profile data are merged from different
> contexts including from calls in different modules.   For instance,
> both a.C and b.C define foo. and b.C:foo is selected at runtime, and
> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
> a.C:foo won't have any profile data, and b.C:foo has merged profile
> data resulting from calls in both a.C and b.C.

Yes, but this is what I am concerned about.  Without LTO at least when
compiling a.C with profile feedback we will have foo with 0 counts.
We might however work out that calls of foo are frequent and decide to
inline foo. We will take the counts and rescale resulting in inlining
foo optimized for size.

When comdats are resolved within LTO, this will not be deal, but LTO
still produce comdats that are later resolved with library etc., so we don't
solve the problem this way.
At very least we should be able to figure out that we are having function
that has no profile and do something more sane.

Do you have any idea how common these scenarios are?

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-04-30 18:32                       ` Jan Hubicka
@ 2010-04-30 20:13                         ` Xinliang David Li
  2010-09-28  0:29                           ` Neil Vachharajani
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-04-30 20:13 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok,
	Neil Vachharajani

On Fri, Apr 30, 2010 at 11:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >
>> > Interesting.  My plan for profiling with LTO is to ultimately make it linktime
>> > transform.  This will be more difficult with WHOPR (i.e. instrumenting need
>> > function bodies that are not available at WPA time), but I believe it is
>> > solvable: just assign uids to the edges and do instrumentation at ltrans.  Then
>> > we will save cgraph profile in some easier way so WHOPR can read it in and read
>> > rest of stuff in ltrans.  This would invovlve shipping the correct profiles for
>> > given function etc so it will be a bit of implementation challenge.
>>
>> This can be tricky -- to maximize FDO benefit, the
>> profile-use/annotation needs to happen early which means
>> instrumentation also needs to happen early (to avoid cfg mismatches).
>
> I don't see much problem in this particular area.
>
> GCC optimization queue is organized in a way that we first do early
> optimizatoins that all are intended to be simple cleanups without size/speed
> tradeoffs.  Then we do IPA and late optimizations that are both driven by
> profile (estimated or read).
> Profile reading happens early because we use same infrastructure for gcov and
> profile feedback.  This is not giving profile feedback better benefit, quite a
> converse since early passes may not be able to update profile precisely and we
> also get higher profile overhead.
>
> So I think decoupling gcov and profile feedback and pushing profile feedback
> back in queue is going to be win.
>

There are two parts of profile-feedback
1) cfg edge counts annotation.

  For this part, yes, most of the early phases (other than possibly
einline-2) do not need/depend on, and can probably pushed back (in
fact the static/guessed profile pass is later).

2) value profile transformations:

This part may benefit more from doing early -- not only because of
more cleanups, but also due to the requirement for getting more
precise inline summary.


> Yes, optimization must match, but with LTO this is not problem and in general
> the early optimization should be stable wrt memory layout (nothing else
> changes).  This used to be excercised before profiling was updated to tree
> level in 4.x.


You mean CFG layout is stable? but ccp, copy_prop, dce, tail recursion
etc all can change cfg.

>
> I would be very interested in the low overhead support - there is a lot to gain
> especially because the profiling resuls are less dependent on setup and can be
> better reused.  I know part of code was contributed (the support for reading not
> 100% valid profiles). Is there any extra info available on this?
>

For profile smoothing, Neil may point to more information.

> Main problem IMO is how to get profile into WHOPR without having function bodies.
> I guess we will end up with summarizing the info in WHOR firendly way and
> letting it to stream the other counters to LTRANS that will annotate the function
> body once read in from the file.
>>

I am a little lost here :)

>>
>> >
>> >> 2) comdat function resolution -- since LIPO uses aux module functions
>> >> for inlining purpose only, it has the freedom to choose which copy to
>> >> use. The current scheme chooses copy in current module with priority
>> >> for better profile data context sensitivity (see below)
>> >
>> > This is interesting.  How do you solve the problem when given comdat function
>> > "loose"? I.e. it is replaced at linktime by other function that may or may
>> > not be profiled from other unit?
>>
>> Whatever function that is selected will have profile data (assuming it
>> called at runtime) -- but the profile data are merged from different
>> contexts including from calls in different modules.   For instance,
>> both a.C and b.C define foo. and b.C:foo is selected at runtime, and
>> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
>> a.C:foo won't have any profile data, and b.C:foo has merged profile
>> data resulting from calls in both a.C and b.C.
>
> Yes, but this is what I am concerned about.  Without LTO at least when
> compiling a.C with profile feedback we will have foo with 0 counts.
> We might however work out that calls of foo are frequent and decide to
> inline foo. We will take the counts and rescale resulting in inlining
> foo optimized for size

Not always ideal though -- scaling does not expose whether foo is hot
or not (the call edge may be cold, but is still worth inlining).

.
>
> When comdats are resolved within LTO, this will not be deal, but LTO
> still produce comdats that are later resolved with library etc., so we don't
> solve the problem this way.
> At very least we should be able to figure out that we are having function
> that has no profile and do something more sane.

You mean LTO does not discard duplicate bodies? Why ?

>
> Do you have any idea how common these scenarios are?

I don't have direct data, but I think it can be common.

Thanks,

David

>
> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 21:36       ` Xinliang David Li
@ 2010-05-01  9:36         ` Jan Hubicka
  2010-05-02  7:04           ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-05-01  9:36 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org

> 
> Vortex needs -fno-strict-aliasing.  It casts between two record types
> with one record being a 'prefix' of another.

So today runs are complette.  Thanks to Richi who fixed ICE in symtab merging
that affected perl and GCC.  With vortex problem was that in addition to
-fno-strict-aliasing it is writting to closed files that cause ICE depending on
partiuclar glibc version.

Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.

Any idea if it is one of value transforms or just edge profile making the
difference?  There are some cases of write only globals we can constant
propagate with -fwhole-program in SPEC, but I think it is parser.

Honza
> 
> David
> 
> 
> 
> > Honza
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-05-01  9:36         ` Jan Hubicka
@ 2010-05-02  7:04           ` Xinliang David Li
  2010-05-02 13:46             ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-05-02  7:04 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

On Sat, May 1, 2010 at 2:36 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> Vortex needs -fno-strict-aliasing.  It casts between two record types
>> with one record being a 'prefix' of another.
>
> So today runs are complette.  Thanks to Richi who fixed ICE in symtab merging
> that affected perl and GCC.  With vortex problem was that in addition to
> -fno-strict-aliasing it is writting to closed files that cause ICE depending on
> partiuclar glibc version.
>
> Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
> vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
> http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
> has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.
>
> Any idea if it is one of value transforms or just edge profile making the
> difference?  There are some cases of write only globals we can constant
> propagate with -fwhole-program in SPEC, but I think it is parser.
>

I got the following number for O2, FDO, and LIPO : 2351, 2761 (17%), 3448 (24%).

The FDO improvement over O2 come from both edge profile and vpt
(div,rem). With FDO, one of the important loops in Part_Delete may get
tail duplicated which helps performance.

LIPO improvement mainly come from cross module ininling of hot
functions Mem_GetWord, Mem_GetAddr, Chunk_ChkGetChunk.

David


> Honza
>>
>> David
>>
>>
>>
>> > Honza
>> >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-05-02  7:04           ` Xinliang David Li
@ 2010-05-02 13:46             ` Jan Hubicka
  2010-05-03  4:57               ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-05-02 13:46 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org

> On Sat, May 1, 2010 at 2:36 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>
> >> Vortex needs -fno-strict-aliasing. Â It casts between two record types
> >> with one record being a 'prefix' of another.
> >
> > So today runs are complette. Â Thanks to Richi who fixed ICE in symtab merging
> > that affected perl and GCC. Â With vortex problem was that in addition to
> > -fno-strict-aliasing it is writting to closed files that cause ICE depending on
> > partiuclar glibc version.
> >
> > Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
> > vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
> > http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
> > has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.
> >
> > Any idea if it is one of value transforms or just edge profile making the
> > difference? Â There are some cases of write only globals we can constant
> > propagate with -fwhole-program in SPEC, but I think it is parser.
> >
> 
> I got the following number for O2, FDO, and LIPO : 2351, 2761 (17%), 3448 (24%).
> 
> The FDO improvement over O2 come from both edge profile and vpt
> (div,rem). With FDO, one of the important loops in Part_Delete may get

I see.  I am particularly interested in the div/rem transform.  With LTO such
things are sometimes doable at compile time (propagating that the divisor is
know constant value).  We currently make no constant propagation across global
variables except for simple detection if it is readonly and initialized.  It
would be possible to be a bit smarter here and look for vars that are only used
to store a constant value into it and then replace all the division/rem by that
constant counting on fact that the value 0 can not reach the division.

Is this case detectable at compile time without feedback?

Honza
> tail duplicated which helps performance.
> 
> LIPO improvement mainly come from cross module ininling of hot
> functions Mem_GetWord, Mem_GetAddr, Chunk_ChkGetChunk.
> 
> David
> 
> 
> > Honza
> >>
> >> David
> >>
> >>
> >>
> >> > Honza
> >> >
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on  SPEC2000 for x86/x86_64
  2010-05-02 13:46             ` Jan Hubicka
@ 2010-05-03  4:57               ` Xinliang David Li
  2010-05-04 18:04                 ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-05-03  4:57 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

On Sun, May 2, 2010 at 6:45 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Sat, May 1, 2010 at 2:36 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >>
>> >> Vortex needs -fno-strict-aliasing.  It casts between two record types
>> >> with one record being a 'prefix' of another.
>> >
>> > So today runs are complette.  Thanks to Richi who fixed ICE in symtab merging
>> > that affected perl and GCC.  With vortex problem was that in addition to
>> > -fno-strict-aliasing it is writting to closed files that cause ICE depending on
>> > partiuclar glibc version.
>> >
>> > Comparing http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-fdo-64-FDO/recent.html
>> > vortex is 2036 with -O2 -flto, 2438 with -O2 -flto and FDO (so about 20% improvement)
>> > http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-head-64/list.html
>> > has -O2 runs without LTO that is 1859, so 31% for LTO+FDO, 10% LTO.
>> >
>> > Any idea if it is one of value transforms or just edge profile making the
>> > difference?  There are some cases of write only globals we can constant
>> > propagate with -fwhole-program in SPEC, but I think it is parser.
>> >
>>
>> I got the following number for O2, FDO, and LIPO : 2351, 2761 (17%), 3448 (24%).
>>
>> The FDO improvement over O2 come from both edge profile and vpt
>> (div,rem). With FDO, one of the important loops in Part_Delete may get
>
> I see.  I am particularly interested in the div/rem transform.  With LTO such
> things are sometimes doable at compile time (propagating that the divisor is
> know constant value).  We currently make no constant propagation across global
> variables except for simple detection if it is readonly and initialized.  It
> would be possible to be a bit smarter here and look for vars that are only used
> to store a constant value into it and then replace all the division/rem by that
> constant counting on fact that the value 0 can not reach the division.
>
> Is this case detectable at compile time without feedback?

That depends. The following cases exist in vortex:

1) the value is runtime constant -- it is read from input file but
never changed -- e.g.: QueBug. Nothing can be done by the compiler in
this case;

2) Global variable written only once in the program, e.g
StrucAlignment.   Compiler needs to prove that the definition
dominates (interprocedurally) all uses. Sjeng has similar cases.

3) The simplest case -- global variable only initialized statically
and never written in the program -- compiler should be able to
recognize it.

4) Local variable with known constant value sets -- AllocSize -- can
be handled by compiler with the help of static prediction.


David

>
> Honza
>> tail duplicated which helps performance.
>>
>> LIPO improvement mainly come from cross module ininling of hot
>> functions Mem_GetWord, Mem_GetAddr, Chunk_ChkGetChunk.
>>
>> David
>>
>>
>> > Honza
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> > Honza
>> >> >
>> >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-05-03  4:57               ` Xinliang David Li
@ 2010-05-04 18:04                 ` Jan Hubicka
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-05-04 18:04 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, Vladimir Makarov, gcc.gcc.gnu.org

> On Sun, May 2, 2010 at 6:45 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> That depends. The following cases exist in vortex:
> 
> 1) the value is runtime constant -- it is read from input file but
> never changed -- e.g.: QueBug. Nothing can be done by the compiler in
> this case;
> 
> 2) Global variable written only once in the program, e.g
> StrucAlignment.   Compiler needs to prove that the definition
> dominates (interprocedurally) all uses. Sjeng has similar cases.
> 
> 3) The simplest case -- global variable only initialized statically
> and never written in the program -- compiler should be able to
> recognize it.
We handle this in ipa-reference already.
> 
> 4) Local variable with known constant value sets -- AllocSize -- can
> be handled by compiler with the help of static prediction.

Yep, I was basically interested if vortex case is 1) or not.  I am thinking
about extending my ipa-ref collecting code to collect list of known constants
variable is initialized with and feed it into some local optimization passes
(value range profiling is good case) as well as implement expansion for
division and modulo to something like:

switch (divisor)
{
  case known_cst1:
    res = a/known_cst1;
  case known_cst2:
    res = a/known_cst2;
...
}

special casing 0 as impossible value.  I know this can handle well parser in
spec2000, but I am curious if it is worth the (relatively little) effort to 
implement this.

Given how easy is to get this done, I guess I will do so.
Also ipa-cp can be extended to handle static vars used in function as parameters
and propagate across them that can get some of this with flow sensitivity.

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 23:47                 ` Steven Bosscher
@ 2010-09-28  0:28                   ` Neil Vachharajani
  2010-09-28  1:01                     ` Jack Howarth
  0 siblings, 1 reply; 68+ messages in thread
From: Neil Vachharajani @ 2010-09-28  0:28 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Jan Hubicka, Xinliang David Li, Vladimir Makarov,
	gcc.gcc.gnu.org, Raksit Ashok

On Thu, Apr 29, 2010 at 4:07 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
>> Yep, I read that page (and saw some of implementation too).  Just was not able
>> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
>> than LTO+FDO then why)
>
> OK, that's an interesting question. The first question (if...) is
> something you'll have to try yourself, I suppose :-)
>
> BTW will the CGO presentation about LIPO and sampled FDO be published
> somewhere in the open?

All the slides from CGO are available here:
http://www.cgo.org/cgo2010/talks/

>
> Ciao!
> Steven
>



-- 
Neil Vachharajani
Google
650-214-1804

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-30 20:13                         ` Xinliang David Li
@ 2010-09-28  0:29                           ` Neil Vachharajani
  0 siblings, 0 replies; 68+ messages in thread
From: Neil Vachharajani @ 2010-09-28  0:29 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Steven Bosscher, Vladimir Makarov, gcc.gcc.gnu.org,
	Raksit Ashok

On Fri, Apr 30, 2010 at 12:07 PM, Xinliang David Li <davidxl@google.com> wrote:
>
> On Fri, Apr 30, 2010 at 11:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> >
> >> > Interesting.  My plan for profiling with LTO is to ultimately make it linktime
> >> > transform.  This will be more difficult with WHOPR (i.e. instrumenting need
> >> > function bodies that are not available at WPA time), but I believe it is
> >> > solvable: just assign uids to the edges and do instrumentation at ltrans.  Then
> >> > we will save cgraph profile in some easier way so WHOPR can read it in and read
> >> > rest of stuff in ltrans.  This would invovlve shipping the correct profiles for
> >> > given function etc so it will be a bit of implementation challenge.
> >>
> >> This can be tricky -- to maximize FDO benefit, the
> >> profile-use/annotation needs to happen early which means
> >> instrumentation also needs to happen early (to avoid cfg mismatches).
> >
> > I don't see much problem in this particular area.
> >
> > GCC optimization queue is organized in a way that we first do early
> > optimizatoins that all are intended to be simple cleanups without size/speed
> > tradeoffs.  Then we do IPA and late optimizations that are both driven by
> > profile (estimated or read).
> > Profile reading happens early because we use same infrastructure for gcov and
> > profile feedback.  This is not giving profile feedback better benefit, quite a
> > converse since early passes may not be able to update profile precisely and we
> > also get higher profile overhead.
> >
> > So I think decoupling gcov and profile feedback and pushing profile feedback
> > back in queue is going to be win.
> >
>
> There are two parts of profile-feedback
> 1) cfg edge counts annotation.
>
>  For this part, yes, most of the early phases (other than possibly
> einline-2) do not need/depend on, and can probably pushed back (in
> fact the static/guessed profile pass is later).
>
> 2) value profile transformations:
>
> This part may benefit more from doing early -- not only because of
> more cleanups, but also due to the requirement for getting more
> precise inline summary.
>
>
> > Yes, optimization must match, but with LTO this is not problem and in general
> > the early optimization should be stable wrt memory layout (nothing else
> > changes).  This used to be excercised before profiling was updated to tree
> > level in 4.x.
>
>
> You mean CFG layout is stable? but ccp, copy_prop, dce, tail recursion
> etc all can change cfg.
>
> >
> > I would be very interested in the low overhead support - there is a lot to gain
> > especially because the profiling resuls are less dependent on setup and can be
> > better reused.  I know part of code was contributed (the support for reading not
> > 100% valid profiles). Is there any extra info available on this?
> >
>
> For profile smoothing, Neil may point to more information.

Sorry for the *very* delayed response, but some email filters went a bit wild.

Profile smoothing does a good job of taking imprecise profiles and
fixing them up.  This doesn't address the stale profile problem with
GCC instrumentation based FDO profile collection.  There are checks
which completely discard profiles if the function line numbers (IIRC)
do not match.  I have some patches I've been meaning to send upstream
which help ease this restriction (i.e., add the ability to retain more
of a stale profile), but this opens up many bugs which I've been
incrementally squashing throughout the rest of the compiler.

>
> > Main problem IMO is how to get profile into WHOPR without having function bodies.
> > I guess we will end up with summarizing the info in WHOR firendly way and
> > letting it to stream the other counters to LTRANS that will annotate the function
> > body once read in from the file.
> >>
>
> I am a little lost here :)
>
> >>
> >> >
> >> >> 2) comdat function resolution -- since LIPO uses aux module functions
> >> >> for inlining purpose only, it has the freedom to choose which copy to
> >> >> use. The current scheme chooses copy in current module with priority
> >> >> for better profile data context sensitivity (see below)
> >> >
> >> > This is interesting.  How do you solve the problem when given comdat function
> >> > "loose"? I.e. it is replaced at linktime by other function that may or may
> >> > not be profiled from other unit?
> >>
> >> Whatever function that is selected will have profile data (assuming it
> >> called at runtime) -- but the profile data are merged from different
> >> contexts including from calls in different modules.   For instance,
> >> both a.C and b.C define foo. and b.C:foo is selected at runtime, and
> >> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then
> >> a.C:foo won't have any profile data, and b.C:foo has merged profile
> >> data resulting from calls in both a.C and b.C.
> >
> > Yes, but this is what I am concerned about.  Without LTO at least when
> > compiling a.C with profile feedback we will have foo with 0 counts.
> > We might however work out that calls of foo are frequent and decide to
> > inline foo. We will take the counts and rescale resulting in inlining
> > foo optimized for size
>
> Not always ideal though -- scaling does not expose whether foo is hot
> or not (the call edge may be cold, but is still worth inlining).
>
> .
> >
> > When comdats are resolved within LTO, this will not be deal, but LTO
> > still produce comdats that are later resolved with library etc., so we don't
> > solve the problem this way.
> > At very least we should be able to figure out that we are having function
> > that has no profile and do something more sane.
>
> You mean LTO does not discard duplicate bodies? Why ?
>
> >
> > Do you have any idea how common these scenarios are?
>
> I don't have direct data, but I think it can be common.
>
> Thanks,
>
> David
>
> >
> > Honza
> >



--
Neil Vachharajani
Google
650-214-1804

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-09-28  0:28                   ` Neil Vachharajani
@ 2010-09-28  1:01                     ` Jack Howarth
  0 siblings, 0 replies; 68+ messages in thread
From: Jack Howarth @ 2010-09-28  1:01 UTC (permalink / raw)
  To: Neil Vachharajani
  Cc: Steven Bosscher, Jan Hubicka, Xinliang David Li,
	Vladimir Makarov, gcc.gcc.gnu.org, Raksit Ashok

On Mon, Sep 27, 2010 at 11:04:10AM -0700, Neil Vachharajani wrote:
> On Thu, Apr 29, 2010@4:07 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> > 2010/4/30 Jan Hubicka <hubicka@ucw.cz>:
> >> Yep, I read that page (and saw some of implementation too).  Just was not able
> >> to follow the precise feature set of LIPO (i.e. if it gets better SPEC results
> >> than LTO+FDO then why)
> >
> > OK, that's an interesting question. The first question (if...) is
> > something you'll have to try yourself, I suppose :-)
> >
> > BTW will the CGO presentation about LIPO and sampled FDO be published
> > somewhere in the open?
> 
> All the slides from CGO are available here:
> http://www.cgo.org/cgo2010/talks/
> 
> >
> > Ciao!
> > Steven
> >
> 

FYI, my recent Polyhedron 2008 benchmark runs for llvm-gcc-4.2 2.8rc2 on
x86_64-apple-darwin10 indicates that there are some significant performance
regressions between 2.7 and 2.8.

http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-September/034780.html

> 
> 
> -- 
> Neil Vachharajani
> Google
> 650-214-1804

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
                   ` (2 preceding siblings ...)
  2010-04-29 22:42 ` Jack Howarth
@ 2010-11-13 23:15 ` Xinliang David Li
  2010-11-14 14:48   ` Paolo Bonzini
                     ` (2 more replies)
  3 siblings, 3 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-13 23:15 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc.gcc.gnu.org

I re-measured the performance difference using trunk gcc and trunk
clang/llvm on a core-2 box.  -fno-strict-aliasing is added to gcc
because clang/llvm's type based aliasing is not incomplete and not
enabled by default. I also added -fomit-frame-pointer to clang/llvm as
this is gcc's default. The base option is -O2.

32bit:

            164.gzip                1210                1239      2.44%
             175.vpr                1662                1621     -2.42%
             181.mcf                2733                3109     13.75%
          186.crafty                1812                1721     -5.00%
          197.parser                1328                1289     -2.92%
         253.perlbmk                2086                2580     23.67%
             254.gap                1968                1912     -2.86%
          255.vortex                1842                1965      6.66%
           256.bzip2                1440                1553      7.82%
           300.twolf                2284                2213     -3.08%


64bit:
            164.gzip                1268                1320      4.15%
             175.vpr                1605                1534     -4.42%
             176.gcc                2203                2315      5.08%
             181.mcf                1625                1737      6.85%
          186.crafty                2411                2307     -4.30%
          197.parser                1173                1166     -0.57%
             252.eon                2245                2464      9.72%
         253.perlbmk                2214                2444     10.37%
             254.gap                1987                1978     -0.47%
          255.vortex                2497                2422     -3.00%
           256.bzip2                1585                1740      9.80%
           300.twolf                2294                2281     -0.58%


Though gcc leads LLVM in performance overrall, there are a couple of
benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
twolf (32bit), vortex (64bit).  This needs to be triaged.   gcc
miscompiles gcc and eon in 32bit -- is there a bug tracking the
problem?

Thanks,

David


On Thu, Apr 29, 2010 at 9:25 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>  GCC-4.5.0 and LLVM-2.7 were released recently.  To understand
> where we stand after releasing GCC-4.5.0 I benchmarked it on SPEC2000
> for x86/x86-64 and posted the comparison of it with the
> previous GCC releases and LLVM-2.7.
>
>  Even benchmarking SPEC2000 takes a lot of time on the fastest
> machine I have. So I don't plan to use SPEC2006 for this in near
> future.
>
>  You can find the comparison on
> http://vmakarov.fedorapeople.org/spec/ (please just click links at the
> bottom of the left frame starting with link "GCC release comparison").
>
>  If you need exact numbers, please use the tables (the links to them
> are also given) which were used to generate the corresponding bar
> graphs.
>
>
>  In general GCC-4.5.0 became faster (upto 10%) in -O2 mode.  This is
> first considerable compilation speed improvement since GCC-4.2.
> GCC-4.5.0 generates a better (1-2% in average upto 4% for x86-64
> SPECFP2000 in -O2 mode) code too in comparison with the previous
> release.  That is not including LTO and Graphite which can gives even
> more (especially LTO) in many cases.
>
>  GCC-4.5.0 has new big optimizations LTO and Graphite (more
> accurately graphite was introduced in the previous release).
> Therefore I ran additional benchmarks to test them.
>
>  LTO is a promising technology especially for integer benchmarks for
> which it results in smaller and faster code.  But it might result in
> degradations too on SPECFP2000 mainly because of big degradations on a
> few benchmarks like wupwise or facerec.  Another annoying thing about
> LTO, it considerably slows down the compiler.
>
>  Currently Graphite gives small improvements on x86 (one exception is
> 2% for peak x86 SPECFP2000) and mostly degradation on x86_64 (with
> maximum one more than 10% for SPECFP2000 because of big degradations
> on mgrid and swim).  So further work is needed on the project because
> it seems not mature yet.
>
>  As for LLVM, LLVM became slower (e.g. in comparison with llvm-2.5 on
> 15%-50% for x86-64).  So the gap between compilation speed of GCC and
> LLVM decreased and sometimes achieves 4% on x86_64 and 8% on x86 (both
> for SPECInt2000 in -O2 mode).  May be I am wrong but I don't think
> CLANG will improve this situation significantly (in -O2 and -O3 mode)
> because optimizations still take most of time of any serious
> optimizing compiler.
>
>  LLVM did a progress in code performance especially for floating
> point benchmarks.  But the gap between LLVM-2.7 and GCC-4.5 in peak
> performance (not including GCC LTO and Graphite) still 6-7% on
> SPECInt200 and 13-17% on SPECFP2000.
>
>  In general, IMHO GCC-4.5.0 is a good and promising release.
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-13 23:15 ` Xinliang David Li
@ 2010-11-14 14:48   ` Paolo Bonzini
  2010-11-14 15:43     ` Xinliang David Li
  2010-11-14 21:12   ` H.J. Lu
  2010-11-15 15:49   ` Andrey Belevantsev
  2 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2010-11-14 14:48 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

On 11/13/2010 10:08 PM, Xinliang David Li wrote:
> Though gcc leads LLVM in performance overrall, there are a couple of
> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
> twolf (32bit), vortex (64bit).  This needs to be triaged.   gcc
> miscompiles gcc and eon in 32bit -- is there a bug tracking the
> problem?

Have you tried -ffast-math or -mfpmath=sse for eon?

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-14 14:48   ` Paolo Bonzini
@ 2010-11-14 15:43     ` Xinliang David Li
  0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-14 15:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

On Sat, Nov 13, 2010 at 2:39 PM, Paolo Bonzini <bonzini@gnu.org> wrote:
> On 11/13/2010 10:08 PM, Xinliang David Li wrote:
>>
>> Though gcc leads LLVM in performance overrall, there are a couple of
>> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
>> twolf (32bit), vortex (64bit).  This needs to be triaged.   gcc
>> miscompiles gcc and eon in 32bit -- is there a bug tracking the
>> problem?
>
> Have you tried -ffast-math or -mfpmath=sse for eon?
>

-ffast-math is used on eon.

David

> Paolo
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-13 23:15 ` Xinliang David Li
  2010-11-14 14:48   ` Paolo Bonzini
@ 2010-11-14 21:12   ` H.J. Lu
  2010-11-15  9:29     ` Xinliang David Li
  2010-11-15 15:49   ` Andrey Belevantsev
  2 siblings, 1 reply; 68+ messages in thread
From: H.J. Lu @ 2010-11-14 21:12 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

On Sat, Nov 13, 2010 at 1:08 PM, Xinliang David Li <davidxl@google.com> wrote:
>
> Though gcc leads LLVM in performance overrall, there are a couple of
> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
> twolf (32bit), vortex (64bit).  This needs to be triaged.   gcc
> miscompiles gcc and eon in 32bit -- is there a bug tracking the
> problem?
>

GCC trunk compiles and runs SPEC CPU 2K correctly at
-O2 and -O3 for both 32bit and 64bit on x86:

http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00977.html
http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00983.html

You need alternate source for eon. I use:

252.eon=default=default=default:
CXXPORTABILITY = -DHAS_ERRLIST
EXTRA_CXXFLAGS=-ffast-math -mpc64
EXTRA_LDFLAGS = -ffast-math -mpc64
srcalt=gcc43

176.gcc=default=default=default:
CPORTABILITY  = -Dalloca=_alloca


-- 
H.J.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-14 21:12   ` H.J. Lu
@ 2010-11-15  9:29     ` Xinliang David Li
  0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-15  9:29 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

Thanks, this works.

gcc vs llvm

176.gcc: +3.7%
252.eon: +6.1%

David

On Sat, Nov 13, 2010 at 3:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sat, Nov 13, 2010 at 1:08 PM, Xinliang David Li <davidxl@google.com> wrote:
>>
>> Though gcc leads LLVM in performance overrall, there are a couple of
>> benchmarks gcc is worse: vpr and crafty (64bit and 32bit), parser and
>> twolf (32bit), vortex (64bit).  This needs to be triaged.   gcc
>> miscompiles gcc and eon in 32bit -- is there a bug tracking the
>> problem?
>>
>
> GCC trunk compiles and runs SPEC CPU 2K correctly at
> -O2 and -O3 for both 32bit and 64bit on x86:
>
> http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00977.html
> http://gcc.gnu.org/ml/gcc-testresults/2010-11/msg00983.html
>
> You need alternate source for eon. I use:
>
> 252.eon=default=default=default:
> CXXPORTABILITY = -DHAS_ERRLIST
> EXTRA_CXXFLAGS=-ffast-math -mpc64
> EXTRA_LDFLAGS = -ffast-math -mpc64
> srcalt=gcc43
>
> 176.gcc=default=default=default:
> CPORTABILITY  = -Dalloca=_alloca
>
>
> --
> H.J.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-13 23:15 ` Xinliang David Li
  2010-11-14 14:48   ` Paolo Bonzini
  2010-11-14 21:12   ` H.J. Lu
@ 2010-11-15 15:49   ` Andrey Belevantsev
  2010-11-15 17:41     ` Xinliang David Li
  2 siblings, 1 reply; 68+ messages in thread
From: Andrey Belevantsev @ 2010-11-15 15:49 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

Hello,

On 14.11.2010 0:08, Xinliang David Li wrote:
> I re-measured the performance difference using trunk gcc and trunk
> clang/llvm on a core-2 box.  -fno-strict-aliasing is added to gcc
> because clang/llvm's type based aliasing is not incomplete and not
> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
> this is gcc's default. The base option is -O2.

It would be very interesting to compare also peak numbers, i.e. with LTO 
and strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops, 
similar to Vlad's or OpenSUSE's options.  Can you try to measure these? 
Maybe you can also run SPEC2k6, if there is enough machine resources, but 
that's probably asking too much...

Andrey

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-15 15:49   ` Andrey Belevantsev
@ 2010-11-15 17:41     ` Xinliang David Li
  2010-11-15 18:31       ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-15 17:41 UTC (permalink / raw)
  To: Andrey Belevantsev; +Cc: Vladimir Makarov, gcc.gcc.gnu.org

For peak, FDO is the most effective option. It can boost performance
by 7-10% depending on the program. The options you suggested probably
won't make too big a dent.  -funroll-loops can hurt performance
without profiling.  More aggressive inlining, ipa-cp, unswitching etc
enabled by O3 may help a little if there is any. -ffast-math won't
help for integer benchmarks other than eon.  Traditionally, O3 helps
FP performance because of the loop transformation enabled, but this
won't be the case for gcc for now.

Thanks,

David

On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
> Hello,
>
> On 14.11.2010 0:08, Xinliang David Li wrote:
>>
>> I re-measured the performance difference using trunk gcc and trunk
>> clang/llvm on a core-2 box.  -fno-strict-aliasing is added to gcc
>> because clang/llvm's type based aliasing is not incomplete and not
>> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
>> this is gcc's default. The base option is -O2.
>
> It would be very interesting to compare also peak numbers, i.e. with LTO and
> strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
> similar to Vlad's or OpenSUSE's options.  Can you try to measure these?
> Maybe you can also run SPEC2k6, if there is enough machine resources, but
> that's probably asking too much...
>
> Andrey
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-15 17:41     ` Xinliang David Li
@ 2010-11-15 18:31       ` Jan Hubicka
  2010-11-15 22:25         ` Richard Guenther
  2010-11-15 22:47         ` Xinliang David Li
  0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-15 18:31 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

> For peak, FDO is the most effective option. It can boost performance
> by 7-10% depending on the program. The options you suggested probably
> won't make too big a dent.  -funroll-loops can hurt performance
> without profiling.  More aggressive inlining, ipa-cp, unswitching etc

-funroll-loops overall was 2.2% win on SPECint, -funrol-all-loops 2.5% last
time I noted down the SPECint results of this (that was in 2003, heh :)
http://www.ucw.cz/~hubicka/papers/amd64/node4.html

> enabled by O3 may help a little if there is any. -ffast-math won't
> help for integer benchmarks other than eon.  Traditionally, O3 helps
> FP performance because of the loop transformation enabled, but this
> won't be the case for gcc for now.

Function inlining definitly helps. -O3 also imply vectorization and other stuff.

Honza
> 
> Thanks,
> 
> David
> 
> On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
> > Hello,
> >
> > On 14.11.2010 0:08, Xinliang David Li wrote:
> >>
> >> I re-measured the performance difference using trunk gcc and trunk
> >> clang/llvm on a core-2 box. Â -fno-strict-aliasing is added to gcc
> >> because clang/llvm's type based aliasing is not incomplete and not
> >> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
> >> this is gcc's default. The base option is -O2.
> >
> > It would be very interesting to compare also peak numbers, i.e. with LTO and
> > strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
> > similar to Vlad's or OpenSUSE's options. Â Can you try to measure these?
> > Maybe you can also run SPEC2k6, if there is enough machine resources, but
> > that's probably asking too much...
> >
> > Andrey
> >
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-15 18:31       ` Jan Hubicka
@ 2010-11-15 22:25         ` Richard Guenther
  2010-11-15 22:47         ` Xinliang David Li
  1 sibling, 0 replies; 68+ messages in thread
From: Richard Guenther @ 2010-11-15 22:25 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

2010/11/15 Jan Hubicka <hubicka@ucw.cz>:
>> For peak, FDO is the most effective option. It can boost performance
>> by 7-10% depending on the program. The options you suggested probably
>> won't make too big a dent.  -funroll-loops can hurt performance
>> without profiling.  More aggressive inlining, ipa-cp, unswitching etc
>
> -funroll-loops overall was 2.2% win on SPECint, -funrol-all-loops 2.5% last
> time I noted down the SPECint results of this (that was in 2003, heh :)
> http://www.ucw.cz/~hubicka/papers/amd64/node4.html
>
>> enabled by O3 may help a little if there is any. -ffast-math won't
>> help for integer benchmarks other than eon.  Traditionally, O3 helps
>> FP performance because of the loop transformation enabled, but this
>> won't be the case for gcc for now.
>
> Function inlining definitly helps. -O3 also imply vectorization and other stuff.

Indeed.  You can look at the various testers at gcc.opensuse.org which compare
-O2 vs. -O3 but also -O3 vs. -O3 -funroll-loops (and other things) to
get an idea
what helps and what not.

Richard.

> Honza
>>
>> Thanks,
>>
>> David
>>
>> On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
>> > Hello,
>> >
>> > On 14.11.2010 0:08, Xinliang David Li wrote:
>> >>
>> >> I re-measured the performance difference using trunk gcc and trunk
>> >> clang/llvm on a core-2 box.  -fno-strict-aliasing is added to gcc
>> >> because clang/llvm's type based aliasing is not incomplete and not
>> >> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
>> >> this is gcc's default. The base option is -O2.
>> >
>> > It would be very interesting to compare also peak numbers, i.e. with LTO and
>> > strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
>> > similar to Vlad's or OpenSUSE's options.  Can you try to measure these?
>> > Maybe you can also run SPEC2k6, if there is enough machine resources, but
>> > that's probably asking too much...
>> >
>> > Andrey
>> >
>> >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-15 18:31       ` Jan Hubicka
  2010-11-15 22:25         ` Richard Guenther
@ 2010-11-15 22:47         ` Xinliang David Li
  2010-11-15 23:06           ` Jan Hubicka
  1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-15 22:47 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

I did some measurement (64bit).

Experiment 1:

-O2 -funroll-loops vs -O2

It improves performance (geomean) by 0.56%, not too much:
                                         O2                 O2 unroll-loops
            164.gzip                1324                1331      0.56%
             175.vpr                1694                1605     -5.24%
             176.gcc                2293                2350      2.47%
             181.mcf                1772                1788      0.90%
          186.crafty                2320                2326      0.26%
          197.parser                1166                1162     -0.32%
             252.eon                2443                2529      3.50%
         253.perlbmk                2410                2460      2.07%
             254.gap                1987                2019      1.58%
          255.vortex                2392                2406      0.58%
           256.bzip2                1719                1715     -0.25%
           300.twolf                2288                2308      0.88%


 Experiment 2: O3 vs O2:

The improvement on SPEC2k is larger than large internal programs
tested  -- geomean 2.38%.


            164.gzip                1324                1329      0.40%
             175.vpr                1694                1700      0.31%
             176.gcc                2293                2336      1.89%
             181.mcf                1772                1739     -1.81%
          186.crafty                2320                2323      0.14%
          197.parser                1166                1252      7.39%
             252.eon                2443                2645      8.23%
         253.perlbmk                2410                2452      1.74%
             254.gap                1987                2020      1.62%
          255.vortex                2392                2473      3.39%
           256.bzip2                1719                1766      2.74%
           300.twolf                2288                2350      2.70%

Experiment 3:    O2 lto vs O2:    geomean 0.72%
                                        O2                   O2 LTO
           164.gzip                1324                1317     -0.53%
             175.vpr                1694                1697      0.18%
             176.gcc                2293                2291     -0.08%
             181.mcf                1772                1760     -0.65%
          186.crafty                2320                2245     -3.26%
          197.parser                1166                1163     -0.29%
             252.eon                2443                2576      5.44%
         253.perlbmk                2410                2433      0.93%
             254.gap                1987                1995      0.36%
          255.vortex                2392                2588      8.19%
           256.bzip2                1719                1729      0.56%
           300.twolf                2288                2248     -1.77%


David


On Mon, Nov 15, 2010 at 9:54 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> For peak, FDO is the most effective option. It can boost performance
>> by 7-10% depending on the program. The options you suggested probably
>> won't make too big a dent.  -funroll-loops can hurt performance
>> without profiling.  More aggressive inlining, ipa-cp, unswitching etc
>
> -funroll-loops overall was 2.2% win on SPECint, -funrol-all-loops 2.5% last
> time I noted down the SPECint results of this (that was in 2003, heh :)
> http://www.ucw.cz/~hubicka/papers/amd64/node4.html
>
>> enabled by O3 may help a little if there is any. -ffast-math won't
>> help for integer benchmarks other than eon.  Traditionally, O3 helps
>> FP performance because of the loop transformation enabled, but this
>> won't be the case for gcc for now.
>
> Function inlining definitly helps. -O3 also imply vectorization and other stuff.
>
> Honza
>>
>> Thanks,
>>
>> David
>>
>> On Mon, Nov 15, 2010 at 4:29 AM, Andrey Belevantsev <abel@ispras.ru> wrote:
>> > Hello,
>> >
>> > On 14.11.2010 0:08, Xinliang David Li wrote:
>> >>
>> >> I re-measured the performance difference using trunk gcc and trunk
>> >> clang/llvm on a core-2 box.  -fno-strict-aliasing is added to gcc
>> >> because clang/llvm's type based aliasing is not incomplete and not
>> >> enabled by default. I also added -fomit-frame-pointer to clang/llvm as
>> >> this is gcc's default. The base option is -O2.
>> >
>> > It would be very interesting to compare also peak numbers, i.e. with LTO and
>> > strict aliasing enabled, as well as -O3 and -ffast-math/-funroll-loops,
>> > similar to Vlad's or OpenSUSE's options.  Can you try to measure these?
>> > Maybe you can also run SPEC2k6, if there is enough machine resources, but
>> > that's probably asking too much...
>> >
>> > Andrey
>> >
>> >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-15 22:47         ` Xinliang David Li
@ 2010-11-15 23:06           ` Jan Hubicka
  2010-11-16  0:41             ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-15 23:06 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

> I did some measurement (64bit).
> 
> Experiment 1:
> 
> -O2 -funroll-loops vs -O2
> 
> It improves performance (geomean) by 0.56%, not too much:
>                                          O2                 O2 unroll-loops
>             164.gzip                1324                1331      0.56%
>              175.vpr                1694                1605     -5.24%
>              176.gcc                2293                2350      2.47%
>              181.mcf                1772                1788      0.90%
>           186.crafty                2320                2326      0.26%
>           197.parser                1166                1162     -0.32%
>              252.eon                2443                2529      3.50%
>          253.perlbmk                2410                2460      2.07%
>              254.gap                1987                2019      1.58%
>           255.vortex                2392                2406      0.58%
>            256.bzip2                1719                1715     -0.25%
>            300.twolf                2288                2308      0.88%

Can you also try -funroll-all-loops?  As for pretty small programs, like
spec2k, -funroll-all-loops is often win.  In just few loops we can work out
number of iterations.

> 
> Experiment 3:    O2 lto vs O2:    geomean 0.72%
>                                         O2                   O2 LTO
>            164.gzip                1324                1317     -0.53%
>              175.vpr                1694                1697      0.18%
>              176.gcc                2293                2291     -0.08%
>              181.mcf                1772                1760     -0.65%
>           186.crafty                2320                2245     -3.26%
>           197.parser                1166                1163     -0.29%
>              252.eon                2443                2576      5.44%
>          253.perlbmk                2410                2433      0.93%
>              254.gap                1987                1995      0.36%
>           255.vortex                2392                2588      8.19%
>            256.bzip2                1719                1729      0.56%
>            300.twolf                2288                2248     -1.77%

You need -O3 -fwhole-program -flto for resonable cross module inlining to happen.
-fwhole-program is quite essential to get resonable win from LTO (w/o profile feedback).

At least our nightly tester then gets quite nice improvements on few benchmark at spec2k,
see also my gccsummit slides.

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-15 23:06           ` Jan Hubicka
@ 2010-11-16  0:41             ` Xinliang David Li
  2010-11-16  0:53               ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16  0:41 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

Just measured: lto +O3 improves over O2 by a decent 4.8% geomean. More
data come later.

            164.gzip                1324                1322     -0.10%
             175.vpr                1694                1703      0.51%
             176.gcc                2293                2347      2.34%
             181.mcf                1772                1797      1.43%
          186.crafty                2320                2486      7.12%
          197.parser                1166                1236      6.02%
             252.eon                2443                2810     14.98%
         253.perlbmk                2410                2407     -0.16%
             254.gap                1987                2024      1.82%
          255.vortex                2392                2826     18.13%
           256.bzip2                1719                1760      2.38%
           300.twolf                2288                2394      4.63%


David


On Mon, Nov 15, 2010 at 2:38 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> I did some measurement (64bit).
>>
>> Experiment 1:
>>
>> -O2 -funroll-loops vs -O2
>>
>> It improves performance (geomean) by 0.56%, not too much:
>>                                          O2                 O2 unroll-loops
>>             164.gzip                1324                1331      0.56%
>>              175.vpr                1694                1605     -5.24%
>>              176.gcc                2293                2350      2.47%
>>              181.mcf                1772                1788      0.90%
>>           186.crafty                2320                2326      0.26%
>>           197.parser                1166                1162     -0.32%
>>              252.eon                2443                2529      3.50%
>>          253.perlbmk                2410                2460      2.07%
>>              254.gap                1987                2019      1.58%
>>           255.vortex                2392                2406      0.58%
>>            256.bzip2                1719                1715     -0.25%
>>            300.twolf                2288                2308      0.88%
>
> Can you also try -funroll-all-loops?  As for pretty small programs, like
> spec2k, -funroll-all-loops is often win.  In just few loops we can work out
> number of iterations.
>
>>
>> Experiment 3:    O2 lto vs O2:    geomean 0.72%
>>                                         O2                   O2 LTO
>>            164.gzip                1324                1317     -0.53%
>>              175.vpr                1694                1697      0.18%
>>              176.gcc                2293                2291     -0.08%
>>              181.mcf                1772                1760     -0.65%
>>           186.crafty                2320                2245     -3.26%
>>           197.parser                1166                1163     -0.29%
>>              252.eon                2443                2576      5.44%
>>          253.perlbmk                2410                2433      0.93%
>>              254.gap                1987                1995      0.36%
>>           255.vortex                2392                2588      8.19%
>>            256.bzip2                1719                1729      0.56%
>>            300.twolf                2288                2248     -1.77%
>
> You need -O3 -fwhole-program -flto for resonable cross module inlining to happen.
> -fwhole-program is quite essential to get resonable win from LTO (w/o profile feedback).
>
> At least our nightly tester then gets quite nice improvements on few benchmark at spec2k,
> see also my gccsummit slides.
>
> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  0:41             ` Xinliang David Li
@ 2010-11-16  0:53               ` Xinliang David Li
  2010-11-16  1:02                 ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16  0:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

This means O3 level inlining should be turned on also for lto build by
default -- as -O2 lto performance is too unimpressive.

David

On Mon, Nov 15, 2010 at 3:36 PM, Xinliang David Li <davidxl@google.com> wrote:
> Just measured: lto +O3 improves over O2 by a decent 4.8% geomean. More
> data come later.
>
>            164.gzip                1324                1322     -0.10%
>             175.vpr                1694                1703      0.51%
>             176.gcc                2293                2347      2.34%
>             181.mcf                1772                1797      1.43%
>          186.crafty                2320                2486      7.12%
>          197.parser                1166                1236      6.02%
>             252.eon                2443                2810     14.98%
>         253.perlbmk                2410                2407     -0.16%
>             254.gap                1987                2024      1.82%
>          255.vortex                2392                2826     18.13%
>           256.bzip2                1719                1760      2.38%
>           300.twolf                2288                2394      4.63%
>
>
> David
>
>
> On Mon, Nov 15, 2010 at 2:38 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> I did some measurement (64bit).
>>>
>>> Experiment 1:
>>>
>>> -O2 -funroll-loops vs -O2
>>>
>>> It improves performance (geomean) by 0.56%, not too much:
>>>                                          O2                 O2 unroll-loops
>>>             164.gzip                1324                1331      0.56%
>>>              175.vpr                1694                1605     -5.24%
>>>              176.gcc                2293                2350      2.47%
>>>              181.mcf                1772                1788      0.90%
>>>           186.crafty                2320                2326      0.26%
>>>           197.parser                1166                1162     -0.32%
>>>              252.eon                2443                2529      3.50%
>>>          253.perlbmk                2410                2460      2.07%
>>>              254.gap                1987                2019      1.58%
>>>           255.vortex                2392                2406      0.58%
>>>            256.bzip2                1719                1715     -0.25%
>>>            300.twolf                2288                2308      0.88%
>>
>> Can you also try -funroll-all-loops?  As for pretty small programs, like
>> spec2k, -funroll-all-loops is often win.  In just few loops we can work out
>> number of iterations.
>>
>>>
>>> Experiment 3:    O2 lto vs O2:    geomean 0.72%
>>>                                         O2                   O2 LTO
>>>            164.gzip                1324                1317     -0.53%
>>>              175.vpr                1694                1697      0.18%
>>>              176.gcc                2293                2291     -0.08%
>>>              181.mcf                1772                1760     -0.65%
>>>           186.crafty                2320                2245     -3.26%
>>>           197.parser                1166                1163     -0.29%
>>>              252.eon                2443                2576      5.44%
>>>          253.perlbmk                2410                2433      0.93%
>>>              254.gap                1987                1995      0.36%
>>>           255.vortex                2392                2588      8.19%
>>>            256.bzip2                1719                1729      0.56%
>>>            300.twolf                2288                2248     -1.77%
>>
>> You need -O3 -fwhole-program -flto for resonable cross module inlining to happen.
>> -fwhole-program is quite essential to get resonable win from LTO (w/o profile feedback).
>>
>> At least our nightly tester then gets quite nice improvements on few benchmark at spec2k,
>> see also my gccsummit slides.
>>
>> Honza
>>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  0:53               ` Xinliang David Li
@ 2010-11-16  1:02                 ` Jan Hubicka
  2010-11-16  1:19                   ` Jan Hubicka
  2010-11-16  1:24                   ` Xinliang David Li
  0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16  1:02 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

> This means O3 level inlining should be turned on also for lto build by
> default -- as -O2 lto performance is too unimpressive.

I am just re-tunning the inliner and hope to get more speedups for smaller
costs than we get right now.  I however don't think we can resonably enable it
as it is at LTO with -O2. We sort of declare that -O2 is the level where
compiler optimize hard without bloating code size. Automatic inlining bloats a
lot.  Enabling it at -O2 will make developers who care about code size unhappy.

Can you, please, try -O2 -fwhole-program, too?

Testing Firefox I however noticed that enabling inlining and --param
inline-unit-growth=5 gets most of speedups from inlining at very little cost of
code size (in fact code size gets smaller at firefox because of better
optimization).  This is sort of logical: when not doing LTO, limiting unit
growth at each separate comilation unit lose, since the inliner has too little
freedom (some units require a lot of unit growth to copmile well, while most of
units won't need it at all).
When doing LTO however the inliner can use the space constrain more resonably.

I am wondering what to do here - I just tried that pushing down unit growth from
30% to 15% hurts some of benchmarks (like tramp3d). I guess we will need to make
unit growth to depend on unit size somehow: at the moment we bypass unit growht
at very tiny units via large-unit-insns parameter, but this is not good enough.
For medium sized units we need growths as big as 30%, for large units we need 5%.
I guess I can either define very-large-unit-growth and very-large-unit-insns
to jump down in growth at some point, or define the growth to be function of 1/size.
Do we know of better alternatives?

Enabling such extensively trimmed down automatic inlining at -O2 IMO can make
sense if we can prove it makes binaries of about same size and brings
noticeable speedups.
After all, we want to make LTO selling well - most people will probably repeat
mistake you did and try it at -O2 without -fwhole-program.  The second I am hoping to
fight with enabling -fuse-linker-plugin by default as discussed on the summit
(that has similar effects to -fwhole-program code quality wise even if underlying
implementation is different).

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  1:02                 ` Jan Hubicka
@ 2010-11-16  1:19                   ` Jan Hubicka
  2010-11-16  1:24                   ` Xinliang David Li
  1 sibling, 0 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16  1:19 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

> > This means O3 level inlining should be turned on also for lto build by
> > default -- as -O2 lto performance is too unimpressive.
> 
> I am just re-tunning the inliner and hope to get more speedups for smaller
> costs than we get right now.  I however don't think we can resonably enable it
> as it is at LTO with -O2. We sort of declare that -O2 is the level where
> compiler optimize hard without bloating code size. Automatic inlining bloats a
> lot.  Enabling it at -O2 will make developers who care about code size unhappy.
> 
> Can you, please, try -O2 -fwhole-program, too?

Also for my code size work, it would be great if you tracked also sizes of the
stripped binaries in your tests ;)

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  1:02                 ` Jan Hubicka
  2010-11-16  1:19                   ` Jan Hubicka
@ 2010-11-16  1:24                   ` Xinliang David Li
  2010-11-16  1:39                     ` Jan Hubicka
  1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16  1:24 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

On Mon, Nov 15, 2010 at 4:25 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> This means O3 level inlining should be turned on also for lto build by
>> default -- as -O2 lto performance is too unimpressive.
>
> I am just re-tunning the inliner and hope to get more speedups for smaller
> costs than we get right now.  I however don't think we can resonably enable it
> as it is at LTO with -O2. We sort of declare that -O2 is the level where
> compiler optimize hard without bloating code size. Automatic inlining bloats a
> lot.  Enabling it at -O2 will make developers who care about code size unhappy.

Looks like you want to brand LTO as a size optimization technology
more than performance one :) -- is that the right promotion for lto?
If more people care about performance, then the default should be
tuned toward it. For size optimization, use -Os -flto.

>
> Can you, please, try -O2 -fwhole-program, too?

Too many experiments -- but sure, I can do it.

>
> Testing Firefox I however noticed that enabling inlining and --param
> inline-unit-growth=5 gets most of speedups from inlining at very little cost of
> code size (in fact code size gets smaller at firefox because of better
> optimization).  This is sort of logical: when not doing LTO, limiting unit
> growth at each separate comilation unit lose, since the inliner has too little
> freedom (some units require a lot of unit growth to copmile well, while most of
> units won't need it at all).

Yes, that is what I call adaptive budget -- better with profiling.

> When doing LTO however the inliner can use the space constrain more resonably.
>

yes -- global decision can be made.

> I am wondering what to do here - I just tried that pushing down unit growth from
> 30% to 15% hurts some of benchmarks (like tramp3d). I guess we will need to make
> unit growth to depend on unit size somehow:

yes.

>at the moment we bypass unit growht
> at very tiny units via large-unit-insns parameter, but this is not good enough.
> For medium sized units we need growths as big as 30%, for large units we need 5%.
> I guess I can either define very-large-unit-growth and very-large-unit-insns
> to jump down in growth at some point, or define the growth to be function of 1/size.
> Do we know of better alternatives?
>

Mark can provide some suggestions -- he has many inliner patches
related to performance/size trade off.

> Enabling such extensively trimmed down automatic inlining at -O2 IMO can make
> sense if we can prove it makes binaries of about same size and brings
> noticeable speedups.
> After all, we want to make LTO selling well - most people will probably repeat
> mistake you did and try it at -O2 without -fwhole-program.  The second I am hoping to
> fight with enabling -fuse-linker-plugin by default as discussed on the summit
> (that has similar effects to -fwhole-program code quality wise even if underlying
> implementation is different).
>

I don't think that is a mistake --  a large percent of people will
likely not (be able to) use -fwhole-program for various reasons -- for
instance shared library build, partially available source, option
limitations etc. It is therefore more (at least equally) important to
sell lto without -fwhole-program.

Thanks,

David

> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  1:24                   ` Xinliang David Li
@ 2010-11-16  1:39                     ` Jan Hubicka
  2010-11-16  1:45                       ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16  1:39 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
	gcc.gcc.gnu.org, Mark Heffernan

> On Mon, Nov 15, 2010 at 4:25 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> This means O3 level inlining should be turned on also for lto build by
> >> default -- as -O2 lto performance is too unimpressive.
> >
> > I am just re-tunning the inliner and hope to get more speedups for smaller
> > costs than we get right now. Â I however don't think we can resonably enable it
> > as it is at LTO with -O2. We sort of declare that -O2 is the level where
> > compiler optimize hard without bloating code size. Automatic inlining bloats a
> > lot. Â Enabling it at -O2 will make developers who care about code size unhappy.
> 
> Looks like you want to brand LTO as a size optimization technology
> more than performance one :) -- is that the right promotion for lto?

No, I don't want to brand is as size optimization.  I however want -O2 -flto to
be right setting for compiling majority of program in your distro to you good
overall system performance.

The size matters here - you don't want to bloat your distro by 20% to get 4% of
performance benefit when everything is in cache since overall performance will
likely degrade. If this was desirable, everyone would be using -O3 already.  In
my tests (without -fwhole-program) -O3 +LTO bloats code even more than -O3
alone.

-Ofast is for those who want maximal performance now and -O3 for those who
don't care about size but are old fashioned and affraid to use -Ofast ;)

> >at the moment we bypass unit growht
> > at very tiny units via large-unit-insns parameter, but this is not good enough.
> > For medium sized units we need growths as big as 30%, for large units we need 5%.
> > I guess I can either define very-large-unit-growth and very-large-unit-insns
> > to jump down in growth at some point, or define the growth to be function of 1/size.
> > Do we know of better alternatives?
> >
> 
> Mark can provide some suggestions -- he has many inliner patches
> related to performance/size trade off.

I would be definitly interested to see them, too.
> 
> > Enabling such extensively trimmed down automatic inlining at -O2 IMO can make
> > sense if we can prove it makes binaries of about same size and brings
> > noticeable speedups.
> > After all, we want to make LTO selling well - most people will probably repeat
> > mistake you did and try it at -O2 without -fwhole-program. Â The second I am hoping to
> > fight with enabling -fuse-linker-plugin by default as discussed on the summit
> > (that has similar effects to -fwhole-program code quality wise even if underlying
> > implementation is different).
> >
> 
> I don't think that is a mistake --  a large percent of people will
> likely not (be able to) use -fwhole-program for various reasons -- for
> instance shared library build, partially available source, option
> limitations etc. It is therefore more (at least equally) important to

Fortunately linker plugin solves the problem here and this is why I want to
have it by default.  GCC then can do effectively -fwhole-program for binaries
(since linker knows what will be bound elsewhere) and take advantage of
visibility((hidden)) hints for shared libraries same way.  Most of important
shared libraries gets visibility ((hidden)) right.

It is sad that LTO w/o linker plugin doesn't give that much benefits.
Ideas are welcome here.

Honza
> sell lto without -fwhole-program.
> 
> Thanks,
> 
> David
> 
> > Honza
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  1:39                     ` Jan Hubicka
@ 2010-11-16  1:45                       ` Xinliang David Li
  2010-11-16  4:11                         ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16  1:45 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

> Fortunately linker plugin solves the problem here and this is why I want to
> have it by default.  GCC then can do effectively -fwhole-program for binaries
> (since linker knows what will be bound elsewhere) and take advantage of
> visibility((hidden)) hints for shared libraries same way.  Most of important
> shared libraries gets visibility ((hidden)) right.
>
> It is sad that LTO w/o linker plugin doesn't give that much benefits.
> Ideas are welcome here.

Linker feedback will be limited here -- mostly global variable
aliasing (as I remember only 2/3 spec programs benefit from it), it
helps  You don't get whole program points-to, whole program mod-ref
(with context sensitivity), whole program structure layout. The latter
are the real kickers (in terms of SPEC performance), but promoting LTO
with those numbers can be misleading as many programs won't get it.

David

>
> Honza
>> sell lto without -fwhole-program.
>>
>> Thanks,
>>
>> David
>>
>> > Honza
>> >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  1:45                       ` Xinliang David Li
@ 2010-11-16  4:11                         ` Jan Hubicka
  2010-11-16  6:56                           ` Xinliang David Li
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16  4:11 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
	gcc.gcc.gnu.org, Mark Heffernan

> > Fortunately linker plugin solves the problem here and this is why I want to
> > have it by default. Â GCC then can do effectively -fwhole-program for binaries
> > (since linker knows what will be bound elsewhere) and take advantage of
> > visibility((hidden)) hints for shared libraries same way. Â Most of important
> > shared libraries gets visibility ((hidden)) right.
> >
> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
> > Ideas are welcome here.
> 
> Linker feedback will be limited here -- mostly global variable
> aliasing (as I remember only 2/3 spec programs benefit from it), it
> helps  You don't get whole program points-to, whole program mod-ref
> (with context sensitivity), whole program structure layout. The latter
> are the real kickers (in terms of SPEC performance), but promoting LTO
> with those numbers can be misleading as many programs won't get it.

Well, I am speaking of our linker plugin here.  What it does is to pass GCC
resolution information so it knows what symbols are bound externally. Since
typically you link LTO alone or with small non-LTO part, most of symbols are
not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
declare everything static except for main ())

We don't really do whole program points-to or structure layout. Mod-ref is just
simple ipa-reference code. How you get context sensitivity on mod/ref?

Honza
> 
> David
> 
> >
> > Honza
> >> sell lto without -fwhole-program.
> >>
> >> Thanks,
> >>
> >> David
> >>
> >> > Honza
> >> >
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  4:11                         ` Jan Hubicka
@ 2010-11-16  6:56                           ` Xinliang David Li
  2010-11-16  8:26                             ` Jan Hubicka
  0 siblings, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16  6:56 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Fortunately linker plugin solves the problem here and this is why I want to
>> > have it by default.  GCC then can do effectively -fwhole-program for binaries
>> > (since linker knows what will be bound elsewhere) and take advantage of
>> > visibility((hidden)) hints for shared libraries same way.  Most of important
>> > shared libraries gets visibility ((hidden)) right.
>> >
>> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>> > Ideas are welcome here.
>>
>> Linker feedback will be limited here -- mostly global variable
>> aliasing (as I remember only 2/3 spec programs benefit from it), it
>> helps  You don't get whole program points-to, whole program mod-ref
>> (with context sensitivity), whole program structure layout. The latter
>> are the real kickers (in terms of SPEC performance), but promoting LTO
>> with those numbers can be misleading as many programs won't get it.
>
> Well, I am speaking of our linker plugin here.  What it does is to pass GCC
> resolution information so it knows what symbols are bound externally. Since
> typically you link LTO alone or with small non-LTO part, most of symbols are
> not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
> declare everything static except for main ())
>
> We don't really do whole program points-to or structure layout.

gcc will eventually, right?

> Mod-ref is just
> simple ipa-reference code. How you get context sensitivity on mod/ref?

mod-ref relies on points-to. With context sensitive points-to, you can
also get CS mod-ref -- basically mod-ref info per callsite.

Thanks,

David
>
> Honza
>>
>> David
>>
>> >
>> > Honza
>> >> sell lto without -fwhole-program.
>> >>
>> >> Thanks,
>> >>
>> >> David
>> >>
>> >> > Honza
>> >> >
>> >
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  6:56                           ` Xinliang David Li
@ 2010-11-16  8:26                             ` Jan Hubicka
  2010-11-16  9:00                               ` Xinliang David Li
  2010-11-16 15:43                               ` Richard Guenther
  0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16  8:26 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
	gcc.gcc.gnu.org, Mark Heffernan

> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> > Fortunately linker plugin solves the problem here and this is why I want to
> >> > have it by default. Â GCC then can do effectively -fwhole-program for binaries
> >> > (since linker knows what will be bound elsewhere) and take advantage of
> >> > visibility((hidden)) hints for shared libraries same way. Â Most of important
> >> > shared libraries gets visibility ((hidden)) right.
> >> >
> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
> >> > Ideas are welcome here.
> >>
> >> Linker feedback will be limited here -- mostly global variable
> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
> >> helps Â You don't get whole program points-to, whole program mod-ref
> >> (with context sensitivity), whole program structure layout. The latter
> >> are the real kickers (in terms of SPEC performance), but promoting LTO
> >> with those numbers can be misleading as many programs won't get it.
> >
> > Well, I am speaking of our linker plugin here. Â What it does is to pass GCC
> > resolution information so it knows what symbols are bound externally. Since
> > typically you link LTO alone or with small non-LTO part, most of symbols are
> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
> > declare everything static except for main ())
> >
> > We don't really do whole program points-to or structure layout.
> 
> gcc will eventually, right?

Sure hope so ;) 
We really need to solve scalability with our IPA points-to and make it
compatible with WHOPR.
> 
> > Mod-ref is just
> > simple ipa-reference code. How you get context sensitivity on mod/ref?
> 
> mod-ref relies on points-to. With context sensitive points-to, you can
> also get CS mod-ref -- basically mod-ref info per callsite.

Ah sure, I was too focused on our current "mod/ref" :)

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  8:26                             ` Jan Hubicka
@ 2010-11-16  9:00                               ` Xinliang David Li
  2010-11-16 14:23                                 ` Xinliang David Li
  2010-11-16 15:43                               ` Richard Guenther
  1 sibling, 1 reply; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16  9:00 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

More performance data:

-O2 -funroll-all-loops vs O2:   +1.1% geomean

                                          O2               O2 unroll-all-loops
            164.gzip                1324                1336      0.94%
             175.vpr                1694                1670     -1.44%
             176.gcc                2293                2353      2.60%
             181.mcf                1772                1793      1.20%
          186.crafty                2320                2300     -0.86%
          197.parser                1166                1171      0.39%
             252.eon                2443                2515      2.93%
         253.perlbmk                2410                2250     -6.66%
             254.gap                1987                2041      2.68%
          255.vortex                2392                2411      0.78%
           256.bzip2                1719                1806      5.08%
           300.twolf                2288                2436      6.44%


-O3 -flto -fwhole-program vs -O2  : geomean +6%     (-fwhole-program add ~1% )

            164.gzip                1324                1318     -0.45%
             175.vpr                1694                1717      1.34%
             176.gcc                2293                2359      2.88%
             181.mcf                1772                1772      0.02%
          186.crafty                2320                2526      8.86%
          197.parser                1166                1248      7.04%
             252.eon                2443                2898     18.59%
         253.perlbmk                2410                2323     -3.62%
             254.gap                1987                2039      2.58%
          255.vortex                2392                2918     21.99%
           256.bzip2                1719                1946     13.19%
           300.twolf                2288                2342      2.34%


-O2 -flto -fwhole-program vs -O2: geomean +3.4% . mainly from three
programs: vortex, eon and bzip2.

            164.gzip                1324                1313     -0.82%
             175.vpr                1694                1659     -2.05%
             176.gcc                2293                2300      0.30%
             181.mcf                1772                1781      0.52%
          186.crafty                2320                2327      0.30%
          197.parser                1166                1188      1.92%
             252.eon                2443                2664      9.00%
         253.perlbmk                2410                2470      2.47%
             254.gap                1987                1987     -0.02%
          255.vortex                2392                2883     20.53%
           256.bzip2                1719                1839      7.00%
           300.twolf                2288                2365      3.34%


Thanks,

David


On Mon, Nov 15, 2010 at 5:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>> >> > have it by default.  GCC then can do effectively -fwhole-program for binaries
>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>> >> > visibility((hidden)) hints for shared libraries same way.  Most of important
>> >> > shared libraries gets visibility ((hidden)) right.
>> >> >
>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>> >> > Ideas are welcome here.
>> >>
>> >> Linker feedback will be limited here -- mostly global variable
>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>> >> helps  You don't get whole program points-to, whole program mod-ref
>> >> (with context sensitivity), whole program structure layout. The latter
>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>> >> with those numbers can be misleading as many programs won't get it.
>> >
>> > Well, I am speaking of our linker plugin here.  What it does is to pass GCC
>> > resolution information so it knows what symbols are bound externally. Since
>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>> > declare everything static except for main ())
>> >
>> > We don't really do whole program points-to or structure layout.
>>
>> gcc will eventually, right?
>
> Sure hope so ;)
> We really need to solve scalability with our IPA points-to and make it
> compatible with WHOPR.
>>
>> > Mod-ref is just
>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>
>> mod-ref relies on points-to. With context sensitive points-to, you can
>> also get CS mod-ref -- basically mod-ref info per callsite.
>
> Ah sure, I was too focused on our current "mod/ref" :)
>
> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  9:00                               ` Xinliang David Li
@ 2010-11-16 14:23                                 ` Xinliang David Li
  2010-11-16 17:10                                   ` Jan Hubicka
  2010-11-18 11:48                                   ` Xinliang David Li
  0 siblings, 2 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 14:23 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

More FDO related performance numbers

Experiment 1:  trunk gcc O2 + FDO vs O2:      FDO improves performance
by 5% geomean
Experiment 2: our internal gcc compiler (4.4.3 based with many local
patches) O2 + FDO vs O2 (trunk gcc):   FDO improves perf by 6.6%
geomean
Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
O2 (trunk gcc):  LIPO improves by 12%
Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2:  LTO +
FDO improves by 10.8%


1. Trunk gcc FDO vs O2  (5%)

            164.gzip                1324                1302     -1.64%
             175.vpr                1694                1725      1.84%
             176.gcc                2293                2387      4.07%
             181.mcf                1772                1756     -0.88%
          186.crafty                2320                2280     -1.75%
          197.parser                1166                1556     33.42%
             252.eon                2443                2552      4.45%
         253.perlbmk                2410                2586      7.28%
             254.gap                1987                2021      1.71%
          255.vortex                2392                2720     13.71%
           256.bzip2                1719                1717     -0.12%
           300.twolf                2288                2331      1.86%

2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)

            164.gzip                1324                1317     -0.48%
             175.vpr                1694                1758      3.76%
             176.gcc                2293                2472      7.79%
             181.mcf                1772                1730     -2.35%
          186.crafty                2320                2353      1.40%
          197.parser                1166                1652     41.70%
             252.eon                2443                2610      6.82%
         253.perlbmk                2410                2561      6.23%
             254.gap                1987                1987     -0.04%
          255.vortex                2392                2801     17.09%
           256.bzip2                1719                1748      1.68%
           300.twolf                2288                2335      2.04%

3. LIPO  vs trunk O2 (12%)

            164.gzip                1324                1350      1.99%
             175.vpr                1694                1758      3.77%
             176.gcc                2293                2519      9.83%
             181.mcf                1772                1766     -0.33%
          186.crafty                2320                2394      3.16%
          197.parser                1166                1683     44.32%
             252.eon                2443                2879     17.80%
         253.perlbmk                2410                2556      6.04%
             254.gap                1987                2139      7.61%
          255.vortex                2392                3669     53.40%
           256.bzip2                1719                1824      6.09%
           300.twolf                2288                2345      2.49%

4. LTO + -fwhole-program + O2 + FDO vs O2 (10.8%)

            164.gzip                1324                1340      1.25%
             175.vpr                1694                1709      0.87%
             176.gcc                2293                2411      5.13%
             181.mcf                1772                1757     -0.80%
          186.crafty                2320                2566     10.59%
          197.parser                1166                1614     38.44%
             252.eon                2443                2785     13.98%
         253.perlbmk                2410                2618      8.61%
             254.gap                1987                2063      3.81%
          255.vortex                2392                3294     37.69%
           256.bzip2                1719                1956     13.77%
           300.twolf                2288                2404      5.07%


David


On Mon, Nov 15, 2010 at 6:18 PM, Xinliang David Li <davidxl@google.com> wrote:
> More performance data:
>
> -O2 -funroll-all-loops vs O2:   +1.1% geomean
>
>                                          O2               O2 unroll-all-loops
>            164.gzip                1324                1336      0.94%
>             175.vpr                1694                1670     -1.44%
>             176.gcc                2293                2353      2.60%
>             181.mcf                1772                1793      1.20%
>          186.crafty                2320                2300     -0.86%
>          197.parser                1166                1171      0.39%
>             252.eon                2443                2515      2.93%
>         253.perlbmk                2410                2250     -6.66%
>             254.gap                1987                2041      2.68%
>          255.vortex                2392                2411      0.78%
>           256.bzip2                1719                1806      5.08%
>           300.twolf                2288                2436      6.44%
>
>
> -O3 -flto -fwhole-program vs -O2  : geomean +6%     (-fwhole-program add ~1% )
>
>            164.gzip                1324                1318     -0.45%
>             175.vpr                1694                1717      1.34%
>             176.gcc                2293                2359      2.88%
>             181.mcf                1772                1772      0.02%
>          186.crafty                2320                2526      8.86%
>          197.parser                1166                1248      7.04%
>             252.eon                2443                2898     18.59%
>         253.perlbmk                2410                2323     -3.62%
>             254.gap                1987                2039      2.58%
>          255.vortex                2392                2918     21.99%
>           256.bzip2                1719                1946     13.19%
>           300.twolf                2288                2342      2.34%
>
>
> -O2 -flto -fwhole-program vs -O2: geomean +3.4% . mainly from three
> programs: vortex, eon and bzip2.
>
>            164.gzip                1324                1313     -0.82%
>             175.vpr                1694                1659     -2.05%
>             176.gcc                2293                2300      0.30%
>             181.mcf                1772                1781      0.52%
>          186.crafty                2320                2327      0.30%
>          197.parser                1166                1188      1.92%
>             252.eon                2443                2664      9.00%
>         253.perlbmk                2410                2470      2.47%
>             254.gap                1987                1987     -0.02%
>          255.vortex                2392                2883     20.53%
>           256.bzip2                1719                1839      7.00%
>           300.twolf                2288                2365      3.34%
>
>
> Thanks,
>
> David
>
>
> On Mon, Nov 15, 2010 at 5:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>>> >> > have it by default.  GCC then can do effectively -fwhole-program for binaries
>>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>>> >> > visibility((hidden)) hints for shared libraries same way.  Most of important
>>> >> > shared libraries gets visibility ((hidden)) right.
>>> >> >
>>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>>> >> > Ideas are welcome here.
>>> >>
>>> >> Linker feedback will be limited here -- mostly global variable
>>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>>> >> helps  You don't get whole program points-to, whole program mod-ref
>>> >> (with context sensitivity), whole program structure layout. The latter
>>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>>> >> with those numbers can be misleading as many programs won't get it.
>>> >
>>> > Well, I am speaking of our linker plugin here.  What it does is to pass GCC
>>> > resolution information so it knows what symbols are bound externally. Since
>>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>>> > declare everything static except for main ())
>>> >
>>> > We don't really do whole program points-to or structure layout.
>>>
>>> gcc will eventually, right?
>>
>> Sure hope so ;)
>> We really need to solve scalability with our IPA points-to and make it
>> compatible with WHOPR.
>>>
>>> > Mod-ref is just
>>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>>
>>> mod-ref relies on points-to. With context sensitive points-to, you can
>>> also get CS mod-ref -- basically mod-ref info per callsite.
>>
>> Ah sure, I was too focused on our current "mod/ref" :)
>>
>> Honza
>>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16  8:26                             ` Jan Hubicka
  2010-11-16  9:00                               ` Xinliang David Li
@ 2010-11-16 15:43                               ` Richard Guenther
  1 sibling, 0 replies; 68+ messages in thread
From: Richard Guenther @ 2010-11-16 15:43 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, Andrey Belevantsev, Vladimir Makarov,
	gcc.gcc.gnu.org, Mark Heffernan

2010/11/16 Jan Hubicka <hubicka@ucw.cz>:
>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>> >> > have it by default.  GCC then can do effectively -fwhole-program for binaries
>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>> >> > visibility((hidden)) hints for shared libraries same way.  Most of important
>> >> > shared libraries gets visibility ((hidden)) right.
>> >> >
>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>> >> > Ideas are welcome here.
>> >>
>> >> Linker feedback will be limited here -- mostly global variable
>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>> >> helps  You don't get whole program points-to, whole program mod-ref
>> >> (with context sensitivity), whole program structure layout. The latter
>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>> >> with those numbers can be misleading as many programs won't get it.
>> >
>> > Well, I am speaking of our linker plugin here.  What it does is to pass GCC
>> > resolution information so it knows what symbols are bound externally. Since
>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>> > declare everything static except for main ())
>> >
>> > We don't really do whole program points-to or structure layout.
>>
>> gcc will eventually, right?
>
> Sure hope so ;)
> We really need to solve scalability with our IPA points-to and make it
> compatible with WHOPR.
>>
>> > Mod-ref is just
>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>
>> mod-ref relies on points-to. With context sensitive points-to, you can
>> also get CS mod-ref -- basically mod-ref info per callsite.
>
> Ah sure, I was too focused on our current "mod/ref" :)

Btw, IPA-PTA also performs mod/ref analysis (but of course it is
context insensitive).

Richard.

> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16 14:23                                 ` Xinliang David Li
@ 2010-11-16 17:10                                   ` Jan Hubicka
  2010-11-16 19:11                                     ` Xinliang David Li
  2010-11-18 11:48                                   ` Xinliang David Li
  1 sibling, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-16 17:10 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
	gcc.gcc.gnu.org, Mark Heffernan

> More FDO related performance numbers
> 
> Experiment 1:  trunk gcc O2 + FDO vs O2:      FDO improves performance
> by 5% geomean
> Experiment 2: our internal gcc compiler (4.4.3 based with many local
> patches) O2 + FDO vs O2 (trunk gcc):   FDO improves perf by 6.6%
> geomean
> Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
> O2 (trunk gcc):  LIPO improves by 12%
> Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2:  LTO +
> FDO improves by 10.8%
> 
> 
> 1. Trunk gcc FDO vs O2  (5%)
> 
>             164.gzip                1324                1302     -1.64%
>              175.vpr                1694                1725      1.84%
>              176.gcc                2293                2387      4.07%
>              181.mcf                1772                1756     -0.88%
>           186.crafty                2320                2280     -1.75%
>           197.parser                1166                1556     33.42%
>              252.eon                2443                2552      4.45%
>          253.perlbmk                2410                2586      7.28%
>              254.gap                1987                2021      1.71%
>           255.vortex                2392                2720     13.71%
>            256.bzip2                1719                1717     -0.12%
>            300.twolf                2288                2331      1.86%
> 
> 2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)

Interesting, any idea from where this 1.6% is comming?  I guess LIPO this might
be also reason for that 2% difference in LIPO results (in general LTO
-fwhole-program + FDO should be stronger, but it is not tunned at all yet).

Since the LIPO branch was updated to mainline some time ago, it would be nice
to compare the LIPO from the branch with mainline LTO.  i guess more fair comparsion
should be O2+FDO+LTO WRT O2+LIPO as LIPO makes no whole program assumptions
at all, right?

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16 17:10                                   ` Jan Hubicka
@ 2010-11-16 19:11                                     ` Xinliang David Li
  0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-16 19:11 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org,
	Mark Heffernan, Raksit Ashok

On Tue, Nov 16, 2010 at 6:35 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> More FDO related performance numbers
>>
>> Experiment 1:  trunk gcc O2 + FDO vs O2:      FDO improves performance
>> by 5% geomean
>> Experiment 2: our internal gcc compiler (4.4.3 based with many local
>> patches) O2 + FDO vs O2 (trunk gcc):   FDO improves perf by 6.6%
>> geomean
>> Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
>> O2 (trunk gcc):  LIPO improves by 12%
>> Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2:  LTO +
>> FDO improves by 10.8%
>>
>>
>> 1. Trunk gcc FDO vs O2  (5%)
>>
>>             164.gzip                1324                1302     -1.64%
>>              175.vpr                1694                1725      1.84%
>>              176.gcc                2293                2387      4.07%
>>              181.mcf                1772                1756     -0.88%
>>           186.crafty                2320                2280     -1.75%
>>           197.parser                1166                1556     33.42%
>>              252.eon                2443                2552      4.45%
>>          253.perlbmk                2410                2586      7.28%
>>              254.gap                1987                2021      1.71%
>>           255.vortex                2392                2720     13.71%
>>            256.bzip2                1719                1717     -0.12%
>>            300.twolf                2288                2331      1.86%
>>
>> 2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)
>
> Interesting, any idea from where this 1.6% is comming?

Probably due to local patches (inliner, lrs, etc) we have, but I have
not studied it.

>  I guess LIPO this might
> be also reason for that 2% difference in LIPO results (in general LTO
> -fwhole-program + FDO should be stronger, but it is not tunned at all yet).
>
> Since the LIPO branch was updated to mainline some time ago, it would be nice
> to compare the LIPO from the branch with mainline LTO.  i guess more fair comparsion
> should be O2+FDO+LTO WRT O2+LIPO as LIPO makes no whole program assumptions
> at all, right?

Yes. Raksit maintains the upstream lipo branch, but it has not been
tuned for performance yet.  We have open sourced our compiler changes
via android. It is better to use that  if any one is interested.

Thanks,

David


>
> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-16 14:23                                 ` Xinliang David Li
  2010-11-16 17:10                                   ` Jan Hubicka
@ 2010-11-18 11:48                                   ` Xinliang David Li
  2010-11-18 13:06                                     ` Jan Hubicka
  2010-11-18 13:28                                     ` Jan Hubicka
  1 sibling, 2 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-18 11:48 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

Some text size measurement.

Summary:
1) LTO with -O3 bloats up code considerably;
2) LTO with -O2 reduces text size compared with -O2
3) Google 4.4.3 based compiler is really effective in reducing C++
program size -- this is where the focus of the tuning was done.
Witnessed by eon in SPEC2k and all C++ apps in SPEC06


Notes:
  1.  -ffunction-sections -Wl,-gc-sections are used in the build.
  2. SPEC06 dealII does not build with trunk GCC with some parsing
error.  Hj Lu, what alt source should be used? (it builds fine with
4.4.3 compiler)
  3. xalancbmk and omnetpp do not build with TOT gcc compiler using
FDO -- compiler ICEes.  Will investigate when there is time.


David

SPEC06 C++ program Data (the first data column is the TOT O2 base number)

1. TOT O3 vs TOT O2 ( 3.35% total increase)

        471.omnetpp/    853708    867988      1.67%
         450.soplex/    643273    656349         2.03%
      483.xalancbmk/   3634416   3777600      3.94%
           444.namd/    393142    402038        2.26%
          473.astar/    102182    111038          8.67%
            size_sum   5626721   5815013      3.35%

2. TOT LTO+whole program + O3 vs TOT O2 (0.35% total increase)

        471.omnetpp/    853708    937728      9.84%
         450.soplex/    643273    654057      1.68%
      483.xalancbmk/   3634416   3540646     -2.58%
           444.namd/    393142    401318      2.08%
          473.astar/    102182    112538     10.13%
            size_sum   5626721   5646287      0.35%

3. TOT LTO+whole program + O2 vs TOT O2 (8.10% total reduction)

        471.omnetpp/    853708    822868     -3.61%
         450.soplex/    643273    611653     -4.92%
      483.xalancbmk/   3634416   3245157    -10.71%
           444.namd/    393142    391698     -0.37%
          473.astar/    102182     99586     -2.54%
            size_sum   5626721   5170962     -8.10%

4. google 4.4.3 compiler  O2 vs TOT O2 (13.95% total reduction)

        471.omnetpp/    853708    545840    -36.06%
         450.soplex/    643273    374674    -41.76%
      483.xalancbmk/   3634416   3556306     -2.15%
           444.namd/    393142    329897    -16.09%
          473.astar/    102182     35301    -65.45%
            size_sum   5626721   4842018    -13.95%

5. Google 4.4.3 compiler O2 FDO vs TOT O2 (24.81% total reduction)

        471.omnetpp/    853708    514732    -39.71%
         450.soplex/    643273    357426    -44.44%
      483.xalancbmk/   3634416   2985761    -17.85%
           444.namd/    393142    332806    -15.35%
          473.astar/    102182     39797    -61.05%
            size_sum   5626721   4230522    -24.81%

6. Google 4.4.3 compiler O2 LIPO vs TOT O2 (20.86 % total reduction)

       471.omnetpp/    853708    559944    -34.41%
         450.soplex/    643273    393399    -38.84%
      483.xalancbmk/   3634416   3126428    -13.98%
           444.namd/    393142    334666    -14.87%
          473.astar/    102182     38749    -62.08%
            size_sum   5626721   4453186    -20.86%



SPEC2k text size data:

1. tot O1 vs tot O2 ( 4.48% total reduction)

          300.twolf/    182884    177223     -3.10%
            181.mcf/     11794     11338     -3.87%
           164.gzip/     36705     34388     -6.31%
         186.crafty/    171663    164898     -3.94%
         255.vortex/    463463    456034     -1.60%
          256.bzip2/     28803     28091     -2.47%
            176.gcc/   1422042   1368365     -3.77%
         197.parser/    103225     96644     -6.38%
        253.perlbmk/    563927    515898     -8.52%
            175.vpr/    139321    134316     -3.59%
            252.eon/    607704    591780     -2.62%
            254.gap/    496262    459593     -7.39%
            size_sum   4227793   4038568     -4.48%


2. tot O3 vs tot O2 : (10.8% total size increase)

          300.twolf/    182884    194620      6.42%
            181.mcf/     11794     13290     12.68%
           164.gzip/     36705     46049     25.46%
         186.crafty/    171663    189892     10.62%
         255.vortex/    463463    495875      6.99%
          256.bzip2/     28803     39939     38.66%
            176.gcc/   1422042   1609786     13.20%
         197.parser/    103225    143558     39.07%
        253.perlbmk/    563927    616855      9.39%
            175.vpr/    139321    147081      5.57%
            252.eon/    607704    625176      2.88%
            254.gap/    496262    563187     13.49%
            size_sum   4227793   4685308     10.82%


3. tot LTO + -fwhole-program + -O2  vs tot O2 : (3.65% total size reduction)

          300.twolf/    182884    176572     -3.45%
            181.mcf/     11794      9594    -18.65%
           164.gzip/     36705     34439     -6.17%
         186.crafty/    171663    173071      0.82%
         255.vortex/    463463    382157    -17.54%
          256.bzip2/     28803     27142     -5.77%
            176.gcc/   1422042   1364796     -4.03%
         197.parser/    103225     94997     -7.97%
        253.perlbmk/    563927    590087      4.64%
            175.vpr/    139321    123572    -11.30%
            252.eon/    607704    606226     -0.24%
            254.gap/    496262    491006     -1.06%
            size_sum   4227793   4073659     -3.65%


4. tot LTO + -fwhole-program + -O3 : (16.57% total increase)

          300.twolf/    182884    196316      7.34%
            181.mcf/     11794     11402     -3.32%
           164.gzip/     36705     51477     40.25%
         186.crafty/    171663    214700     25.07%
         255.vortex/    463463    462329     -0.24%
          256.bzip2/     28803     34950     21.34%
            176.gcc/   1422042   1724868     21.30%
         197.parser/    103225    124698     20.80%
        253.perlbmk/    563927    729119     29.29%
            175.vpr/    139321    139729      0.29%
            252.eon/    607704    627194      3.21%
            254.gap/    496262    611515     23.22%
            size_sum   4227793   4928297     16.57%

5. tot O2 FDO vs tot O2: (1.15% total increase)

              300.twolf/    182884    178247     -2.54%
            181.mcf/     11794     17370     47.28%
           164.gzip/     36705     42889     16.85%
         186.crafty/    171663    184085      7.24%
         255.vortex/    463463    483428      4.31%
          256.bzip2/     28803     33635     16.78%
            176.gcc/   1422042   1441797      1.39%
         197.parser/    103225    140401     36.01%
        253.perlbmk/    563927    546447     -3.10%
            175.vpr/    139321    147153      5.62%
            252.eon/    607704    572388     -5.81%
            254.gap/    496262    488758     -1.51%
            size_sum   4227793   4276598      1.15%


6. google local compiler O2 FDO vs tot O2 : (6.33% total increase)

Pay attention to the large reduction in C++ program's text size --
which is  where the size tuning is done.

         300.twolf/    182884    184736      1.01%
            181.mcf/     11794     26560    125.20%
           164.gzip/     36705     48499     32.13%
         186.crafty/    171663    187406      9.17%
         255.vortex/    463463    482090      4.02%
          256.bzip2/     28803     37905     31.60%
            176.gcc/   1422042   1729480     21.62%
         197.parser/    103225    237148    129.74%
        253.perlbmk/    563927    557040     -1.22%
            175.vpr/    139321    153453     10.14%
            252.eon/    607704    312506    -48.58%
            254.gap/    496262    538534      8.52%
            size_sum   4227793   4495357      6.33%

Also for reference, the google compiler vanilla O2 vs tot O2 -- large
reduction in C++ size, overall size increase a little.

         300.twolf/    182884    207829     13.64%
            181.mcf/     11794     12008      1.81%
           164.gzip/     36705     41528     13.14%
         186.crafty/    171663    177104      3.17%
         255.vortex/    463463    473298      2.12%
          256.bzip2/     28803     37961     31.80%
            176.gcc/   1422042   1592952     12.02%
         197.parser/    103225    139969     35.60%
        253.perlbmk/    563927    598632      6.15%
            175.vpr/    139321    156869     12.60%
            252.eon/    607704    322478    -46.94%
            254.gap/    496262    550451     10.92%
            size_sum   4227793   4311079      1.97%


7. LIPO vs tot O2:  (23.2% total increase)

            300.twolf/    182884    185960      1.68%
            181.mcf/     11794     26544    125.06%
           164.gzip/     36705     54827     49.37%
         186.crafty/    171663    234494     36.60%
         255.vortex/    463463    596394     28.68%
          256.bzip2/     28803     40492     40.58%
            176.gcc/   1422042   2070851     45.63%
         197.parser/    103225    250537    142.71%
        253.perlbmk/    563927    638320     13.19%
            175.vpr/    139321    156117     12.06%
            252.eon/    607704    370949    -38.96%
            254.gap/    496262    588139     18.51%
            size_sum   4227793   5213624     23.32%

8. LTO + whole-program +O2 + FDO vs O2:

         300.twolf/    182884    174919     -4.36%
            181.mcf/     11794     16346     38.60%
           164.gzip/     36705     40743     11.00%
         186.crafty/    171663    197698     15.17%
         255.vortex/    463463    395626    -14.64%
          256.bzip2/     28803     36238     25.81%
            176.gcc/   1422042   1439295      1.21%
         197.parser/    103225    143237     38.76%
        253.perlbmk/    563927    590687      4.75%
            175.vpr/    139321    135276     -2.90%
            252.eon/    607704    585954     -3.58%
            254.gap/    496262    487289     -1.81%
            size_sum   4227793   4243308      0.37%


On Tue, Nov 16, 2010 at 12:26 AM, Xinliang David Li <davidxl@google.com> wrote:
> More FDO related performance numbers
>
> Experiment 1:  trunk gcc O2 + FDO vs O2:      FDO improves performance
> by 5% geomean
> Experiment 2: our internal gcc compiler (4.4.3 based with many local
> patches) O2 + FDO vs O2 (trunk gcc):   FDO improves perf by 6.6%
> geomean
> Experiment 3: our internal gcc (4.4.3 with local patchs) O2 + LIPO vs
> O2 (trunk gcc):  LIPO improves by 12%
> Experiment 4: trunk gcc O2 + LTO + fwhole-program + FDO vs O2:  LTO +
> FDO improves by 10.8%
>
>
> 1. Trunk gcc FDO vs O2  (5%)
>
>            164.gzip                1324                1302     -1.64%
>             175.vpr                1694                1725      1.84%
>             176.gcc                2293                2387      4.07%
>             181.mcf                1772                1756     -0.88%
>          186.crafty                2320                2280     -1.75%
>          197.parser                1166                1556     33.42%
>             252.eon                2443                2552      4.45%
>         253.perlbmk                2410                2586      7.28%
>             254.gap                1987                2021      1.71%
>          255.vortex                2392                2720     13.71%
>           256.bzip2                1719                1717     -0.12%
>           300.twolf                2288                2331      1.86%
>
> 2. 4.4.3 gcc with local patch FDO vs trunk O2 (6.6%)
>
>            164.gzip                1324                1317     -0.48%
>             175.vpr                1694                1758      3.76%
>             176.gcc                2293                2472      7.79%
>             181.mcf                1772                1730     -2.35%
>          186.crafty                2320                2353      1.40%
>          197.parser                1166                1652     41.70%
>             252.eon                2443                2610      6.82%
>         253.perlbmk                2410                2561      6.23%
>             254.gap                1987                1987     -0.04%
>          255.vortex                2392                2801     17.09%
>           256.bzip2                1719                1748      1.68%
>           300.twolf                2288                2335      2.04%
>
> 3. LIPO  vs trunk O2 (12%)
>
>            164.gzip                1324                1350      1.99%
>             175.vpr                1694                1758      3.77%
>             176.gcc                2293                2519      9.83%
>             181.mcf                1772                1766     -0.33%
>          186.crafty                2320                2394      3.16%
>          197.parser                1166                1683     44.32%
>             252.eon                2443                2879     17.80%
>         253.perlbmk                2410                2556      6.04%
>             254.gap                1987                2139      7.61%
>          255.vortex                2392                3669     53.40%
>           256.bzip2                1719                1824      6.09%
>           300.twolf                2288                2345      2.49%
>
> 4. LTO + -fwhole-program + O2 + FDO vs O2 (10.8%)
>
>            164.gzip                1324                1340      1.25%
>             175.vpr                1694                1709      0.87%
>             176.gcc                2293                2411      5.13%
>             181.mcf                1772                1757     -0.80%
>          186.crafty                2320                2566     10.59%
>          197.parser                1166                1614     38.44%
>             252.eon                2443                2785     13.98%
>         253.perlbmk                2410                2618      8.61%
>             254.gap                1987                2063      3.81%
>          255.vortex                2392                3294     37.69%
>           256.bzip2                1719                1956     13.77%
>           300.twolf                2288                2404      5.07%
>
>
> David
>
>
> On Mon, Nov 15, 2010 at 6:18 PM, Xinliang David Li <davidxl@google.com> wrote:
>> More performance data:
>>
>> -O2 -funroll-all-loops vs O2:   +1.1% geomean
>>
>>                                          O2               O2 unroll-all-loops
>>            164.gzip                1324                1336      0.94%
>>             175.vpr                1694                1670     -1.44%
>>             176.gcc                2293                2353      2.60%
>>             181.mcf                1772                1793      1.20%
>>          186.crafty                2320                2300     -0.86%
>>          197.parser                1166                1171      0.39%
>>             252.eon                2443                2515      2.93%
>>         253.perlbmk                2410                2250     -6.66%
>>             254.gap                1987                2041      2.68%
>>          255.vortex                2392                2411      0.78%
>>           256.bzip2                1719                1806      5.08%
>>           300.twolf                2288                2436      6.44%
>>
>>
>> -O3 -flto -fwhole-program vs -O2  : geomean +6%     (-fwhole-program add ~1% )
>>
>>            164.gzip                1324                1318     -0.45%
>>             175.vpr                1694                1717      1.34%
>>             176.gcc                2293                2359      2.88%
>>             181.mcf                1772                1772      0.02%
>>          186.crafty                2320                2526      8.86%
>>          197.parser                1166                1248      7.04%
>>             252.eon                2443                2898     18.59%
>>         253.perlbmk                2410                2323     -3.62%
>>             254.gap                1987                2039      2.58%
>>          255.vortex                2392                2918     21.99%
>>           256.bzip2                1719                1946     13.19%
>>           300.twolf                2288                2342      2.34%
>>
>>
>> -O2 -flto -fwhole-program vs -O2: geomean +3.4% . mainly from three
>> programs: vortex, eon and bzip2.
>>
>>            164.gzip                1324                1313     -0.82%
>>             175.vpr                1694                1659     -2.05%
>>             176.gcc                2293                2300      0.30%
>>             181.mcf                1772                1781      0.52%
>>          186.crafty                2320                2327      0.30%
>>          197.parser                1166                1188      1.92%
>>             252.eon                2443                2664      9.00%
>>         253.perlbmk                2410                2470      2.47%
>>             254.gap                1987                1987     -0.02%
>>          255.vortex                2392                2883     20.53%
>>           256.bzip2                1719                1839      7.00%
>>           300.twolf                2288                2365      3.34%
>>
>>
>> Thanks,
>>
>> David
>>
>>
>> On Mon, Nov 15, 2010 at 5:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> On Mon, Nov 15, 2010 at 5:39 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> >> > Fortunately linker plugin solves the problem here and this is why I want to
>>>> >> > have it by default.  GCC then can do effectively -fwhole-program for binaries
>>>> >> > (since linker knows what will be bound elsewhere) and take advantage of
>>>> >> > visibility((hidden)) hints for shared libraries same way.  Most of important
>>>> >> > shared libraries gets visibility ((hidden)) right.
>>>> >> >
>>>> >> > It is sad that LTO w/o linker plugin doesn't give that much benefits.
>>>> >> > Ideas are welcome here.
>>>> >>
>>>> >> Linker feedback will be limited here -- mostly global variable
>>>> >> aliasing (as I remember only 2/3 spec programs benefit from it), it
>>>> >> helps  You don't get whole program points-to, whole program mod-ref
>>>> >> (with context sensitivity), whole program structure layout. The latter
>>>> >> are the real kickers (in terms of SPEC performance), but promoting LTO
>>>> >> with those numbers can be misleading as many programs won't get it.
>>>> >
>>>> > Well, I am speaking of our linker plugin here.  What it does is to pass GCC
>>>> > resolution information so it knows what symbols are bound externally. Since
>>>> > typically you link LTO alone or with small non-LTO part, most of symbols are
>>>> > not bound and thus effecitvely you get -fwhole-program (-fwhole-program just
>>>> > declare everything static except for main ())
>>>> >
>>>> > We don't really do whole program points-to or structure layout.
>>>>
>>>> gcc will eventually, right?
>>>
>>> Sure hope so ;)
>>> We really need to solve scalability with our IPA points-to and make it
>>> compatible with WHOPR.
>>>>
>>>> > Mod-ref is just
>>>> > simple ipa-reference code. How you get context sensitivity on mod/ref?
>>>>
>>>> mod-ref relies on points-to. With context sensitive points-to, you can
>>>> also get CS mod-ref -- basically mod-ref info per callsite.
>>>
>>> Ah sure, I was too focused on our current "mod/ref" :)
>>>
>>> Honza
>>>
>>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-18 11:48                                   ` Xinliang David Li
@ 2010-11-18 13:06                                     ` Jan Hubicka
  2010-11-18 17:20                                       ` Xinliang David Li
       [not found]                                       ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
  2010-11-18 13:28                                     ` Jan Hubicka
  1 sibling, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-18 13:06 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
	gcc.gcc.gnu.org, Mark Heffernan

> Some text size measurement.
> 
> Summary:
> 1) LTO with -O3 bloats up code considerably;
Yes, you need either -fwhole-program or -fuse-linker-plugin to make it behave
sanely.  

For Mozilla I have best experience with -fuse-linker-plugin --param
inline-unit-growth=5 That gives me about 16% code size savings (so LTO -O3 is
same size as -Os).  This means that it is smaller than -O2+LTO+-fwhole-program.

> 2) LTO with -O2 reduces text size compared with -O2

An more so with -fwhole-program as you noticed ;)

> 3) Google 4.4.3 based compiler is really effective in reducing C++
> program size -- this is where the focus of the tuning was done.
> Witnessed by eon in SPEC2k and all C++ apps in SPEC06

Again, I would be very interested to see the patches (and sooner than later
given tham I am re-tunning inliner for 4.6 now). Looking at Mozilla I also
concluded that we have a lot of room for improvement.  How old is the tree you
use for testing?  I recently improved code size somewhat at mainline.

> 
> SPEC06 C++ program Data (the first data column is the TOT O2 base number)
> 
> 1. TOT O3 vs TOT O2 ( 3.35% total increase)
> 
>         471.omnetpp/    853708    867988      1.67%
>          450.soplex/    643273    656349         2.03%
>       483.xalancbmk/   3634416   3777600      3.94%
>            444.namd/    393142    402038        2.26%
>           473.astar/    102182    111038          8.67%
>             size_sum   5626721   5815013      3.35%
> 
> 2. TOT LTO+whole program + O3 vs TOT O2 (0.35% total increase)
> 
>         471.omnetpp/    853708    937728      9.84%
>          450.soplex/    643273    654057      1.68%
>       483.xalancbmk/   3634416   3540646     -2.58%
>            444.namd/    393142    401318      2.08%
>           473.astar/    102182    112538     10.13%
>             size_sum   5626721   5646287      0.35%
> 
> 3. TOT LTO+whole program + O2 vs TOT O2 (8.10% total reduction)
> 
>         471.omnetpp/    853708    822868     -3.61%
>          450.soplex/    643273    611653     -4.92%
>       483.xalancbmk/   3634416   3245157    -10.71%
>            444.namd/    393142    391698     -0.37%
>           473.astar/    102182     99586     -2.54%
>             size_sum   5626721   5170962     -8.10%
> 
> 4. google 4.4.3 compiler  O2 vs TOT O2 (13.95% total reduction)
> 
>         471.omnetpp/    853708    545840    -36.06%
>          450.soplex/    643273    374674    -41.76%
>       483.xalancbmk/   3634416   3556306     -2.15%
>            444.namd/    393142    329897    -16.09%
>           473.astar/    102182     35301    -65.45%
>             size_sum   5626721   4842018    -13.95%

Hmm, this really seems interesting. Why the changes was not contributed this stage1?

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-18 11:48                                   ` Xinliang David Li
  2010-11-18 13:06                                     ` Jan Hubicka
@ 2010-11-18 13:28                                     ` Jan Hubicka
  2010-11-18 18:18                                       ` Xinliang David Li
  1 sibling, 1 reply; 68+ messages in thread
From: Jan Hubicka @ 2010-11-18 13:28 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov,
	gcc.gcc.gnu.org, Mark Heffernan

Hi,
and for size, could you please also do -Os comparsions?  I am aware that -O2
inliner is tuned somewhat up at C++.  This is given by fact that we do have C++
benchmark suite we use to monitor inlining.
http://gcc.opensuse.org/c++bench-frescobaldi/

Programs there are a lot more aggressive on abstraction than whatever SPEC2k
and SPEC2k6 does.  I know I can tune the inliner down for SPEC but get
regression there... But given we can get 10% difference on normal C++ program
it might be interesting to consider some compromise, at least for -O2.

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-18 13:06                                     ` Jan Hubicka
@ 2010-11-18 17:20                                       ` Xinliang David Li
       [not found]                                       ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
  1 sibling, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-18 17:20 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

On Thu, Nov 18, 2010 at 3:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Some text size measurement.
>>
>> Summary:
>> 1) LTO with -O3 bloats up code considerably;
> Yes, you need either -fwhole-program or -fuse-linker-plugin to make it behave
> sanely.
>
> For Mozilla I have best experience with -fuse-linker-plugin --param
> inline-unit-growth=5 That gives me about 16% code size savings (so LTO -O3 is
> same size as -Os).  This means that it is smaller than -O2+LTO+-fwhole-program.
>
>> 2) LTO with -O2 reduces text size compared with -O2
>
> An more so with -fwhole-program as you noticed ;)

Actually all the LTO experiments were done with -fwhole-program.

>
>> 3) Google 4.4.3 based compiler is really effective in reducing C++
>> program size -- this is where the focus of the tuning was done.
>> Witnessed by eon in SPEC2k and all C++ apps in SPEC06
>
> Again, I would be very interested to see the patches (and sooner than later
> given tham I am re-tunning inliner for 4.6 now). Looking at Mozilla I also
> concluded that we have a lot of room for improvement.  How old is the tree you
> use for testing?  I recently improved code size somewhat at mainline.
>

yes -- it may take a while as the code base for our tuning is 4.4.3.

>>
>> SPEC06 C++ program Data (the first data column is the TOT O2 base number)
>>
>> 1. TOT O3 vs TOT O2 ( 3.35% total increase)
>>
>>         471.omnetpp/    853708    867988      1.67%
>>          450.soplex/    643273    656349         2.03%
>>       483.xalancbmk/   3634416   3777600      3.94%
>>            444.namd/    393142    402038        2.26%
>>           473.astar/    102182    111038          8.67%
>>             size_sum   5626721   5815013      3.35%
>>
>> 2. TOT LTO+whole program + O3 vs TOT O2 (0.35% total increase)
>>
>>         471.omnetpp/    853708    937728      9.84%
>>          450.soplex/    643273    654057      1.68%
>>       483.xalancbmk/   3634416   3540646     -2.58%
>>            444.namd/    393142    401318      2.08%
>>           473.astar/    102182    112538     10.13%
>>             size_sum   5626721   5646287      0.35%
>>
>> 3. TOT LTO+whole program + O2 vs TOT O2 (8.10% total reduction)
>>
>>         471.omnetpp/    853708    822868     -3.61%
>>          450.soplex/    643273    611653     -4.92%
>>       483.xalancbmk/   3634416   3245157    -10.71%
>>            444.namd/    393142    391698     -0.37%
>>           473.astar/    102182     99586     -2.54%
>>             size_sum   5626721   5170962     -8.10%
>>
>> 4. google 4.4.3 compiler  O2 vs TOT O2 (13.95% total reduction)
>>
>>         471.omnetpp/    853708    545840    -36.06%
>>          450.soplex/    643273    374674    -41.76%
>>       483.xalancbmk/   3634416   3556306     -2.15%
>>            444.namd/    393142    329897    -16.09%
>>           473.astar/    102182     35301    -65.45%
>>             size_sum   5626721   4842018    -13.95%
>
> Hmm, this really seems interesting. Why the changes was not contributed this stage1?
>

As you can see, the savings are mainly for C++, and may hurt C --
there are more work needed to make it suitable upstream.

Thanks,

David


> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-18 13:28                                     ` Jan Hubicka
@ 2010-11-18 18:18                                       ` Xinliang David Li
  0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-18 18:18 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org, Mark Heffernan

I found an error in my size experiment set up -- (libstdc++ shared vs
non shared) -- please discard the size numbers -- will remeasure.

Thanks,

David

On Thu, Nov 18, 2010 at 4:02 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> and for size, could you please also do -Os comparsions?  I am aware that -O2
> inliner is tuned somewhat up at C++.  This is given by fact that we do have C++
> benchmark suite we use to monitor inlining.
> http://gcc.opensuse.org/c++bench-frescobaldi/
>
> Programs there are a lot more aggressive on abstraction than whatever SPEC2k
> and SPEC2k6 does.  I know I can tune the inliner down for SPEC but get
> regression there... But given we can get 10% difference on normal C++ program
> it might be interesting to consider some compromise, at least for -O2.
>
> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
       [not found]                                       ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
@ 2010-11-19  4:22                                         ` Jan Hubicka
  2010-11-19  7:26                                           ` Xinliang David Li
       [not found]                                           ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
  0 siblings, 2 replies; 68+ messages in thread
From: Jan Hubicka @ 2010-11-19  4:22 UTC (permalink / raw)
  To: Mark Heffernan
  Cc: Jan Hubicka, Xinliang David Li, Andrey Belevantsev,
	Vladimir Makarov, gcc.gcc.gnu.org

Hi,
> I'll get back to you with our local inlining changes.  We're looking to move
> development closer to trunk to reduce this divergence in the future.
> 
> Our tuning was done primarily on big c++ programs.  A significant size
> improvement came from aggressively inlining functions which might be
> eliminated by the linker (garbage collection of uncalled functions).  We
> found that for non-static functions, if all callsites of a function are
> inlined, the function rarely appears in the final binary (excepting address
> taken functions).  Of course, this doesn't necessarily work for libraries
> which may need all non-static functions to be emitted.

Interesting. Coincidentally I recently added logic for this for comdat
functions (setting probability to 20%) to deal with problems that a lot of C++
programs does template instatiations that produce comdat functoins for now good
reason.  This indeed helped quite a lot.  I didn't got so far to set similar
logic for normal external functions, since current toolchain won't eliminate
them by default.

Did the posted size numbers include function garbage collection and unification
that is same for mainline as for google copmiler?

Honza

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
  2010-11-19  4:22                                         ` Jan Hubicka
@ 2010-11-19  7:26                                           ` Xinliang David Li
       [not found]                                           ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
  1 sibling, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-19  7:26 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Mark Heffernan, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

On Thu, Nov 18, 2010 at 4:12 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
>> I'll get back to you with our local inlining changes.  We're looking to move
>> development closer to trunk to reduce this divergence in the future.
>>
>> Our tuning was done primarily on big c++ programs.  A significant size
>> improvement came from aggressively inlining functions which might be
>> eliminated by the linker (garbage collection of uncalled functions).  We
>> found that for non-static functions, if all callsites of a function are
>> inlined, the function rarely appears in the final binary (excepting address
>> taken functions).  Of course, this doesn't necessarily work for libraries
>> which may need all non-static functions to be emitted.
>
> Interesting. Coincidentally I recently added logic for this for comdat
> functions (setting probability to 20%) to deal with problems that a lot of C++
> programs does template instatiations that produce comdat functoins for now good
> reason.  This indeed helped quite a lot.  I didn't got so far to set similar
> logic for normal external functions, since current toolchain won't eliminate
> them by default.
>
> Did the posted size numbers include function garbage collection and unification
> that is same for mainline as for google copmiler?

My previous size numbers are wrong --- looks like trunk gcc does
pretty well  in terms of text size. Update numbers will be posted
later.

Yes, all experiments were done with GC, but safe ICF was not turned on.

David


>
> Honza
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64
       [not found]                                           ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
@ 2010-11-19 14:12                                             ` Xinliang David Li
  0 siblings, 0 replies; 68+ messages in thread
From: Xinliang David Li @ 2010-11-19 14:12 UTC (permalink / raw)
  To: Mark Heffernan
  Cc: Jan Hubicka, Andrey Belevantsev, Vladimir Makarov, gcc.gcc.gnu.org

New size data -- hopefully it is sane this time.

Changes in experiment
1) shared libstdc++ is used with trunk gcc
2) bfd linker is used in both trunk and patched 4.4.3 compiler (which
used gold).

The size comparison for all C benchmarks in previous report is still
valid. The following is the corrected SPEC06 C++ number and some new
data (Os).

a) SPEC06 size

1. tot O3 vs tot O2

      471.omnetpp/    564839    579088      2.52%
         450.soplex/    352709    365785      3.71%
      483.xalancbmk/   3357954   3501139      4.26%
           444.namd/    319553    328449      2.78%
          473.astar/     31343     40199     28.26%
            size_sum   4626398   4814660      4.07%

2. tot Os vs tot O2

       471.omnetpp/    564839    512157     -9.33%
         450.soplex/    352709    256364    -27.32%
      483.xalancbmk/   3357954   2747152    -18.19%
           444.namd/    319553    261799    -18.07%
          473.astar/     31343     26201    -16.41%
            size_sum   4626398   3803673    -17.78%

3. tot O2 lto wpa vs tot O2

       471.omnetpp/    564839    531878     -5.84%
         450.soplex/    352709    319476     -9.42%
      483.xalancbmk/   3357954   2966800    -11.65%
           444.namd/    319553    318099     -0.46%
          473.astar/     31343     28747     -8.28%
            size_sum   4626398   4165000     -9.97%

4. tot O3 lto wpa vs O2

       471.omnetpp/    564839    646708     14.49%
         450.soplex/    352709    361895      2.60%
      483.xalancbmk/   3357954   3262255     -2.85%
           444.namd/    319553    327738      2.56%
          473.astar/     31343     41707     33.07%
            size_sum   4626398   4640303      0.30%

5. patched 4.43 O2 vs tot O2

        471.omnetpp/    564839    539237     -4.53%
         450.soplex/    352709    373263      5.83%
      483.xalancbmk/   3357954   3476137      3.52%
           444.namd/    319553    329769      3.20%
          473.astar/     31343     35250     12.47%
            size_sum   4626398   4753656      2.75%

6. Patched 4.4.3 Os vs tot O2
       471.omnetpp/    564839    486838    -13.81%
         450.soplex/    352709    272146    -22.84%
      483.xalancbmk/   3357954   2769330    -17.53%
           444.namd/    319553    255295    -20.11%
          473.astar/     31343     25852    -17.52%
            size_sum   4626398   3809461    -17.66%

7. patched 4.4.3 O2 FDO vs tot O2:

       471.omnetpp/    564839    508676     -9.94%
         450.soplex/    352709    356223      1.00%
      483.xalancbmk/   3357954   2919924    -13.04%
           444.namd/    319553    332664      4.10%
          473.astar/     31343     39738     26.78%
            size_sum   4626398   4157225    -10.14%

8. patched 4.43 O2 LIPO vs tot O2:

        471.omnetpp/    564839    552361     -2.21%
         450.soplex/    352709    392106     11.17%
      483.xalancbmk/   3357954   3058259     -8.92%
           444.namd/    319553    334522      4.68%
          473.astar/     31343     38690     23.44%
            size_sum   4626398   4375938     -5.41%

SPEC2k Os:

1. tot Os vs tot O2

         300.twolf/    182884    150921    -17.48%
            181.mcf/     11794     10246    -13.13%
           164.gzip/     36705     30983    -15.59%
         186.crafty/    171663    149301    -13.03%
         255.vortex/    463463    398908    -13.93%
          256.bzip2/     28803     24795    -13.92%
            176.gcc/   1422042   1158844    -18.51%
         197.parser/    103225     84814    -17.84%
        253.perlbmk/    563927    457664    -18.84%
            175.vpr/    139321    118330    -15.07%
            252.eon/    314603    258560    -17.81%
            254.gap/    496262    403633    -18.67%
            size_sum   3934692   3246999    -17.48%

2. patched 4.4.3 Os vs tot Os:

         300.twolf/    150921    156185      3.49%
            181.mcf/     10246     10062     -1.80%
           164.gzip/     30983     30991      0.03%
         186.crafty/    149301    151477      1.46%
         255.vortex/    398908    402780      0.97%
          256.bzip2/     24795     24619     -0.71%
            176.gcc/   1158844   1177628      1.62%
         197.parser/     84814     82718     -2.47%
        253.perlbmk/    457664    466152      1.85%
            175.vpr/    118330    121446      2.63%
            252.eon/    258560    281061      8.70%
            254.gap/    403633    411540      1.96%
            size_sum   3246999   3316659      2.15%

David


On Thu, Nov 18, 2010 at 4:37 PM, Mark Heffernan <meheff@google.com> wrote:
> On Thu, Nov 18, 2010 at 4:12 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> Interesting. Coincidentally I recently added logic for this for comdat
>> functions (setting probability to 20%) to deal with problems that a lot of
>> C++
>> programs does template instatiations that produce comdat functoins for now
>> good
>> reason.  This indeed helped quite a lot.  I didn't got so far to set
>> similar
>> logic for normal external functions, since current toolchain won't
>> eliminate
>> them by default.
>
> For non-static, no-address-taken functions, I found that they are emitted in
> the binary (after linker garbage collection) only about 20% of the time or
> so.  Surprisingly small.  This is for large c++ programs.  I'd guess a fair
> number of these functions are template instantiations which may be
> instantiated a particular way (eg, with the same types) in only one
> compilation unit.   Plus if all callsites of a function are inlined in one
> compilation unit, it's more likely that they might be inlined in other
> compilation units too.  However, I didn't dive down deep to figure out
> exactly why this number is so low.
>>
>> Did the posted size numbers include function garbage collection and
>> unification
>> that is same for mainline as for google copmiler?
>
> I think the size numbers David posted earlier had some problems (statically
> linking stdc++ vs non-statically linked, I believe), so I'd wait until he
> reposts them to draw any conclusions.  Not sure if garbage collection was
> enabled or not.  In any case, we found maybe a 2% reduction in code size for
> -Os on x86-64 over our benchmark set comparing our local 4.4.3 vs vanilla
> 4.4.3.  -O2 is comparable in size, but faster because we inline more
> aggressively which balances out the code size reduction.  I have not done
> the comparison vs trunk.
> Mark
>
>>
>> Honza
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2010-11-19  7:26 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-29 16:29 GCC-4.5.0 comparison with previous releases and LLVM-2.7 on SPEC2000 for x86/x86_64 Vladimir Makarov
2010-04-29 16:49 ` Jan Hubicka
2010-04-29 17:25   ` Vladimir Makarov
2010-04-29 18:17     ` Vladimir Makarov
2010-04-29 18:26 ` Xinliang David Li
2010-04-29 18:57   ` Vladimir Makarov
2010-04-29 19:42     ` Xinliang David Li
2010-04-29 20:19       ` Vladimir Makarov
2010-04-29 20:40         ` Xinliang David Li
2010-04-29 21:33     ` Jan Hubicka
2010-04-29 21:34       ` Jan Hubicka
2010-04-29 21:36       ` Xinliang David Li
2010-05-01  9:36         ` Jan Hubicka
2010-05-02  7:04           ` Xinliang David Li
2010-05-02 13:46             ` Jan Hubicka
2010-05-03  4:57               ` Xinliang David Li
2010-05-04 18:04                 ` Jan Hubicka
2010-04-29 21:38       ` Xinliang David Li
2010-04-29 21:46         ` Jan Hubicka
2010-04-29 21:45       ` Steven Bosscher
2010-04-29 22:35         ` Xinliang David Li
2010-04-29 22:50           ` Jan Hubicka
2010-04-29 22:51             ` Steven Bosscher
2010-04-29 23:06               ` Jan Hubicka
2010-04-29 23:47                 ` Steven Bosscher
2010-09-28  0:28                   ` Neil Vachharajani
2010-09-28  1:01                     ` Jack Howarth
2010-04-30  0:57                 ` Xinliang David Li
2010-04-30  8:42                   ` Jan Hubicka
2010-04-30 18:13                     ` Xinliang David Li
2010-04-30 18:32                       ` Jan Hubicka
2010-04-30 20:13                         ` Xinliang David Li
2010-09-28  0:29                           ` Neil Vachharajani
2010-04-29 22:42 ` Jack Howarth
2010-11-13 23:15 ` Xinliang David Li
2010-11-14 14:48   ` Paolo Bonzini
2010-11-14 15:43     ` Xinliang David Li
2010-11-14 21:12   ` H.J. Lu
2010-11-15  9:29     ` Xinliang David Li
2010-11-15 15:49   ` Andrey Belevantsev
2010-11-15 17:41     ` Xinliang David Li
2010-11-15 18:31       ` Jan Hubicka
2010-11-15 22:25         ` Richard Guenther
2010-11-15 22:47         ` Xinliang David Li
2010-11-15 23:06           ` Jan Hubicka
2010-11-16  0:41             ` Xinliang David Li
2010-11-16  0:53               ` Xinliang David Li
2010-11-16  1:02                 ` Jan Hubicka
2010-11-16  1:19                   ` Jan Hubicka
2010-11-16  1:24                   ` Xinliang David Li
2010-11-16  1:39                     ` Jan Hubicka
2010-11-16  1:45                       ` Xinliang David Li
2010-11-16  4:11                         ` Jan Hubicka
2010-11-16  6:56                           ` Xinliang David Li
2010-11-16  8:26                             ` Jan Hubicka
2010-11-16  9:00                               ` Xinliang David Li
2010-11-16 14:23                                 ` Xinliang David Li
2010-11-16 17:10                                   ` Jan Hubicka
2010-11-16 19:11                                     ` Xinliang David Li
2010-11-18 11:48                                   ` Xinliang David Li
2010-11-18 13:06                                     ` Jan Hubicka
2010-11-18 17:20                                       ` Xinliang David Li
     [not found]                                       ` <AANLkTinTVN_T06eG1-nxei_Vj999wFT7qKZ55vwW+TtC@mail.gmail.com>
2010-11-19  4:22                                         ` Jan Hubicka
2010-11-19  7:26                                           ` Xinliang David Li
     [not found]                                           ` <AANLkTimvfm24_fvBdkYmqPVjSzcOgy0hx_0mO11AxbeC@mail.gmail.com>
2010-11-19 14:12                                             ` Xinliang David Li
2010-11-18 13:28                                     ` Jan Hubicka
2010-11-18 18:18                                       ` Xinliang David Li
2010-11-16 15:43                               ` Richard Guenther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).