GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
@ 2004-08-15 15:35 Scott Robert Ladd
  2004-08-15 20:48 ` Giovanni Bajo
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-15 15:35 UTC (permalink / raw)
  To: gcc mailing list

Good day,

Using a custom benchmark suite of my own design, I have compared the 
performance of code generated by recent and pending versions of GCC, for 
AMD Opteron and Intel Pentium 4 processors.

Raw Numbers
===========

System Corwin (x86_64)
   Dual Opteron 240, 1.4GHz
   Tyan K8W 2885 motherboard
   120GB Maxtor 7200 RPM ATA-133 HD
   2GB PC2700 DRAM (1GB per processor, NUMA)
   Radeon 9200 Pro, 128MB, HP f1903 LCD

   Linux 2.6.7 #2 SMP Sat Jun 19 20:16:20 EDT 2004
   GNU C Library 20040808 release version 2.3.4
   GNU assembler 2.15.90.0.1.1 20040303
   ln (coreutils) 5.2.1

                  3.2.3  3.3.3  3.4.2  3.5.0
                  -----  -----  -----  -----
      alma time:   43.1   43.4   42.3   28.1
      arco time:   24.8   25.4   24.7   24.8
       evo time:   47.0   65.9   25.0   24.8
       fft time:   27.7   27.8   27.4   28.1
      huff time:   28.4   28.3   23.6   22.4
       lin time:   30.1   30.1   29.8   29.3
      mat1 time:   28.5   28.3   28.7   29.7
      mole time:   10.7   12.8   12.2   28.8
      tree time:   41.8   40.9   37.7   30.4
--------------   -----  -----  -----  -----
total run time:  282.0  302.8  251.5  246.3

System Tycho (i686)
   2.8GHz Pentium 4, HT enabled in BIOS and OS
   Intel D850EMV2 motherboard
   80GB Maxtor 6L080J4, 7200RPM ATA-100 HD
   80GB Maxtor 6Y080P0, 7200RPM ATA-100 HD
   512MB PC800 RDRAM
   Radeon 9200 Pro, NEC FE990 monitor

   Linux 2.6.7 #1 SMP Sat Jun 26 12:39:11 EDT 2004
   GNU C Library 20040808 release version 2.3.4
   GNU assembler 2.14.90.0.8 20040114
   ln (coreutils) 5.2.1

                  3.2.3  3.3.3  3.4.2  3.5.0  icc 8
                  -----  -----  -----  -----  -----
      alma time:   39.5   39.6   39.0   22.3   13.3
      arco time:   27.8   26.9   25.1   27.3   27.7
       evo time:   43.1   42.9   42.4   42.1   30.1
       fft time:   27.4   27.4   27.0   27.3   30.2
      huff time:   23.1   23.6   18.0   13.1   16.3
       lin time:   19.1   19.1   18.9   19.5   19.1
      mat1 time:    7.4    7.5    7.5    7.5    7.4
      mole time:   31.6   30.5   30.9   31.3    5.1
      tree time:   30.9   32.3   28.3   25.6   28.8
     ----------   -----  -----  -----  -----  -----
     total time:  249.9  249.7  237.1  215.8  178.0

General Thoughts
================

Overall, GCC 3.5 provides a minor improvement in generated code 
performance when compared to GCC 3.4. The historical comparison with 
earlier GCCs shows that code performance *is* improving with subsequent 
releases.

At this time, GCC 3.5 and 3.4 often produce comparable code -- but on a 
few benchmarks, they differ greatly. For the Opteron, GCC 3.5 generates 
significantly faster code for the alma and tree benchmarks -- but it 
suffers a massive regression on the mole test. For the Pentium 4, GCC 
3.5 is superior for the alma, huff, and tree tests, but loses a bit of 
ground against 3.4 on others.

Intel C is still amazingly effective. HOWEVER, I do not have a more 
recent version of Intel C because my current commercial license has 
expired, and compiler updates won't install any more. In terms of 
intellectual and practical freedom, GCC wins hands down.

The Usual Explanations and Caveats
==================================

All compilers were built on the host systems, from official, unpatched 
archives (3.2 and 3.3) or CVS checkouts (3.4 and 3.5), acquired on the 
morning of 14 August 2004. The compiler configuration command was:

../gcc/configure --prefix=/opt/gcc-3.?
	--enable-shared
	--enable-threads=posix
	--enable-__cxa_atexit
	--disable-checking
	--disable-multilib
	--enable-languages=c,c++,f77 (f95 for gcc 3.5)

The compilers were built with make -j2 bootstrap.

Since we're interested in generated code speed, all compiles were 
performed with the option set used by typical users:

   -O3 -ffast-math -march=pentium4
   -O3 -ffast-math -march=athlon-mp (Opteron, for GCCs 3.2 and 3.3)
   -O3 -ffast-math -march=opteron   (Opteron, for GCCs 3.4 and 3.5)

On the Pentium 4, I also compiled the code with Intel's ICC compiler, 
version 8.0.055 (build 20031211Z), using the options:

   -xN -O3 -ipo

As my Acovea program has shown, a selection of individual optimization 
flags often produces code that performs faster that what is generated by 
the generic (-O? options). However, most programmers don't have the time 
or expertise required for finding optimal optimizations (!) -- and as 
such, they tend to use the most "powerful" composite options (e.g., -O3).

Some folk may object to my use of -ffast-math -- however, in numerous 
accuracy tests, -ffast-math produces code that is both faster *and* more 
accurate than code generated without it. Yes, -ffast-math has other 
aspects that make for interesting debate; however, such discussions 
belong in another article.

This article is *NOT* a comparison of the Pentium 4 and Opteron 
processors; my two test systems are far too different for any such 
comparison to have meaning. Please do not ask me to test on systems I 
don't own, unless you're willing to send me hardware. Assuming I find 
some paying work this month, I'll be making some system upgrades in the 
near future; for now, what I've got is what I've got.

About the Benchmarks
====================

alma -- Calculates the daily ephemeris (at noon) for the years 
2000-2099; tests array handling, floating-point math, and mathematical 
functions such as sin() and cos().

evo -- A simple genetic algorithm that maximizes a two-dimensional 
function; tests 64-bit math, loop generation, and floating-point math.

fft -- Uses a Fast Fourier Transform to multiply two very (very) large 
polynomials; tests the C99 _Complex type and basic floating-point math.

huff -- Compresses a large block of data using the Huffman algorithm; 
tests string manipulation, bit twiddling, and the use of large memory 
blocks.

lin -- Solves a large linear equation via LUP decomposition; tests basic 
floating-point math, two-dimensional array performance, and loop 
optimization.

mat1 -- Multiplies two very large matrices using the brute-force 
algorithm; tests loop optimization.

mole -- A molecular dynamics simulation, with performance predicated on 
matrix operations, loop efficiency, and sin() and cos(). I recently 
added this test, which exhibits very different characteristics from alma 
(even if they appear similar).

tree -- Creates and modifies a large B-tree in memory; tests integer 
looping, and dynamic memory management.

My benchmark suite is still in development, and isn't packaged as nicely 
as I'd like for general distribution. If you'd want the benchmark source 
code, or have any questions about these tests, please e-mail me.

Thank you!

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-15 15:35 GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004 Scott Robert Ladd
@ 2004-08-15 20:48 ` Giovanni Bajo
  2004-08-16 15:51   ` Scott Robert Ladd
  2004-08-17  5:14 ` Natros
  2004-08-18  1:35 ` Kaveh R. Ghazi
  2 siblings, 1 reply; 43+ messages in thread
From: Giovanni Bajo @ 2004-08-15 20:48 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: gcc

Scott Robert Ladd wrote:


> For the Opteron, GCC 3.5
> generates significantly faster code for the alma and tree benchmarks
> -- but it suffers a massive regression on the mole test.

Can you please file a bugreport with a preprocessed version of the mole test?
We can flag it as a regression and hopefully have it fixed before the release.
-- 
Giovanni Bajo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-15 20:48 ` Giovanni Bajo
@ 2004-08-16 15:51   ` Scott Robert Ladd
  2004-08-17 19:28     ` Marcel Cox
  0 siblings, 1 reply; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-16 15:51 UTC (permalink / raw)
  To: Giovanni Bajo; +Cc: gcc

Giovanni Bajo wrote:
  > Can you please file a bugreport with a preprocessed version of the 
mole test?
> We can flag it as a regression and hopefully have it fixed before the release.

Filed as 17050, with -save-temps source file.

This problem is unique to either 64-bit or Opteron code generation, 
given that my Pentium 4 does not show the same regression. On the other 
hand, Intel produces code that is 6X faster on Pentium 4, suggesting 
that GCC's code generation may be weak there as well.

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-15 15:35 GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004 Scott Robert Ladd
  2004-08-15 20:48 ` Giovanni Bajo
@ 2004-08-17  5:14 ` Natros
  2004-08-18  1:35 ` Kaveh R. Ghazi
  2 siblings, 0 replies; 43+ messages in thread
From: Natros @ 2004-08-17  5:14 UTC (permalink / raw)
  To: gcc mailing list

On Sun, 15 Aug 2004 10:55:07 -0400, Scott Robert Ladd
<coyote@coyotegulch.com> wrote:
> Good day,
> 
> Using a custom benchmark suite of my own design, I have compared the
> performance of code generated by recent and pending versions of GCC, for
> AMD Opteron and Intel Pentium 4 processors.
> 
> Raw Numbers
> ===========
> 
> System Corwin (x86_64)
>    Dual Opteron 240, 1.4GHz
>    Tyan K8W 2885 motherboard
>    120GB Maxtor 7200 RPM ATA-133 HD
>    2GB PC2700 DRAM (1GB per processor, NUMA)
>    Radeon 9200 Pro, 128MB, HP f1903 LCD
> 
>    Linux 2.6.7 #2 SMP Sat Jun 19 20:16:20 EDT 2004
>    GNU C Library 20040808 release version 2.3.4
>    GNU assembler 2.15.90.0.1.1 20040303
>    ln (coreutils) 5.2.1
> 
>                   3.2.3  3.3.3  3.4.2  3.5.0
>                   -----  -----  -----  -----
>       alma time:   43.1   43.4   42.3   28.1
>       arco time:   24.8   25.4   24.7   24.8
>        evo time:   47.0   65.9   25.0   24.8
>        fft time:   27.7   27.8   27.4   28.1
>       huff time:   28.4   28.3   23.6   22.4
>        lin time:   30.1   30.1   29.8   29.3
>       mat1 time:   28.5   28.3   28.7   29.7
>       mole time:   10.7   12.8   12.2   28.8
>       tree time:   41.8   40.9   37.7   30.4
> --------------   -----  -----  -----  -----
> total run time:  282.0  302.8  251.5  246.3
> 
> System Tycho (i686)
>    2.8GHz Pentium 4, HT enabled in BIOS and OS
>    Intel D850EMV2 motherboard
>    80GB Maxtor 6L080J4, 7200RPM ATA-100 HD
>    80GB Maxtor 6Y080P0, 7200RPM ATA-100 HD
>    512MB PC800 RDRAM
>    Radeon 9200 Pro, NEC FE990 monitor
> 
>    Linux 2.6.7 #1 SMP Sat Jun 26 12:39:11 EDT 2004
>    GNU C Library 20040808 release version 2.3.4
>    GNU assembler 2.14.90.0.8 20040114
>    ln (coreutils) 5.2.1
> 
>                   3.2.3  3.3.3  3.4.2  3.5.0  icc 8
>                   -----  -----  -----  -----  -----
>       alma time:   39.5   39.6   39.0   22.3   13.3
>       arco time:   27.8   26.9   25.1   27.3   27.7
>        evo time:   43.1   42.9   42.4   42.1   30.1
>        fft time:   27.4   27.4   27.0   27.3   30.2
>       huff time:   23.1   23.6   18.0   13.1   16.3
>        lin time:   19.1   19.1   18.9   19.5   19.1
>       mat1 time:    7.4    7.5    7.5    7.5    7.4
>       mole time:   31.6   30.5   30.9   31.3    5.1
>       tree time:   30.9   32.3   28.3   25.6   28.8
>      ----------   -----  -----  -----  -----  -----
>      total time:  249.9  249.7  237.1  215.8  178.0
> 
> General Thoughts
> ================
> 
> Overall, GCC 3.5 provides a minor improvement in generated code
> performance when compared to GCC 3.4. The historical comparison with
> earlier GCCs shows that code performance *is* improving with subsequent
> releases.
> 
> At this time, GCC 3.5 and 3.4 often produce comparable code -- but on a
> few benchmarks, they differ greatly. For the Opteron, GCC 3.5 generates
> significantly faster code for the alma and tree benchmarks -- but it
> suffers a massive regression on the mole test. For the Pentium 4, GCC
> 3.5 is superior for the alma, huff, and tree tests, but loses a bit of
> ground against 3.4 on others.
> 
> Intel C is still amazingly effective. HOWEVER, I do not have a more
> recent version of Intel C because my current commercial license has
> expired, and compiler updates won't install any more. In terms of
> intellectual and practical freedom, GCC wins hands down.

But you can get a non-commercial license and the updates. I've been
using it for 2 years

> The Usual Explanations and Caveats
> ==================================
> 
> All compilers were built on the host systems, from official, unpatched
> archives (3.2 and 3.3) or CVS checkouts (3.4 and 3.5), acquired on the
> morning of 14 August 2004. The compiler configuration command was:
> 
> .../gcc/configure --prefix=/opt/gcc-3.?
>         --enable-shared
>         --enable-threads=posix
>         --enable-__cxa_atexit
>         --disable-checking
>         --disable-multilib
>         --enable-languages=c,c++,f77 (f95 for gcc 3.5)
> 
> The compilers were built with make -j2 bootstrap.
> 
> Since we're interested in generated code speed, all compiles were
> performed with the option set used by typical users:
> 
>    -O3 -ffast-math -march=pentium4
>    -O3 -ffast-math -march=athlon-mp (Opteron, for GCCs 3.2 and 3.3)
>    -O3 -ffast-math -march=opteron   (Opteron, for GCCs 3.4 and 3.5)
> 
> On the Pentium 4, I also compiled the code with Intel's ICC compiler,
> version 8.0.055 (build 20031211Z), using the options:
> 
>    -xN -O3 -ipo
> 
> As my Acovea program has shown, a selection of individual optimization
> flags often produces code that performs faster that what is generated by
> the generic (-O? options). However, most programmers don't have the time
> or expertise required for finding optimal optimizations (!) -- and as
> such, they tend to use the most "powerful" composite options (e.g., -O3).
> 
> Some folk may object to my use of -ffast-math -- however, in numerous
> accuracy tests, -ffast-math produces code that is both faster *and* more
> accurate than code generated without it. Yes, -ffast-math has other
> aspects that make for interesting debate; however, such discussions
> belong in another article.
> 
> This article is *NOT* a comparison of the Pentium 4 and Opteron
> processors; my two test systems are far too different for any such
> comparison to have meaning. Please do not ask me to test on systems I
> don't own, unless you're willing to send me hardware. Assuming I find
> some paying work this month, I'll be making some system upgrades in the
> near future; for now, what I've got is what I've got.
> 
> About the Benchmarks
> ====================
> 
> alma -- Calculates the daily ephemeris (at noon) for the years
> 2000-2099; tests array handling, floating-point math, and mathematical
> functions such as sin() and cos().
> 
> evo -- A simple genetic algorithm that maximizes a two-dimensional
> function; tests 64-bit math, loop generation, and floating-point math.
> 
> fft -- Uses a Fast Fourier Transform to multiply two very (very) large
> polynomials; tests the C99 _Complex type and basic floating-point math.
> 
> huff -- Compresses a large block of data using the Huffman algorithm;
> tests string manipulation, bit twiddling, and the use of large memory
> blocks.
> 
> lin -- Solves a large linear equation via LUP decomposition; tests basic
> floating-point math, two-dimensional array performance, and loop
> optimization.
> 
> mat1 -- Multiplies two very large matrices using the brute-force
> algorithm; tests loop optimization.
> 
> mole -- A molecular dynamics simulation, with performance predicated on
> matrix operations, loop efficiency, and sin() and cos(). I recently
> added this test, which exhibits very different characteristics from alma
> (even if they appear similar).
> 
> tree -- Creates and modifies a large B-tree in memory; tests integer
> looping, and dynamic memory management.
> 
> My benchmark suite is still in development, and isn't packaged as nicely
> as I'd like for general distribution. If you'd want the benchmark source
> code, or have any questions about these tests, please e-mail me.
> 
> Thank you!
> 
> --
> Scott Robert Ladd
> Coyote Gulch Productions (http://www.coyotegulch.com)
> Software Invention for High-Performance Computing
> 


-- 
Natros

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-16 15:51   ` Scott Robert Ladd
@ 2004-08-17 19:28     ` Marcel Cox
  2004-08-17 21:26       ` Scott Robert Ladd
  0 siblings, 1 reply; 43+ messages in thread
From: Marcel Cox @ 2004-08-17 19:28 UTC (permalink / raw)
  To: gcc

Scott Robert Ladd wrote:

> This problem is unique to either 64-bit or Opteron code generation,
> given that my Pentium 4 does not show the same regression. On the
> other hand, Intel produces code that is 6X faster on Pentium 4,
> suggesting that GCC's code generation may be weak there as well.

I don't know where the difference on Opteron comes from, but for the
difference between GCC 3.4.x and ICC on Pentium4, it becomes quite
clear what optimisations ICC does and are missed by GCC. It boild down
to the single following function:

inline static double dv(double x)
{
    return 2.0 * sin(((x < PI2) ? x : PI2)) * cos(((x < PI2) ? x :
PI2));
}

1) This function is poorly written. A much more efficient way to write
exactly the same function would be :

inline static double dv(double x)
{
	if (x >= PI2) return 0;
	else return sin(2.0 * x);
}

This removes the test out of the function call and uses the
mathematical property that sin(2*x) = 2*sin(x)*cos(x).
On my computer, this transformation alone reduces the run time from 32s
to 18s. My guess is that ICC is able to make this transform itself thus
avoiding having to calculate both sin and cos and then do the
multiplication. It also possible that ICC does not do this transform,
but uses a function to calculate sin and cos at the same time. In both
cases, it results in a single funciton instead of 2.

2) The function dv() is called from within a loop with the argument not
changing. I guess that ICC detects that dv() is a pure function and
moves it out of the loop. GCC however calculates the sin and cos at
each iteration. By moving the call outside the loop and storing the
result in a variable that is used in the loop, the run time goes down
to 7s thus very close to ICC.

So, all in all, GCC just misses 2 optimisations in this case:
- high level transforms of trigonometric functions
- recognize that trigonometric functions are called with the same
arguments in a loop and fail to move them out of the loop

While both these types of optimizations would be nice to have, I still
consider them to be optimizations that someone who writes a program
that does mathematical calculations should have done himself when
writing the program.

-- 
Marcel Cox (using XanaNews 1.16.3.1)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 19:28     ` Marcel Cox
@ 2004-08-17 21:26       ` Scott Robert Ladd
  2004-08-17 22:21         ` Joe Buck
                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-17 21:26 UTC (permalink / raw)
  To: Marcel Cox; +Cc: gcc

Hello,

Thanks for the analysis; the mole benchmark is a cleaned-up automatic 
translation of Fortran 90 code written by Bill Magro of Kuck and 
Associates, Inc. I hadn't made a pass over it to look for hand 
optimizations.

I've implemented the changes you suggested in both C and Fortran 90 
versions, plus I made a few other optimizations that came to mind. As a 
result, the run-time of the mole benchmark dropped from 29+ seconds to 
6.7 seconds on the Opteron. This will be reflected in the new tables 
I'll publishing this coming weekend.

Marcel Cox wrote:
 > While both these types of optimizations would be nice to have, I still
 > consider them to be optimizations that someone who writes a program
 > that does mathematical calculations should have done himself when
 > writing the program.

Almost any code can be hand-optimized for better performance; the 
SPEC2000 benchmarks, for example, are far from perfectly realized. And 
sometimes an excellent optimizer allows code to be written for greater 
clarity as opposed to cleverness. Not all loop invariants are so obvious.

Given that Intel is smart-enough to generate both correct and fast code 
from the source, GCC should be able to do the same. I'm thankful that we 
have a very smart group of people who are working hard on such matters.

..Scott

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 21:26       ` Scott Robert Ladd
@ 2004-08-17 22:21         ` Joe Buck
  2004-08-17 22:39           ` Mike Stump
  2004-08-17 22:54           ` Scott Robert Ladd
  2004-08-17 23:53         ` Robert Dewar
  2004-08-18 11:14         ` Marcel Cox
  2 siblings, 2 replies; 43+ messages in thread
From: Joe Buck @ 2004-08-17 22:21 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: Marcel Cox, gcc

On Tue, Aug 17, 2004 at 04:33:30PM -0400, Scott Robert Ladd wrote:
> Given that Intel is smart-enough to generate both correct and fast code 
> from the source, GCC should be able to do the same. I'm thankful that we 
> have a very smart group of people who are working hard on such matters.

There's an issue here.  I hesitate to call it "ethics", but it is
borderline.

The issue is whether the compiler should specifically detect certain
transformation opportunities in benchmarks that are unlikely to exist in
real code, to get an artificially high score on that benchmark that does
not reflect what the user can expect on code that the compiler developers
have not seen before.

A classic example was the Whetstone benchmark, where many compiler
developers started specifically recognizing a particular expression
involving transcendental functions and applying a mathematical identity to
speed it up.  This particular transformation is useless for anything
except speeding up Whetstone, but it slows down the compiler a bit for all
programs.

In this case, a transformation could be added as part of -ffast-math that
would specifically recognize

inline static double dv(double x)
{
    return 2.0 * sin(((x < PI2) ? x : PI2)) * cos(((x < PI2) ? x :
PI2));
}

and transform it into

inline static double dv(double x)
{
    return x >= PI2 ? 0.0 : sin(2.0 * x);
}

just to get a high score on mole.  But that strikes me as close to
immoral; any tranformation should fall out of some generally useful
transformation sequence.  For example, if (x < PI2 ? x : PI2) is
recognized as a common subexpression, we have

inline static double dv(double x)
{
   double arg = x < PI2 ? x : PI2;
   return 2.0 * sin(arg) * cos(arg);
}

This might then be turned into

inline static double dv(double x)
{
   if (x < PI2)
       return 2.0 * sin(x) * cos(x);
   else
       return 0.0;
}

and if ICC does it this way, I would not consider this "cheating" at all.
That could be justified.  The trig identity that 2*sin(x)*cos(x) is
sin(2*x) is more iffy; we'd waste time trying to apply such
transformations to every tree, and it's pretty much exactly what caused
such controversy when everyone did it to Whetstone.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 22:21         ` Joe Buck
@ 2004-08-17 22:39           ` Mike Stump
  2004-08-17 22:54           ` Scott Robert Ladd
  1 sibling, 0 replies; 43+ messages in thread
From: Mike Stump @ 2004-08-17 22:39 UTC (permalink / raw)
  To: Joe Buck; +Cc: Marcel Cox, gcc, Scott Robert Ladd

On Aug 17, 2004, at 3:15 PM, Joe Buck wrote:
> The issue is whether the compiler should specifically detect certain
> transformation opportunities in benchmarks that are unlikely to exist 
> in
> real code, to get an artificially high score on that benchmark that 
> does
> not reflect what the user can expect on code that the compiler 
> developers
> have not seen before.

[ cough, art hack, cough ]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 22:21         ` Joe Buck
  2004-08-17 22:39           ` Mike Stump
@ 2004-08-17 22:54           ` Scott Robert Ladd
  1 sibling, 0 replies; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-17 22:54 UTC (permalink / raw)
  To: Joe Buck, gcc

Joe Buck wrote:
> There's an issue here.  I hesitate to call it "ethics", but it is
> borderline.
> 
> The issue is whether the compiler should specifically detect certain
> transformation opportunities in benchmarks that are unlikely to exist in
> real code, to get an artificially high score on that benchmark that does
> not reflect what the user can expect on code that the compiler developers
> have not seen before.
> 
> A classic example was the Whetstone benchmark, where many compiler
> developers started specifically recognizing a particular expression
> involving transcendental functions and applying a mathematical identity to
> speed it up.  This particular transformation is useless for anything
> except speeding up Whetstone, but it slows down the compiler a bit for all
> programs.

A good point. I wrote my first benchmark article for Micro Cornucopia 
back in 1988, and ran into exactly the problem above. When I wrote a 
review of Fortran compilers for Computer Language in about '90, we had 
trouble with compilers built for specific benchmarks, and vendors who 
complained loudly if we did not use the specific benchmarks they were 
best at optimizing. And today we have chicanery involving drivers for 
various video cards -- ugh.

To combat this, I use non-standard benchmarks, often reducing larger 
"real world" programs to find something easily understandable yet 
complex enough to stress compilers' code generation.

> In this case, a transformation could be added as part of -ffast-math that
> would specifically recognize
> 
> inline static double dv(double x)
> {
>     return 2.0 * sin(((x < PI2) ? x : PI2)) * cos(((x < PI2) ? x :
> PI2));
> }
> 
> and transform it into
> 
> inline static double dv(double x)
> {
>     return x >= PI2 ? 0.0 : sin(2.0 * x);
> }
> 
> just to get a high score on mole.  But that strikes me as close to
> immoral; any tranformation should fall out of some generally useful
> transformation sequence.

Consider that my "mole" benchmark is *not* a well-known benchmark 
program, and I published it *after* the release of the Intel compiler 
I'm using. I don't see how they could have anticipated my benchmark, 
unless another, better-known benchmark includes a similar function.

While I know Intel has taken some personal interest in my benchmarks, I 
sincerely doubt they have had the time or desire to make optimizations 
specific to my rather eclectic and obscure set of tests. I'm certain 
they spend much more time trying to co-opt SPEC, for example.

> inline static double dv(double x)
> {
>    double arg = x < PI2 ? x : PI2;
>    return 2.0 * sin(arg) * cos(arg);
> }
> 
> This might then be turned into
> 
> inline static double dv(double x)
> {
>    if (x < PI2)
>        return 2.0 * sin(x) * cos(x);
>    else
>        return 0.0;
> }
> 
> and if ICC does it this way, I would not consider this "cheating" at all.
> That could be justified.  The trig identity that 2*sin(x)*cos(x) is
> sin(2*x) is more iffy; we'd waste time trying to apply such
> transformations to every tree, and it's pretty much exactly what caused
> such controversy when everyone did it to Whetstone.

I agree with your analysis; a compiler should not need to "know" 
trigonometric identities. I do find it interesting that Intel's compiler 
can make some very interesting "optimizations" of this sort, rather quickly.

Also, how deep should such an analysis go? Would the compiler need to 
recognize various transformations of the identity in order to replace 
it? The number of special cases is rather daunting; I think it better to 
rely on programmers for reducing such expressions.

..Scott

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 21:26       ` Scott Robert Ladd
  2004-08-17 22:21         ` Joe Buck
@ 2004-08-17 23:53         ` Robert Dewar
  2004-08-18  0:28           ` Joe Buck
  2004-08-18 11:14         ` Marcel Cox
  2 siblings, 1 reply; 43+ messages in thread
From: Robert Dewar @ 2004-08-17 23:53 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: Marcel Cox, gcc

Scott Robert Ladd wrote:

> Given that Intel is smart-enough to generate both correct and fast code 
> from the source, GCC should be able to do the same. I'm thankful that we 
> have a very smart group of people who are working hard on such matters.

I am always suspicious of spec results. So much effort is expended by
some systems houses (including Intel) that is focused on the spec
programs, and I am never convinced that the result is representative.

Of course in this case, we know some clear ways to improve results
that for sure *are* generally useful. I think we should always
concentrate on the notion of generally useful, and consider spec
results to be useful only as an indication of possible candidates,
not as an end in themselves.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 23:53         ` Robert Dewar
@ 2004-08-18  0:28           ` Joe Buck
  2004-08-18  0:41             ` Robert Dewar
                               ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Joe Buck @ 2004-08-18  0:28 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Scott Robert Ladd, Marcel Cox, gcc

On Tue, Aug 17, 2004 at 07:44:34PM -0400, Robert Dewar wrote:
> I am always suspicious of spec results. So much effort is expended by
> some systems houses (including Intel) that is focused on the spec
> programs, and I am never convinced that the result is representative.

icc's good performance on Kahan's Paranoia benchmark is surprising given
that it makes obviously unsafe floating point transformations; a
reasonable conclusion is that its developers were asked for both a good
Paranoia score and good floating point performance, and they cut corners,
but took care not to break Paranoia.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:28           ` Joe Buck
@ 2004-08-18  0:41             ` Robert Dewar
  2004-08-18  0:44             ` Robert Dewar
  2004-08-18  1:34             ` Scott Robert Ladd
  2 siblings, 0 replies; 43+ messages in thread
From: Robert Dewar @ 2004-08-18  0:41 UTC (permalink / raw)
  To: Joe Buck; +Cc: Scott Robert Ladd, Marcel Cox, gcc

Joe Buck wrote:

> icc's good performance on Kahan's Paranoia benchmark is surprising given
> that it makes obviously unsafe floating point transformations; a
> reasonable conclusion is that its developers were asked for both a good
> Paranoia score and good floating point performance, and they cut corners,
> but took care not to break Paranoia.

Such cynicism! :-) :-) :-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:28           ` Joe Buck
  2004-08-18  0:41             ` Robert Dewar
@ 2004-08-18  0:44             ` Robert Dewar
  2004-08-18  0:58               ` Joe Buck
  2004-08-18  1:34             ` Scott Robert Ladd
  2 siblings, 1 reply; 43+ messages in thread
From: Robert Dewar @ 2004-08-18  0:44 UTC (permalink / raw)
  To: Joe Buck; +Cc: Scott Robert Ladd, Marcel Cox, gcc

Joe Buck wrote:

> icc's good performance on Kahan's Paranoia benchmark is surprising given
> that it makes obviously unsafe floating point transformations; a
> reasonable conclusion is that its developers were asked for both a good
> Paranoia score and good floating point performance, and they cut corners,
> but took care not to break Paranoia.

Actually to be fair, we have often found in the past that an impression
that ICC was doing something horrible turned out to be wrong on more
careful examination, so let's make sure we know what we are talking
about here :-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:44             ` Robert Dewar
@ 2004-08-18  0:58               ` Joe Buck
  2004-08-18  1:04                 ` Robert Dewar
  0 siblings, 1 reply; 43+ messages in thread
From: Joe Buck @ 2004-08-18  0:58 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Scott Robert Ladd, Marcel Cox, gcc

On Tue, Aug 17, 2004 at 08:40:09PM -0400, Robert Dewar wrote:
> Joe Buck wrote:
> 
> > icc's good performance on Kahan's Paranoia benchmark is surprising given
> > that it makes obviously unsafe floating point transformations; a
> > reasonable conclusion is that its developers were asked for both a good
> > Paranoia score and good floating point performance, and they cut corners,
> > but took care not to break Paranoia.
> 
> Actually to be fair, we have often found in the past that an impression
> that ICC was doing something horrible turned out to be wrong on more
> careful examination, so let's make sure we know what we are talking
> about here :-)

You said otherwise yourself:

http://gcc.gnu.org/ml/gcc/2004-03/msg01597.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:58               ` Joe Buck
@ 2004-08-18  1:04                 ` Robert Dewar
  2004-08-18  1:18                   ` Joe Buck
  0 siblings, 1 reply; 43+ messages in thread
From: Robert Dewar @ 2004-08-18  1:04 UTC (permalink / raw)
  To: Joe Buck; +Cc: Scott Robert Ladd, Marcel Cox, gcc

Joe Buck wrote:

>>Actually to be fair, we have often found in the past that an impression
>>that ICC was doing something horrible turned out to be wrong on more
>>careful examination, so let's make sure we know what we are talking
>>about here :-)

> You said otherwise yourself:

> http://gcc.gnu.org/ml/gcc/2004-03/msg01597.html

A little problem with quantifiers in your reasoning here :-) My
statement above does NOT have a universal quantifier in it,
so a single counter example is not a contradiction!

I just think we need to be careful to be absolutely sure we
know what icc is doing before we criticize.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:04                 ` Robert Dewar
@ 2004-08-18  1:18                   ` Joe Buck
  2004-08-18  1:27                     ` Robert Dewar
  0 siblings, 1 reply; 43+ messages in thread
From: Joe Buck @ 2004-08-18  1:18 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Scott Robert Ladd, Marcel Cox, gcc

On Tue, Aug 17, 2004 at 08:49:51PM -0400, Robert Dewar wrote:
> Joe Buck wrote:
> 
> >>Actually to be fair, we have often found in the past that an impression
> >>that ICC was doing something horrible turned out to be wrong on more
> >>careful examination, so let's make sure we know what we are talking
> >>about here :-)
> 
> > You said otherwise yourself:
> 
> > http://gcc.gnu.org/ml/gcc/2004-03/msg01597.html
> 
> A little problem with quantifiers in your reasoning here :-) My
> statement above does NOT have a universal quantifier in it,
> so a single counter example is not a contradiction!
> 
> I just think we need to be careful to be absolutely sure we
> know what icc is doing before we criticize.

I'm not accusing the icc developers of lack of ethics; merely pointing
out that, as for many compilers, performance on benchmarks can be
misleading.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:18                   ` Joe Buck
@ 2004-08-18  1:27                     ` Robert Dewar
  0 siblings, 0 replies; 43+ messages in thread
From: Robert Dewar @ 2004-08-18  1:27 UTC (permalink / raw)
  To: Joe Buck; +Cc: Scott Robert Ladd, Marcel Cox, gcc

Joe Buck wrote:

>>I just think we need to be careful to be absolutely sure we
>>know what icc is doing before we criticize.

> I'm not accusing the icc developers of lack of ethics; merely pointing
> out that, as for many compilers, performance on benchmarks can be
> misleading.

Sure, but the issue here is not ethics, it is a technical issue
of what optimizations are permissible. And secondarily, what
optimizations are actually useful.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:28           ` Joe Buck
  2004-08-18  0:41             ` Robert Dewar
  2004-08-18  0:44             ` Robert Dewar
@ 2004-08-18  1:34             ` Scott Robert Ladd
  2004-08-18  1:45               ` Zack Weinberg
                                 ` (3 more replies)
  2 siblings, 4 replies; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-18  1:34 UTC (permalink / raw)
  To: Joe Buck; +Cc: gcc

Joe Buck wrote:
> icc's good performance on Kahan's Paranoia benchmark is surprising given
> that it makes obviously unsafe floating point transformations; a
> reasonable conclusion is that its developers were asked for both a good
> Paranoia score and good floating point performance, and they cut corners,
> but took care not to break Paranoia.

Perhaps.

On my Pentium 4 (Northwood core), Intel produces an excellent Paranoia 
result:  a single flaw (Kahan's most insignificant category of error), 
*when* the compile is performed with the -xN (optimize for Northwood) 
option. The -mp and -mp1 options, which purportedly improve 
floating-point consistency and accuracy, seem ineffective in terms of 
Paranoia.

On the Pentium 4, the best GCC 3.4 can do is one defect and one flaw, 
when the benchmark is compiled with -march=pentium4. Adding -O3 
introduces a host of paranoia errors (ten more!), while -ffast-math adds 
a single defect.

There is one compiler that produces a perfect score on Paranoia: No 
failures, defects, or flaws.

GCC 3.5, on my Opteron system.

Compiled with -march=opteron, GCC 3.5 has a perfect paranoia score. 
Adding -O3 does *not* introduce any errors on the Opteron; adding 
-ffast-math results in a *single* minor "flaw".

Perhaps this is due to underlying architecture of the processors; or 
maybe the x86_64 code generator is just better than Intel's.

Ooh-ooh! Conspiracy theory: We all know how fanatical those AMD devotees 
are; maybe they quietly made certain that GCC produces perfect code on 
Paranoia for the Opteron!  This would both help GCC *and* promote the 
AMD64 architecture.

Muwahahahaha! ;)

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-15 15:35 GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004 Scott Robert Ladd
  2004-08-15 20:48 ` Giovanni Bajo
  2004-08-17  5:14 ` Natros
@ 2004-08-18  1:35 ` Kaveh R. Ghazi
  2004-08-18 13:14   ` Arnaud Desitter
  2 siblings, 1 reply; 43+ messages in thread
From: Kaveh R. Ghazi @ 2004-08-18  1:35 UTC (permalink / raw)
  To: coyote; +Cc: gcc

 >                  3.2.3  3.3.3  3.4.2  3.5.0  icc 8
 >                  -----  -----  -----  -----  -----
 >      alma time:   39.5   39.6   39.0   22.3   13.3
 >      arco time:   27.8   26.9   25.1   27.3   27.7
 >       evo time:   43.1   42.9   42.4   42.1   30.1
 >       fft time:   27.4   27.4   27.0   27.3   30.2
 >      huff time:   23.1   23.6   18.0   13.1   16.3
 >       lin time:   19.1   19.1   18.9   19.5   19.1
 >      mat1 time:    7.4    7.5    7.5    7.5    7.4
 >      mole time:   31.6   30.5   30.9   31.3    5.1
 >      tree time:   30.9   32.3   28.3   25.6   28.8
 >     ----------   -----  -----  -----  -----  -----
 >     total time:  249.9  249.7  237.1  215.8  178.0
 > 
 > Since we're interested in generated code speed, all compiles were performed with the option set used by typical users:
 > 
 >   -O3 -ffast-math -march=pentium4

Hi Scott,

Does your glibc contain the fixes posted here?
http://gcc.gnu.org/ml/gcc-patches/2004-05/msg01720.html
It seems like Jakub predicts these fixes will be in glibc 2.3.4 which
is the version you listed you have, but I wanted to be sure.
Otherwise, it's not a fair test if the well known glibc trig "inlines"
are killing gcc vs icc.

If you have the fixes, I'm curious to know if Jakub got all the
builtin cases gcc handles.  You can easily tell by looking at the -E
output for alma with 3.5.0 and checking for asm statements.  (There
shouldn't be any because I think all the trig funcs alma used were
done.)  Conversely, if -D__NO_MATH_INLINES improves alma, then he
missed a few.

I'd also like to know how icc works around this issue.  Does it
provide it's own C library?  Does it use glibc but disable the inlines
somehow?

		Thanks,
		--Kaveh
--
Kaveh R. Ghazi			ghazi@caip.rutgers.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:34             ` Scott Robert Ladd
@ 2004-08-18  1:45               ` Zack Weinberg
  2004-08-18 18:25                 ` Laurent GUERBY
  2004-08-18  2:04               ` Scott Robert Ladd
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Zack Weinberg @ 2004-08-18  1:45 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: Joe Buck, gcc

Scott Robert Ladd <coyote@coyotegulch.com> writes:

> There is one compiler that produces a perfect score on Paranoia: No 
> failures, defects, or flaws.
>
> GCC 3.5, on my Opteron system.
>
> Compiled with -march=opteron, GCC 3.5 has a perfect paranoia score. 
> Adding -O3 does *not* introduce any errors on the Opteron; adding 
> -ffast-math results in a *single* minor "flaw".

I imagine it helps that the Opteron actually has IEEE-conformant
floating point hardware.

zw

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:34             ` Scott Robert Ladd
  2004-08-18  1:45               ` Zack Weinberg
@ 2004-08-18  2:04               ` Scott Robert Ladd
  2004-08-18  8:37               ` Richard Henderson
  2004-08-18 18:48               ` Toon Moene
  3 siblings, 0 replies; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-18  2:04 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: gcc

Scott Robert Ladd wrote:
> Compiled with -march=opteron, GCC 3.5 has a perfect paranoia score. 
> Adding -O3 does *not* introduce any errors on the Opteron; adding 
> -ffast-math results in a *single* minor "flaw".

A minor clarification:

*BOTH* GCC 3.4 and 3.5 achieve a perfect score on the Opteron, with or 
without any -O? optimization switch. When -ffast-math (actually, 
-funsafe-math-optimizations) is added, GCC 3.4 gains a single flaw, 
while 3.5 gains one defect and two flaws. I'll try to track this down 
and report it as a regression against 3.5.

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:34             ` Scott Robert Ladd
  2004-08-18  1:45               ` Zack Weinberg
  2004-08-18  2:04               ` Scott Robert Ladd
@ 2004-08-18  8:37               ` Richard Henderson
  2004-08-18 16:25                 ` Scott Robert Ladd
  2004-08-18 18:48               ` Toon Moene
  3 siblings, 1 reply; 43+ messages in thread
From: Richard Henderson @ 2004-08-18  8:37 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: Joe Buck, gcc

On Tue, Aug 17, 2004 at 09:31:45PM -0400, Scott Robert Ladd wrote:
> There is one compiler that produces a perfect score on Paranoia: No 
> failures, defects, or flaws.
> 
> GCC 3.5, on my Opteron system.

You should get the same with "-msse2 -mfpmath=sse".

This is the default for the amd64 abi.


r~

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 21:26       ` Scott Robert Ladd
  2004-08-17 22:21         ` Joe Buck
  2004-08-17 23:53         ` Robert Dewar
@ 2004-08-18 11:14         ` Marcel Cox
  2 siblings, 0 replies; 43+ messages in thread
From: Marcel Cox @ 2004-08-18 11:14 UTC (permalink / raw)
  To: gcc

Scott Robert Ladd wrote:

> Marcel Cox wrote:
> > While both these types of optimizations would be nice to have, I
> > still consider them to be optimizations that someone who writes a
> > program that does mathematical calculations should have done
> > himself when writing the program.
> 
> Almost any code can be hand-optimized for better performance; the
> SPEC2000 benchmarks, for example, are far from perfectly realized.
> And sometimes an excellent optimizer allows code to be written for
> greater clarity as opposed to cleverness. Not all loop invariants are
> so obvious.

Yes, I should really have distinguished between the 2 optimizations.
The first optimization, optimizing the body of the dv function was an
easy and obvious optimization for humans which in addition makes the
function much cleaner and more readable.
The second optimization, moving the function call out of the loop is
less obvious to humans, does not improve the readability, but should
OTOH be an optimization that a good compiler should have been able to
figure out itself.

-- 
Marcel (using XanaNews 1.16.3.1)
I can resist everything but temptation

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:35 ` Kaveh R. Ghazi
@ 2004-08-18 13:14   ` Arnaud Desitter
  2004-08-18 15:37     ` Kaveh R. Ghazi
  0 siblings, 1 reply; 43+ messages in thread
From: Arnaud Desitter @ 2004-08-18 13:14 UTC (permalink / raw)
  To: Kaveh R. Ghazi; +Cc: gcc


----- Original Message ----- 
From: "Kaveh R. Ghazi" <ghazi@caip.rutgers.edu>
Newsgroups: gmane.comp.gcc.devel
Cc: <gcc@gcc.gnu.org>
Sent: Wednesday, August 18, 2004 2:33 AM
Subject: Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004


> 
> Does your glibc contain the fixes posted here?
> http://gcc.gnu.org/ml/gcc-patches/2004-05/msg01720.html
> It seems like Jakub predicts these fixes will be in glibc 2.3.4 which
> is the version you listed you have, but I wanted to be sure.
> Otherwise, it's not a fair test if the well known glibc trig "inlines"
> are killing gcc vs icc.
> 
> If you have the fixes, I'm curious to know if Jakub got all the
> builtin cases gcc handles.  You can easily tell by looking at the -E
> output for alma with 3.5.0 and checking for asm statements.  (There
> shouldn't be any because I think all the trig funcs alma used were
> done.)  Conversely, if -D__NO_MATH_INLINES improves alma, then he
> missed a few.
> 
> I'd also like to know how icc works around this issue.  Does it
> provide it's own C library?  Does it use glibc but disable the inlines
> somehow?
> 

Using glibc 2.3.2, icc 8.0 gets its definition of sin, cos and friends from
"bits/mathcall.h". As __NO_MATH_INLINES is defined, "bits/mathinline.h"
is not used and there is no expansion of standard functions to fancy
assembler.

Regards,

>cat qq.c
#include <math.h>
double aa(double x){
  return sin(x)*cos(x);
}
int main(){
  double x = 0, y;
  unsigned int i;
  for(i=0;i<10;++i){
    x += 10*i;
    aa(x+i);
  }
}
>icc -H -O3 -xN qq.c
/usr/include/math.h
 /usr/include/features.h
  /usr/include/sys/cdefs.h
  /usr/include/gnu/stubs.h
 /usr/include/bits/huge_val.h
  /usr/include/features.h
 /usr/include/bits/mathdef.h
 /usr/include/bits/mathcalls.h
 /usr/include/bits/mathcalls.h
 /usr/include/bits/mathcalls.h
>icc -E -dM -O3 -xN qq.c | grep INLINE
#define __NO_MATH_INLINES 1
#define __NO_STRING_INLINES 1
#define __NO_INLINE__ 1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 13:14   ` Arnaud Desitter
@ 2004-08-18 15:37     ` Kaveh R. Ghazi
  2004-08-18 17:05       ` Scott Robert Ladd
  0 siblings, 1 reply; 43+ messages in thread
From: Kaveh R. Ghazi @ 2004-08-18 15:37 UTC (permalink / raw)
  To: arnaud.desitter; +Cc: coyote, gcc, ghazi

 > >icc -E -dM -O3 -xN qq.c | grep INLINE
 > #define __NO_MATH_INLINES 1
 > #define __NO_STRING_INLINES 1
 > #define __NO_INLINE__ 1

Well, I for one think this is *highly* unfair. :-)
If icc gets to avoid using glibc's macros, then so should we!

So my other question remains, did Jakub's patch get all the cases
where GCC does a better job than glibc?  If the benchmarks are
compiled with the above defines, does gcc's score get better on any of
the tests?  (Sorry, I don't have access to run it myself.)

		Thanks,
		--Kaveh
--
Kaveh R. Ghazi			ghazi@caip.rutgers.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  8:37               ` Richard Henderson
@ 2004-08-18 16:25                 ` Scott Robert Ladd
  0 siblings, 0 replies; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-18 16:25 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc

Richard Henderson wrote:
> You should get the same with "-msse2 -mfpmath=sse".
> 
> This is the default for the amd64 abi.

Adding -mfpmath=sse on the Pentium 4 produces an improved, but different 
result (one defect), as compared to compiling for AMD64 (where I see a 
single, less significant flaw).

BTW, -msse2 is implied by -march=pentium4.

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 15:37     ` Kaveh R. Ghazi
@ 2004-08-18 17:05       ` Scott Robert Ladd
  2004-08-18 18:05         ` Kaveh R. Ghazi
  0 siblings, 1 reply; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-18 17:05 UTC (permalink / raw)
  To: Kaveh R. Ghazi; +Cc: arnaud.desitter, gcc

Kaveh R. Ghazi wrote:
>  > >icc -E -dM -O3 -xN qq.c | grep INLINE
>  > #define __NO_MATH_INLINES 1
>  > #define __NO_STRING_INLINES 1
>  > #define __NO_INLINE__ 1
> 
> 
> Well, I for one think this is *highly* unfair. :-)
> If icc gets to avoid using glibc's macros, then so should we!

This issue has arisen before. Let's throw all these suggestions together 
(including the suggested changes in "mole"), run some compiles with GCC 
3.5-20040814, and see what we get:

A = gcc -o bench_3.5_O3_p4 -lrt -lm -std=gnu99
         -O3
         -march=pentium4
         *.c

B = gcc -o bench_3.5_O3_p4_fm -lrt -lm -std=gnu99
         -O3
         -march=pentium4
         -ffast-math
         *.c

C = gcc -o bench_3.5_all -lrt -lm -std=gnu99
         -O3
         -march=pentium4
         -ffast-math
         -mfpmath=sse
         -D__NO_MATH_INLINES
         -D__NO_STRING_INLINES
         -D__NO_INLINE
         *.c

icc = icc -o iccbench -O3 -xN -tpp7 -ipo -lm -lrt *.c

               A      B      C     icc
             -----  -----  -----  -----
      alma:   43.2   22.2   23.7   13.2
      arco:   27.4   27.2   27.2   20.5
       evo:   43.4   42.0   63.8   29.8
       fft:   27.7   27.5   28.4   30.4
      huff:   13.9   13.1   13.2   16.4
       lin:   20.2   19.6   19.8   19.2
      mat1:    7.7    7.5    7.4    7.4
      mole:    8.8    6.7    6.8    2.1
      tree:   26.0   25.7   25.8   28.8
     -----   -----  -----  -----  -----
     total:  218.4  191.7  216.2  167.8

Hmmmm... I don't see where adding the -D__NO_??? options helped GCC -- 
in fact, those options hindered run time severely on the evo test!

Now people know why I don't specify all those #defines when I run my 
tests; I haven't seen a measurable gain in generated code speed from 
their use.

I note that ICC wins 4 benchmarks decisively, including three are are 
wholly of my own design and completely original (alma, arco, evo). 
Furthermore, the code changes in "mole" helped both gcc and icc.

I'm likely to toss out mat1 and mole soon, replacing them with a wavelet 
transform and a particle system computation.

More food for thought, I hope.

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 17:05       ` Scott Robert Ladd
@ 2004-08-18 18:05         ` Kaveh R. Ghazi
  2004-08-18 18:05           ` Scott Robert Ladd
  0 siblings, 1 reply; 43+ messages in thread
From: Kaveh R. Ghazi @ 2004-08-18 18:05 UTC (permalink / raw)
  To: coyote; +Cc: arnaud.desitter, gcc, ghazi, jakub

 >               A      B      C     icc
 >             -----  -----  -----  -----
 >      alma:   43.2   22.2   23.7   13.2
 >      arco:   27.4   27.2   27.2   20.5
 >       evo:   43.4   42.0   63.8   29.8
 >       fft:   27.7   27.5   28.4   30.4
 >      huff:   13.9   13.1   13.2   16.4
 >       lin:   20.2   19.6   19.8   19.2
 >      mat1:    7.7    7.5    7.4    7.4
 >      mole:    8.8    6.7    6.8    2.1
 >      tree:   26.0   25.7   25.8   28.8
 >     -----   -----  -----  -----  -----
 >     total:  218.4  191.7  216.2  167.8
 > 
 > 
 > Hmmmm... I don't see where adding the -D__NO_??? options helped GCC --
 > in fact, those options hindered run time severely on the evo test!

Scott - thanks very much for running these tests, it was very informative.
Based on the results I conclude these things.

1.  Your current glibc *has* the patches that Jakub produced which
    benefits both 3.4 and 3.5.  Whereas in May, the -D__NO_* flags
    made an improvement:
    http://gcc.gnu.org/ml/gcc/2004-05/msg00037.html
    now, clearly they don't help.  This is an improved situation, I
    want us to compare apples to apples when competing against ICC.
    We're closer to testing the compilers fairly now.

2.  In May GCC-3.4.1 beat ICC on alma, now it's the reverse.  Did GCC
    get worse, or did ICC get better?  See:
    http://gcc.gnu.org/ml/gcc/2004-05/msg00114.html
    Something is fishy here.

3.  The evo test didn't regress in May when inlines were off,
    something else must be going on here.  Again see:
    http://gcc.gnu.org/ml/gcc/2004-05/msg00037.html


 > Now people know why I don't specify all those #defines when I run my
 > tests; I haven't seen a measurable gain in generated code speed from
 > their use.

Yes you have, you just forgot.  See #1.

		Thanks,
		--Kaveh
--
Kaveh R. Ghazi			ghazi@caip.rutgers.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 18:05         ` Kaveh R. Ghazi
@ 2004-08-18 18:05           ` Scott Robert Ladd
  0 siblings, 0 replies; 43+ messages in thread
From: Scott Robert Ladd @ 2004-08-18 18:05 UTC (permalink / raw)
  To: gcc mailing list

Kaveh R. Ghazi wrote:
> Yes you have, you just forgot.  See #1.

Must be old age, or maybe having spent the last two months in a
decidedly non-computer environment. Or the heat here in Florida...

I'll look back at what I posted in late May, and see where the
discrepancies lie.

..Scott

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:45               ` Zack Weinberg
@ 2004-08-18 18:25                 ` Laurent GUERBY
  2004-08-18 18:45                   ` Zack Weinberg
  0 siblings, 1 reply; 43+ messages in thread
From: Laurent GUERBY @ 2004-08-18 18:25 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Scott Robert Ladd, Joe Buck, gcc

On Wed, 2004-08-18 at 03:34, Zack Weinberg wrote:
> I imagine it helps that the Opteron actually has IEEE-conformant
> floating point hardware.

Do you imply that Intel Pentium IV do not share this property? If so
are there documented examples of the differences?

Laurent

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 18:25                 ` Laurent GUERBY
@ 2004-08-18 18:45                   ` Zack Weinberg
  0 siblings, 0 replies; 43+ messages in thread
From: Zack Weinberg @ 2004-08-18 18:45 UTC (permalink / raw)
  To: Laurent GUERBY; +Cc: Scott Robert Ladd, Joe Buck, gcc

Laurent GUERBY <laurent@guerby.net> writes:

> On Wed, 2004-08-18 at 03:34, Zack Weinberg wrote:
>> I imagine it helps that the Opteron actually has IEEE-conformant
>> floating point hardware.
>
> Do you imply that Intel Pentium IV do not share this property? If so
> are there documented examples of the differences?

See, oh, any of the dozens of bug reports closed with "use
-ffloat-store".

zw

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:34             ` Scott Robert Ladd
                                 ` (2 preceding siblings ...)
  2004-08-18  8:37               ` Richard Henderson
@ 2004-08-18 18:48               ` Toon Moene
  3 siblings, 0 replies; 43+ messages in thread
From: Toon Moene @ 2004-08-18 18:48 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: Joe Buck, gcc

Scott Robert Ladd wrote:

> There is one compiler that produces a perfect score on Paranoia: No 
> failures, defects, or flaws.
> 
> GCC 3.5, on my Opteron system.
> 
> Compiled with -march=opteron, GCC 3.5 has a perfect paranoia score. 
> Adding -O3 does *not* introduce any errors on the Opteron; adding 
> -ffast-math results in a *single* minor "flaw".

Last week I was at meeting 169 of J3, the Fortran Standardization body.

During that meeting I had a discussion about support for floating point 
computations by ICC and GCC.  It turns out that ICC uses both the 
multimedia floating point instructions *and* the old ix87 ones.

As far as I know, on AMD64, GCC only uses the multimedia ones.

That might explain the difference.

-- 
Toon Moene - mailto:toon@moene.indiv.nluug.nl - phoneto: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
A maintainer of GNU Fortran 95: http://gcc.gnu.org/fortran/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 12:11 ` Paolo Bonzini
  2004-08-18 13:09   ` Daniel Berlin
@ 2004-08-18 13:48   ` Uros Bizjak
  1 sibling, 0 replies; 43+ messages in thread
From: Uros Bizjak @ 2004-08-18 13:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gcc

Paolo Bonzini wrote:

> Now one *would* think that GCC was optimized specially for this 
> benchmark! :-)

Great!

However, for non-cmov case, gcc currently generates one conditional and 
one nonconditional jump for code "(a < b) ? a : b":

       ...

       sahf
       ja .L6
       fstp  %st(0)
       jmp .L4
       .p2align 4,,7
.L6:
       fstp  %st(1)
.L4:
       fsincos
       ...

icc in this case generates only one conditional jump:

       ...
       sahf                                                    #63.28
       ja        .L9           # Prob 50%                      #63.28
       fst       %st(1)                                        #63.28
.L9:                                                            #
       fstp      %st(0)                                        #63.28
       fsincos                                                 #63.18

       ...

Imho even better code could be generated in this case:

       ...
       sahf
       ja .L6
       fxch
.L6:
       fstp  %st(0)
       fsincos
       ...

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
@ 2004-08-18 13:24 Wolfgang Bangerth
  0 siblings, 0 replies; 43+ messages in thread
From: Wolfgang Bangerth @ 2004-08-18 13:24 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: Joe Buck, gcc


> A good point. I wrote my first benchmark article for Micro Cornucopia back 
> in 1988, and ran into exactly the problem above. When I wrote a review of
> Fortran compilers for Computer Language in about '90, we had trouble with
> compilers built for specific benchmarks, and vendors who complained loudly
> if we did not use the specific benchmarks they were best at optimizing. And
> today we have chicanery involving drivers for various video cards -- ugh.
>
> To combat this, I use non-standard benchmarks, often reducing larger "real
> world" programs to find something easily understandable yet complex enough
> to stress compilers' code generation.

Organizations like SPEC are clearly aware of these shortcomings of testsuites. 
The newer generation programs in their suites are mostly complete real world 
programs, rather than manufactured small ones. The criteria for inclusion 
into SPEC also propose that programs have flat profiles without hotspots, so 
that microoptimizations of a few patterns don't give you so much anymore.


> Consider that my "mole" benchmark is *not* a well-known benchmark program,
> and I published it *after* the release of the Intel compiler I'm using. I
> don't see how they could have anticipated my benchmark, unless another,
> better-known benchmark includes a similar function.
 
Mole is a pretty well-known piece of code. It may be included in other 
testsuites as well.
 
W.

-------------------------------------------------------------------------
Wolfgang Bangerth              email:            bangerth@ices.utexas.edu
                               www: http://www.ices.utexas.edu/~bangerth/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 12:11 ` Paolo Bonzini
@ 2004-08-18 13:09   ` Daniel Berlin
  2004-08-18 13:48   ` Uros Bizjak
  1 sibling, 0 replies; 43+ messages in thread
From: Daniel Berlin @ 2004-08-18 13:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gcc

>
> Now one *would* think that GCC was optimized specially for this 
> benchmark! :-)
>
> I don't know why MIN_EXPR and MAX_EXPR were left out of GIMPLE given 
> that ABS_EXPR is there.

I thought we had this discussion a while ago, and the result was "MIN 
and MAX are GIMPLE, but we might as well keep the early simplification" 
for some reason.
In fact, I know we had this discussion, and decided that MIN_EXPR and 
MAX_EXPR are GIMPLE, because the linear loop transforms uses it. It 
just didn't require any patches to make this true (because the 
predicates are testing class '2' and whatnot).

--Dan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18 10:25 Uros Bizjak
@ 2004-08-18 12:11 ` Paolo Bonzini
  2004-08-18 13:09   ` Daniel Berlin
  2004-08-18 13:48   ` Uros Bizjak
  0 siblings, 2 replies; 43+ messages in thread
From: Paolo Bonzini @ 2004-08-18 12:11 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 1523 bytes --]

> It looks that gcc does not detect that parameters to sin() and cos() are 
> actually the same.

Yes, because they are converted to jumps too early.  You could paper 
over it in this case, by disabling the call to gimplify_minimax_expr in 
gimplify.c.  The results of the attached patch, with -O2 -ffast-math, 
are as follows:

prova.c:
	#define __NO_MATH_INLINES
	#include <math.h>
	#define PI2 0x1.922p0

	double dv (double x)
	{
	  return 2.0 * sin (((x < PI2) ? x : PI2))
		* cos (((x < PI2) ? x : PI2));
	}

prova.c.t59.optimized:
	dv (x)
	{
	  double T.1;
	
	<bb 0>:
	  T.1 = MIN_EXPR <x, 1.57080078125e+0>;
	  return sin (T.1) * cos (T.1) * 2.0e+0;
	}

prova.s:
	dv:
		pushl	%ebp
		flds	.LC0
		movl	%esp, %ebp
		fldl	8(%ebp)
		fcom	%st(1)
		fnstsw	%ax
		testb	$69, %ah
		jne	.L4
		fstp	%st(0)
		jmp	.L2
	.L4:
		fstp	%st(1)
	.L2:
		fsincos
		leave
		fmulp	%st, %st(1)
		fadd	%st(0), %st
		ret

Now look at prova.s with -march=i686 -mtune=i686 -fomit-frame-pointer:

	dv:
		fldl	4(%esp)
		flds	.LC0
		fcomi	%st(1), %st
		fcmovnb	%st(1), %st
		fstp	%st(1)
		fsincos
		fmulp	%st, %st(1)
		fadd	%st(0), %st
		ret

Now one *would* think that GCC was optimized specially for this 
benchmark! :-)

I don't know why MIN_EXPR and MAX_EXPR were left out of GIMPLE given 
that ABS_EXPR is there.  But these jump threadings seem to be difficult, 
so maybe it is worth including into GIMPLE a COND_EXPR with cmov 
semantics.  Even when if statements are used in the source code, phiopt 
could synthesize it pretty easily.

Paolo

[-- Attachment #2: no-gimplify-minimax.patch --]
[-- Type: text/plain, Size: 483 bytes --]

Index: gimplify.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/gimplify.c,v
retrieving revision 2.62
diff -u -r2.62 gimplify.c
--- gimplify.c	12 Aug 2004 03:54:11 -0000	2.62
+++ gimplify.c	18 Aug 2004 11:42:57 -0000
@@ -3876,7 +3876,8 @@
 
 	case MIN_EXPR:
 	case MAX_EXPR:
-	  ret = gimplify_minimax_expr (expr_p, pre_p, post_p);
+	  if (0)
+	    ret = gimplify_minimax_expr (expr_p, pre_p, post_p);
 	  break;
 
 	case LABEL_DECL:

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
@ 2004-08-18 10:25 Uros Bizjak
  2004-08-18 12:11 ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2004-08-18 10:25 UTC (permalink / raw)
  To: gcc, dberlin


>Without -xN, it does:
># parameter 1: 8 + %ebx
>..B3.1:                         # Preds ..B3.0
>        pushl     %ebx                                          #62.1
>        movl      %esp, %ebx                                    #62.1
>        andl      $-8, %esp                                     #62.1
>        fldl      8(%ebx)                                       #61.15
>        fldl      PI2                                           #63.28
>        fcom      %st(1)                                        #63.28
>        fnstsw    %ax                                           #63.28
>        sahf                                                    #63.28
>        ja        .L9           # Prob 50%                      #63.28
>        fst       %st(1)                                        #63.28
>.L9:                                                            #
>        fstp      %st(0)                                        #63.28
>        fsincos                                                 #63.18
>        fxch      %st(1)                                        #63.18
>        fadd      %st(0), %st                                   #63.18
>        fmulp     %st, %st(1)                                   #63.47
>        movl      %ebx, %esp                                    #63.47
>        popl      %ebx                                          #63.47
>        ret                                                     #63.47
>        .align    4,0x90
>                                # LOE
>

Icc transforms dv() function from:
{
  return 2.0 * sin (((x < PI2) ? x : PI2)) * cos (((x < PI2) ? x : PI2));
}

into:
{
  double tmp = (x < PI2) ? x : PI2;
  return 2.0 * sin (tmp) * cos (tmp);
}

By introducint temp variable, gcc is able to produce even better code 
for second line:
.L4:
        fsincos
        fmulp %st, %st(1)
        fadd  %st(0), %st
        ret

... and somewhat unoptimized first line (jumps!):
dv:
        fldl  .LC0
        fldl  4(%esp)
        fld %st(1)
        fxch  %st(1)
        fcom  %st(2)
        fnstsw %ax
        fstp  %st(2)
        sahf
        ja .L6
        fstp  %st(0)
        jmp .L4
        .p2align 4,,7
.L6:
        fstp  %st(1)
.L4:

It looks that gcc does not detect that parameters to sin() and cos() are 
actually the same.

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  1:46     ` Daniel Berlin
@ 2004-08-18  2:10       ` Daniel Berlin
  0 siblings, 0 replies; 43+ messages in thread
From: Daniel Berlin @ 2004-08-18  2:10 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Joe Buck, gcc, Robert Dewar, Richard Kenner

BTW, here are the timings for the respective versions:
The first uses SSE math + an SSE sincos
[dberlin@dberlin dberlin]$ icc mole.c -O3 -xN  -ipo
IPO: using IR for /tmp/iccQqtM99.o
IPO: performing single-file optimizations
mole.c(132) : (col. 5) remark: LOOP WAS VECTORIZED.
[dberlin@dberlin dberlin]$ time ./a.out

real    0m6.010s
user    0m5.950s
sys     0m0.020s

making it stop using the SSE sincos, and instead, call out to libm sin 
+ cos gives us:

[dberlin@dberlin dberlin]$ icc mole.c -O3 -xN  -ipo -nolib_inline
IPO: using IR for /tmp/iccb0fk4w.o
IPO: performing single-file optimizations
mole.c(132) : (col. 5) remark: LOOP WAS VECTORIZED.
[dberlin@dberlin dberlin]$ time ./a.out

real    0m20.016s
user    0m19.930s
sys     0m0.020s

Telling it not to use the vectorized math library, but still optimizing 
for P4 gives us (this is the version that uses x87 fpu only):

[dberlin@dberlin dberlin]$ icc mole.c -O3   -ipo
IPO: using IR for /tmp/icckW7Va1.o
IPO: performing single-file optimizations
[dberlin@dberlin dberlin]$ time ./a.out

real    0m11.084s
user    0m10.210s
sys     0m0.040s
[dberlin@dberlin dberlin]$

So the vectorized intrinsic is worth half it's performance, at least.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:21   ` Joe Buck
  2004-08-18  1:32     ` Daniel Berlin
@ 2004-08-18  1:46     ` Daniel Berlin
  2004-08-18  2:10       ` Daniel Berlin
  1 sibling, 1 reply; 43+ messages in thread
From: Daniel Berlin @ 2004-08-18  1:46 UTC (permalink / raw)
  To: Joe Buck; +Cc: gcc, Robert Dewar, Richard Kenner

> Are we sure that ICC does 2*sin(x)*cos(x) -> sin(2*x)?  It seems to me
> that just being able to produce x < PI2 ? 2*sin(x)*cos(x) : 0.0 gives
> most of the available speedup.  Such a transformation should be 
> achievable
> and preserves the result.
>
It doesn't even do that.  This is ICC 8.0.

..B3.1:                         # Preds ..B3.0
         pushl     %ebx                                          #62.1
         movl      %esp, %ebx                                    #62.1
         andl      $-8, %esp                                     #62.1
         subl      $8, %esp                                      #62.1
         movsd     8(%ebx), %xmm0                                #61.15
         minsd     PI2, %xmm0                                    #63.28
         call      __libm_sse2_sincos                            #63.18
                                 # LOE ebp esi edi xmm0 xmm1
..B3.4:                         # Preds ..B3.1
         addsd     %xmm0, %xmm0                                  #63.18
         mulsd     %xmm1, %xmm0                                  #63.47
         movsd     %xmm0, (%esp)                                 #63.47
         fldl      (%esp)                                        #63.47
         movl      %ebx, %esp                                    #63.47
         popl      %ebx                                          #63.47
         ret                                                     #63.47
         .align    4,0x90

Even before it transformed to this builtin, it was just doing a normal 
(sin + sin ) * cos, at all levels of it's transformations.
About the only interesting thing it does to the code is to use a = sin; 
b = cos; return (a + a) *b instead of return 2 * sin * cos. (IE it 
removes a floating point multiply by two in favor of a floating point 
add).

Note that it only uses this builtin if you specify -xN.
if you remove the -xN, the runtime doubles from 5 seconds to 10 seconds 
on my machine.

Without -xN, it does:
# parameter 1: 8 + %ebx
..B3.1:                         # Preds ..B3.0
         pushl     %ebx                                          #62.1
         movl      %esp, %ebx                                    #62.1
         andl      $-8, %esp                                     #62.1
         fldl      8(%ebx)                                       #61.15
         fldl      PI2                                           #63.28
         fcom      %st(1)                                        #63.28
         fnstsw    %ax                                           #63.28
         sahf                                                    #63.28
         ja        .L9           # Prob 50%                      #63.28
         fst       %st(1)                                        #63.28
.L9:                                                            #
         fstp      %st(0)                                        #63.28
         fsincos                                                 #63.18
         fxch      %st(1)                                        #63.18
         fadd      %st(0), %st                                   #63.18
         fmulp     %st, %st(1)                                   #63.47
         movl      %ebx, %esp                                    #63.47
         popl      %ebx                                          #63.47
         ret                                                     #63.47
         .align    4,0x90
                                 # LOE

If you use -xN -nolib_inline, to stop it from generating the intrisic 
call, it does:

dv:
# parameter 1: 8 + %ebx
..B3.1:                         # Preds ..B3.0
         pushl     %ebx                                          #62.1
         movl      %esp, %ebx                                    #62.1
         andl      $-64, %esp                                    #62.1
         subl      $64, %esp                                     #62.1
         movsd     PI2, %xmm0                                    #63.28
         movsd     8(%ebx), %xmm1                                #63.28
         minsd     %xmm0, %xmm1                                  #63.28
         movsd     %xmm1, (%esp)                                 #63.28
         call      sin                                           #63.18
                                 # LOE ebp esi edi f1
..B3.6:                         # Preds ..B3.1
         fstpl     8(%esp)                                       #63.18
         movsd     8(%esp), %xmm0                                #63.18
         movsd     %xmm0, 24(%esp)                               #63.18
                                 # LOE ebp esi edi
..B3.2:                         # Preds ..B3.6
         movsd     PI2, %xmm0                                    #63.57
         movsd     8(%ebx), %xmm1                                #63.57
         minsd     %xmm0, %xmm1                                  #63.57
         movsd     %xmm1, (%esp)                                 #63.57
         call      cos                                           #63.47
                                 # LOE ebp esi edi f1
..B3.7:                         # Preds ..B3.2
         fstpl     16(%esp)                                      #63.47
         movsd     16(%esp), %xmm0                               #63.47
                                 # LOE ebp esi edi xmm0
..B3.3:                         # Preds ..B3.7
         movsd     24(%esp), %xmm1                               #63.18
         addsd     %xmm1, %xmm1                                  #63.18
         mulsd     %xmm0, %xmm1                                  #63.47
         movsd     %xmm1, 16(%esp)                               #63.47
         fldl      16(%esp)                                      #63.47
         movl      %ebx, %esp                                    #63.47
         popl      %ebx                                          #63.47
         ret                                                     #63.47


again, no short circuiting.

--Dan



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:21   ` Joe Buck
@ 2004-08-18  1:32     ` Daniel Berlin
  2004-08-18  1:46     ` Daniel Berlin
  1 sibling, 0 replies; 43+ messages in thread
From: Daniel Berlin @ 2004-08-18  1:32 UTC (permalink / raw)
  To: Joe Buck; +Cc: gcc, Robert Dewar, Richard Kenner


On Aug 17, 2004, at 8:13 PM, Joe Buck wrote:

> On Tue, Aug 17, 2004 at 07:53:40PM -0400, Robert Dewar wrote:
>> Richard Kenner wrote:
>>
>>>     The trig identity that 2*sin(x)*cos(x) is sin(2*x) is more iffy; 
>>> we'd
>>>     waste time trying to apply such transformations to every tree, 
>>> and
>>>     it's pretty much exactly what caused such controversy when 
>>> everyone
>>>     did it to Whetstone.
>>>
>>> Although I agree with you in general about tuning to benchmarks, I 
>>> disagree
>>> with the above.  When you consider some of the even more obscure 
>>> cases in
>>> both tree and RTL folding, that doesn't seem so peculiar anymore.
>>
>> It seems completely wrong to me. I guess people who use -ffast-math
>> tolerate these kind of non-optimizations (an optimization to me is
>> a transformation that is meaning preserving, not a substitution of
>> some other computation that gives a different result!)
>
> Are we sure that ICC does 2*sin(x)*cos(x) -> sin(2*x)?  It seems to me
> that just being able to produce x < PI2 ? 2*sin(x)*cos(x) : 0.0 gives
> most of the available speedup.  Such a transformation should be 
> achievable
> and preserves the result.
>
Neither is what ICC does, actually.
At least according to it's debug dumps.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-18  0:13 ` Robert Dewar
@ 2004-08-18  0:21   ` Joe Buck
  2004-08-18  1:32     ` Daniel Berlin
  2004-08-18  1:46     ` Daniel Berlin
  0 siblings, 2 replies; 43+ messages in thread
From: Joe Buck @ 2004-08-18  0:21 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Richard Kenner, gcc

On Tue, Aug 17, 2004 at 07:53:40PM -0400, Robert Dewar wrote:
> Richard Kenner wrote:
> 
> >     The trig identity that 2*sin(x)*cos(x) is sin(2*x) is more iffy; we'd
> >     waste time trying to apply such transformations to every tree, and
> >     it's pretty much exactly what caused such controversy when everyone
> >     did it to Whetstone.
> > 
> > Although I agree with you in general about tuning to benchmarks, I disagree
> > with the above.  When you consider some of the even more obscure cases in
> > both tree and RTL folding, that doesn't seem so peculiar anymore.
> 
> It seems completely wrong to me. I guess people who use -ffast-math
> tolerate these kind of non-optimizations (an optimization to me is
> a transformation that is meaning preserving, not a substitution of
> some other computation that gives a different result!)

Are we sure that ICC does 2*sin(x)*cos(x) -> sin(2*x)?  It seems to me
that just being able to produce x < PI2 ? 2*sin(x)*cos(x) : 0.0 gives
most of the available speedup.  Such a transformation should be achievable
and preserves the result.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
  2004-08-17 22:44 Richard Kenner
@ 2004-08-18  0:13 ` Robert Dewar
  2004-08-18  0:21   ` Joe Buck
  0 siblings, 1 reply; 43+ messages in thread
From: Robert Dewar @ 2004-08-18  0:13 UTC (permalink / raw)
  To: Richard Kenner; +Cc: Joe.Buck, gcc

Richard Kenner wrote:

>     The trig identity that 2*sin(x)*cos(x) is sin(2*x) is more iffy; we'd
>     waste time trying to apply such transformations to every tree, and
>     it's pretty much exactly what caused such controversy when everyone
>     did it to Whetstone.
> 
> Although I agree with you in general about tuning to benchmarks, I disagree
> with the above.  When you consider some of the even more obscure cases in
> both tree and RTL folding, that doesn't seem so peculiar anymore.

It seems completely wrong to me. I guess people who use -ffast-math
tolerate these kind of non-optimizations (an optimization to me is
a transformation that is meaning preserving, not a substitution of
some other computation that gives a different result!)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
@ 2004-08-17 22:44 Richard Kenner
  2004-08-18  0:13 ` Robert Dewar
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Kenner @ 2004-08-17 22:44 UTC (permalink / raw)
  To: Joe.Buck; +Cc: gcc

    The trig identity that 2*sin(x)*cos(x) is sin(2*x) is more iffy; we'd
    waste time trying to apply such transformations to every tree, and
    it's pretty much exactly what caused such controversy when everyone
    did it to Whetstone.

Although I agree with you in general about tuning to benchmarks, I disagree
with the above.  When you consider some of the even more obscure cases in
both tree and RTL folding, that doesn't seem so peculiar anymore.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2004-08-18 18:40 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-15 15:35 GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004 Scott Robert Ladd
2004-08-15 20:48 ` Giovanni Bajo
2004-08-16 15:51   ` Scott Robert Ladd
2004-08-17 19:28     ` Marcel Cox
2004-08-17 21:26       ` Scott Robert Ladd
2004-08-17 22:21         ` Joe Buck
2004-08-17 22:39           ` Mike Stump
2004-08-17 22:54           ` Scott Robert Ladd
2004-08-17 23:53         ` Robert Dewar
2004-08-18  0:28           ` Joe Buck
2004-08-18  0:41             ` Robert Dewar
2004-08-18  0:44             ` Robert Dewar
2004-08-18  0:58               ` Joe Buck
2004-08-18  1:04                 ` Robert Dewar
2004-08-18  1:18                   ` Joe Buck
2004-08-18  1:27                     ` Robert Dewar
2004-08-18  1:34             ` Scott Robert Ladd
2004-08-18  1:45               ` Zack Weinberg
2004-08-18 18:25                 ` Laurent GUERBY
2004-08-18 18:45                   ` Zack Weinberg
2004-08-18  2:04               ` Scott Robert Ladd
2004-08-18  8:37               ` Richard Henderson
2004-08-18 16:25                 ` Scott Robert Ladd
2004-08-18 18:48               ` Toon Moene
2004-08-18 11:14         ` Marcel Cox
2004-08-17  5:14 ` Natros
2004-08-18  1:35 ` Kaveh R. Ghazi
2004-08-18 13:14   ` Arnaud Desitter
2004-08-18 15:37     ` Kaveh R. Ghazi
2004-08-18 17:05       ` Scott Robert Ladd
2004-08-18 18:05         ` Kaveh R. Ghazi
2004-08-18 18:05           ` Scott Robert Ladd
2004-08-17 22:44 Richard Kenner
2004-08-18  0:13 ` Robert Dewar
2004-08-18  0:21   ` Joe Buck
2004-08-18  1:32     ` Daniel Berlin
2004-08-18  1:46     ` Daniel Berlin
2004-08-18  2:10       ` Daniel Berlin
2004-08-18 10:25 Uros Bizjak
2004-08-18 12:11 ` Paolo Bonzini
2004-08-18 13:09   ` Daniel Berlin
2004-08-18 13:48   ` Uros Bizjak
2004-08-18 13:24 Wolfgang Bangerth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).