[Bug c/46186] New: Clang creates code running 1600 times faster than gcc's

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's
@ 2010-10-26 14:27 jak@jak-linux.org
  2010-10-26 14:30 ` [Bug c/46186] " jak@jak-linux.org
                   ` (26 more replies)
  0 siblings, 27 replies; 28+ messages in thread
From: jak@jak-linux.org @ 2010-10-26 14:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

           Summary: Clang creates code running 1600 times faster than
                    gcc's
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: jak@jak-linux.org


Created attachment 22161
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22161
C file

The attached code compiles into an executable that takes 1.6 seconds to run,
when compiled with clang, it takes 0.001 seconds (both with -O2 as the only
compiler option).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
@ 2010-10-26 14:30 ` jak@jak-linux.org
  2010-10-26 14:31 ` paolo.carlini at oracle dot com
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jak@jak-linux.org @ 2010-10-26 14:30 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #1 from Julian Andres Klode <jak@jak-linux.org> 2010-10-26 14:30:24 UTC ---
Created attachment 22162
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162
Clang's assember

Attaching the assembler output from clang, it should help understand which
optimizations clang does (and improve gcc to do them as well).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
  2010-10-26 14:30 ` [Bug c/46186] " jak@jak-linux.org
@ 2010-10-26 14:31 ` paolo.carlini at oracle dot com
  2010-10-26 14:33 ` jak@jak-linux.org
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: paolo.carlini at oracle dot com @ 2010-10-26 14:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Paolo Carlini <paolo.carlini at oracle dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |paolo.carlini at oracle dot
                   |                            |com

--- Comment #2 from Paolo Carlini <paolo.carlini at oracle dot com> 2010-10-26 14:31:47 UTC ---
;)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
  2010-10-26 14:30 ` [Bug c/46186] " jak@jak-linux.org
  2010-10-26 14:31 ` paolo.carlini at oracle dot com
@ 2010-10-26 14:33 ` jak@jak-linux.org
  2010-10-26 14:47 ` redi at gcc dot gnu.org
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jak@jak-linux.org @ 2010-10-26 14:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #3 from Julian Andres Klode <jak@jak-linux.org> 2010-10-26 14:32:27 UTC ---
System information:

Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-5'
--with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.4 --enable-shared --enable-multiarch
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix
--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--with-arch-32=i586 --with-tune=generic --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.5 (Debian 4.4.5-5)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (2 preceding siblings ...)
  2010-10-26 14:33 ` jak@jak-linux.org
@ 2010-10-26 14:47 ` redi at gcc dot gnu.org
  2010-10-26 14:53 ` jak@jak-linux.org
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: redi at gcc dot gnu.org @ 2010-10-26 14:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #4 from Jonathan Wakely <redi at gcc dot gnu.org> 2010-10-26 14:47:12 UTC ---
GCC's output is significantly faster at -O3 or without the noinline attribute


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (3 preceding siblings ...)
  2010-10-26 14:47 ` redi at gcc dot gnu.org
@ 2010-10-26 14:53 ` jak@jak-linux.org
  2010-10-26 14:59 ` dominiq at lps dot ens.fr
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jak@jak-linux.org @ 2010-10-26 14:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #5 from Julian Andres Klode <jak@jak-linux.org> 2010-10-26 14:53:24 UTC ---
(In reply to comment #4)
> GCC's output is significantly faster at -O3 or without the noinline attribute

I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At
-O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's
code at -O3.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (4 preceding siblings ...)
  2010-10-26 14:53 ` jak@jak-linux.org
@ 2010-10-26 14:59 ` dominiq at lps dot ens.fr
  2010-10-26 15:00 ` jak@jak-linux.org
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2010-10-26 14:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #6 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 14:59:18 UTC ---
You get this kind of speedup if the compiler knows that the result of the loop
is

sum=(b*(b-1)-a*(a-1))/2

In which case the timing is meaningless (it is 0.000s on my laptop), so is the
ratio with the execution of the loop.

The basic question is: how much the user's ignorance should be repaired by the
optimizer? (A colleague of mine told me that he once audited a CFD code and
found that \int_a^b dx/x was evaluated numerically instead of using
log(b)-log(a)!-)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (5 preceding siblings ...)
  2010-10-26 14:59 ` dominiq at lps dot ens.fr
@ 2010-10-26 15:00 ` jak@jak-linux.org
  2010-10-26 15:26 ` jak@jak-linux.org
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jak@jak-linux.org @ 2010-10-26 15:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #7 from Julian Andres Klode <jak@jak-linux.org> 2010-10-26 15:00:37 UTC ---
(In reply to comment #5)
> (In reply to comment #4)
> > GCC's output is significantly faster at -O3 or without the noinline attribute
> 
> I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At
> -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's
> code at -O3.

Using gcc 4.5 with -O3 may work for the C code, but it does not optimize the
C++ code posted at http://lwn.net/Articles/411776/:

    g++-4.5 -O3: real 0m1.608s
    clang++ -O2: real 0m0.003s
    clang++ -Os: real 0m0.003s

But I guess the C++ problem might be a different one.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (6 preceding siblings ...)
  2010-10-26 15:00 ` jak@jak-linux.org
@ 2010-10-26 15:26 ` jak@jak-linux.org
  2010-10-26 15:29 ` redi at gcc dot gnu.org
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jak@jak-linux.org @ 2010-10-26 15:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #8 from Julian Andres Klode <jak@jak-linux.org> 2010-10-26 15:25:56 UTC ---
(In reply to comment #6)
> You get this kind of speedup if the compiler knows that the result of the loop
> is
> 
> sum=(b*(b-1)-a*(a-1))/2
> 
> In which case the timing is meaningless (it is 0.000s on my laptop), so is the
> ratio with the execution of the loop.
> 
> The basic question is: how much the user's ignorance should be repaired by the
> optimizer? (A colleague of mine told me that he once audited a CFD code and
> found that \int_a^b dx/x was evaluated numerically instead of using
> log(b)-log(a)!-)

Since the optimization seems to be mostly there in -O3, it's just a matter of
enabling it in -O2. 

I just found out that it does not optimize if you call f() via a global
function pointer, it still takes 1.6 seconds despite being compiled at -O3,
whereas clang can optimize it to 0.001s.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (7 preceding siblings ...)
  2010-10-26 15:26 ` jak@jak-linux.org
@ 2010-10-26 15:29 ` redi at gcc dot gnu.org
  2010-10-26 15:37 ` jakub at gcc dot gnu.org
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: redi at gcc dot gnu.org @ 2010-10-26 15:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #9 from Jonathan Wakely <redi at gcc dot gnu.org> 2010-10-26 15:28:51 UTC ---
(In reply to comment #8)
> 
> Since the optimization seems to be mostly there in -O3, it's just a matter of
> enabling it in -O2. 

Or if you want all optimisations, it's just a matter of using -O3

Expecting all optimisations to be done below the maximum optimisation level is
... unexpected.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (8 preceding siblings ...)
  2010-10-26 15:29 ` redi at gcc dot gnu.org
@ 2010-10-26 15:37 ` jakub at gcc dot gnu.org
  2010-10-26 15:43 ` paolo.carlini at oracle dot com
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2010-10-26 15:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> 2010-10-26 15:37:04 UTC ---
-O2 -fipa-cp-clone should be in theory enough, then f would be normally cloned,
assuming gcc thinks it is from a hot spot.  But this is not a real-world
testcase, the call from main where it happens just once for the process is not
considered a hot spot.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (9 preceding siblings ...)
  2010-10-26 15:37 ` jakub at gcc dot gnu.org
@ 2010-10-26 15:43 ` paolo.carlini at oracle dot com
       [not found] ` <4cc6e60b.2811970a.71aa.016aSMTPIN_ADDED@mx.google.com>
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: paolo.carlini at oracle dot com @ 2010-10-26 15:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #11 from Paolo Carlini <paolo.carlini at oracle dot com> 2010-10-26 15:42:58 UTC ---
Can we please stop talking about nano and giga numbers like kids? If an
optimization like complete loop unrolling is involved of course very small or
large numbers can be involved, doesn't really contribute anything to the
problem talking about the exact figure.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Bug c/46186] Clang creates code running 1600 times faster than gcc's
       [not found] ` <4cc6e60b.2811970a.71aa.016aSMTPIN_ADDED@mx.google.com>
@ 2010-10-26 15:56   ` Andrew Pinski
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Pinski @ 2010-10-26 15:56 UTC (permalink / raw)
  To: jak@jak-linux.org; +Cc: gcc-bugs



On Oct 26, 2010, at 7:30 AM, "jak@jak-linux.org" <gcc-bugzilla@gcc.gnu.org 
 > wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186
>
> --- Comment #1 from Julian Andres Klode <jak@jak-linux.org>  
> 2010-10-26 14:30:24 UTC ---
> Created attachment 22162
>  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162
> Clang's assember

This multiplication transformation is incorrect if the loop wraps  
(unsigned always wraps; never overflows). Gcc is correct in its speed.  
Though -O3 is faster but not because of multiplication but rather  
constant propatagtion. So this bug is invalid and some one should  
report a bug to llvm folks about this.

>
> Attaching the assembler output from clang, it should help understand  
> which
> optimizations clang does (and improve gcc to do them as well).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (11 preceding siblings ...)
       [not found] ` <4cc6e60b.2811970a.71aa.016aSMTPIN_ADDED@mx.google.com>
@ 2010-10-26 15:56 ` pinskia at gmail dot com
  2010-10-26 16:36 ` dominiq at lps dot ens.fr
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: pinskia at gmail dot com @ 2010-10-26 15:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #12 from pinskia at gmail dot com <pinskia at gmail dot com> 2010-10-26 15:56:20 UTC ---
On Oct 26, 2010, at 7:30 AM, "jak@jak-linux.org" <gcc-bugzilla@gcc.gnu.org 
 > wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186
>
> --- Comment #1 from Julian Andres Klode <jak@jak-linux.org>  
> 2010-10-26 14:30:24 UTC ---
> Created attachment 22162
>  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162
> Clang's assember

This multiplication transformation is incorrect if the loop wraps  
(unsigned always wraps; never overflows). Gcc is correct in its speed.  
Though -O3 is faster but not because of multiplication but rather  
constant propatagtion. So this bug is invalid and some one should  
report a bug to llvm folks about this.

>
> Attaching the assembler output from clang, it should help understand  
> which
> optimizations clang does (and improve gcc to do them as well).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (12 preceding siblings ...)
  2010-10-26 15:56 ` pinskia at gmail dot com
@ 2010-10-26 16:36 ` dominiq at lps dot ens.fr
  2010-10-26 16:59 ` jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2010-10-26 16:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #13 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 16:36:05 UTC ---
> This multiplication transformation is incorrect if the loop wraps  
> (unsigned always wraps; never overflows).

I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64
here) which "works" for additions and multiplications, so if there is wrapping,
the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped.

On my Core2duo 2.53Ghz with -Ofast the run time is ~1.2s for elementary 2*10^9
loops or .6ns/loop or ~1.5 clock cycle per loop. For a perfect vectorization
and no loop overhead, I would expect a minimum of 0.5 clock cycle per loop. If
you get anything below this number, it means that the loop

    for (; a < b; a++)
        sum += a;

is replaced with sum=(b*(b-1)-a*(a-1))/2 (you can confirm it by checking that
the timing behaves as O(len) or not). Apparently clang does this (valid)
transformation while gcc don't for any options I have tried.

Note that If I write such a loop, it is because I am interested by the timing
of the loop, not by the result I know for more than 40 years!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (13 preceding siblings ...)
  2010-10-26 16:36 ` dominiq at lps dot ens.fr
@ 2010-10-26 16:59 ` jakub at gcc dot gnu.org
  2010-10-26 17:15 ` dominiq at lps dot ens.fr
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2010-10-26 16:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2010.10.26 16:59:00
                 CC|                            |spop at gcc dot gnu.org
     Ever Confirmed|0                           |1

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> 2010-10-26 16:59:00 UTC ---
For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't
handle even the sum += a case.  Sebastian?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (14 preceding siblings ...)
  2010-10-26 16:59 ` jakub at gcc dot gnu.org
@ 2010-10-26 17:15 ` dominiq at lps dot ens.fr
  2010-10-26 18:44 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2010-10-26 17:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #15 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 17:15:31 UTC ---
> For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't
> handle even the sum += a case.

2 and b are constants while a is not. For constants you have to know that the
sum is cst*nloop, while if a is incremented you have to know that it is related
to nloop*(nloop+1)/2 (and if you use sum += a*a, nloop*(nloop+1)*(2*nloop+1)/6,
if sum += a*a*a, nloop^2*(nloop+1)^2/4 and so on).

However is it worth the work?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (15 preceding siblings ...)
  2010-10-26 17:15 ` dominiq at lps dot ens.fr
@ 2010-10-26 18:44 ` jakub at gcc dot gnu.org
  2010-10-26 18:54 ` dominiq at lps dot ens.fr
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2010-10-26 18:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> 2010-10-26 18:43:40 UTC ---
chrec_apply is called with
{a_4(D), +, {a_4(D) + 1, +, 1}_1}_1
chrec and ~a_4(D) + b_5(D) in x.
I wonder if this can be fixed just by recognizing such special cases in
chrec_apply (after checking e.g. that it is unsigned computation instead of
signed etc.).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (16 preceding siblings ...)
  2010-10-26 18:44 ` jakub at gcc dot gnu.org
@ 2010-10-26 18:54 ` dominiq at lps dot ens.fr
  2010-10-26 19:12 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2010-10-26 18:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #17 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 18:53:49 UTC ---
Note that clang seems to know the general result: \sum_{i=a}^b p(i)=P(b), where
p(i) is a given polynomial of degree n and P(x) a polynomial of degree n+1 such
that P(x)=P(x-1)+p(x) and P(a)=p(a). At least clang is able to remove the loop
for

sum5 += a*a*a*a*a + 2*a*a*a +5*a;


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (17 preceding siblings ...)
  2010-10-26 18:54 ` dominiq at lps dot ens.fr
@ 2010-10-26 19:12 ` jakub at gcc dot gnu.org
  2010-10-26 20:30 ` joseph at codesourcery dot com
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2010-10-26 19:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> 2010-10-26 19:11:59 UTC ---
I guess you mean LLVM instead of clang, I'm pretty sure the FE doesn't perform
this optimization.

Anyway, given:

#define F(n, exp) \
unsigned long                                   \
f##n (unsigned long a, unsigned long b)         \
{                                               \
  unsigned long sum = 0;                        \
  for (; a < b; a++)                            \
    exp;                                        \
  return sum;                                   \
}

F (1, sum += a)
F (2, sum += 2)
F (3, sum += b)
F (4, sum += a * a)
F (5, sum += a * a * a)
F (6, a * a * a * a * a + 2 * a * a * a + 5 * a)

only the f1/f2/f3 cases make it into chrec_apply (and only f2/f3 are currently
handled there).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (18 preceding siblings ...)
  2010-10-26 19:12 ` jakub at gcc dot gnu.org
@ 2010-10-26 20:30 ` joseph at codesourcery dot com
  2010-10-26 21:00 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: joseph at codesourcery dot com @ 2010-10-26 20:30 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #19 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2010-10-26 20:29:56 UTC ---
On Tue, 26 Oct 2010, dominiq at lps dot ens.fr wrote:

> --- Comment #13 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 16:36:05 UTC ---
> > This multiplication transformation is incorrect if the loop wraps  
> > (unsigned always wraps; never overflows).
> 
> I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64
> here) which "works" for additions and multiplications, so if there is wrapping,
> the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped.

It's a bit more complicated than that, in that you can't just compute 
(b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
bit.  (I haven't checked exactly what the generated code is doing here.)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (19 preceding siblings ...)
  2010-10-26 20:30 ` joseph at codesourcery dot com
@ 2010-10-26 21:00 ` jakub at gcc dot gnu.org
  2010-10-26 21:07 ` dominiq at lps dot ens.fr
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2010-10-26 21:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #20 from Jakub Jelinek <jakub at gcc dot gnu.org> 2010-10-26 21:00:11 UTC ---
If I translate the assembly back to C, it seems it is performing part of the
arithmetics in TImode:

unsigned long f (unsigned long a, unsigned long b)
{
  if (a >= b)
    return 0;
  else
    return (a + 1) * (b - 1 - a) + a + (unsigned long)(((unsigned __int128) (b
- 1 - a) * (b - 2 - a)) >> 1);
}


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (20 preceding siblings ...)
  2010-10-26 21:00 ` jakub at gcc dot gnu.org
@ 2010-10-26 21:07 ` dominiq at lps dot ens.fr
  2010-10-29 21:07 ` tg at mirbsd dot org
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2010-10-26 21:07 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #21 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 21:06:48 UTC ---
> I guess you mean LLVM instead of clang,

Yes, if you prefer. I was referring to the command I used.

> F (6, a * a * a * a * a + 2 * a * a * a + 5 * a)

you probably mean

F (6, sum +=a * a * a * a * a + 2 * a * a * a + 5 * a)

> It's a bit more complicated than that, in that you can't just compute 
> (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
> bit.  (I haven't checked exactly what the generated code is doing here.)

This is right if you do the multiply before the division. However b or b-1 can
be divided exactly by 2, so you have to do (b/2)*(b-1) if b is even and
b*((b-1)/2) if b is odd. The same result applies for the sum of square and
cubes, although you may be one more trick if 2*b+1 wrap and can be divided
exactly by 3.

I have tested the following variant


[macbook] f90/bug% cat pr46186_db.c
#include <stdio.h>

#define len 1000000000L

unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline));

int main()
{    
    unsigned long res;
    res = f(2, 2*len);
/*    printf("%lu\n", res); */
    return 0;
}

unsigned long f(unsigned long a, unsigned long b)
{
    unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0;
    for (; a < b; a++)
      {
    sum0 += 2;
        sum += a;
    sum2 += a*a;
    sum5 += a*a*a*a*a + 2*a*a*a +5*a;
      }
    printf("%lu\n", sum0);
    printf("%lu\n", sum);
    printf("%lu\n", sum2);
    printf("%lu\n", sum5);
    return sum;
}
 which gives

[macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db.c

pr46186_db.c:18: note: LOOP VECTORIZED.
pr46186_db.c:15: note: vectorized 1 loops in function.
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
18.114u 0.012s 0:18.15 99.8%    0+0k 0+0io 0pf+0w
[macbook] f90/bug% clang -O pr46186_db.c
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
0.000u 0.000s 0:00.00 0.0%    0+0k 0+0io 0pf+0w

So the wrapping seems properly handled for the loop replacement.

Now I have also tested the following variant


[macbook] f90/bug% cat pr46186_db_1.c
#include <stdio.h>

#define len 1000000000L

unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline));

int main()
{    
    unsigned long res;
    res = f(2, 2*len);
/*    printf("%lu\n", res); */
    return 0;
}

unsigned long f(unsigned long a, unsigned long b)
{
    unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0;
    unsigned long a2;
    for (; a < b; a++)
      {
    sum0 += 2;
        sum += a;
    a2 = a*a;
    sum2 += a2;
    sum5 += a*(a2*(a2 + 2) +5);
      }
    printf("%lu\n", sum0);
    printf("%lu\n", sum);
    printf("%lu\n", sum2);
    printf("%lu\n", sum5);
    return sum;
}

and the timings are

[macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db_1.c

pr46186_db_1.c:19: note: LOOP VECTORIZED.
pr46186_db_1.c:15: note: vectorized 1 loops in function.
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
12.227u 0.016s 0:12.36 98.9%    0+0k 0+0io 0pf+0w  <== was 18.114u 0.012s
0:18.15 without
hand optimization
[macbook] f90/bug% clang -O pr46186_db_1.c
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
0.000u 0.000s 0:00.00 0.0%    0+0k 0+0io 0pf+0w

So clearly gcc is missing some generic optimizations in products.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (21 preceding siblings ...)
  2010-10-26 21:07 ` dominiq at lps dot ens.fr
@ 2010-10-29 21:07 ` tg at mirbsd dot org
  2010-10-29 21:44 ` [Bug tree-optimization/46186] " sebpop at gmail dot com
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tg at mirbsd dot org @ 2010-10-29 21:07 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Thorsten Glaser <tg at mirbsd dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tg at mirbsd dot org

--- Comment #22 from Thorsten Glaser <tg at mirbsd dot org> 2010-10-29 21:06:44 UTC ---
(In reply to comment #19)
> It's a bit more complicated than that, in that you can't just compute 
> (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
> bit.  (I haven't checked exactly what the generated code is doing here.)

Haven’t checked either, but in i386 asm, using RCR instead of SHR would work
if the carry flag is correct after the sequence calculating that (or SAR).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (22 preceding siblings ...)
  2010-10-29 21:07 ` tg at mirbsd dot org
@ 2010-10-29 21:44 ` sebpop at gmail dot com
  2010-10-29 21:59 ` joseph at codesourcery dot com
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: sebpop at gmail dot com @ 2010-10-29 21:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #23 from sebpop at gmail dot com <sebpop at gmail dot com> 2010-10-29 21:44:09 UTC ---
Hi,

here is a preliminary patch (not tested yet other that the PR testcase).
This patch improves chrec_apply to also handle these very uncommon
cases that some like to make big titles about (I wonder if the guy who
submitted this bug report is part of some marketing division... anyways)

Note that for these cases
F (4, sum += a * a)
F (5, sum += a * a * a)
F (6, sum += a * a * a * a * a + 2 * a * a * a + 5 * a)

although GCC with this patch knows how to transform these into end
of loop values, GCC won't change them, because of this heuristic:

          /* Do not emit expensive expressions.  The rationale is that
         when someone writes a code like

         while (n > 45) n -= 45;

         he probably knows that n is not large, and does not want it
         to be turned into n %= 45.  */
          || expression_expensive_p (def))

one needs to also set the --param scev-max-expr-size to a pretty big
value for f6 to pass the fold steps...

Sebastian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (23 preceding siblings ...)
  2010-10-29 21:44 ` [Bug tree-optimization/46186] " sebpop at gmail dot com
@ 2010-10-29 21:59 ` joseph at codesourcery dot com
  2014-02-16 13:13 ` jackie.rosen at hushmail dot com
  2021-08-28 23:42 ` pinskia at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: joseph at codesourcery dot com @ 2010-10-29 21:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #24 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2010-10-29 21:58:46 UTC ---
On Fri, 29 Oct 2010, sebpop at gmail dot com wrote:

> here is a preliminary patch (not tested yet other that the PR testcase).

How does this patch deal with needing to compute in a wider type to avoid 
losing high bits before the division - is that covered by some existing 
code?  It could certainly do with execution testcases that verify cases 
where computing everything naively in the original type would yield an 
incorrect result (and those were computing in a type of double the width 
would still have overflow, etc.).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (24 preceding siblings ...)
  2010-10-29 21:59 ` joseph at codesourcery dot com
@ 2014-02-16 13:13 ` jackie.rosen at hushmail dot com
  2021-08-28 23:42 ` pinskia at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: jackie.rosen at hushmail dot com @ 2014-02-16 13:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Jackie Rosen <jackie.rosen at hushmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jackie.rosen at hushmail dot com

--- Comment #25 from Jackie Rosen <jackie.rosen at hushmail dot com> ---
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/46186] Clang creates code running 1600 times faster than gcc's
  2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
                   ` (25 preceding siblings ...)
  2014-02-16 13:13 ` jackie.rosen at hushmail dot com
@ 2021-08-28 23:42 ` pinskia at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-28 23:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |DUPLICATE
             Status|NEW                         |RESOLVED

--- Comment #28 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Dup of bug 65855.

*** This bug has been marked as a duplicate of bug 65855 ***

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-08-28 23:42 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-26 14:27 [Bug c/46186] New: Clang creates code running 1600 times faster than gcc's jak@jak-linux.org
2010-10-26 14:30 ` [Bug c/46186] " jak@jak-linux.org
2010-10-26 14:31 ` paolo.carlini at oracle dot com
2010-10-26 14:33 ` jak@jak-linux.org
2010-10-26 14:47 ` redi at gcc dot gnu.org
2010-10-26 14:53 ` jak@jak-linux.org
2010-10-26 14:59 ` dominiq at lps dot ens.fr
2010-10-26 15:00 ` jak@jak-linux.org
2010-10-26 15:26 ` jak@jak-linux.org
2010-10-26 15:29 ` redi at gcc dot gnu.org
2010-10-26 15:37 ` jakub at gcc dot gnu.org
2010-10-26 15:43 ` paolo.carlini at oracle dot com
     [not found] ` <4cc6e60b.2811970a.71aa.016aSMTPIN_ADDED@mx.google.com>
2010-10-26 15:56   ` Andrew Pinski
2010-10-26 15:56 ` pinskia at gmail dot com
2010-10-26 16:36 ` dominiq at lps dot ens.fr
2010-10-26 16:59 ` jakub at gcc dot gnu.org
2010-10-26 17:15 ` dominiq at lps dot ens.fr
2010-10-26 18:44 ` jakub at gcc dot gnu.org
2010-10-26 18:54 ` dominiq at lps dot ens.fr
2010-10-26 19:12 ` jakub at gcc dot gnu.org
2010-10-26 20:30 ` joseph at codesourcery dot com
2010-10-26 21:00 ` jakub at gcc dot gnu.org
2010-10-26 21:07 ` dominiq at lps dot ens.fr
2010-10-29 21:07 ` tg at mirbsd dot org
2010-10-29 21:44 ` [Bug tree-optimization/46186] " sebpop at gmail dot com
2010-10-29 21:59 ` joseph at codesourcery dot com
2014-02-16 13:13 ` jackie.rosen at hushmail dot com
2021-08-28 23:42 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).