public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug fortran/32084]  New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
@ 2007-05-25 13:23 burnus at gcc dot gnu dot org
  2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-05-25 13:23 UTC (permalink / raw)
  To: gcc-bugs

gfortran seemingly generates an significatly inferior internal TREE
representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower
than g95, which is based on GCC 4.0.3. (Compared with other compilers the
difference is even larger.)

(GCC 4.3 is in general faster than GCC 4.1; for induct one does not see any
runtime change with the gfortran frontend during the last 1.5 years, though
GCC/gfortran 4.1.2 was seemingly slightly faster:
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-induct-19.png
)

If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops
with gfortran whereas GCC 4.0 vectorizes 0 loops with g95.


For reduced-size example (395 instead of 6635 lines), gfortran is still 13%
slower:

$ fortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize
-ftree-loop-linear -msse3 -O3  test2.f90
$ time a.out
real    0m4.632s  user    0m4.624s  sys     0m0.004s

$ g95 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -msse3 -O3
test2.f90
$ time a.out
real    0m4.030s  user    0m4.024s  sys     0m0.004s

$ ifort test2.f90
$ time a.out
real    0m3.859s  user    0m3.856s  sys     0m0.000s

# NAG f95 + system gcc 4.1.3
$ f95 -O4 -ieee=full -Bstatic -march=opteron -ffast-math -funroll-loops
-ftree-vectorize -msse3 test2.f90
$ time a.out
real    0m3.381s  user    0m3.380s  sys     0m0.004s

$ sunf95 -w4 -fast -xarch=amd64a -xipo=0 test2.f90
$ time a.out
real    0m3.741s  user    0m3.736s  sys     0m0.000s




For induct (on x86_64-unknown-linux-gnu):
51.65 [100%]  gfortran -m64 as above
51.90 [100%]  gfortran with -fprofile-use
61.41 [118%]  gfortran 32bit, x87
46.12 [ 89%]  gfortran 32bit, SSE
43.33 [ 83%]  ifort 9.1
40.73 [ 78%]  ifort 10beta
42.53 [ 82%]  sunf95
30.16 [ 58%]  pathscale
38.86 [ 75%]  NAG f95 using system gcc 4.1
42.65 [ 82%]  g95/gcc 4.0.3 (g95 0.91!)


-- 
           Summary: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-
                    based competitor
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: fortran
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: burnus at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug fortran/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
@ 2007-05-25 13:26 ` burnus at gcc dot gnu dot org
  2007-05-25 13:54 ` burnus at gcc dot gnu dot org
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-05-25 13:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from burnus at gcc dot gnu dot org  2007-05-25 14:25 -------
Created an attachment (id=13611)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13611&action=view)
test case, 395 lines; based on Polyhedron's induct.f90


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug fortran/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
  2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
@ 2007-05-25 13:54 ` burnus at gcc dot gnu dot org
  2007-06-26 19:43 ` [Bug tree-optimization/32084] " ubizjak at gmail dot com
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-05-25 13:54 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from burnus at gcc dot gnu dot org  2007-05-25 14:54 -------
Using the GCC 4.1.3 20070430 which comes with openSUSE Factory and contains
some patches from 4.2/4.3, I get the following timings:

$ gfortran-4.1 -march=opteron -ffast-math -funroll-loops -ftree-vectorize
-ftree-loop-linear -msse3 -O3 induct.f90
$ time a.out
real    0m47.043s  user    0m46.911s  sys     0m0.020s

which means that gcc/gfortran 4.1.3 was 10% faster for induct than 4.3's
gfortran, but still almost 10% slower than gcc/g95 4.0.3.


For the testcase (without "volatile"):
   real    0m4.194s  user    0m4.192s  sys     0m0.000s
which is timewise also between gfortran 4.3 and g95.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
  2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
  2007-05-25 13:54 ` burnus at gcc dot gnu dot org
@ 2007-06-26 19:43 ` ubizjak at gmail dot com
  2007-06-27 11:24 ` [Bug rtl-optimization/32084] " ubizjak at gmail dot com
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-26 19:43 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from ubizjak at gmail dot com  2007-06-26 19:43 -------
(In reply to comment #0)
> gfortran seemingly generates an significatly inferior internal TREE
> representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower
> than g95, which is based on GCC 4.0.3. (Compared with other compilers the
> difference is even larger.)

> If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops
> with gfortran whereas GCC 4.0 vectorizes 0 loops with g95.

The problem is in -ftree-vectorize:

gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear
-ftree-vectorize -msse3 -O3 pr32084.f90
time ./a.out

real    0m2.941s
user    0m2.940s
sys     0m0.004s

gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3
pr32084.f90
time ./a.out

real    0m1.574s
user    0m1.572s
sys     0m0.004s

The testcase runs 47% faster without -ftree-vectorize.

gcc -v
Target: x86_64-unknown-linux-gnu
...
gcc version 4.3.0 20070622 (experimental)

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         X6800  @ 2.93GHz
stepping        : 5
cpu MHz         : 2933.435
cache size      : 4096 KB

This is marked a "tree-optimization" bug because we have no "vectorizer"
component to choose from.


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ubizjak at gmail dot com
             Status|UNCONFIRMED                 |NEW
          Component|fortran                     |tree-optimization
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2007-06-26 19:43:36
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (2 preceding siblings ...)
  2007-06-26 19:43 ` [Bug tree-optimization/32084] " ubizjak at gmail dot com
@ 2007-06-27 11:24 ` ubizjak at gmail dot com
  2007-06-27 11:57 ` dorit at il dot ibm dot com
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-27 11:24 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from ubizjak at gmail dot com  2007-06-27 11:24 -------
(In reply to comment #3)

> The problem is in -ftree-vectorize

The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9)
is completely unrolled, but with vectorization, the loop is vectorized, but
_not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors,
we loose big time at this point.

My best guess for unroller problems would be rtl-optimization.


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|dorit at gcc dot gnu dot org|
          Component|tree-optimization           |rtl-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (3 preceding siblings ...)
  2007-06-27 11:24 ` [Bug rtl-optimization/32084] " ubizjak at gmail dot com
@ 2007-06-27 11:57 ` dorit at il dot ibm dot com
  2007-06-28  0:41 ` harsha dot jagasia at amd dot com
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: dorit at il dot ibm dot com @ 2007-06-27 11:57 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from dorit at il dot ibm dot com  2007-06-27 11:57 -------
(In reply to comment #4)
> (In reply to comment #3)
> > The problem is in -ftree-vectorize
> The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9)
> is completely unrolled, but with vectorization, the loop is vectorized, but
> _not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors,
> we loose big time at this point.
> My best guess for unroller problems would be rtl-optimization.

Could it be the tree-level complete unroller? (does the vectorizer peel the
loop to handle a misaligned store by any chance? if so, and if the misalignment
amount is unknown, then the number of iterations of the vectorized loop is
unknown, in which case the complete unroller wouldn't work). In autovect-branch
the tree-level complete unroller is before the vectorizer - wonder what happens
there.

Another thing to consider is using -fvect-cost-model (it's very perliminary and
hasn't been tuned much, but this could be a good data point for whoever wants
to tune the vectorizer cost-model for x86_64).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (4 preceding siblings ...)
  2007-06-27 11:57 ` dorit at il dot ibm dot com
@ 2007-06-28  0:41 ` harsha dot jagasia at amd dot com
  2007-06-28  0:41 ` harsha dot jagasia at amd dot com
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: harsha dot jagasia at amd dot com @ 2007-06-28  0:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from harsha dot jagasia at amd dot com  2007-06-28 00:41 -------
Created an attachment (id=13796)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13796&action=view)
vectorizer dump with cost model on


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (5 preceding siblings ...)
  2007-06-28  0:41 ` harsha dot jagasia at amd dot com
@ 2007-06-28  0:41 ` harsha dot jagasia at amd dot com
  2007-06-28  0:42 ` harsha dot jagasia at amd dot com
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: harsha dot jagasia at amd dot com @ 2007-06-28  0:41 UTC (permalink / raw)
  To: gcc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2054 bytes --]



------- Comment #7 from harsha dot jagasia at amd dot com  2007-06-28 00:41 -------
This is what I get without -ftree-vectorize, with -ftree-vectorize (default
cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an
AMD x86-64 (with trunk plus the patch posted by Dorit at
http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt )

Case 1: (no vectorization)
gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops
pr32084.f90 -o 4.3.novect.out
time ./4.3.novect.out
real    0m4.414s
user    0m4.312s
sys     0m0.000s

Case 2: (vectorization without cost model)
gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math
-funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o
4.3.nocost.out
time ./4.3.nocost.out
real    0m4.776s
user    0m4.668s
sys     0m0.004s

Case 3: (vectorization with cost model)
gfortran -static -ftree-vectorize -fvect-cost-model -march=opteron -msse3 -O3
-ffast-math -funroll-loops -fdump-tree-vect-details -fno-show-column
pr32084.f90 -o 4.3.cost.out
time ./4.3.cost.out
real    0m4.403s
user    0m4.300s
sys     0m0.000s

In short, the 8% advantage that the scalar version has over the vector version
disappears with the cost model.

Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1,
9) don’t get vectorized (irrespective of the cost model).

Looking at the dumps, the lines being vectorized without the cost model are the
calls to TRANSPOSE and DOT_PRODUCT (line no 335, 333, 288, 223, 221 and 176).
And the cost model determines that it's not profitable to vectorize these
resorting to the scalar version instead.

The dumps are attached.

Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /home/hjagasia/autovect/src-trunk/gcc/configure
--prefix=/local/hjagasia/autovect/obj-trunk-nobootstrap
--enable-languages=c,c++,fortran --enable-multilib --disable-bootstrap
Thread model: posix
gcc version 4.3.0 20070627 (experimental)

Thanks,
Harsha


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (6 preceding siblings ...)
  2007-06-28  0:41 ` harsha dot jagasia at amd dot com
@ 2007-06-28  0:42 ` harsha dot jagasia at amd dot com
  2007-06-28  8:36 ` ubizjak at gmail dot com
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: harsha dot jagasia at amd dot com @ 2007-06-28  0:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from harsha dot jagasia at amd dot com  2007-06-28 00:42 -------
Created an attachment (id=13797)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13797&action=view)
vectorizer dump with cost model off


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (7 preceding siblings ...)
  2007-06-28  0:42 ` harsha dot jagasia at amd dot com
@ 2007-06-28  8:36 ` ubizjak at gmail dot com
  2007-06-28  9:20 ` ubizjak at gmail dot com
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28  8:36 UTC (permalink / raw)
  To: gcc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1532 bytes --]



------- Comment #9 from ubizjak at gmail dot com  2007-06-28 08:36 -------
(In reply to comment #7)
> This is what I get without -ftree-vectorize, with -ftree-vectorize (default
> cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an
> AMD x86-64 (with trunk plus the patch posted by Dorit at
> http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt )
> 
> Case 1: (no vectorization)
> gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops
> pr32084.f90 -o 4.3.novect.out
> time ./4.3.novect.out
> real    0m4.414s
> user    0m4.312s
> sys     0m0.000s
> 
> Case 2: (vectorization without cost model)
> gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math
> -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o
> 4.3.nocost.out
> time ./4.3.nocost.out
> real    0m4.776s
> user    0m4.668s
> sys     0m0.004s
>
> In short, the 8% advantage that the scalar version has over the vector version
> disappears with the cost model.
> 
> Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1,
> 9) don’t get vectorized (irrespective of the cost model).

No, it is OK (but for core2 and nocona -ftree-vectorize has 50% disadvantage
compared to scalar versions). The problem is that vectorized loop is not
unrolled anymore in the RTL unroller. My speculation is, that by unrolling the
vectorized loop, the runtimes of vectorized version will be _faster_ than
scalar versions.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (8 preceding siblings ...)
  2007-06-28  8:36 ` ubizjak at gmail dot com
@ 2007-06-28  9:20 ` ubizjak at gmail dot com
  2007-06-28 11:39 ` ubizjak at gmail dot com
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28  9:20 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from ubizjak at gmail dot com  2007-06-28 09:20 -------
Well, well - what can be found in _.146r.loop_unroll:

Loop 10 is simple:
  simple exit 40 -> 42
  number of iterations: (const_int 8 [0x8])
  upper bound: 8
;; Unable to prove that the loop rolls exactly once

;; Considering peeling completely
;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum
peelings])

Really funny... Since when is "8 more than 8"? ;(

However, gcc has no problems when unrolling without --ftree-vectorize:

Loop 8 is simple:
  simple exit 28 -> 30
  number of iterations: (const_int 8 [0x8])
  upper bound: 8
;; Unable to prove that the loop rolls exactly once

;; Considering peeling completely
;; Decided to peel loop completely

Investigating...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (9 preceding siblings ...)
  2007-06-28  9:20 ` ubizjak at gmail dot com
@ 2007-06-28 11:39 ` ubizjak at gmail dot com
  2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28 11:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from ubizjak at gmail dot com  2007-06-28 11:39 -------
(In reply to comment #10)

> ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum
> peelings])

This is meant that original + 8 unroll iterations > 8. So, loop has 46 insns,
and 9 copies of loops is more than PARAM_MAX_COMPLETELY_PEELED_INSNS (currently
400) and unroll is rejeceted.

However, even with unrolled vectorized loop, we are still 50% slower. It looks
that tight sequences of subsd/subpd and mulsd/mulpd kill performance in
-ftree-vectorize:

        movapd  %xmm6, %xmm0
        movsd   %xmm1, -200(%ebp)
        subsd   %xmm5, %xmm0
        subpd   (%ebx), %xmm3
        mulsd   %xmm0, %xmm0
        mulpd   %xmm3, %xmm3
        haddpd  %xmm3, %xmm3
        movapd  %xmm3, %xmm2
        movsd   w2gauss.1408+8, %xmm3
        addsd   %xmm2, %xmm0
        mulsd   w1gauss.1411-8(,%eax,8), %xmm3
        sqrtsd  %xmm0, %xmm1

It looks that there is no other help but -fvect-cost-model. The results for
induct.f90 (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse
-funroll-loops) are:

induct.f90, -ftree-vectorize without -fvect-cost-model:
user    1m34.046s

induct.f90, -ftree-vectorize with -fvect-cost-model:
user    0m45.447s

induct.f90 without -ftree-vectorize:
user    0m45.215s


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (10 preceding siblings ...)
  2007-06-28 11:39 ` ubizjak at gmail dot com
@ 2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
  2007-06-28 12:03 ` burnus at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-06-28 11:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from rguenth at gcc dot gnu dot org  2007-06-28 11:40 -------
I suspect the vectorizer leaves us with too much dead statements that confuse
the complete unrollers size cost metric.  Running dce after vectorization might
fix this.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (11 preceding siblings ...)
  2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
@ 2007-06-28 12:03 ` burnus at gcc dot gnu dot org
  2007-06-28 12:59 ` ubizjak at gmail dot com
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-06-28 12:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #13 from burnus at gcc dot gnu dot org  2007-06-28 12:03 -------
core2      AMD
0m45.215s  0m4.312s  (no vectorize)
1m34.046s  0m4.668s  -ftree-vectorize
0m45.447s  0m4.300s  -ftree-vectorize -fvect-cost-model

i.e. "-ftree-vectorize -fvect-cost-model" is marginally faster than without
-ftree-vectorize on AMD but slower on Intel; and on Intel "-ftree-vectorize"
alone has a huge impact (80% slower) whereas on AMD only it is only 8% slower.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (12 preceding siblings ...)
  2007-06-28 12:03 ` burnus at gcc dot gnu dot org
@ 2007-06-28 12:59 ` ubizjak at gmail dot com
  2007-12-10  8:37 ` bonzini at gnu dot org
  2007-12-10 10:08 ` rguenth at gcc dot gnu dot org
  15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28 12:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #14 from ubizjak at gmail dot com  2007-06-28 12:59 -------
(In reply to comment #13)
> core2      AMD
> 0m45.215s  0m4.312s  (no vectorize)

Ehm, the first is full induct.f90 run on _nocona_, whereas AMD is the result of
running the attached test. The table with comparable results is then:

(gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse -funroll-loops)

nocona(32) AMD(64)
0m4.176s   0m4.312s  (no vectorize)
0m8.169s   0m4.668s  -ftree-vectorize
0m4.108s   0m4.300s  -ftree-vectorize -fvect-cost-model


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (13 preceding siblings ...)
  2007-06-28 12:59 ` ubizjak at gmail dot com
@ 2007-12-10  8:37 ` bonzini at gnu dot org
  2007-12-10 10:08 ` rguenth at gcc dot gnu dot org
  15 siblings, 0 replies; 17+ messages in thread
From: bonzini at gnu dot org @ 2007-12-10  8:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #15 from bonzini at gnu dot org  2007-12-10 08:37 -------
As I committed PR32086 to use the cost model, this should be fixed.  However, I
prefer to leave it open as a missed optimization since Richard G.'s comments
suggest that: a) there should be a DCE pass after vectorization, b) the cost
model might actually be wrong?


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bonzini at gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
  2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
                   ` (14 preceding siblings ...)
  2007-12-10  8:37 ` bonzini at gnu dot org
@ 2007-12-10 10:08 ` rguenth at gcc dot gnu dot org
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-12-10 10:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #16 from rguenth at gcc dot gnu dot org  2007-12-10 10:07 -------
I have this noted down on my TODO list, so I suppose it's better to close
this PR.  I have opened PR34416 to track pass-pipeline issues we are aware of.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED
   Target Milestone|---                         |4.3.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-12-10 10:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
2007-05-25 13:54 ` burnus at gcc dot gnu dot org
2007-06-26 19:43 ` [Bug tree-optimization/32084] " ubizjak at gmail dot com
2007-06-27 11:24 ` [Bug rtl-optimization/32084] " ubizjak at gmail dot com
2007-06-27 11:57 ` dorit at il dot ibm dot com
2007-06-28  0:41 ` harsha dot jagasia at amd dot com
2007-06-28  0:41 ` harsha dot jagasia at amd dot com
2007-06-28  0:42 ` harsha dot jagasia at amd dot com
2007-06-28  8:36 ` ubizjak at gmail dot com
2007-06-28  9:20 ` ubizjak at gmail dot com
2007-06-28 11:39 ` ubizjak at gmail dot com
2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
2007-06-28 12:03 ` burnus at gcc dot gnu dot org
2007-06-28 12:59 ` ubizjak at gmail dot com
2007-12-10  8:37 ` bonzini at gnu dot org
2007-12-10 10:08 ` rguenth at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).