public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
@ 2007-05-25 13:23 burnus at gcc dot gnu dot org
2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
` (15 more replies)
0 siblings, 16 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-05-25 13:23 UTC (permalink / raw)
To: gcc-bugs
gfortran seemingly generates an significatly inferior internal TREE
representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower
than g95, which is based on GCC 4.0.3. (Compared with other compilers the
difference is even larger.)
(GCC 4.3 is in general faster than GCC 4.1; for induct one does not see any
runtime change with the gfortran frontend during the last 1.5 years, though
GCC/gfortran 4.1.2 was seemingly slightly faster:
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-induct-19.png
)
If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops
with gfortran whereas GCC 4.0 vectorizes 0 loops with g95.
For reduced-size example (395 instead of 6635 lines), gfortran is still 13%
slower:
$ fortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize
-ftree-loop-linear -msse3 -O3 test2.f90
$ time a.out
real 0m4.632s user 0m4.624s sys 0m0.004s
$ g95 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -msse3 -O3
test2.f90
$ time a.out
real 0m4.030s user 0m4.024s sys 0m0.004s
$ ifort test2.f90
$ time a.out
real 0m3.859s user 0m3.856s sys 0m0.000s
# NAG f95 + system gcc 4.1.3
$ f95 -O4 -ieee=full -Bstatic -march=opteron -ffast-math -funroll-loops
-ftree-vectorize -msse3 test2.f90
$ time a.out
real 0m3.381s user 0m3.380s sys 0m0.004s
$ sunf95 -w4 -fast -xarch=amd64a -xipo=0 test2.f90
$ time a.out
real 0m3.741s user 0m3.736s sys 0m0.000s
For induct (on x86_64-unknown-linux-gnu):
51.65 [100%] gfortran -m64 as above
51.90 [100%] gfortran with -fprofile-use
61.41 [118%] gfortran 32bit, x87
46.12 [ 89%] gfortran 32bit, SSE
43.33 [ 83%] ifort 9.1
40.73 [ 78%] ifort 10beta
42.53 [ 82%] sunf95
30.16 [ 58%] pathscale
38.86 [ 75%] NAG f95 using system gcc 4.1
42.65 [ 82%] g95/gcc 4.0.3 (g95 0.91!)
--
Summary: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-
based competitor
Product: gcc
Version: 4.3.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: fortran
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: burnus at gcc dot gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug fortran/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
@ 2007-05-25 13:26 ` burnus at gcc dot gnu dot org
2007-05-25 13:54 ` burnus at gcc dot gnu dot org
` (14 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-05-25 13:26 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from burnus at gcc dot gnu dot org 2007-05-25 14:25 -------
Created an attachment (id=13611)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13611&action=view)
test case, 395 lines; based on Polyhedron's induct.f90
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug fortran/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
@ 2007-05-25 13:54 ` burnus at gcc dot gnu dot org
2007-06-26 19:43 ` [Bug tree-optimization/32084] " ubizjak at gmail dot com
` (13 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-05-25 13:54 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from burnus at gcc dot gnu dot org 2007-05-25 14:54 -------
Using the GCC 4.1.3 20070430 which comes with openSUSE Factory and contains
some patches from 4.2/4.3, I get the following timings:
$ gfortran-4.1 -march=opteron -ffast-math -funroll-loops -ftree-vectorize
-ftree-loop-linear -msse3 -O3 induct.f90
$ time a.out
real 0m47.043s user 0m46.911s sys 0m0.020s
which means that gcc/gfortran 4.1.3 was 10% faster for induct than 4.3's
gfortran, but still almost 10% slower than gcc/g95 4.0.3.
For the testcase (without "volatile"):
real 0m4.194s user 0m4.192s sys 0m0.000s
which is timewise also between gfortran 4.3 and g95.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug tree-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
2007-05-25 13:54 ` burnus at gcc dot gnu dot org
@ 2007-06-26 19:43 ` ubizjak at gmail dot com
2007-06-27 11:24 ` [Bug rtl-optimization/32084] " ubizjak at gmail dot com
` (12 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-26 19:43 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from ubizjak at gmail dot com 2007-06-26 19:43 -------
(In reply to comment #0)
> gfortran seemingly generates an significatly inferior internal TREE
> representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower
> than g95, which is based on GCC 4.0.3. (Compared with other compilers the
> difference is even larger.)
> If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops
> with gfortran whereas GCC 4.0 vectorizes 0 loops with g95.
The problem is in -ftree-vectorize:
gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear
-ftree-vectorize -msse3 -O3 pr32084.f90
time ./a.out
real 0m2.941s
user 0m2.940s
sys 0m0.004s
gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3
pr32084.f90
time ./a.out
real 0m1.574s
user 0m1.572s
sys 0m0.004s
The testcase runs 47% faster without -ftree-vectorize.
gcc -v
Target: x86_64-unknown-linux-gnu
...
gcc version 4.3.0 20070622 (experimental)
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU X6800 @ 2.93GHz
stepping : 5
cpu MHz : 2933.435
cache size : 4096 KB
This is marked a "tree-optimization" bug because we have no "vectorizer"
component to choose from.
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |ubizjak at gmail dot com
Status|UNCONFIRMED |NEW
Component|fortran |tree-optimization
Ever Confirmed|0 |1
Last reconfirmed|0000-00-00 00:00:00 |2007-06-26 19:43:36
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (2 preceding siblings ...)
2007-06-26 19:43 ` [Bug tree-optimization/32084] " ubizjak at gmail dot com
@ 2007-06-27 11:24 ` ubizjak at gmail dot com
2007-06-27 11:57 ` dorit at il dot ibm dot com
` (11 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-27 11:24 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from ubizjak at gmail dot com 2007-06-27 11:24 -------
(In reply to comment #3)
> The problem is in -ftree-vectorize
The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9)
is completely unrolled, but with vectorization, the loop is vectorized, but
_not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors,
we loose big time at this point.
My best guess for unroller problems would be rtl-optimization.
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC|dorit at gcc dot gnu dot org|
Component|tree-optimization |rtl-optimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (3 preceding siblings ...)
2007-06-27 11:24 ` [Bug rtl-optimization/32084] " ubizjak at gmail dot com
@ 2007-06-27 11:57 ` dorit at il dot ibm dot com
2007-06-28 0:41 ` harsha dot jagasia at amd dot com
` (10 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: dorit at il dot ibm dot com @ 2007-06-27 11:57 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from dorit at il dot ibm dot com 2007-06-27 11:57 -------
(In reply to comment #4)
> (In reply to comment #3)
> > The problem is in -ftree-vectorize
> The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9)
> is completely unrolled, but with vectorization, the loop is vectorized, but
> _not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors,
> we loose big time at this point.
> My best guess for unroller problems would be rtl-optimization.
Could it be the tree-level complete unroller? (does the vectorizer peel the
loop to handle a misaligned store by any chance? if so, and if the misalignment
amount is unknown, then the number of iterations of the vectorized loop is
unknown, in which case the complete unroller wouldn't work). In autovect-branch
the tree-level complete unroller is before the vectorizer - wonder what happens
there.
Another thing to consider is using -fvect-cost-model (it's very perliminary and
hasn't been tuned much, but this could be a good data point for whoever wants
to tune the vectorizer cost-model for x86_64).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (4 preceding siblings ...)
2007-06-27 11:57 ` dorit at il dot ibm dot com
@ 2007-06-28 0:41 ` harsha dot jagasia at amd dot com
2007-06-28 0:41 ` harsha dot jagasia at amd dot com
` (9 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: harsha dot jagasia at amd dot com @ 2007-06-28 0:41 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from harsha dot jagasia at amd dot com 2007-06-28 00:41 -------
Created an attachment (id=13796)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13796&action=view)
vectorizer dump with cost model on
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (5 preceding siblings ...)
2007-06-28 0:41 ` harsha dot jagasia at amd dot com
@ 2007-06-28 0:41 ` harsha dot jagasia at amd dot com
2007-06-28 0:42 ` harsha dot jagasia at amd dot com
` (8 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: harsha dot jagasia at amd dot com @ 2007-06-28 0:41 UTC (permalink / raw)
To: gcc-bugs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2054 bytes --]
------- Comment #7 from harsha dot jagasia at amd dot com 2007-06-28 00:41 -------
This is what I get without -ftree-vectorize, with -ftree-vectorize (default
cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an
AMD x86-64 (with trunk plus the patch posted by Dorit at
http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt )
Case 1: (no vectorization)
gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops
pr32084.f90 -o 4.3.novect.out
time ./4.3.novect.out
real 0m4.414s
user 0m4.312s
sys 0m0.000s
Case 2: (vectorization without cost model)
gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math
-funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o
4.3.nocost.out
time ./4.3.nocost.out
real 0m4.776s
user 0m4.668s
sys 0m0.004s
Case 3: (vectorization with cost model)
gfortran -static -ftree-vectorize -fvect-cost-model -march=opteron -msse3 -O3
-ffast-math -funroll-loops -fdump-tree-vect-details -fno-show-column
pr32084.f90 -o 4.3.cost.out
time ./4.3.cost.out
real 0m4.403s
user 0m4.300s
sys 0m0.000s
In short, the 8% advantage that the scalar version has over the vector version
disappears with the cost model.
Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1,
9) dont get vectorized (irrespective of the cost model).
Looking at the dumps, the lines being vectorized without the cost model are the
calls to TRANSPOSE and DOT_PRODUCT (line no 335, 333, 288, 223, 221 and 176).
And the cost model determines that it's not profitable to vectorize these
resorting to the scalar version instead.
The dumps are attached.
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /home/hjagasia/autovect/src-trunk/gcc/configure
--prefix=/local/hjagasia/autovect/obj-trunk-nobootstrap
--enable-languages=c,c++,fortran --enable-multilib --disable-bootstrap
Thread model: posix
gcc version 4.3.0 20070627 (experimental)
Thanks,
Harsha
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (6 preceding siblings ...)
2007-06-28 0:41 ` harsha dot jagasia at amd dot com
@ 2007-06-28 0:42 ` harsha dot jagasia at amd dot com
2007-06-28 8:36 ` ubizjak at gmail dot com
` (7 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: harsha dot jagasia at amd dot com @ 2007-06-28 0:42 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from harsha dot jagasia at amd dot com 2007-06-28 00:42 -------
Created an attachment (id=13797)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13797&action=view)
vectorizer dump with cost model off
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (7 preceding siblings ...)
2007-06-28 0:42 ` harsha dot jagasia at amd dot com
@ 2007-06-28 8:36 ` ubizjak at gmail dot com
2007-06-28 9:20 ` ubizjak at gmail dot com
` (6 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28 8:36 UTC (permalink / raw)
To: gcc-bugs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1532 bytes --]
------- Comment #9 from ubizjak at gmail dot com 2007-06-28 08:36 -------
(In reply to comment #7)
> This is what I get without -ftree-vectorize, with -ftree-vectorize (default
> cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an
> AMD x86-64 (with trunk plus the patch posted by Dorit at
> http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt )
>
> Case 1: (no vectorization)
> gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops
> pr32084.f90 -o 4.3.novect.out
> time ./4.3.novect.out
> real 0m4.414s
> user 0m4.312s
> sys 0m0.000s
>
> Case 2: (vectorization without cost model)
> gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math
> -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o
> 4.3.nocost.out
> time ./4.3.nocost.out
> real 0m4.776s
> user 0m4.668s
> sys 0m0.004s
>
> In short, the 8% advantage that the scalar version has over the vector version
> disappears with the cost model.
>
> Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1,
> 9) dont get vectorized (irrespective of the cost model).
No, it is OK (but for core2 and nocona -ftree-vectorize has 50% disadvantage
compared to scalar versions). The problem is that vectorized loop is not
unrolled anymore in the RTL unroller. My speculation is, that by unrolling the
vectorized loop, the runtimes of vectorized version will be _faster_ than
scalar versions.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (8 preceding siblings ...)
2007-06-28 8:36 ` ubizjak at gmail dot com
@ 2007-06-28 9:20 ` ubizjak at gmail dot com
2007-06-28 11:39 ` ubizjak at gmail dot com
` (5 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28 9:20 UTC (permalink / raw)
To: gcc-bugs
------- Comment #10 from ubizjak at gmail dot com 2007-06-28 09:20 -------
Well, well - what can be found in _.146r.loop_unroll:
Loop 10 is simple:
simple exit 40 -> 42
number of iterations: (const_int 8 [0x8])
upper bound: 8
;; Unable to prove that the loop rolls exactly once
;; Considering peeling completely
;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum
peelings])
Really funny... Since when is "8 more than 8"? ;(
However, gcc has no problems when unrolling without --ftree-vectorize:
Loop 8 is simple:
simple exit 28 -> 30
number of iterations: (const_int 8 [0x8])
upper bound: 8
;; Unable to prove that the loop rolls exactly once
;; Considering peeling completely
;; Decided to peel loop completely
Investigating...
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (9 preceding siblings ...)
2007-06-28 9:20 ` ubizjak at gmail dot com
@ 2007-06-28 11:39 ` ubizjak at gmail dot com
2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
` (4 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28 11:39 UTC (permalink / raw)
To: gcc-bugs
------- Comment #11 from ubizjak at gmail dot com 2007-06-28 11:39 -------
(In reply to comment #10)
> ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum
> peelings])
This is meant that original + 8 unroll iterations > 8. So, loop has 46 insns,
and 9 copies of loops is more than PARAM_MAX_COMPLETELY_PEELED_INSNS (currently
400) and unroll is rejeceted.
However, even with unrolled vectorized loop, we are still 50% slower. It looks
that tight sequences of subsd/subpd and mulsd/mulpd kill performance in
-ftree-vectorize:
movapd %xmm6, %xmm0
movsd %xmm1, -200(%ebp)
subsd %xmm5, %xmm0
subpd (%ebx), %xmm3
mulsd %xmm0, %xmm0
mulpd %xmm3, %xmm3
haddpd %xmm3, %xmm3
movapd %xmm3, %xmm2
movsd w2gauss.1408+8, %xmm3
addsd %xmm2, %xmm0
mulsd w1gauss.1411-8(,%eax,8), %xmm3
sqrtsd %xmm0, %xmm1
It looks that there is no other help but -fvect-cost-model. The results for
induct.f90 (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse
-funroll-loops) are:
induct.f90, -ftree-vectorize without -fvect-cost-model:
user 1m34.046s
induct.f90, -ftree-vectorize with -fvect-cost-model:
user 0m45.447s
induct.f90 without -ftree-vectorize:
user 0m45.215s
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (10 preceding siblings ...)
2007-06-28 11:39 ` ubizjak at gmail dot com
@ 2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
2007-06-28 12:03 ` burnus at gcc dot gnu dot org
` (3 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-06-28 11:40 UTC (permalink / raw)
To: gcc-bugs
------- Comment #12 from rguenth at gcc dot gnu dot org 2007-06-28 11:40 -------
I suspect the vectorizer leaves us with too much dead statements that confuse
the complete unrollers size cost metric. Running dce after vectorization might
fix this.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (11 preceding siblings ...)
2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
@ 2007-06-28 12:03 ` burnus at gcc dot gnu dot org
2007-06-28 12:59 ` ubizjak at gmail dot com
` (2 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-06-28 12:03 UTC (permalink / raw)
To: gcc-bugs
------- Comment #13 from burnus at gcc dot gnu dot org 2007-06-28 12:03 -------
core2 AMD
0m45.215s 0m4.312s (no vectorize)
1m34.046s 0m4.668s -ftree-vectorize
0m45.447s 0m4.300s -ftree-vectorize -fvect-cost-model
i.e. "-ftree-vectorize -fvect-cost-model" is marginally faster than without
-ftree-vectorize on AMD but slower on Intel; and on Intel "-ftree-vectorize"
alone has a huge impact (80% slower) whereas on AMD only it is only 8% slower.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (12 preceding siblings ...)
2007-06-28 12:03 ` burnus at gcc dot gnu dot org
@ 2007-06-28 12:59 ` ubizjak at gmail dot com
2007-12-10 8:37 ` bonzini at gnu dot org
2007-12-10 10:08 ` rguenth at gcc dot gnu dot org
15 siblings, 0 replies; 17+ messages in thread
From: ubizjak at gmail dot com @ 2007-06-28 12:59 UTC (permalink / raw)
To: gcc-bugs
------- Comment #14 from ubizjak at gmail dot com 2007-06-28 12:59 -------
(In reply to comment #13)
> core2 AMD
> 0m45.215s 0m4.312s (no vectorize)
Ehm, the first is full induct.f90 run on _nocona_, whereas AMD is the result of
running the attached test. The table with comparable results is then:
(gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse -funroll-loops)
nocona(32) AMD(64)
0m4.176s 0m4.312s (no vectorize)
0m8.169s 0m4.668s -ftree-vectorize
0m4.108s 0m4.300s -ftree-vectorize -fvect-cost-model
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (13 preceding siblings ...)
2007-06-28 12:59 ` ubizjak at gmail dot com
@ 2007-12-10 8:37 ` bonzini at gnu dot org
2007-12-10 10:08 ` rguenth at gcc dot gnu dot org
15 siblings, 0 replies; 17+ messages in thread
From: bonzini at gnu dot org @ 2007-12-10 8:37 UTC (permalink / raw)
To: gcc-bugs
------- Comment #15 from bonzini at gnu dot org 2007-12-10 08:37 -------
As I committed PR32086 to use the cost model, this should be fixed. However, I
prefer to leave it open as a missed optimization since Richard G.'s comments
suggest that: a) there should be a DCE pass after vectorization, b) the cost
model might actually be wrong?
--
bonzini at gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |bonzini at gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug rtl-optimization/32084] gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
` (14 preceding siblings ...)
2007-12-10 8:37 ` bonzini at gnu dot org
@ 2007-12-10 10:08 ` rguenth at gcc dot gnu dot org
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-12-10 10:08 UTC (permalink / raw)
To: gcc-bugs
------- Comment #16 from rguenth at gcc dot gnu dot org 2007-12-10 10:07 -------
I have this noted down on my TODO list, so I suppose it's better to close
this PR. I have opened PR34416 to track pass-pipeline issues we are aware of.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
Target Milestone|--- |4.3.0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2007-12-10 10:08 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-25 13:23 [Bug fortran/32084] New: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor burnus at gcc dot gnu dot org
2007-05-25 13:26 ` [Bug fortran/32084] " burnus at gcc dot gnu dot org
2007-05-25 13:54 ` burnus at gcc dot gnu dot org
2007-06-26 19:43 ` [Bug tree-optimization/32084] " ubizjak at gmail dot com
2007-06-27 11:24 ` [Bug rtl-optimization/32084] " ubizjak at gmail dot com
2007-06-27 11:57 ` dorit at il dot ibm dot com
2007-06-28 0:41 ` harsha dot jagasia at amd dot com
2007-06-28 0:41 ` harsha dot jagasia at amd dot com
2007-06-28 0:42 ` harsha dot jagasia at amd dot com
2007-06-28 8:36 ` ubizjak at gmail dot com
2007-06-28 9:20 ` ubizjak at gmail dot com
2007-06-28 11:39 ` ubizjak at gmail dot com
2007-06-28 11:40 ` rguenth at gcc dot gnu dot org
2007-06-28 12:03 ` burnus at gcc dot gnu dot org
2007-06-28 12:59 ` ubizjak at gmail dot com
2007-12-10 8:37 ` bonzini at gnu dot org
2007-12-10 10:08 ` rguenth at gcc dot gnu dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).