public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/34265] New: Missed optimizations
@ 2007-11-28 15:27 dominiq at lps dot ens dot fr
2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
` (32 more replies)
0 siblings, 33 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 15:27 UTC (permalink / raw)
To: gcc-bugs
I have had a closer look to the optimization of the polyhedron test
induct.f90.
3/4 of the runtime is spent in the subroutine 'mutual_ind_quad_cir_coil' and
1/4 in 'mutual_ind_quad_rec_coil'.
The two subroutines contain two main loops with the following structure:
do i = 1, 2*m
...
do j = 1, 9
...
do k = 1, 9
q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal)
q_vector(3) = 0.0_longreal
!
! rotate quad vector into the global coordinate system
!
rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
!
! compute and add in quadrature term
!
numerator = w1gauss(j) * w2gauss(k) * &
dot_product(coil_current_vec,current_vector)
denominator = sqrt(dot_product(rot_c_vector-rot_q_vector, &
rot_c_vector-rot_q_vector))
l12_lower = l12_lower + numerator/denominator
end do
end do
end do
where the six first lines of code in the k loop do not depend on i nor j
and can be computed in a 'do k = 1, 9' loop outside the main loop by
replacing the length-three vector 'rot_q_vector' by a nine by three array.
The original code induct.f90 gives the following timing (all with -O3
-ffast-math -funroll-loops):
93.227u 0.094s 1:33.32 99.9% 0+0k 0+0io 0pf+0w
and output:
...
Maximum wand/quad abs rel mutual inductance = 5.95379428444656467E-002
Minimum resq coil/quad abs rel mutual inductance = 0.0000000000000000
Maximum resq coil/quad abs rel mutual inductance = 9.63995250242061230E-002
...
Unrolling bye hand 'numerator' and 'denominator' gives (see
http://gcc.gnu.org/ml/fortran/2007-11/msg00231.html):
65.563u 0.092s 1:05.66 99.9% 0+0k 0+0io 0pf+0w
Looking at the assembly I can see that for the original code the inner loops in
k are not unrolled, as guessed by Paul Thomas (only the implied vector loops
being unrolled).
QUESTION 1: Should the frontend do the unrolling for small vectors itself? or
should the middleend be more aggressive for nested loops with small known
iterations?
Moving the invariants on i and j in the k loops outside the main loops gives:
80.313u 0.074s 1:20.39 99.9% 0+0k 0+1io 0pf+0w
Combining the two hand optimizations gives:
35.925u 0.040s 0:35.97 99.9% 0+0k 0+0io 0pf+0w
(without -ffast-math the timing is
59.263u 0.067s 0:59.33 99.9% 0+0k 0+1io 0pf+0w)
but the results change to:
Maximum wand/quad abs rel mutual inductance = 5.95379428444656675E-002
Minimum resq coil/quad abs rel mutual inductance = 0.0000000000000000
Maximum resq coil/quad abs rel mutual inductance = 9.63995250242059842E-002
( Maximum wand/quad abs rel mutual inductance = 5.95379428444659520E-002
Minimum resq coil/quad abs rel mutual inductance = 0.0000000000000000
Maximum resq coil/quad abs rel mutual inductance = 9.63995250242060675E-002
without -ffast-math).
The attached file gives the differences between the original code and the three
variants.
This is to be compared to further optimizations (indu.v2.f90) leading to:
35.376u 0.062s 0:35.44 99.9% 0+0k 0+1io 0pf+0w
or after merging the two loops (indu.v3.f90) to:
34.452u 0.041s 0:34.49 100.0% 0+0k 0+0io 0pf+0w
I have counted the number of sqrt() in the assembly code and found 9 of them in
the slow codes while I only found 5 (10 for indu.v3.f90) of them for the fast
codes. I checked that this was not due to common
subexpressions I may have missed. Looking more closely to the assemby I saw
that he slow codes used 'sqrtsd' while the fast ones used 'sqrtpd' along with
several other packed operations. Now 65.66*5/9=36.48 explaining most of the
speedup. Note that 5=9/2+1, i.e., four packed computations
followed by a scalar one. Owing to the same structure in the two loops, the
two scalar computations could be merged in a packed one, but it is missed by
the optimizer.
I have tried without success to trigger this "vectorization" by making some
variables vectors in k.
QUESTION 2: Why the optimizer is able to vectorize in some cases and not in
others? Can the frontend help to vectorize?
--
Summary: Missed optimizations
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: dominiq at lps dot ens dot fr
GCC build triplet: i686-apple-darwin9
GCC host triplet: i686-apple-darwin9
GCC target triplet: i686-apple-darwin9
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
@ 2007-11-28 15:30 ` dominiq at lps dot ens dot fr
2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
` (31 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 15:30 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from dominiq at lps dot ens dot fr 2007-11-28 15:30 -------
Created an attachment (id=14654)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14654&action=view)
Diffs between the original file and the simplest variants
In induct.v1.f90 'nominator' and 'denominator' are unrolled by hand. In
induct.v2.f90 the invariants are moved outside the mail loop and induct.v3.f90
combines the two.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
@ 2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
2007-11-28 16:14 ` dominiq at lps dot ens dot fr
` (30 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 16:06 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from rguenth at gcc dot gnu dot org 2007-11-28 16:06 -------
GCC doesn't have a facility to split the inner loop and move it out of the
outer loops by introducing a array temporary.
As for completely unrolling, this only happens for innermost loops(?) and you
can tune the heuristics with --param max-completely-peeled-insns=N (defaults to
400) and --param max-completely-peel-times (defaults to 16). Use
-funroll-loops
to enable this.
Note that complete unrolling happens too late to help LIM or vectorization.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 16:14 ` dominiq at lps dot ens dot fr
2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
` (29 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 16:14 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from dominiq at lps dot ens dot fr 2007-11-28 16:14 -------
> Note that complete unrolling happens too late to help LIM or vectorization.
Could this be translated as a YES to my first question: the fortran frontend
should unroll computations for short vectors?
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (2 preceding siblings ...)
2007-11-28 16:14 ` dominiq at lps dot ens dot fr
@ 2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
` (28 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 16:18 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from rguenth at gcc dot gnu dot org 2007-11-28 16:17 -------
I would in principle say no - we can instead improve the middle-end here. But
it may pay off to not generate a loop for short vectors in case the resulting
IL is smaller for example. Of course it would duplicate logic in the frontend
if that is not already available, so from a middle-end point of view we should
fix it there instead (the same problems happen for C and C++). See PR34223.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
BugsThisDependsOn| |34223
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (3 preceding siblings ...)
2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
2007-11-28 18:18 ` dominiq at lps dot ens dot fr
` (27 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 16:34 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from rguenth at gcc dot gnu dot org 2007-11-28 16:33 -------
Created an attachment (id=14655)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14655&action=view)
patch for early complete unrolling of inner loops
For example with a patch like this.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (4 preceding siblings ...)
2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 18:18 ` dominiq at lps dot ens dot fr
2007-11-28 18:49 ` dominiq at lps dot ens dot fr
` (26 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 18:18 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from dominiq at lps dot ens dot fr 2007-11-28 18:18 -------
Subject: Re: Missed optimizations
> For example with a patch like this.
You also need
--- ../_gcc_clean/gcc/tree-flow.h 2007-11-16 16:17:46.000000000 +0100
+++ ../gcc-4.3-work/gcc/tree-flow.h 2007-11-28 18:56:42.000000000 +0100
@@ -980,7 +980,7 @@
void tree_ssa_lim (void);
unsigned int tree_ssa_unswitch_loops (void);
unsigned int canonicalize_induction_variables (void);
-unsigned int tree_unroll_loops_completely (bool);
+unsigned int tree_unroll_loops_completely (bool, bool);
unsigned int tree_ssa_prefetch_arrays (void);
unsigned int remove_empty_loops (void);
void tree_ssa_iv_optimize (void);
Still building.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (5 preceding siblings ...)
2007-11-28 18:18 ` dominiq at lps dot ens dot fr
@ 2007-11-28 18:49 ` dominiq at lps dot ens dot fr
2007-11-28 20:48 ` jb at gcc dot gnu dot org
` (25 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 18:49 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from dominiq at lps dot ens dot fr 2007-11-28 18:48 -------
Subject: Re: Missed optimizations
With your patch the runtime went from
93.670u 0.103s 1:33.85 99.9% 0+0k 0+0io 32pf+0w
to
38.741u 0.038s 0:38.85 99.7% 0+0k 0+1io 32pf+0w
Pretty impressive!
Note that with gfortran 4.2.2 the timing is
72.451u 0.046s 1:12.59 99.8% 0+0k 1+0io 33pf+0w
I'll run the full polyhedron suite.
Thanks
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (6 preceding siblings ...)
2007-11-28 18:49 ` dominiq at lps dot ens dot fr
@ 2007-11-28 20:48 ` jb at gcc dot gnu dot org
2007-11-28 21:27 ` burnus at gcc dot gnu dot org
` (24 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: jb at gcc dot gnu dot org @ 2007-11-28 20:48 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from jb at gcc dot gnu dot org 2007-11-28 20:48 -------
The vectorization of dot products is covered by PR31738, I suppose
--
jb at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
BugsThisDependsOn| |31738
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (7 preceding siblings ...)
2007-11-28 20:48 ` jb at gcc dot gnu dot org
@ 2007-11-28 21:27 ` burnus at gcc dot gnu dot org
2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
` (23 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-11-28 21:27 UTC (permalink / raw)
To: gcc-bugs
------- Comment #9 from burnus at gcc dot gnu dot org 2007-11-28 21:27 -------
> With your patch the runtime went from
> 93.670u 0.103s 1:33.85 99.9% 0+0k 0+0io 32pf+0w
> to
> 38.741u 0.038s 0:38.85 99.7% 0+0k 0+1io 32pf+0w
Thus: 59% faster. Here, it "only" went ~30% down from 49.89s to ~35.2s.
(AMD64/Linux, -m64). Still quite impressive!
--
burnus at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |burnus at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (8 preceding siblings ...)
2007-11-28 21:27 ` burnus at gcc dot gnu dot org
@ 2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
2007-11-28 22:36 ` dominiq at lps dot ens dot fr
` (22 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 22:05 UTC (permalink / raw)
To: gcc-bugs
------- Comment #10 from rguenth at gcc dot gnu dot org 2007-11-28 22:05 -------
Indeed - unexpectedly impressive ;) The patch has (obviously) received no
tuning
as of the placement of the early unrolling in the pass pipeline and early
unrolling is only done if that doesn't increase code-size (as of the metric
of the unrolling pass, of course), unless you specify -O3.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (9 preceding siblings ...)
2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 22:36 ` dominiq at lps dot ens dot fr
2007-11-28 22:49 ` steven at gcc dot gnu dot org
` (21 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 22:36 UTC (permalink / raw)
To: gcc-bugs
------- Comment #11 from dominiq at lps dot ens dot fr 2007-11-28 22:35 -------
Here are the timings before and after the patch for the polyhedron tests and
some variants:
Before patch After patch
Benchmark Ave Run Number Estim : Ave Run Number Estim
Name (secs) Repeats Err % : (secs) Repeats Err %
--------- ------- ------- ------ : ------- ------- ------
ac 16.92 5 0.0183 : 16.16 5 0.0056
aermod 36.82 5 0.0082 : 36.92 5 0.0106
air 11.38 10 0.0479 : 11.43 11 0.0494
capacita 62.21 5 0.0036 : 61.97 5 0.0343
channel 4.04 12 0.0333 : 4.04 5 0.0160
doduc 58.07 5 0.0257 : 57.56 5 0.0164
fatigue 14.94 5 0.0338 : 14.33 5 0.0184
gas_dyn 11.78 17 0.0448 : 11.89 18 0.0349
induct 93.27 5 0.0093 : 36.93 5 0.0205
linpk 28.15 5 0.0099 : 28.21 5 0.0259
mdbx 16.80 5 0.0112 : 16.83 5 0.0051
nf 32.45 5 0.0388 : 32.63 10 0.0495
protein 55.63 5 0.0069 : 54.86 5 0.0305
rnflow 45.88 5 0.0366 : 46.06 5 0.0230
test_fpu 14.64 5 0.0115 : 14.46 5 0.0207
tfft 3.04 5 0.0380 : 3.06 20 0.0284
ac_v1 16.15 5 0.0197 : 15.16 5 0.0109
air_v1 10.81 5 0.0411 : 10.88 10 0.0471
capacita_8 69.45 5 0.0136 : 69.41 5 0.0091
capacita_10 113.20 5 0.0200 : 112.44 5 0.0290
chan_v1 2.23 5 0.0183 : 2.24 7 0.0471
channel_10 16.61 5 0.0351 : 16.64 14 0.0492
fatigue_v1 13.26 5 0.0071 : 12.05 5 0.0148
fatigue_10 20.54 15 0.0312 : 21.73 5 0.0117
induct_v2 35.07 5 0.0007 : 60.08 5 0.0236
induct_v3 34.40 5 0.0189 : 58.64 5 0.0249
induct_vm 262.95 2 0.0000 : 253.62 2 0.0197
induct_10 100.12 5 0.0053 : 84.65 5 0.0008
kepler 22.73 5 0.0123 : 26.11 5 0.0069
kepler_10 69.59 5 0.0047 : 61.42 5 0.0110
nf_10 58.00 5 0.0413 : 58.36 5 0.0388
protein_10 57.04 5 0.0167 : 56.38 5 0.0486
test_fpu_v1 15.15 5 0.0195 : 14.98 5 0.0104
test_fpu_10 34.75 5 0.0408 : 34.68 5 0.0120
tfft_8 6.81 5 0.0110 : 6.83 5 0.0371
tfft_10 14.36 5 0.0373 : 14.40 6 0.0496
Before patch After patch
Benchmark Compile Executable : Compile Executable
Name (secs) (bytes) : (secs) (bytes)
--------- ------- ---------- : ------- ----------
ac 4.52 50628 : 4.60 50628
aermod 96.22 1288460 : 106.72 1288460
air 6.57 80956 : 6.68 80956
capacita 3.18 60140 : 3.34 64236
channel 1.55 38532 : 1.65 38532
doduc 13.52 183264 : 14.02 191456
fatigue 5.69 84564 : 5.83 80468
gas_dyn 5.36 695776 : 5.50 695776
induct 11.65 160132 : 12.02 168324
linpk 1.67 46512 : 1.71 46512
mdbx 3.76 72672 : 3.85 72672
nf 4.45 87644 : 4.46 87644
protein 11.18 113900 : 11.45 113900
rnflow 11.58 187316 : 11.74 187316
test_fpu 11.23 182544 : 11.09 178448
tfft 1.30 34420 : 1.33 34420
ac_v1 4.51 50628 : 4.60 50628
air_v1 6.63 80956 : 6.75 80956
capacita_8 3.20 60136 : 3.33 64232
capacita_10 3.19 64216 : 3.42 68312
chan_v1 1.85 38500 : 1.85 38500
channel_10 1.32 34392 : 1.40 34392
fatigue_v1 5.75 84524 : 5.77 80428
fatigue_10 4.91 76352 : 4.91 76352
induct_v2 11.72 168324 : 12.19 172420
induct_v3 11.72 164228 : 12.12 172420
induct_vm 11.44 160132 : 11.73 164228
induct_10 11.47 159964 : 11.89 159964
kepler 0.34 17652 : 0.35 17652
kepler_10 0.33 17632 : 0.34 17632
nf_10 2.03 46684 : 2.07 46684
protein_10 7.01 93400 : 7.18 93400
test_fpu_v1 11.23 182592 : 11.17 178496
test_fpu_10 6.54 117056 : 6.40 117056
tfft_8 1.25 30348 : 1.31 30348
tfft_10 1.15 30328 : 1.18 30328
The only timings significantly changed by the patch are the induct avatars,
with the strange result that the variants which missed the vectorization are
now vectorized, while those previously vectorized are not any more (also true
for the variants of the first attachment). So there is probably some need of a
little bit of tuning.
I have also to regtest and do some further investigations.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (10 preceding siblings ...)
2007-11-28 22:36 ` dominiq at lps dot ens dot fr
@ 2007-11-28 22:49 ` steven at gcc dot gnu dot org
2007-11-28 23:07 ` kargl at gcc dot gnu dot org
` (20 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: steven at gcc dot gnu dot org @ 2007-11-28 22:49 UTC (permalink / raw)
To: gcc-bugs
------- Comment #12 from steven at gcc dot gnu dot org 2007-11-28 22:49 -------
The only timings significantly changed are actually the compile times, which go
up significantly.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (11 preceding siblings ...)
2007-11-28 22:49 ` steven at gcc dot gnu dot org
@ 2007-11-28 23:07 ` kargl at gcc dot gnu dot org
2007-11-28 23:18 ` steven at gcc dot gnu dot org
` (19 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: kargl at gcc dot gnu dot org @ 2007-11-28 23:07 UTC (permalink / raw)
To: gcc-bugs
------- Comment #13 from kargl at gcc dot gnu dot org 2007-11-28 23:06 -------
(In reply to comment #12)
> The only timings significantly changed are actually the compile times, which go
> up significantly.
>
Look at the kepler execution time. 22.73 s without the patch and
26.11 s with the patch. That's a 15% decrease in execution speed
(ie., it runs slower!).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (12 preceding siblings ...)
2007-11-28 23:07 ` kargl at gcc dot gnu dot org
@ 2007-11-28 23:18 ` steven at gcc dot gnu dot org
2007-11-28 23:57 ` dominiq at lps dot ens dot fr
` (18 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: steven at gcc dot gnu dot org @ 2007-11-28 23:18 UTC (permalink / raw)
To: gcc-bugs
------- Comment #14 from steven at gcc dot gnu dot org 2007-11-28 23:17 -------
Yes, that too. It was more a sarcastic addendum to your remark that there were
so few significantly changed numbers. It seemed to me you should not look at
just the execution times ;-)
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (13 preceding siblings ...)
2007-11-28 23:18 ` steven at gcc dot gnu dot org
@ 2007-11-28 23:57 ` dominiq at lps dot ens dot fr
2007-11-29 8:06 ` dominiq at lps dot ens dot fr
` (17 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 23:57 UTC (permalink / raw)
To: gcc-bugs
------- Comment #15 from dominiq at lps dot ens dot fr 2007-11-28 23:57 -------
If I am allowed to be sacarstic too, I'll say that the increase in compile time
(worst case 11%, arithmetic average 5%) is not against the current trend one
can see for instance in
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-1-0.html
for no gain at all on the execution time (see also the thread
http://gcc.gnu.org/ml/fortran/2007-07/msg00276.html).
Now I do expect that there will be never patch commited worst than the
Richard's one!
It came very fast: about one hour after my post.
It did not break anything so far.
It did the optimizations it was supposed to do on the intended code and some
variants, even if it broke the vectorization some other variants and increased
the execution time of kepler by 15%.
At least it comfirmed that the bottleneck for induct was both the loop
unrolling and vectorization. Indeed it remains to understand why vectorization
is no longer applied to codes for which it was before the patch.
To be clear, I think it is a mistake to use the f90 array features on small
vectors, but I have seen it more often than I'ld like. So this is a kind of
optimization that can find its place for real life codes and not only
benchmarks.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (14 preceding siblings ...)
2007-11-28 23:57 ` dominiq at lps dot ens dot fr
@ 2007-11-29 8:06 ` dominiq at lps dot ens dot fr
2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
` (16 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 8:06 UTC (permalink / raw)
To: gcc-bugs
------- Comment #16 from dominiq at lps dot ens dot fr 2007-11-29 08:06 -------
A quick report of the comparison between the regression results for revision
130500 + patch in comment #5 + Tobias' patch for pr34262 and revision 130489 +
some patches applied to rev. 130500. I have the following new failures:
ERROR: gcc.dg/tree-ssa/loop-1.c: error executing dg-final: no files matched
glob pattern "loop-1.c.[0-9][0-9][0-9]t.cunroll"
UNRESOLVED: gcc.dg/tree-ssa/loop-1.c: error executing dg-final: no files
matched glob pattern "loop-1.c.[0-9][0-9][0-9]t.cunroll"
FAIL: gcc.dg/tree-ssa/loop-17.c scan-tree-dump sccp "set_nb_iterations_in_loop
= 1"
ERROR: gcc.dg/tree-ssa/loop-23.c: error executing dg-final: no files matched
glob pattern "loop-23.c.[0-9][0-9][0-9]t.cunroll"
UNRESOLVED: gcc.dg/tree-ssa/loop-23.c: error executing dg-final: no files
matched glob pattern "loop-23.c.[0-9][0-9][0-9]t.cunroll"
FAIL: gcc.dg/vect/vect-105.c scan-tree-dump-times vect "vectorized 1 loops" 1
ERROR: gcc.dg/vect/vect-11a.c: error executing dg-final: no files matched glob
pattern "vect-11a.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.dg/vect/vect-11a.c: error executing dg-final: no files matched
glob pattern "vect-11a.c.[0-9][0-9][0-9]t.vect"
FAIL: gcc.dg/vect/vect-66.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-76.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-92.c scan-tree-dump-times vect "vectorized 1 loops" 3
FAIL: gcc.dg/vect/vect-92.c scan-tree-dump-times vect "Alignment of access
forced using peeling" 3
FAIL: gcc.dg/vect/vect-outer-1.c scan-tree-dump-times vect "strided access in
outer loop" 1
FAIL: gcc.dg/vect/vect-outer-6.c scan-tree-dump-times vect "OUTER LOOP
VECTORIZED" 1
FAIL: gcc.dg/vect/vect-outer-6.c scan-tree-dump-times vect "zero step in outer
loop." 1
ERROR: gcc.dg/vect/vect-shift-1.c: error executing dg-final: no files matched
glob pattern "vect-shift-1.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.dg/vect/vect-shift-1.c: error executing dg-final: no files
matched glob pattern "vect-shift-1.c.[0-9][0-9][0-9]t.vect"
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect
"Alignment of access forced using peeling" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-69.c scan-tree-dump-times vect
"vectorized 4 loops" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-69.c scan-tree-dump-times vect
"Alignment of access forced using peeling" 2
ERROR: gcc.target/i386/vectorize1.c: error executing dg-final: no files matched
glob pattern "vectorize1.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.target/i386/vectorize1.c: error executing dg-final: no files
matched glob pattern "vectorize1.c.[0-9][0-9][0-9]t.vect"
FAIL: gfortran.dg/array_1.f90 -O3 -fomit-frame-pointer execution test
FAIL: gfortran.dg/array_1.f90 -O3 -fomit-frame-pointer -funroll-loops
execution test
FAIL: gfortran.dg/array_1.f90 -O3 -fomit-frame-pointer -funroll-all-loops
-finline-functions execution test
FAIL: gfortran.dg/array_1.f90 -O3 -g execution test
I am waiting for directives on how I can investigate further these problems.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (15 preceding siblings ...)
2007-11-29 8:06 ` dominiq at lps dot ens dot fr
@ 2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
2007-11-29 10:22 ` dominiq at lps dot ens dot fr
` (15 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-29 10:12 UTC (permalink / raw)
To: gcc-bugs
------- Comment #17 from rguenth at gcc dot gnu dot org 2007-11-29 10:11 -------
Doh, not only I missed to diff the chunk mentioned in comment #6, but I also
added the original unrolling pass, not the one only supposed to unroll inner
loops #)
So, change the passes.c hunk to
Index: gcc/passes.c
===================================================================
--- gcc/passes.c (revision 130511)
+++ gcc/passes.c (working copy)
@@ -570,6 +570,9 @@ init_optimization_passes (void)
NEXT_PASS (pass_merge_phi);
NEXT_PASS (pass_vrp);
NEXT_PASS (pass_dce);
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_complete_unrolli);
+ NEXT_PASS (pass_tree_loop_done);
NEXT_PASS (pass_cselim);
NEXT_PASS (pass_dominator);
/* The only const/copy propagation opportunities left after
that should fix some of the testsuite failures. Some thing also to experiment
with (to maybe fix some of the compile-time problems) is in the
tree-lssa-loop-ivcanon.c hunk change the condition to
if (!unroll_outer
&& loop->inner)
continue;
to only unroll innermost loops, not all-but-outermost loops.
As of pass placement another thing to look at is if it works as part of
early optimizations around
NEXT_PASS (pass_early_inline);
NEXT_PASS (pass_cleanup_cfg);
NEXT_PASS (pass_rename_ssa_copies);
.... here
NEXT_PASS (pass_ccp);
NEXT_PASS (pass_forwprop);
NEXT_PASS (pass_update_address_taken);
.... or here
NEXT_PASS (pass_simple_dse);
NEXT_PASS (pass_sra_early);
because this may enable SRA of variables in the loop body.
Most of the compile-time impact is actually from doing loop discovery, but
as we preserve loops now maybe we do not need pass_tree_loop_done after
the early unrolling and as well not pass_tree_loop_init before the rest
of loop optimizations anymore? Zdenek, can you confirm this?
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rakdver at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (16 preceding siblings ...)
2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
@ 2007-11-29 10:22 ` dominiq at lps dot ens dot fr
2007-11-29 10:40 ` dominiq at lps dot ens dot fr
` (14 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 10:22 UTC (permalink / raw)
To: gcc-bugs
------- Comment #18 from dominiq at lps dot ens dot fr 2007-11-29 10:22 -------
I have had a look at what's happening for kepler.f90 (from the 2004 polyhedron
test suite?) and it looks like another missed vectorization: if I count the
mulpd in the kepler.s files, I find 24 before the patch and none after.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (17 preceding siblings ...)
2007-11-29 10:22 ` dominiq at lps dot ens dot fr
@ 2007-11-29 10:40 ` dominiq at lps dot ens dot fr
2007-11-29 11:01 ` dominiq at lps dot ens dot fr
` (13 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 10:40 UTC (permalink / raw)
To: gcc-bugs
------- Comment #19 from dominiq at lps dot ens dot fr 2007-11-29 10:40 -------
Richard,
I am not sure to understand your patch in comment #17. I have already in
gcc/passes.c (after your patch in comment #5):
NEXT_PASS (pass_merge_phi);
NEXT_PASS (pass_vrp);
NEXT_PASS (pass_dce);
NEXT_PASS (pass_tree_loop_init);
NEXT_PASS (pass_complete_unroll);
NEXT_PASS (pass_tree_loop_done);
NEXT_PASS (pass_cselim);
NEXT_PASS (pass_dominator);
/* The only const/copy propagation opportunities left after
do you mean that I should change
NEXT_PASS (pass_complete_unroll);
to
NEXT_PASS (pass_complete_unrolli);
? I am assuming my interpretation is correct and rebuild gcc right now.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (18 preceding siblings ...)
2007-11-29 10:40 ` dominiq at lps dot ens dot fr
@ 2007-11-29 11:01 ` dominiq at lps dot ens dot fr
2007-11-29 11:13 ` rguenther at suse dot de
` (12 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 11:01 UTC (permalink / raw)
To: gcc-bugs
------- Comment #20 from dominiq at lps dot ens dot fr 2007-11-29 11:00 -------
I have applied my interpretation of the first two changes in comment #17.
gfortran.dg/array_1.f90 still abort and induct.v3.f90 is still not vectorized.
The good news are that induct.f90 is still properly unrolled and vectorized.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (19 preceding siblings ...)
2007-11-29 11:01 ` dominiq at lps dot ens dot fr
@ 2007-11-29 11:13 ` rguenther at suse dot de
2007-11-29 11:16 ` dominiq at lps dot ens dot fr
` (11 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: rguenther at suse dot de @ 2007-11-29 11:13 UTC (permalink / raw)
To: gcc-bugs
------- Comment #21 from rguenther at suse dot de 2007-11-29 11:13 -------
Subject: Re: Missed optimizations
On Thu, 29 Nov 2007, dominiq at lps dot ens dot fr wrote:
> Richard,
>
> I am not sure to understand your patch in comment #17. I have already in
> gcc/passes.c (after your patch in comment #5):
>
> NEXT_PASS (pass_merge_phi);
> NEXT_PASS (pass_vrp);
> NEXT_PASS (pass_dce);
> NEXT_PASS (pass_tree_loop_init);
> NEXT_PASS (pass_complete_unroll);
> NEXT_PASS (pass_tree_loop_done);
> NEXT_PASS (pass_cselim);
> NEXT_PASS (pass_dominator);
> /* The only const/copy propagation opportunities left after
>
> do you mean that I should change
>
> NEXT_PASS (pass_complete_unroll);
>
> to
>
> NEXT_PASS (pass_complete_unrolli);
>
> ? I am assuming my interpretation is correct and rebuild gcc right now.
Yes, that's correct - I did too much copy&paste there :)
Richard.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (20 preceding siblings ...)
2007-11-29 11:13 ` rguenther at suse dot de
@ 2007-11-29 11:16 ` dominiq at lps dot ens dot fr
2007-11-29 12:25 ` dominiq at lps dot ens dot fr
` (10 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 11:16 UTC (permalink / raw)
To: gcc-bugs
------- Comment #22 from dominiq at lps dot ens dot fr 2007-11-29 11:16 -------
In top of the first two patches of comment #17, I have MOVED
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_complete_unrolli);
+ NEXT_PASS (pass_tree_loop_done);
to the first suggested place. Now gfortran.dg/array_1.f90 pass the test, induct
is no longer unrolled/vectorized, but induct.v3 is: back to before the patch at
least on these quick tests.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (21 preceding siblings ...)
2007-11-29 11:16 ` dominiq at lps dot ens dot fr
@ 2007-11-29 12:25 ` dominiq at lps dot ens dot fr
2007-11-29 15:49 ` dominiq at lps dot ens dot fr
` (9 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 12:25 UTC (permalink / raw)
To: gcc-bugs
------- Comment #23 from dominiq at lps dot ens dot fr 2007-11-29 12:24 -------
In top of the first two patches of comment #17, I have MOVED
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_complete_unrolli);
+ NEXT_PASS (pass_tree_loop_done);
to the second suggested place. Same result as in comment #22.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (22 preceding siblings ...)
2007-11-29 12:25 ` dominiq at lps dot ens dot fr
@ 2007-11-29 15:49 ` dominiq at lps dot ens dot fr
2007-11-30 21:38 ` ubizjak at gmail dot com
` (8 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 15:49 UTC (permalink / raw)
To: gcc-bugs
------- Comment #24 from dominiq at lps dot ens dot fr 2007-11-29 15:49 -------
I think I have now a partial understanding of what is happening for the induct
variants that do not vectorize with the patch in comment #5: they do not
contain any loop inside the k loop. If I replace
rot_q_vector(1) = rot_c_vector(1) - rot_qk_vector(k,1)
rot_q_vector(2) = rot_c_vector(2) - rot_qk_vector(k,2)
rot_q_vector(3) = rot_c_vector(3) - rot_qk_vector(k,3)
by
rot_q_vector(:) = rot_c_vector(:) - rot_qk_vector(k,:)
Then the loop is vectorized again.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (23 preceding siblings ...)
2007-11-29 15:49 ` dominiq at lps dot ens dot fr
@ 2007-11-30 21:38 ` ubizjak at gmail dot com
2007-12-03 14:08 ` dominiq at lps dot ens dot fr
` (7 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: ubizjak at gmail dot com @ 2007-11-30 21:38 UTC (permalink / raw)
To: gcc-bugs
------- Comment #25 from ubizjak at gmail dot com 2007-11-30 21:38 -------
(In reply to comment #24)
> Then the loop is vectorized again.
IMO, SLP should vectorize the sequence.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (24 preceding siblings ...)
2007-11-30 21:38 ` ubizjak at gmail dot com
@ 2007-12-03 14:08 ` dominiq at lps dot ens dot fr
2007-12-03 14:32 ` dominiq at lps dot ens dot fr
` (6 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:08 UTC (permalink / raw)
To: gcc-bugs
------- Comment #26 from dominiq at lps dot ens dot fr 2007-12-03 14:08 -------
> IMO, SLP should vectorize the sequence.
Uros,
What is the meaning of the above sentence?
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (25 preceding siblings ...)
2007-12-03 14:08 ` dominiq at lps dot ens dot fr
@ 2007-12-03 14:32 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
` (5 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:32 UTC (permalink / raw)
To: gcc-bugs
------- Comment #27 from dominiq at lps dot ens dot fr 2007-12-03 14:32 -------
I have had a look at the failure of gfortran.dg/array_1.f90 with patch #5. The
following reduced code gives the same failure:
! { dg-do run }
! PR 15553 : the array used to be filled with garbage
! this problem disappeared between 2004-05-20 and 2004-09-15
program arrpack
implicit none
double precision x(10,10), tmp(6,5)
integer i, j
x = -1
do i=1,6
do j=1,5
x(i,j) = i+j*10
end do
end do
tmp(:,:) = x(1:6, 1:5)
print '(6F8.2)', tmp
end program arrpack
With -O3 and patch #5, the output is
11.00 12.00 13.00 14.00 15.00 16.00
-1.00 -1.00 -1.00 -1.00 21.00 22.00
23.00 24.00 25.00 26.00 -1.00 -1.00
-1.00 -1.00 31.00 32.00 33.00 34.00
35.00 36.00 -1.00 -1.00 -1.00 -1.00
instead of
11.00 12.00 13.00 14.00 15.00 16.00
21.00 22.00 23.00 24.00 25.00 26.00
31.00 32.00 33.00 34.00 35.00 36.00
41.00 42.00 43.00 44.00 45.00 46.00
51.00 52.00 53.00 54.00 55.00 56.00
I am amaze that it is the only failure of this kind for the several 1000 tests
I have passed!
I'll attach the the results of -fdump-tree-optimize for with and without patch
#5.
I have also looked at the gcc failures. Most of them are missed vectorizations
or new ones. So this is already reported. Is *.[0-9][0-9][0-9]t.vect supposed
to exist if the vectorization is missed? If yes, this explaina few failures.
Concerning the failures with *.[0-9][0-9][0-9]t.cunroll, I see *cunroll1/2, but
no cunroll.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (26 preceding siblings ...)
2007-12-03 14:32 ` dominiq at lps dot ens dot fr
@ 2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
` (4 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:34 UTC (permalink / raw)
To: gcc-bugs
------- Comment #28 from dominiq at lps dot ens dot fr 2007-12-03 14:33 -------
Created an attachment (id=14691)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14691&action=view)
result of -fdump-tree-optimized with patch #5
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (27 preceding siblings ...)
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
@ 2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 16:30 ` ubizjak at gmail dot com
` (3 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:34 UTC (permalink / raw)
To: gcc-bugs
------- Comment #29 from dominiq at lps dot ens dot fr 2007-12-03 14:34 -------
Created an attachment (id=14692)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14692&action=view)
result of -fdump-tree-optimized without patch #5
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (28 preceding siblings ...)
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
@ 2007-12-03 16:30 ` ubizjak at gmail dot com
2007-12-03 18:59 ` dominiq at lps dot ens dot fr
` (2 subsequent siblings)
32 siblings, 0 replies; 38+ messages in thread
From: ubizjak at gmail dot com @ 2007-12-03 16:30 UTC (permalink / raw)
To: gcc-bugs
------- Comment #30 from ubizjak at gmail dot com 2007-12-03 16:30 -------
(In reply to comment #26)
> > IMO, SLP should vectorize the sequence.
> What is the meaning of the above sentence?
Uh, sorry for being terse. If there are no loops, then "straight-line
parallelization" [SLP] should vectorize your manually unrolled sequence in
comment #24. You can look into testsuite/gcc.dg/vect/slp-*.c for some c code
examples.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (29 preceding siblings ...)
2007-12-03 16:30 ` ubizjak at gmail dot com
@ 2007-12-03 18:59 ` dominiq at lps dot ens dot fr
2007-12-04 6:57 ` irar at il dot ibm dot com
2008-04-23 21:27 ` dominiq at lps dot ens dot fr
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 18:59 UTC (permalink / raw)
To: gcc-bugs
------- Comment #31 from dominiq at lps dot ens dot fr 2007-12-03 18:58 -------
> If there are no loops, then "straight-line parallelization" [SLP] should vectorize
> your manually unrolled sequence in comment #24.
Yes it should, but if does not after patch #5. The unanswered question so far
is why it does not, then how to change the patch so that it does it. Anyhow,
the "good" vectorization should be along the k loop (length 9 instead of 3). My
understanding of my tests is first that 5/9<2/3 and, more important, the
packing/unpacking overhead is a smaller penalty if it is shared as in the k
vectorization.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (30 preceding siblings ...)
2007-12-03 18:59 ` dominiq at lps dot ens dot fr
@ 2007-12-04 6:57 ` irar at il dot ibm dot com
2008-04-23 21:27 ` dominiq at lps dot ens dot fr
32 siblings, 0 replies; 38+ messages in thread
From: irar at il dot ibm dot com @ 2007-12-04 6:57 UTC (permalink / raw)
To: gcc-bugs
------- Comment #32 from irar at il dot ibm dot com 2007-12-04 06:56 -------
(In reply to comment #30)
> Uh, sorry for being terse. If there are no loops, then "straight-line
> parallelization" [SLP] should vectorize your manually unrolled sequence in
> comment #24.
Currently only loop-aware SLP is implemented, meaning the code must be enclosed
in a loop to get vectorized.
> You can look into testsuite/gcc.dg/vect/slp-*.c for some c code
> examples.
>
Right. For example,
for (i = 0; i < N; i++)
{
out[i*4] = a0;
out[i*4 + 1] = a1;
out[i*4 + 2] = a2;
out[i*4 + 3] = a3;
}
will be transformed into:
for (i = 0; i < N; i++)
{
out[i*4:i*4+3] = {a0, a1, a2, a3};
}
Ira
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
` (31 preceding siblings ...)
2007-12-04 6:57 ` irar at il dot ibm dot com
@ 2008-04-23 21:27 ` dominiq at lps dot ens dot fr
32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2008-04-23 21:27 UTC (permalink / raw)
To: gcc-bugs
------- Comment #33 from dominiq at lps dot ens dot fr 2008-04-23 21:26 -------
Created an attachment (id=15523)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15523&action=view)
induct.f90 variants and their diff with the original file
The original diff's have space problems.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
[not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2011-09-17 17:53 ` dominiq at lps dot ens.fr
@ 2011-09-26 9:02 ` rguenth at gcc dot gnu.org
3 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-09-26 9:02 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
Resolution| |FIXED
--- Comment #37 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-09-26 08:36:26 UTC ---
Testcases for runtime properties are not supported.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
[not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
2011-05-22 12:33 ` dominiq at lps dot ens.fr
2011-09-16 15:53 ` dominiq at lps dot ens.fr
@ 2011-09-17 17:53 ` dominiq at lps dot ens.fr
2011-09-26 9:02 ` rguenth at gcc dot gnu.org
3 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-09-17 17:53 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
Dominique d'Humieres <dominiq at lps dot ens.fr> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |irar at gcc dot gnu.org,
| |wschmidt at gcc dot gnu.org
--- Comment #36 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-09-17 17:09:59 UTC ---
The pr 34265 and 49006 have been fixed by revision 176984:
Author: wschmidt
Date: Sun Jul 31 18:58:06 2011 UTC (6 weeks, 5 days ago)
Changed paths: 2
Log Message:
2011-07-29 Bill Schmidt <wschmidt@linux.vnet.ibm.com>
PR tree-optimization/49749
* tree-ssa-reassoc.c (get_rank): New forward declaration.
(PHI_LOOP_BIAS): New macro.
(phi_rank): New function.
(loop_carried_phi): Likewise.
(propagate_rank): Likewise.
(get_rank): Add calls to phi_rank and propagate_rank.
The following table compare the execution time and the line at which the
vectorizer reports the vectorization for the twelve variants described in
comment #34, showing that they are now all vectorized:
revision 176983 176984 178905
num den rot time line time line time line
o o o 2.154s 243 2.154s 243 2.133s 243
u o o 1.973s 243 1.970s 243 2.035s 243
o u o 2.024s 243 2.023s 243 1.977s 243
u u o 3.053s 1.817s 234 1.831s 234
o o u 3.015s 1.839s 234 1.841s 234
u o u 3.030s 1.828s 234 1.816s 234
o u u 3.049s 1.818s 234 1.834s 234
u u u 3.059s 1.820s 234 1.818s 234
o o f 3.010s 1.825s 234 1.822s 234
u o f 3.033s 1.836s 234 1.826s 234
o u f 3.061s 1.814s 234 1.828s 234
u u f 3.058s 1.825s 234 1.812s 234
graphite 1.937s 243 1.938s 243 1.912s 243
(num, den and rot stand for numerator, denominator and rotate respectively; o,
u, and f stand for original, unrolled, and folded.
Since now gcc has (at least for this class of code;) the three properties I
expect from a good optimizer:
(1) it does not destroy the hand optimization I have done;
(2) it optimizes the original code;
(3) it has a consistent behavior across variants;
I think a test should be added to the test suite to check that none of these
properties are lost in future revisions.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
[not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
2011-05-22 12:33 ` dominiq at lps dot ens.fr
@ 2011-09-16 15:53 ` dominiq at lps dot ens.fr
2011-09-17 17:53 ` dominiq at lps dot ens.fr
2011-09-26 9:02 ` rguenth at gcc dot gnu.org
3 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-09-16 15:53 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
--- Comment #35 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-09-16 15:42:15 UTC ---
This pr (as well as pr49006) seems to have been fixed between revisions 176696
and 177649. I am closing
pr49006 as fixed and I'll use this pr to track the remaining issues.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [Bug tree-optimization/34265] Missed optimizations
[not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
@ 2011-05-22 12:33 ` dominiq at lps dot ens.fr
2011-09-16 15:53 ` dominiq at lps dot ens.fr
` (2 subsequent siblings)
3 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-05-22 12:33 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265
--- Comment #34 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-05-22 12:06:20 UTC ---
Created attachment 24325
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24325
reduced tests
The attached bzipped tar contains the files induct_red.f90 with the all the
infrastructure to provide a realistic framework to run a reduced version of
the subroutine mutual_ind_quad_cir_coil contained in induct_qc_x.F90
(reduced to only one critical nested loops).
When the macro XPA is defined the original rotate code
rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
is unrolled as (q_vector(2)==0) if the macro FLD is not defined
rot_q_vector(1) = rotate_quad(1,1) * q_vector(1) + &
rotate_quad(1,2) * q_vector(2)
rot_q_vector(2) = rotate_quad(2,1) * q_vector(1) + &
rotate_quad(2,2) * q_vector(2)
rot_q_vector(3) = rotate_quad(3,1) * q_vector(1) + &
rotate_quad(3,2) * q_vector(2)
Otherwise it is folded as
rot_q_vector(:) = rotate_quad(:,1) * q_vector(1) + &
rotate_quad(:,2) * q_vector(2)
When the macro XPB is defined the original numerator
numerator = w1gauss(j) * w2gauss(k) * &
dot_product(coil_current_vec,current_vector)
is unrolled as
numerator = w1gauss(j) * w2gauss(k) * &
(coil_current_vec(1)*current_vector(1) + &
coil_current_vec(2)*current_vector(2) + &
coil_current_vec(3)*current_vector(3))
When the macro XPC is defined the original denominator
denominator = sqrt(dot_product(rot_c_vector-rot_q_vector, &
rot_c_vector-rot_q_vector))
is unrolled as
denominator = sqrt((rot_c_vector(1)-rot_q_vector(1))**2 + &
(rot_c_vector(2)-rot_q_vector(2))**2 + &
(rot_c_vector(3)-rot_q_vector(3))**2)
It contains also a script to run the twelve cases and one case with
graphite and the raw results for revisions 167530, 167531, and 173917
(original, with r167531 reverted: 173917r1, and with /* NEXT_PASS
(pass_complete_unrolli); */ : 173917n since I think this is related to
revision 134730).
See also pr49006.
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2011-09-26 8:38 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
2007-11-28 16:14 ` dominiq at lps dot ens dot fr
2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
2007-11-28 18:18 ` dominiq at lps dot ens dot fr
2007-11-28 18:49 ` dominiq at lps dot ens dot fr
2007-11-28 20:48 ` jb at gcc dot gnu dot org
2007-11-28 21:27 ` burnus at gcc dot gnu dot org
2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
2007-11-28 22:36 ` dominiq at lps dot ens dot fr
2007-11-28 22:49 ` steven at gcc dot gnu dot org
2007-11-28 23:07 ` kargl at gcc dot gnu dot org
2007-11-28 23:18 ` steven at gcc dot gnu dot org
2007-11-28 23:57 ` dominiq at lps dot ens dot fr
2007-11-29 8:06 ` dominiq at lps dot ens dot fr
2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
2007-11-29 10:22 ` dominiq at lps dot ens dot fr
2007-11-29 10:40 ` dominiq at lps dot ens dot fr
2007-11-29 11:01 ` dominiq at lps dot ens dot fr
2007-11-29 11:13 ` rguenther at suse dot de
2007-11-29 11:16 ` dominiq at lps dot ens dot fr
2007-11-29 12:25 ` dominiq at lps dot ens dot fr
2007-11-29 15:49 ` dominiq at lps dot ens dot fr
2007-11-30 21:38 ` ubizjak at gmail dot com
2007-12-03 14:08 ` dominiq at lps dot ens dot fr
2007-12-03 14:32 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 16:30 ` ubizjak at gmail dot com
2007-12-03 18:59 ` dominiq at lps dot ens dot fr
2007-12-04 6:57 ` irar at il dot ibm dot com
2008-04-23 21:27 ` dominiq at lps dot ens dot fr
[not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
2011-05-22 12:33 ` dominiq at lps dot ens.fr
2011-09-16 15:53 ` dominiq at lps dot ens.fr
2011-09-17 17:53 ` dominiq at lps dot ens.fr
2011-09-26 9:02 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).