[Bug tree-optimization/34265] New: Missed optimizations

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/34265]  New: Missed optimizations
@ 2007-11-28 15:27 dominiq at lps dot ens dot fr
  2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
                   ` (32 more replies)
  0 siblings, 33 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 15:27 UTC (permalink / raw)
  To: gcc-bugs

I have had a closer look to the optimization of the polyhedron test 
induct.f90. 
3/4 of the runtime is spent in the subroutine 'mutual_ind_quad_cir_coil' and
1/4 in 'mutual_ind_quad_rec_coil'.
The two subroutines contain two main loops with the following structure:

do i = 1, 2*m
    ...
    do j = 1, 9
        ...
        do k = 1, 9
            q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
            q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal)
            q_vector(3) = 0.0_longreal
!
!       rotate quad vector into the global coordinate system
!
            rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
            rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
            rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
!
!       compute and add in quadrature term
!
            numerator = w1gauss(j) * w2gauss(k) *                         &
                        dot_product(coil_current_vec,current_vector)
            denominator = sqrt(dot_product(rot_c_vector-rot_q_vector,     &
                                           rot_c_vector-rot_q_vector))
            l12_lower = l12_lower + numerator/denominator
        end do
    end do
end do

where the six first lines of code in the k loop do not depend on i nor j
and can be computed in a 'do k = 1, 9' loop outside the main loop by
replacing the length-three vector 'rot_q_vector' by a nine by three array.

The original code induct.f90 gives the following timing (all with -O3 
-ffast-math -funroll-loops):

93.227u 0.094s 1:33.32 99.9%    0+0k 0+0io 0pf+0w

and output:

...
Maximum wand/quad abs rel mutual inductance =   5.95379428444656467E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242061230E-002
...

Unrolling bye hand 'numerator' and 'denominator' gives (see 
http://gcc.gnu.org/ml/fortran/2007-11/msg00231.html):

65.563u 0.092s 1:05.66 99.9%    0+0k 0+0io 0pf+0w

Looking at the assembly I can see that for the original code the inner loops in
k are not  unrolled, as guessed by Paul Thomas (only the implied vector loops
being unrolled).

QUESTION 1: Should the frontend do the unrolling for small vectors itself? or
should the middleend be more aggressive for nested loops with small known
iterations?

Moving the invariants on i and j in the k loops outside the main loops gives:

80.313u 0.074s 1:20.39 99.9%    0+0k 0+1io 0pf+0w

Combining the two hand optimizations gives:

35.925u 0.040s 0:35.97 99.9%    0+0k 0+0io 0pf+0w

(without -ffast-math the timing is
59.263u 0.067s 0:59.33 99.9%    0+0k 0+1io 0pf+0w)

but the results change to:

Maximum wand/quad abs rel mutual inductance =   5.95379428444656675E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242059842E-002

( Maximum wand/quad abs rel mutual inductance =   5.95379428444659520E-002
  Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
  Maximum resq coil/quad abs rel mutual inductance =   9.63995250242060675E-002
without -ffast-math).

The attached file gives the differences between the original code and the three
variants.

This is to be compared to further optimizations (indu.v2.f90) leading to:

35.376u 0.062s 0:35.44 99.9%    0+0k 0+1io 0pf+0w

or after merging the two loops (indu.v3.f90) to:

34.452u 0.041s 0:34.49 100.0%   0+0k 0+0io 0pf+0w

I have counted the number of sqrt() in the assembly code and found 9 of them in
the slow codes while I only found 5 (10 for indu.v3.f90) of them for the fast
codes. I checked that this was not due to common 
subexpressions I may have missed. Looking more closely to the assemby I saw
that he slow codes used 'sqrtsd' while the fast ones used 'sqrtpd' along with
several other packed operations. Now 65.66*5/9=36.48 explaining most of the
speedup. Note that 5=9/2+1, i.e., four packed computations 
followed by a scalar one.  Owing to the same structure in the two loops, the
two scalar computations could be merged in a packed one, but it is missed by
the optimizer.

I have tried without success to trigger this "vectorization" by making some
variables vectors in k.

QUESTION 2:  Why the optimizer is able to vectorize in some cases and not in
others? Can the frontend help to vectorize?

-- 
           Summary: Missed optimizations
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: dominiq at lps dot ens dot fr
 GCC build triplet: i686-apple-darwin9
  GCC host triplet: i686-apple-darwin9
GCC target triplet: i686-apple-darwin9

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
@ 2007-11-28 15:30 ` dominiq at lps dot ens dot fr
  2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 15:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from dominiq at lps dot ens dot fr  2007-11-28 15:30 -------
Created an attachment (id=14654)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14654&action=view)
Diffs between the original file and the simplest variants

In induct.v1.f90 'nominator' and 'denominator' are unrolled by hand. In
induct.v2.f90 the invariants are moved outside the mail loop and induct.v3.f90
combines the two.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
  2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
@ 2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
  2007-11-28 16:14 ` dominiq at lps dot ens dot fr
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 16:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from rguenth at gcc dot gnu dot org  2007-11-28 16:06 -------
GCC doesn't have a facility to split the inner loop and move it out of the
outer loops by introducing a array temporary.

As for completely unrolling, this only happens for innermost loops(?) and you
can tune the heuristics with --param max-completely-peeled-insns=N (defaults to
400) and --param max-completely-peel-times (defaults to 16).  Use
-funroll-loops
to enable this.

Note that complete unrolling happens too late to help LIM or vectorization.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
  2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
  2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 16:14 ` dominiq at lps dot ens dot fr
  2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 16:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from dominiq at lps dot ens dot fr  2007-11-28 16:14 -------
> Note that complete unrolling happens too late to help LIM or vectorization.

Could this be translated as a YES to my first question: the fortran frontend
should unroll computations for short vectors?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (2 preceding siblings ...)
  2007-11-28 16:14 ` dominiq at lps dot ens dot fr
@ 2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
  2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 16:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from rguenth at gcc dot gnu dot org  2007-11-28 16:17 -------
I would in principle say no - we can instead improve the middle-end here.  But
it may pay off to not generate a loop for short vectors in case the resulting
IL is smaller for example.  Of course it would duplicate logic in the frontend
if that is not already available, so from a middle-end point of view we should
fix it there instead (the same problems happen for C and C++).  See PR34223.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |34223


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (3 preceding siblings ...)
  2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
  2007-11-28 18:18 ` dominiq at lps dot ens dot fr
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 16:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from rguenth at gcc dot gnu dot org  2007-11-28 16:33 -------
Created an attachment (id=14655)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14655&action=view)
patch for early complete unrolling of inner loops

For example with a patch like this.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (4 preceding siblings ...)
  2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 18:18 ` dominiq at lps dot ens dot fr
  2007-11-28 18:49 ` dominiq at lps dot ens dot fr
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 18:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from dominiq at lps dot ens dot fr  2007-11-28 18:18 -------
Subject: Re:  Missed optimizations

> For example with a patch like this.

You also need

--- ../_gcc_clean/gcc/tree-flow.h       2007-11-16 16:17:46.000000000 +0100
+++ ../gcc-4.3-work/gcc/tree-flow.h     2007-11-28 18:56:42.000000000 +0100
@@ -980,7 +980,7 @@
 void tree_ssa_lim (void);
 unsigned int tree_ssa_unswitch_loops (void);
 unsigned int canonicalize_induction_variables (void);
-unsigned int tree_unroll_loops_completely (bool);
+unsigned int tree_unroll_loops_completely (bool, bool);
 unsigned int tree_ssa_prefetch_arrays (void);
 unsigned int remove_empty_loops (void);
 void tree_ssa_iv_optimize (void);

Still building.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (5 preceding siblings ...)
  2007-11-28 18:18 ` dominiq at lps dot ens dot fr
@ 2007-11-28 18:49 ` dominiq at lps dot ens dot fr
  2007-11-28 20:48 ` jb at gcc dot gnu dot org
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 18:49 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from dominiq at lps dot ens dot fr  2007-11-28 18:48 -------
Subject: Re:  Missed optimizations

With your patch the runtime went from

93.670u 0.103s 1:33.85 99.9%    0+0k 0+0io 32pf+0w

to

38.741u 0.038s 0:38.85 99.7%    0+0k 0+1io 32pf+0w

Pretty impressive!

Note that with gfortran 4.2.2 the timing is

72.451u 0.046s 1:12.59 99.8%    0+0k 1+0io 33pf+0w

I'll run the full polyhedron suite.

Thanks


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (6 preceding siblings ...)
  2007-11-28 18:49 ` dominiq at lps dot ens dot fr
@ 2007-11-28 20:48 ` jb at gcc dot gnu dot org
  2007-11-28 21:27 ` burnus at gcc dot gnu dot org
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: jb at gcc dot gnu dot org @ 2007-11-28 20:48 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from jb at gcc dot gnu dot org  2007-11-28 20:48 -------
The vectorization of dot products is covered by PR31738, I suppose


-- 

jb at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |31738


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (7 preceding siblings ...)
  2007-11-28 20:48 ` jb at gcc dot gnu dot org
@ 2007-11-28 21:27 ` burnus at gcc dot gnu dot org
  2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: burnus at gcc dot gnu dot org @ 2007-11-28 21:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from burnus at gcc dot gnu dot org  2007-11-28 21:27 -------
> With your patch the runtime went from
> 93.670u 0.103s 1:33.85 99.9%    0+0k 0+0io 32pf+0w
> to
> 38.741u 0.038s 0:38.85 99.7%    0+0k 0+1io 32pf+0w

Thus: 59% faster. Here, it "only" went ~30% down from 49.89s to ~35.2s.
(AMD64/Linux, -m64). Still quite impressive!


-- 

burnus at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |burnus at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (8 preceding siblings ...)
  2007-11-28 21:27 ` burnus at gcc dot gnu dot org
@ 2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
  2007-11-28 22:36 ` dominiq at lps dot ens dot fr
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-28 22:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from rguenth at gcc dot gnu dot org  2007-11-28 22:05 -------
Indeed - unexpectedly impressive ;)  The patch has (obviously) received no
tuning
as of the placement of the early unrolling in the pass pipeline and early
unrolling is only done if that doesn't increase code-size (as of the metric
of the unrolling pass, of course), unless you specify -O3.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (9 preceding siblings ...)
  2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
@ 2007-11-28 22:36 ` dominiq at lps dot ens dot fr
  2007-11-28 22:49 ` steven at gcc dot gnu dot org
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 22:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from dominiq at lps dot ens dot fr  2007-11-28 22:35 -------
Here are the timings before and after the patch for the polyhedron tests and
some variants:

                   Before patch                   After patch 

  Benchmark   Ave Run  Number   Estim    :   Ave Run  Number   Estim
       Name    (secs) Repeats   Err %    :    (secs) Repeats   Err %
  ---------   ------- -------  ------    :   ------- -------  ------
         ac     16.92       5  0.0183    :     16.16       5  0.0056
     aermod     36.82       5  0.0082    :     36.92       5  0.0106
        air     11.38      10  0.0479    :     11.43      11  0.0494
   capacita     62.21       5  0.0036    :     61.97       5  0.0343
    channel      4.04      12  0.0333    :      4.04       5  0.0160
      doduc     58.07       5  0.0257    :     57.56       5  0.0164
    fatigue     14.94       5  0.0338    :     14.33       5  0.0184
    gas_dyn     11.78      17  0.0448    :     11.89      18  0.0349
     induct     93.27       5  0.0093    :     36.93       5  0.0205
      linpk     28.15       5  0.0099    :     28.21       5  0.0259
       mdbx     16.80       5  0.0112    :     16.83       5  0.0051
         nf     32.45       5  0.0388    :     32.63      10  0.0495
    protein     55.63       5  0.0069    :     54.86       5  0.0305
     rnflow     45.88       5  0.0366    :     46.06       5  0.0230
   test_fpu     14.64       5  0.0115    :     14.46       5  0.0207
       tfft      3.04       5  0.0380    :      3.06      20  0.0284
      ac_v1     16.15       5  0.0197    :     15.16       5  0.0109
     air_v1     10.81       5  0.0411    :     10.88      10  0.0471
 capacita_8     69.45       5  0.0136    :     69.41       5  0.0091
capacita_10    113.20       5  0.0200    :    112.44       5  0.0290
    chan_v1      2.23       5  0.0183    :      2.24       7  0.0471
 channel_10     16.61       5  0.0351    :     16.64      14  0.0492
 fatigue_v1     13.26       5  0.0071    :     12.05       5  0.0148
 fatigue_10     20.54      15  0.0312    :     21.73       5  0.0117
  induct_v2     35.07       5  0.0007    :     60.08       5  0.0236
  induct_v3     34.40       5  0.0189    :     58.64       5  0.0249
  induct_vm    262.95       2  0.0000    :    253.62       2  0.0197
  induct_10    100.12       5  0.0053    :     84.65       5  0.0008
     kepler     22.73       5  0.0123    :     26.11       5  0.0069
  kepler_10     69.59       5  0.0047    :     61.42       5  0.0110
      nf_10     58.00       5  0.0413    :     58.36       5  0.0388
 protein_10     57.04       5  0.0167    :     56.38       5  0.0486
test_fpu_v1     15.15       5  0.0195    :     14.98       5  0.0104
test_fpu_10     34.75       5  0.0408    :     34.68       5  0.0120
     tfft_8      6.81       5  0.0110    :      6.83       5  0.0371
    tfft_10     14.36       5  0.0373    :     14.40       6  0.0496

                 Before patch                After patch 

  Benchmark   Compile  Executable     :  Compile  Executable
       Name    (secs)     (bytes)     :   (secs)     (bytes)
  ---------   -------  ----------     :  -------  ----------
         ac      4.52       50628     :     4.60       50628
     aermod     96.22     1288460     :   106.72     1288460
        air      6.57       80956     :     6.68       80956
   capacita      3.18       60140     :     3.34       64236
    channel      1.55       38532     :     1.65       38532
      doduc     13.52      183264     :    14.02      191456
    fatigue      5.69       84564     :     5.83       80468
    gas_dyn      5.36      695776     :     5.50      695776
     induct     11.65      160132     :    12.02      168324
      linpk      1.67       46512     :     1.71       46512
       mdbx      3.76       72672     :     3.85       72672
         nf      4.45       87644     :     4.46       87644
    protein     11.18      113900     :    11.45      113900
     rnflow     11.58      187316     :    11.74      187316
   test_fpu     11.23      182544     :    11.09      178448
       tfft      1.30       34420     :     1.33       34420
      ac_v1      4.51       50628     :     4.60       50628
     air_v1      6.63       80956     :     6.75       80956
 capacita_8      3.20       60136     :     3.33       64232
capacita_10      3.19       64216     :     3.42       68312
    chan_v1      1.85       38500     :     1.85       38500
 channel_10      1.32       34392     :     1.40       34392
 fatigue_v1      5.75       84524     :     5.77       80428
 fatigue_10      4.91       76352     :     4.91       76352
  induct_v2     11.72      168324     :    12.19      172420
  induct_v3     11.72      164228     :    12.12      172420
  induct_vm     11.44      160132     :    11.73      164228
  induct_10     11.47      159964     :    11.89      159964
     kepler      0.34       17652     :     0.35       17652
  kepler_10      0.33       17632     :     0.34       17632
      nf_10      2.03       46684     :     2.07       46684
 protein_10      7.01       93400     :     7.18       93400
test_fpu_v1     11.23      182592     :    11.17      178496
test_fpu_10      6.54      117056     :     6.40      117056
     tfft_8      1.25       30348     :     1.31       30348
    tfft_10      1.15       30328     :     1.18       30328

The only timings significantly changed by the patch are the induct avatars,
with the strange result that the variants which missed the vectorization are
now vectorized, while those previously vectorized are not any more (also true
for the variants of the first attachment). So there is probably some need of a
little bit of tuning. 

I have also to regtest and do some further investigations.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (10 preceding siblings ...)
  2007-11-28 22:36 ` dominiq at lps dot ens dot fr
@ 2007-11-28 22:49 ` steven at gcc dot gnu dot org
  2007-11-28 23:07 ` kargl at gcc dot gnu dot org
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: steven at gcc dot gnu dot org @ 2007-11-28 22:49 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from steven at gcc dot gnu dot org  2007-11-28 22:49 -------
The only timings significantly changed are actually the compile times, which go
up  significantly.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (11 preceding siblings ...)
  2007-11-28 22:49 ` steven at gcc dot gnu dot org
@ 2007-11-28 23:07 ` kargl at gcc dot gnu dot org
  2007-11-28 23:18 ` steven at gcc dot gnu dot org
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: kargl at gcc dot gnu dot org @ 2007-11-28 23:07 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #13 from kargl at gcc dot gnu dot org  2007-11-28 23:06 -------
(In reply to comment #12)
> The only timings significantly changed are actually the compile times, which go
> up  significantly.
> 

Look at the kepler execution time.  22.73 s without the patch and
26.11 s with the patch.  That's a 15% decrease in execution speed
(ie., it runs slower!).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (12 preceding siblings ...)
  2007-11-28 23:07 ` kargl at gcc dot gnu dot org
@ 2007-11-28 23:18 ` steven at gcc dot gnu dot org
  2007-11-28 23:57 ` dominiq at lps dot ens dot fr
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: steven at gcc dot gnu dot org @ 2007-11-28 23:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #14 from steven at gcc dot gnu dot org  2007-11-28 23:17 -------
Yes, that too.  It was more a sarcastic addendum to your remark that there were
so few significantly changed numbers.  It seemed to me you should not look at
just the execution times ;-)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (13 preceding siblings ...)
  2007-11-28 23:18 ` steven at gcc dot gnu dot org
@ 2007-11-28 23:57 ` dominiq at lps dot ens dot fr
  2007-11-29  8:06 ` dominiq at lps dot ens dot fr
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 23:57 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #15 from dominiq at lps dot ens dot fr  2007-11-28 23:57 -------
If I am allowed to be sacarstic too, I'll say that the increase in compile time
(worst case 11%, arithmetic average 5%) is not against the current trend one
can see for instance in

http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-1-0.html

for no gain at all on the execution time (see also the thread

http://gcc.gnu.org/ml/fortran/2007-07/msg00276.html).

Now I do expect that there will be never patch commited worst than the
Richard's one!

It came very fast: about one hour after my post.

It did not break anything so far.

It did the optimizations it was supposed to do on the intended code and some
variants, even if it broke the vectorization some other variants and increased
the execution time of kepler by 15%.

At least it comfirmed that the bottleneck for induct was both the loop
unrolling and vectorization. Indeed it remains to understand why vectorization
is no longer applied to codes for which it was before the patch.

To be clear, I think it is a mistake to use the f90 array features on small
vectors, but I have seen it more often than I'ld like. So this is a kind of
optimization that can find its place for real life codes and not only
benchmarks.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (14 preceding siblings ...)
  2007-11-28 23:57 ` dominiq at lps dot ens dot fr
@ 2007-11-29  8:06 ` dominiq at lps dot ens dot fr
  2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29  8:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #16 from dominiq at lps dot ens dot fr  2007-11-29 08:06 -------
A quick report of the comparison between the regression results for revision
130500 + patch in comment #5 + Tobias' patch for pr34262 and revision 130489 +
some patches applied to rev. 130500. I have the following new failures:

ERROR: gcc.dg/tree-ssa/loop-1.c: error executing dg-final: no files matched
glob pattern "loop-1.c.[0-9][0-9][0-9]t.cunroll"
UNRESOLVED: gcc.dg/tree-ssa/loop-1.c: error executing dg-final: no files
matched glob pattern "loop-1.c.[0-9][0-9][0-9]t.cunroll"
FAIL: gcc.dg/tree-ssa/loop-17.c scan-tree-dump sccp "set_nb_iterations_in_loop
= 1"
ERROR: gcc.dg/tree-ssa/loop-23.c: error executing dg-final: no files matched
glob pattern "loop-23.c.[0-9][0-9][0-9]t.cunroll"
UNRESOLVED: gcc.dg/tree-ssa/loop-23.c: error executing dg-final: no files
matched glob pattern "loop-23.c.[0-9][0-9][0-9]t.cunroll"
FAIL: gcc.dg/vect/vect-105.c scan-tree-dump-times vect "vectorized 1 loops" 1
ERROR: gcc.dg/vect/vect-11a.c: error executing dg-final: no files matched glob
pattern "vect-11a.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.dg/vect/vect-11a.c: error executing dg-final: no files matched
glob pattern "vect-11a.c.[0-9][0-9][0-9]t.vect"
FAIL: gcc.dg/vect/vect-66.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-76.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-92.c scan-tree-dump-times vect "vectorized 1 loops" 3
FAIL: gcc.dg/vect/vect-92.c scan-tree-dump-times vect "Alignment of access
forced using peeling" 3
FAIL: gcc.dg/vect/vect-outer-1.c scan-tree-dump-times vect "strided access in
outer loop" 1
FAIL: gcc.dg/vect/vect-outer-6.c scan-tree-dump-times vect "OUTER LOOP
VECTORIZED" 1
FAIL: gcc.dg/vect/vect-outer-6.c scan-tree-dump-times vect "zero step in outer
loop." 1
ERROR: gcc.dg/vect/vect-shift-1.c: error executing dg-final: no files matched
glob pattern "vect-shift-1.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.dg/vect/vect-shift-1.c: error executing dg-final: no files
matched glob pattern "vect-shift-1.c.[0-9][0-9][0-9]t.vect"
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect
"Alignment of access forced using peeling" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-69.c scan-tree-dump-times vect
"vectorized 4 loops" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-69.c scan-tree-dump-times vect
"Alignment of access forced using peeling" 2
ERROR: gcc.target/i386/vectorize1.c: error executing dg-final: no files matched
glob pattern "vectorize1.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.target/i386/vectorize1.c: error executing dg-final: no files
matched glob pattern "vectorize1.c.[0-9][0-9][0-9]t.vect"

FAIL: gfortran.dg/array_1.f90  -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/array_1.f90  -O3 -fomit-frame-pointer -funroll-loops 
execution test
FAIL: gfortran.dg/array_1.f90  -O3 -fomit-frame-pointer -funroll-all-loops
-finline-functions  execution test
FAIL: gfortran.dg/array_1.f90  -O3 -g  execution test

I am waiting for directives on how I can investigate further these problems.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (15 preceding siblings ...)
  2007-11-29  8:06 ` dominiq at lps dot ens dot fr
@ 2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
  2007-11-29 10:22 ` dominiq at lps dot ens dot fr
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-11-29 10:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #17 from rguenth at gcc dot gnu dot org  2007-11-29 10:11 -------
Doh, not only I missed to diff the chunk mentioned in comment #6, but I also
added the original unrolling pass, not the one only supposed to unroll inner
loops #)

So, change the passes.c hunk to

Index: gcc/passes.c
===================================================================
--- gcc/passes.c        (revision 130511)
+++ gcc/passes.c        (working copy)
@@ -570,6 +570,9 @@ init_optimization_passes (void)
       NEXT_PASS (pass_merge_phi);
       NEXT_PASS (pass_vrp);
       NEXT_PASS (pass_dce);
+      NEXT_PASS (pass_tree_loop_init);
+      NEXT_PASS (pass_complete_unrolli);
+      NEXT_PASS (pass_tree_loop_done);
       NEXT_PASS (pass_cselim);
       NEXT_PASS (pass_dominator);
       /* The only const/copy propagation opportunities left after


that should fix some of the testsuite failures.  Some thing also to experiment
with (to maybe fix some of the compile-time problems) is in the
tree-lssa-loop-ivcanon.c hunk change the condition to

           if (!unroll_outer
               && loop->inner)
             continue;

to only unroll innermost loops, not all-but-outermost loops.

As of pass placement another thing to look at is if it works as part of
early optimizations around

          NEXT_PASS (pass_early_inline);
          NEXT_PASS (pass_cleanup_cfg);
          NEXT_PASS (pass_rename_ssa_copies);
.... here
          NEXT_PASS (pass_ccp);
          NEXT_PASS (pass_forwprop);
          NEXT_PASS (pass_update_address_taken);
.... or here
          NEXT_PASS (pass_simple_dse);
          NEXT_PASS (pass_sra_early);

because this may enable SRA of variables in the loop body.

Most of the compile-time impact is actually from doing loop discovery, but
as we preserve loops now maybe we do not need pass_tree_loop_done after
the early unrolling and as well not pass_tree_loop_init before the rest
of loop optimizations anymore?  Zdenek, can you confirm this?


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rakdver at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (16 preceding siblings ...)
  2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
@ 2007-11-29 10:22 ` dominiq at lps dot ens dot fr
  2007-11-29 10:40 ` dominiq at lps dot ens dot fr
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 10:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #18 from dominiq at lps dot ens dot fr  2007-11-29 10:22 -------
I have had a look at what's happening for kepler.f90 (from the 2004 polyhedron
test suite?) and it looks like another missed vectorization: if I count the
mulpd in the kepler.s files, I find 24 before the patch and none after.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (17 preceding siblings ...)
  2007-11-29 10:22 ` dominiq at lps dot ens dot fr
@ 2007-11-29 10:40 ` dominiq at lps dot ens dot fr
  2007-11-29 11:01 ` dominiq at lps dot ens dot fr
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 10:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #19 from dominiq at lps dot ens dot fr  2007-11-29 10:40 -------
Richard,

I am not sure to understand your patch in comment #17. I have already in
gcc/passes.c (after your patch in comment #5):

      NEXT_PASS (pass_merge_phi);
      NEXT_PASS (pass_vrp);
      NEXT_PASS (pass_dce);
      NEXT_PASS (pass_tree_loop_init);
      NEXT_PASS (pass_complete_unroll);
      NEXT_PASS (pass_tree_loop_done);
      NEXT_PASS (pass_cselim);
      NEXT_PASS (pass_dominator);
      /* The only const/copy propagation opportunities left after

do you mean that I should change

      NEXT_PASS (pass_complete_unroll);

to

      NEXT_PASS (pass_complete_unrolli);

? I am assuming my interpretation is correct and rebuild gcc right now.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (18 preceding siblings ...)
  2007-11-29 10:40 ` dominiq at lps dot ens dot fr
@ 2007-11-29 11:01 ` dominiq at lps dot ens dot fr
  2007-11-29 11:13 ` rguenther at suse dot de
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 11:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #20 from dominiq at lps dot ens dot fr  2007-11-29 11:00 -------
I have applied my interpretation of the first two changes in comment #17.
gfortran.dg/array_1.f90 still abort and induct.v3.f90 is still not vectorized.
The good news are that induct.f90 is still properly unrolled and vectorized.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (19 preceding siblings ...)
  2007-11-29 11:01 ` dominiq at lps dot ens dot fr
@ 2007-11-29 11:13 ` rguenther at suse dot de
  2007-11-29 11:16 ` dominiq at lps dot ens dot fr
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: rguenther at suse dot de @ 2007-11-29 11:13 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #21 from rguenther at suse dot de  2007-11-29 11:13 -------
Subject: Re:  Missed optimizations

On Thu, 29 Nov 2007, dominiq at lps dot ens dot fr wrote:

> Richard,
> 
> I am not sure to understand your patch in comment #17. I have already in
> gcc/passes.c (after your patch in comment #5):
> 
>       NEXT_PASS (pass_merge_phi);
>       NEXT_PASS (pass_vrp);
>       NEXT_PASS (pass_dce);
>       NEXT_PASS (pass_tree_loop_init);
>       NEXT_PASS (pass_complete_unroll);
>       NEXT_PASS (pass_tree_loop_done);
>       NEXT_PASS (pass_cselim);
>       NEXT_PASS (pass_dominator);
>       /* The only const/copy propagation opportunities left after
> 
> do you mean that I should change
> 
>       NEXT_PASS (pass_complete_unroll);
> 
> to
> 
>       NEXT_PASS (pass_complete_unrolli);
> 
> ? I am assuming my interpretation is correct and rebuild gcc right now.

Yes, that's correct - I did too much copy&paste there :)

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (20 preceding siblings ...)
  2007-11-29 11:13 ` rguenther at suse dot de
@ 2007-11-29 11:16 ` dominiq at lps dot ens dot fr
  2007-11-29 12:25 ` dominiq at lps dot ens dot fr
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 11:16 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #22 from dominiq at lps dot ens dot fr  2007-11-29 11:16 -------
In top of the first two patches of comment #17, I have MOVED

+      NEXT_PASS (pass_tree_loop_init);
+      NEXT_PASS (pass_complete_unrolli);
+      NEXT_PASS (pass_tree_loop_done);

to the first suggested place. Now gfortran.dg/array_1.f90 pass the test, induct
is no longer unrolled/vectorized, but induct.v3 is: back to before the patch at
least on these quick tests.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (21 preceding siblings ...)
  2007-11-29 11:16 ` dominiq at lps dot ens dot fr
@ 2007-11-29 12:25 ` dominiq at lps dot ens dot fr
  2007-11-29 15:49 ` dominiq at lps dot ens dot fr
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 12:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #23 from dominiq at lps dot ens dot fr  2007-11-29 12:24 -------
In top of the first two patches of comment #17, I have MOVED

+      NEXT_PASS (pass_tree_loop_init);
+      NEXT_PASS (pass_complete_unrolli);
+      NEXT_PASS (pass_tree_loop_done);

to the second suggested place. Same result as in comment #22.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (22 preceding siblings ...)
  2007-11-29 12:25 ` dominiq at lps dot ens dot fr
@ 2007-11-29 15:49 ` dominiq at lps dot ens dot fr
  2007-11-30 21:38 ` ubizjak at gmail dot com
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-29 15:49 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #24 from dominiq at lps dot ens dot fr  2007-11-29 15:49 -------
I think I have now a partial understanding of what is happening for the induct
variants that do not vectorize with the patch in comment #5: they do not
contain any loop inside the k loop. If I replace

                  rot_q_vector(1) = rot_c_vector(1) - rot_qk_vector(k,1)
                  rot_q_vector(2) = rot_c_vector(2) - rot_qk_vector(k,2)
                  rot_q_vector(3) = rot_c_vector(3) - rot_qk_vector(k,3)

by

                  rot_q_vector(:) = rot_c_vector(:) - rot_qk_vector(k,:)

Then the loop is vectorized again.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (23 preceding siblings ...)
  2007-11-29 15:49 ` dominiq at lps dot ens dot fr
@ 2007-11-30 21:38 ` ubizjak at gmail dot com
  2007-12-03 14:08 ` dominiq at lps dot ens dot fr
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: ubizjak at gmail dot com @ 2007-11-30 21:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #25 from ubizjak at gmail dot com  2007-11-30 21:38 -------
(In reply to comment #24)

> Then the loop is vectorized again.

IMO, SLP should vectorize the sequence.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (24 preceding siblings ...)
  2007-11-30 21:38 ` ubizjak at gmail dot com
@ 2007-12-03 14:08 ` dominiq at lps dot ens dot fr
  2007-12-03 14:32 ` dominiq at lps dot ens dot fr
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #26 from dominiq at lps dot ens dot fr  2007-12-03 14:08 -------
> IMO, SLP should vectorize the sequence.

Uros,

What is the meaning of the above sentence?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (25 preceding siblings ...)
  2007-12-03 14:08 ` dominiq at lps dot ens dot fr
@ 2007-12-03 14:32 ` dominiq at lps dot ens dot fr
  2007-12-03 14:34 ` dominiq at lps dot ens dot fr
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #27 from dominiq at lps dot ens dot fr  2007-12-03 14:32 -------
I have had a look at the failure of gfortran.dg/array_1.f90 with patch #5. The
following reduced code gives the same failure:

! { dg-do run }
! PR 15553 : the array used to be filled with garbage
! this problem disappeared between 2004-05-20 and 2004-09-15
program arrpack
  implicit none

  double precision x(10,10), tmp(6,5)
  integer i, j

  x = -1
  do i=1,6
     do j=1,5
        x(i,j) = i+j*10
     end do
  end do
  tmp(:,:) = x(1:6, 1:5)
  print '(6F8.2)', tmp

end program arrpack

With -O3 and patch #5, the output is

   11.00   12.00   13.00   14.00   15.00   16.00
   -1.00   -1.00   -1.00   -1.00   21.00   22.00
   23.00   24.00   25.00   26.00   -1.00   -1.00
   -1.00   -1.00   31.00   32.00   33.00   34.00
   35.00   36.00   -1.00   -1.00   -1.00   -1.00

instead of

   11.00   12.00   13.00   14.00   15.00   16.00
   21.00   22.00   23.00   24.00   25.00   26.00
   31.00   32.00   33.00   34.00   35.00   36.00
   41.00   42.00   43.00   44.00   45.00   46.00
   51.00   52.00   53.00   54.00   55.00   56.00

I am amaze that it is the only failure of this kind for the several 1000 tests
I have passed!

I'll attach the the results of -fdump-tree-optimize for with and without patch
#5.

I have also looked at the gcc failures. Most of them are missed vectorizations
or new ones. So this is already reported. Is *.[0-9][0-9][0-9]t.vect supposed
to exist if the vectorization is missed? If yes, this explaina few failures.
Concerning the failures with *.[0-9][0-9][0-9]t.cunroll, I see *cunroll1/2, but
no cunroll.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (26 preceding siblings ...)
  2007-12-03 14:32 ` dominiq at lps dot ens dot fr
@ 2007-12-03 14:34 ` dominiq at lps dot ens dot fr
  2007-12-03 14:34 ` dominiq at lps dot ens dot fr
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #28 from dominiq at lps dot ens dot fr  2007-12-03 14:33 -------
Created an attachment (id=14691)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14691&action=view)
result of -fdump-tree-optimized with patch #5


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (27 preceding siblings ...)
  2007-12-03 14:34 ` dominiq at lps dot ens dot fr
@ 2007-12-03 14:34 ` dominiq at lps dot ens dot fr
  2007-12-03 16:30 ` ubizjak at gmail dot com
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 14:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #29 from dominiq at lps dot ens dot fr  2007-12-03 14:34 -------
Created an attachment (id=14692)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14692&action=view)
result of -fdump-tree-optimized without patch #5


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (28 preceding siblings ...)
  2007-12-03 14:34 ` dominiq at lps dot ens dot fr
@ 2007-12-03 16:30 ` ubizjak at gmail dot com
  2007-12-03 18:59 ` dominiq at lps dot ens dot fr
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 38+ messages in thread
From: ubizjak at gmail dot com @ 2007-12-03 16:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #30 from ubizjak at gmail dot com  2007-12-03 16:30 -------
(In reply to comment #26)
> > IMO, SLP should vectorize the sequence.
> What is the meaning of the above sentence?

Uh, sorry for being terse. If there are no loops, then "straight-line
parallelization" [SLP] should vectorize your manually unrolled sequence in
comment #24. You can look into testsuite/gcc.dg/vect/slp-*.c for some c code
examples.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (29 preceding siblings ...)
  2007-12-03 16:30 ` ubizjak at gmail dot com
@ 2007-12-03 18:59 ` dominiq at lps dot ens dot fr
  2007-12-04  6:57 ` irar at il dot ibm dot com
  2008-04-23 21:27 ` dominiq at lps dot ens dot fr
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-12-03 18:59 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #31 from dominiq at lps dot ens dot fr  2007-12-03 18:58 -------
> If there are no loops, then "straight-line parallelization" [SLP] should vectorize 
> your manually unrolled sequence in comment #24.

Yes it should, but if does not after patch #5.  The unanswered question so far
is why it does not, then how to change the patch so that it does it. Anyhow,
the "good" vectorization should be along the k loop (length 9 instead of 3). My
understanding of my tests is first that 5/9<2/3 and, more important, the
packing/unpacking overhead is a smaller penalty if it is shared as in the k
vectorization.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (30 preceding siblings ...)
  2007-12-03 18:59 ` dominiq at lps dot ens dot fr
@ 2007-12-04  6:57 ` irar at il dot ibm dot com
  2008-04-23 21:27 ` dominiq at lps dot ens dot fr
  32 siblings, 0 replies; 38+ messages in thread
From: irar at il dot ibm dot com @ 2007-12-04  6:57 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #32 from irar at il dot ibm dot com  2007-12-04 06:56 -------
(In reply to comment #30)
> Uh, sorry for being terse. If there are no loops, then "straight-line
> parallelization" [SLP] should vectorize your manually unrolled sequence in
> comment #24. 

Currently only loop-aware SLP is implemented, meaning the code must be enclosed
in a loop to get vectorized.

> You can look into testsuite/gcc.dg/vect/slp-*.c for some c code
> examples.
> 

Right. For example,

  for (i = 0; i < N; i++)
    {
      out[i*4] = a0;
      out[i*4 + 1] = a1;
      out[i*4 + 2] = a2;
      out[i*4 + 3] = a3;
    }

will be transformed into:

  for (i = 0; i < N; i++)
    {
      out[i*4:i*4+3] = {a0, a1, a2, a3};
    }

Ira


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
  2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
                   ` (31 preceding siblings ...)
  2007-12-04  6:57 ` irar at il dot ibm dot com
@ 2008-04-23 21:27 ` dominiq at lps dot ens dot fr
  32 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens dot fr @ 2008-04-23 21:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #33 from dominiq at lps dot ens dot fr  2008-04-23 21:26 -------
Created an attachment (id=15523)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15523&action=view)
induct.f90 variants and their diff with the original file

The original diff's have space problems.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
       [not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2011-09-17 17:53 ` dominiq at lps dot ens.fr
@ 2011-09-26  9:02 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-09-26  9:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|                            |FIXED

--- Comment #37 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-09-26 08:36:26 UTC ---
Testcases for runtime properties are not supported.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
       [not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
  2011-05-22 12:33 ` dominiq at lps dot ens.fr
  2011-09-16 15:53 ` dominiq at lps dot ens.fr
@ 2011-09-17 17:53 ` dominiq at lps dot ens.fr
  2011-09-26  9:02 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-09-17 17:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

Dominique d'Humieres <dominiq at lps dot ens.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at gcc dot gnu.org,
                   |                            |wschmidt at gcc dot gnu.org

--- Comment #36 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-09-17 17:09:59 UTC ---
The pr 34265 and 49006 have been fixed by revision 176984:

Author:    wschmidt
Date:    Sun Jul 31 18:58:06 2011 UTC (6 weeks, 5 days ago)
Changed paths:    2
Log Message:    
2011-07-29  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

    PR tree-optimization/49749
    * tree-ssa-reassoc.c (get_rank): New forward declaration.
    (PHI_LOOP_BIAS): New macro.
    (phi_rank): New function.
    (loop_carried_phi): Likewise.
    (propagate_rank): Likewise.
    (get_rank): Add calls to phi_rank and propagate_rank.

The following table compare the execution time and the line at which the
vectorizer reports the vectorization for the twelve variants described in
comment #34, showing that they are now all vectorized:

revision           176983          176984          178905
num den rot      time  line      time  line      time  line
 o   o   o      2.154s  243     2.154s  243     2.133s  243 
 u   o   o      1.973s  243     1.970s  243     2.035s  243 
 o   u   o      2.024s  243     2.023s  243     1.977s  243 
 u   u   o      3.053s          1.817s  234     1.831s  234 
 o   o   u      3.015s          1.839s  234     1.841s  234 
 u   o   u      3.030s          1.828s  234     1.816s  234 
 o   u   u      3.049s          1.818s  234     1.834s  234 
 u   u   u      3.059s          1.820s  234     1.818s  234 
 o   o   f      3.010s          1.825s  234     1.822s  234 
 u   o   f      3.033s          1.836s  234     1.826s  234 
 o   u   f      3.061s          1.814s  234     1.828s  234 
 u   u   f      3.058s          1.825s  234     1.812s  234 
graphite        1.937s  243     1.938s  243     1.912s  243 

(num, den and rot stand for numerator, denominator and rotate respectively; o,
u, and f stand for original, unrolled, and folded.

Since now gcc has (at least for this class of code;) the three properties I
expect from a good optimizer:
(1) it does not destroy the hand optimization I have done;
(2) it optimizes the original code;
(3) it has a consistent behavior across variants;
I think a test should be added to the test suite to check that none of these
properties are lost in future revisions.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
       [not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
  2011-05-22 12:33 ` dominiq at lps dot ens.fr
@ 2011-09-16 15:53 ` dominiq at lps dot ens.fr
  2011-09-17 17:53 ` dominiq at lps dot ens.fr
  2011-09-26  9:02 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-09-16 15:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

--- Comment #35 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-09-16 15:42:15 UTC ---
This pr (as well as pr49006) seems to have been fixed between revisions 176696
and 177649. I am closing 
pr49006 as fixed and I'll use this pr to track the remaining issues.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug tree-optimization/34265] Missed optimizations
       [not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
@ 2011-05-22 12:33 ` dominiq at lps dot ens.fr
  2011-09-16 15:53 ` dominiq at lps dot ens.fr
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-05-22 12:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265

--- Comment #34 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-05-22 12:06:20 UTC ---
Created attachment 24325
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24325
reduced tests

The attached bzipped tar contains the files induct_red.f90 with the all the
infrastructure to provide a realistic framework to run a reduced version of
the subroutine mutual_ind_quad_cir_coil contained in induct_qc_x.F90
(reduced to only one critical nested loops).

When the macro XPA is defined the original rotate code

          rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
          rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
          rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))

is unrolled as (q_vector(2)==0) if the macro FLD is not defined

          rot_q_vector(1) = rotate_quad(1,1) * q_vector(1) + &
                    rotate_quad(1,2) * q_vector(2)
          rot_q_vector(2) = rotate_quad(2,1) * q_vector(1) + &
                    rotate_quad(2,2) * q_vector(2)
          rot_q_vector(3) = rotate_quad(3,1) * q_vector(1) + &
                    rotate_quad(3,2) * q_vector(2)

Otherwise it is folded as

          rot_q_vector(:) = rotate_quad(:,1) * q_vector(1) + &
                    rotate_quad(:,2) * q_vector(2)

When the macro XPB is defined the original numerator

          numerator = w1gauss(j) * w2gauss(k) *               &
                  dot_product(coil_current_vec,current_vector)

is unrolled as

          numerator = w1gauss(j) * w2gauss(k) *               &
                 (coil_current_vec(1)*current_vector(1) + &
                  coil_current_vec(2)*current_vector(2) + &
                  coil_current_vec(3)*current_vector(3))

When the macro XPC is defined the original denominator

          denominator = sqrt(dot_product(rot_c_vector-rot_q_vector, &
                         rot_c_vector-rot_q_vector))

is unrolled as
          denominator = sqrt((rot_c_vector(1)-rot_q_vector(1))**2 + &
                     (rot_c_vector(2)-rot_q_vector(2))**2 + &
                     (rot_c_vector(3)-rot_q_vector(3))**2)


It contains also a script to run the twelve cases and one case with
graphite and the raw results for revisions 167530, 167531, and 173917
(original, with r167531 reverted: 173917r1, and with /* NEXT_PASS
(pass_complete_unrolli); */ : 173917n since I think this is related to
revision 134730).

See also pr49006.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2011-09-26  8:38 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
2007-11-28 16:14 ` dominiq at lps dot ens dot fr
2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
2007-11-28 18:18 ` dominiq at lps dot ens dot fr
2007-11-28 18:49 ` dominiq at lps dot ens dot fr
2007-11-28 20:48 ` jb at gcc dot gnu dot org
2007-11-28 21:27 ` burnus at gcc dot gnu dot org
2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
2007-11-28 22:36 ` dominiq at lps dot ens dot fr
2007-11-28 22:49 ` steven at gcc dot gnu dot org
2007-11-28 23:07 ` kargl at gcc dot gnu dot org
2007-11-28 23:18 ` steven at gcc dot gnu dot org
2007-11-28 23:57 ` dominiq at lps dot ens dot fr
2007-11-29  8:06 ` dominiq at lps dot ens dot fr
2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
2007-11-29 10:22 ` dominiq at lps dot ens dot fr
2007-11-29 10:40 ` dominiq at lps dot ens dot fr
2007-11-29 11:01 ` dominiq at lps dot ens dot fr
2007-11-29 11:13 ` rguenther at suse dot de
2007-11-29 11:16 ` dominiq at lps dot ens dot fr
2007-11-29 12:25 ` dominiq at lps dot ens dot fr
2007-11-29 15:49 ` dominiq at lps dot ens dot fr
2007-11-30 21:38 ` ubizjak at gmail dot com
2007-12-03 14:08 ` dominiq at lps dot ens dot fr
2007-12-03 14:32 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 16:30 ` ubizjak at gmail dot com
2007-12-03 18:59 ` dominiq at lps dot ens dot fr
2007-12-04  6:57 ` irar at il dot ibm dot com
2008-04-23 21:27 ` dominiq at lps dot ens dot fr
     [not found] <bug-34265-4@http.gcc.gnu.org/bugzilla/>
2011-05-22 12:33 ` dominiq at lps dot ens.fr
2011-09-16 15:53 ` dominiq at lps dot ens.fr
2011-09-17 17:53 ` dominiq at lps dot ens.fr
2011-09-26  9:02 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).