public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/34265]  New: Missed optimizations
@ 2007-11-28 15:27 dominiq at lps dot ens dot fr
  2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
                   ` (32 more replies)
  0 siblings, 33 replies; 34+ messages in thread
From: dominiq at lps dot ens dot fr @ 2007-11-28 15:27 UTC (permalink / raw)
  To: gcc-bugs

I have had a closer look to the optimization of the polyhedron test 
induct.f90. 
3/4 of the runtime is spent in the subroutine 'mutual_ind_quad_cir_coil' and
1/4 in 'mutual_ind_quad_rec_coil'.
The two subroutines contain two main loops with the following structure:

do i = 1, 2*m
    ...
    do j = 1, 9
        ...
        do k = 1, 9
            q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
            q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal)
            q_vector(3) = 0.0_longreal
!
!       rotate quad vector into the global coordinate system
!
            rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
            rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
            rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
!
!       compute and add in quadrature term
!
            numerator = w1gauss(j) * w2gauss(k) *                         &
                        dot_product(coil_current_vec,current_vector)
            denominator = sqrt(dot_product(rot_c_vector-rot_q_vector,     &
                                           rot_c_vector-rot_q_vector))
            l12_lower = l12_lower + numerator/denominator
        end do
    end do
end do

where the six first lines of code in the k loop do not depend on i nor j
and can be computed in a 'do k = 1, 9' loop outside the main loop by
replacing the length-three vector 'rot_q_vector' by a nine by three array.

The original code induct.f90 gives the following timing (all with -O3 
-ffast-math -funroll-loops):

93.227u 0.094s 1:33.32 99.9%    0+0k 0+0io 0pf+0w

and output:

...
Maximum wand/quad abs rel mutual inductance =   5.95379428444656467E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242061230E-002
...

Unrolling bye hand 'numerator' and 'denominator' gives (see 
http://gcc.gnu.org/ml/fortran/2007-11/msg00231.html):

65.563u 0.092s 1:05.66 99.9%    0+0k 0+0io 0pf+0w

Looking at the assembly I can see that for the original code the inner loops in
k are not  unrolled, as guessed by Paul Thomas (only the implied vector loops
being unrolled).

QUESTION 1: Should the frontend do the unrolling for small vectors itself? or
should the middleend be more aggressive for nested loops with small known
iterations?

Moving the invariants on i and j in the k loops outside the main loops gives:

80.313u 0.074s 1:20.39 99.9%    0+0k 0+1io 0pf+0w

Combining the two hand optimizations gives:

35.925u 0.040s 0:35.97 99.9%    0+0k 0+0io 0pf+0w

(without -ffast-math the timing is
59.263u 0.067s 0:59.33 99.9%    0+0k 0+1io 0pf+0w)

but the results change to:

Maximum wand/quad abs rel mutual inductance =   5.95379428444656675E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242059842E-002

( Maximum wand/quad abs rel mutual inductance =   5.95379428444659520E-002
  Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
  Maximum resq coil/quad abs rel mutual inductance =   9.63995250242060675E-002
without -ffast-math).

The attached file gives the differences between the original code and the three
variants.

This is to be compared to further optimizations (indu.v2.f90) leading to:

35.376u 0.062s 0:35.44 99.9%    0+0k 0+1io 0pf+0w

or after merging the two loops (indu.v3.f90) to:

34.452u 0.041s 0:34.49 100.0%   0+0k 0+0io 0pf+0w

I have counted the number of sqrt() in the assembly code and found 9 of them in
the slow codes while I only found 5 (10 for indu.v3.f90) of them for the fast
codes. I checked that this was not due to common 
subexpressions I may have missed. Looking more closely to the assemby I saw
that he slow codes used 'sqrtsd' while the fast ones used 'sqrtpd' along with
several other packed operations. Now 65.66*5/9=36.48 explaining most of the
speedup. Note that 5=9/2+1, i.e., four packed computations 
followed by a scalar one.  Owing to the same structure in the two loops, the
two scalar computations could be merged in a packed one, but it is missed by
the optimizer.

I have tried without success to trigger this "vectorization" by making some
variables vectors in k.

QUESTION 2:  Why the optimizer is able to vectorize in some cases and not in
others? Can the frontend help to vectorize?


-- 
           Summary: Missed optimizations
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: dominiq at lps dot ens dot fr
 GCC build triplet: i686-apple-darwin9
  GCC host triplet: i686-apple-darwin9
GCC target triplet: i686-apple-darwin9


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34265


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2008-04-23 21:27 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-28 15:27 [Bug tree-optimization/34265] New: Missed optimizations dominiq at lps dot ens dot fr
2007-11-28 15:30 ` [Bug tree-optimization/34265] " dominiq at lps dot ens dot fr
2007-11-28 16:06 ` rguenth at gcc dot gnu dot org
2007-11-28 16:14 ` dominiq at lps dot ens dot fr
2007-11-28 16:18 ` rguenth at gcc dot gnu dot org
2007-11-28 16:34 ` rguenth at gcc dot gnu dot org
2007-11-28 18:18 ` dominiq at lps dot ens dot fr
2007-11-28 18:49 ` dominiq at lps dot ens dot fr
2007-11-28 20:48 ` jb at gcc dot gnu dot org
2007-11-28 21:27 ` burnus at gcc dot gnu dot org
2007-11-28 22:05 ` rguenth at gcc dot gnu dot org
2007-11-28 22:36 ` dominiq at lps dot ens dot fr
2007-11-28 22:49 ` steven at gcc dot gnu dot org
2007-11-28 23:07 ` kargl at gcc dot gnu dot org
2007-11-28 23:18 ` steven at gcc dot gnu dot org
2007-11-28 23:57 ` dominiq at lps dot ens dot fr
2007-11-29  8:06 ` dominiq at lps dot ens dot fr
2007-11-29 10:12 ` rguenth at gcc dot gnu dot org
2007-11-29 10:22 ` dominiq at lps dot ens dot fr
2007-11-29 10:40 ` dominiq at lps dot ens dot fr
2007-11-29 11:01 ` dominiq at lps dot ens dot fr
2007-11-29 11:13 ` rguenther at suse dot de
2007-11-29 11:16 ` dominiq at lps dot ens dot fr
2007-11-29 12:25 ` dominiq at lps dot ens dot fr
2007-11-29 15:49 ` dominiq at lps dot ens dot fr
2007-11-30 21:38 ` ubizjak at gmail dot com
2007-12-03 14:08 ` dominiq at lps dot ens dot fr
2007-12-03 14:32 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 14:34 ` dominiq at lps dot ens dot fr
2007-12-03 16:30 ` ubizjak at gmail dot com
2007-12-03 18:59 ` dominiq at lps dot ens dot fr
2007-12-04  6:57 ` irar at il dot ibm dot com
2008-04-23 21:27 ` dominiq at lps dot ens dot fr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).