[Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/31040]  New: unroll/peel loops not aggressive enough
@ 2007-03-05  9:11 jv244 at cam dot ac dot uk
  2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-03-05  9:11 UTC (permalink / raw)
  To: gcc-bugs

Looking at the asm for the program below, there plenty of loops left after
compiling with

> gfortran  -S -march=native -O3 -funroll-loops -funroll-all-loops -fpeel-loops test.f90

or any combination of these options. A full unrolling (and in that case a
return of the value 3) would be possible and much faster.

> cat test.f90

INTEGER FUNCTION lxy()
   lxy=0
   DO lxa=0,1
   DO lxb=0,0
     DO lya=0,1-lxa
     DO lyb=0,0-lxb
       lxy=lxy+1
     ENDDO
     ENDDO
   ENDDO
   ENDDO
END FUNCTION
write(6,*) lxy()
END


-- 
           Summary: unroll/peel loops not aggressive enough
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: jv244 at cam dot ac dot uk


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
  2007-03-05  9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
@ 2007-03-05 10:18 ` rguenth at gcc dot gnu dot org
  2007-03-05 11:47 ` jv244 at cam dot ac dot uk
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-03-05 10:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from rguenth at gcc dot gnu dot org  2007-03-05 10:18 -------
We don't unroll non-innermost loops at the moment.  I don't know if sccp can
be taught to handle this case (and if it's worth it).


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu dot
                   |                            |org, rakdver at gcc dot gnu
                   |                            |dot org
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|0000-00-00 00:00:00         |2007-03-05 10:18:14
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
  2007-03-05  9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
  2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
@ 2007-03-05 11:47 ` jv244 at cam dot ac dot uk
  2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-03-05 11:47 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from jv244 at cam dot ac dot uk  2007-03-05 11:47 -------
(In reply to comment #1)
> We don't unroll non-innermost loops at the moment.  I don't know if sccp can
> be taught to handle this case (and if it's worth it).

such small loops are quite typical for some quantum chemistry integral
routines.
I'm just experimenting rewriting the kernel mentioned in PR 31021. If I do this
unrolling by hand I get quite a speedup on the full kernel:

hand unrolled:
# best time    5.260329
loops:
# best time    6.616413

which is quite impressive because these loops take at most 30% of the kernel
total time: 

The actual code in question is:

             coef(:,:)=0.0_wp
             lxy=0 ; lx=0
             DO lxa=0,1
             DO lxb=0,1
              lx = lx + 1
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
              DO lya=0,1-lxa
              DO lyb=0,1-lxb
                    lxy=lxy+1
                    g1=g1+pyx(1,lxy)*dpy(lyb,lya,jg)
                    g2=g2+pyx(1,lxy)*dpy(lyb,lya,jg2)
                    g1k=g1k+pyx(2,lxy)*dpy(lyb,lya,jg)
                    g2k=g2k+pyx(2,lxy)*dpy(lyb,lya,jg2)
              ENDDO
              ENDDO
              DO icoef=1,3
                 coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1
                 coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2
                 coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k
                 coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k
              ENDDO
             ENDDO
             ENDDO

and the hand-unrolling just explicitly expands all loops to the loop free
version of exactly the same statements:

             coef(:,:)=0.0_wp
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,1)*dpy(0,0,jg)
                    g2=g2+pyx(1,1)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,1)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,1)*dpy(0,0,jg2)
                    g1=g1+pyx(1,2)*dpy(1,0,jg)
                    g2=g2+pyx(1,2)*dpy(1,0,jg2)
                    g1k=g1k+pyx(2,2)*dpy(1,0,jg)
                    g2k=g2k+pyx(2,2)*dpy(1,0,jg2)
                    g1=g1+pyx(1,3)*dpy(0,1,jg)
                    g2=g2+pyx(1,3)*dpy(0,1,jg2)
                    g1k=g1k+pyx(2,3)*dpy(0,1,jg)
                    g2k=g2k+pyx(2,3)*dpy(0,1,jg2)
                    g1=g1+pyx(1,4)*dpy(1,1,jg)
                    g2=g2+pyx(1,4)*dpy(1,1,jg2)
                    g1k=g1k+pyx(2,4)*dpy(1,1,jg)
                    g2k=g2k+pyx(2,4)*dpy(1,1,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,1)*g1
                 coef(01,02)=coef(01,02)+alpha(1,1)*g2
                 coef(01,03)=coef(01,03)+alpha(1,1)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,1)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,1)*g1
                 coef(02,02)=coef(02,02)+alpha(2,1)*g2
                 coef(02,03)=coef(02,03)+alpha(2,1)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,1)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,1)*g1
                 coef(03,02)=coef(03,02)+alpha(3,1)*g2
                 coef(03,03)=coef(03,03)+alpha(3,1)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,1)*g2k
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,5)*dpy(0,0,jg)
                    g2=g2+pyx(1,5)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,5)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,5)*dpy(0,0,jg2)
                    g1=g1+pyx(1,6)*dpy(0,1,jg)
                    g2=g2+pyx(1,6)*dpy(0,1,jg2)
                    g1k=g1k+pyx(2,6)*dpy(0,1,jg)
                    g2k=g2k+pyx(2,6)*dpy(0,1,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,2)*g1
                 coef(01,02)=coef(01,02)+alpha(1,2)*g2
                 coef(01,03)=coef(01,03)+alpha(1,2)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,2)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,2)*g1
                 coef(02,02)=coef(02,02)+alpha(2,2)*g2
                 coef(02,03)=coef(02,03)+alpha(2,2)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,2)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,2)*g1
                 coef(03,02)=coef(03,02)+alpha(3,2)*g2
                 coef(03,03)=coef(03,03)+alpha(3,2)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,2)*g2k
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,7)*dpy(0,0,jg)
                    g2=g2+pyx(1,7)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,7)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,7)*dpy(0,0,jg2)
                    g1=g1+pyx(1,8)*dpy(1,0,jg)
                    g2=g2+pyx(1,8)*dpy(1,0,jg2)
                    g1k=g1k+pyx(2,8)*dpy(1,0,jg)
                    g2k=g2k+pyx(2,8)*dpy(1,0,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,3)*g1
                 coef(01,02)=coef(01,02)+alpha(1,3)*g2
                 coef(01,03)=coef(01,03)+alpha(1,3)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,3)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,3)*g1
                 coef(02,02)=coef(02,02)+alpha(2,3)*g2
                 coef(02,03)=coef(02,03)+alpha(2,3)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,3)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,3)*g1
                 coef(03,02)=coef(03,02)+alpha(3,3)*g2
                 coef(03,03)=coef(03,03)+alpha(3,3)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,3)*g2k
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,9)*dpy(0,0,jg)
                    g2=g2+pyx(1,9)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,9)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,9)*dpy(0,0,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,4)*g1
                 coef(01,02)=coef(01,02)+alpha(1,4)*g2
                 coef(01,03)=coef(01,03)+alpha(1,4)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,4)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,4)*g1
                 coef(02,02)=coef(02,02)+alpha(2,4)*g2
                 coef(02,03)=coef(02,03)+alpha(2,4)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,4)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,4)*g1
                 coef(03,02)=coef(03,02)+alpha(3,4)*g2
                 coef(03,03)=coef(03,03)+alpha(3,4)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,4)*g2k


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
  2007-03-05  9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
  2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
  2007-03-05 11:47 ` jv244 at cam dot ac dot uk
@ 2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
  2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rakdver at atrey dot karlin dot mff dot cuni dot cz @ 2007-03-05 11:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from rakdver at atrey dot karlin dot mff dot cuni dot cz  2007-03-05 11:49 -------
Subject: Re:  unroll/peel loops not aggressive enough

> We don't unroll non-innermost loops at the moment.  I don't know if sccp can
> be taught to handle this case (and if it's worth it).

It is fairly easy to make gcc completely unroll non-innermost loops, I
am working on that.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
  2007-03-05  9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
                   ` (2 preceding siblings ...)
  2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
@ 2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
  2007-07-03 18:21 ` jv244 at cam dot ac dot uk
  2007-07-21  8:59 ` pinskia at gcc dot gnu dot org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-03-05 12:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from rguenth at gcc dot gnu dot org  2007-03-05 12:22 -------
Note that in addition to unrolling the outermost loop you can experiment with
adjusting the --param max-completely-peeled-insns param.  Also I wonder if

  DO lxb=0,0

is really common (if so, the frontend might want to lower this differently).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
  2007-03-05  9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
                   ` (3 preceding siblings ...)
  2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
@ 2007-07-03 18:21 ` jv244 at cam dot ac dot uk
  2007-07-21  8:59 ` pinskia at gcc dot gnu dot org
  5 siblings, 0 replies; 7+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-07-03 18:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from jv244 at cam dot ac dot uk  2007-07-03 18:21 -------
The optimization asked for in this PR is now being performed:

> gfortran -O3 -funroll-loops -S test.f90

yields

globl lxy_
        .type   lxy_, @function
lxy_:
.LFB2:
        movl    $3, %eax
        ret
.LFE2:
        .size   lxy_, .-lxy_
        .section        .eh_frame,"a",@progbits
.Lframe1:


-- 

jv244 at cam dot ac dot uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
  2007-03-05  9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
                   ` (4 preceding siblings ...)
  2007-07-03 18:21 ` jv244 at cam dot ac dot uk
@ 2007-07-21  8:59 ` pinskia at gcc dot gnu dot org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-07-21  8:59 UTC (permalink / raw)
  To: gcc-bugs



-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pinskia at gcc dot gnu dot
                   |                            |org
   Target Milestone|---                         |4.3.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-07-21  8:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-05  9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
2007-03-05 11:47 ` jv244 at cam dot ac dot uk
2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
2007-07-03 18:21 ` jv244 at cam dot ac dot uk
2007-07-21  8:59 ` pinskia at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).