public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough
@ 2007-03-05 9:11 jv244 at cam dot ac dot uk
2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-03-05 9:11 UTC (permalink / raw)
To: gcc-bugs
Looking at the asm for the program below, there plenty of loops left after
compiling with
> gfortran -S -march=native -O3 -funroll-loops -funroll-all-loops -fpeel-loops test.f90
or any combination of these options. A full unrolling (and in that case a
return of the value 3) would be possible and much faster.
> cat test.f90
INTEGER FUNCTION lxy()
lxy=0
DO lxa=0,1
DO lxb=0,0
DO lya=0,1-lxa
DO lyb=0,0-lxb
lxy=lxy+1
ENDDO
ENDDO
ENDDO
ENDDO
END FUNCTION
write(6,*) lxy()
END
--
Summary: unroll/peel loops not aggressive enough
Product: gcc
Version: 4.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: jv244 at cam dot ac dot uk
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
2007-03-05 9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
@ 2007-03-05 10:18 ` rguenth at gcc dot gnu dot org
2007-03-05 11:47 ` jv244 at cam dot ac dot uk
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-03-05 10:18 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from rguenth at gcc dot gnu dot org 2007-03-05 10:18 -------
We don't unroll non-innermost loops at the moment. I don't know if sccp can
be taught to handle this case (and if it's worth it).
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu dot
| |org, rakdver at gcc dot gnu
| |dot org
Status|UNCONFIRMED |NEW
Ever Confirmed|0 |1
Keywords| |missed-optimization
Last reconfirmed|0000-00-00 00:00:00 |2007-03-05 10:18:14
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
2007-03-05 9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
@ 2007-03-05 11:47 ` jv244 at cam dot ac dot uk
2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-03-05 11:47 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from jv244 at cam dot ac dot uk 2007-03-05 11:47 -------
(In reply to comment #1)
> We don't unroll non-innermost loops at the moment. I don't know if sccp can
> be taught to handle this case (and if it's worth it).
such small loops are quite typical for some quantum chemistry integral
routines.
I'm just experimenting rewriting the kernel mentioned in PR 31021. If I do this
unrolling by hand I get quite a speedup on the full kernel:
hand unrolled:
# best time 5.260329
loops:
# best time 6.616413
which is quite impressive because these loops take at most 30% of the kernel
total time:
The actual code in question is:
coef(:,:)=0.0_wp
lxy=0 ; lx=0
DO lxa=0,1
DO lxb=0,1
lx = lx + 1
g1=0.0_wp
g2=0.0_wp
g1k=0.0_wp
g2k=0.0_wp
DO lya=0,1-lxa
DO lyb=0,1-lxb
lxy=lxy+1
g1=g1+pyx(1,lxy)*dpy(lyb,lya,jg)
g2=g2+pyx(1,lxy)*dpy(lyb,lya,jg2)
g1k=g1k+pyx(2,lxy)*dpy(lyb,lya,jg)
g2k=g2k+pyx(2,lxy)*dpy(lyb,lya,jg2)
ENDDO
ENDDO
DO icoef=1,3
coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1
coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2
coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k
coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k
ENDDO
ENDDO
ENDDO
and the hand-unrolling just explicitly expands all loops to the loop free
version of exactly the same statements:
coef(:,:)=0.0_wp
g1=0.0_wp
g2=0.0_wp
g1k=0.0_wp
g2k=0.0_wp
g1=g1+pyx(1,1)*dpy(0,0,jg)
g2=g2+pyx(1,1)*dpy(0,0,jg2)
g1k=g1k+pyx(2,1)*dpy(0,0,jg)
g2k=g2k+pyx(2,1)*dpy(0,0,jg2)
g1=g1+pyx(1,2)*dpy(1,0,jg)
g2=g2+pyx(1,2)*dpy(1,0,jg2)
g1k=g1k+pyx(2,2)*dpy(1,0,jg)
g2k=g2k+pyx(2,2)*dpy(1,0,jg2)
g1=g1+pyx(1,3)*dpy(0,1,jg)
g2=g2+pyx(1,3)*dpy(0,1,jg2)
g1k=g1k+pyx(2,3)*dpy(0,1,jg)
g2k=g2k+pyx(2,3)*dpy(0,1,jg2)
g1=g1+pyx(1,4)*dpy(1,1,jg)
g2=g2+pyx(1,4)*dpy(1,1,jg2)
g1k=g1k+pyx(2,4)*dpy(1,1,jg)
g2k=g2k+pyx(2,4)*dpy(1,1,jg2)
coef(01,01)=coef(01,01)+alpha(1,1)*g1
coef(01,02)=coef(01,02)+alpha(1,1)*g2
coef(01,03)=coef(01,03)+alpha(1,1)*g1k
coef(01,04)=coef(01,04)+alpha(1,1)*g2k
coef(02,01)=coef(02,01)+alpha(2,1)*g1
coef(02,02)=coef(02,02)+alpha(2,1)*g2
coef(02,03)=coef(02,03)+alpha(2,1)*g1k
coef(02,04)=coef(02,04)+alpha(2,1)*g2k
coef(03,01)=coef(03,01)+alpha(3,1)*g1
coef(03,02)=coef(03,02)+alpha(3,1)*g2
coef(03,03)=coef(03,03)+alpha(3,1)*g1k
coef(03,04)=coef(03,04)+alpha(3,1)*g2k
g1=0.0_wp
g2=0.0_wp
g1k=0.0_wp
g2k=0.0_wp
g1=g1+pyx(1,5)*dpy(0,0,jg)
g2=g2+pyx(1,5)*dpy(0,0,jg2)
g1k=g1k+pyx(2,5)*dpy(0,0,jg)
g2k=g2k+pyx(2,5)*dpy(0,0,jg2)
g1=g1+pyx(1,6)*dpy(0,1,jg)
g2=g2+pyx(1,6)*dpy(0,1,jg2)
g1k=g1k+pyx(2,6)*dpy(0,1,jg)
g2k=g2k+pyx(2,6)*dpy(0,1,jg2)
coef(01,01)=coef(01,01)+alpha(1,2)*g1
coef(01,02)=coef(01,02)+alpha(1,2)*g2
coef(01,03)=coef(01,03)+alpha(1,2)*g1k
coef(01,04)=coef(01,04)+alpha(1,2)*g2k
coef(02,01)=coef(02,01)+alpha(2,2)*g1
coef(02,02)=coef(02,02)+alpha(2,2)*g2
coef(02,03)=coef(02,03)+alpha(2,2)*g1k
coef(02,04)=coef(02,04)+alpha(2,2)*g2k
coef(03,01)=coef(03,01)+alpha(3,2)*g1
coef(03,02)=coef(03,02)+alpha(3,2)*g2
coef(03,03)=coef(03,03)+alpha(3,2)*g1k
coef(03,04)=coef(03,04)+alpha(3,2)*g2k
g1=0.0_wp
g2=0.0_wp
g1k=0.0_wp
g2k=0.0_wp
g1=g1+pyx(1,7)*dpy(0,0,jg)
g2=g2+pyx(1,7)*dpy(0,0,jg2)
g1k=g1k+pyx(2,7)*dpy(0,0,jg)
g2k=g2k+pyx(2,7)*dpy(0,0,jg2)
g1=g1+pyx(1,8)*dpy(1,0,jg)
g2=g2+pyx(1,8)*dpy(1,0,jg2)
g1k=g1k+pyx(2,8)*dpy(1,0,jg)
g2k=g2k+pyx(2,8)*dpy(1,0,jg2)
coef(01,01)=coef(01,01)+alpha(1,3)*g1
coef(01,02)=coef(01,02)+alpha(1,3)*g2
coef(01,03)=coef(01,03)+alpha(1,3)*g1k
coef(01,04)=coef(01,04)+alpha(1,3)*g2k
coef(02,01)=coef(02,01)+alpha(2,3)*g1
coef(02,02)=coef(02,02)+alpha(2,3)*g2
coef(02,03)=coef(02,03)+alpha(2,3)*g1k
coef(02,04)=coef(02,04)+alpha(2,3)*g2k
coef(03,01)=coef(03,01)+alpha(3,3)*g1
coef(03,02)=coef(03,02)+alpha(3,3)*g2
coef(03,03)=coef(03,03)+alpha(3,3)*g1k
coef(03,04)=coef(03,04)+alpha(3,3)*g2k
g1=0.0_wp
g2=0.0_wp
g1k=0.0_wp
g2k=0.0_wp
g1=g1+pyx(1,9)*dpy(0,0,jg)
g2=g2+pyx(1,9)*dpy(0,0,jg2)
g1k=g1k+pyx(2,9)*dpy(0,0,jg)
g2k=g2k+pyx(2,9)*dpy(0,0,jg2)
coef(01,01)=coef(01,01)+alpha(1,4)*g1
coef(01,02)=coef(01,02)+alpha(1,4)*g2
coef(01,03)=coef(01,03)+alpha(1,4)*g1k
coef(01,04)=coef(01,04)+alpha(1,4)*g2k
coef(02,01)=coef(02,01)+alpha(2,4)*g1
coef(02,02)=coef(02,02)+alpha(2,4)*g2
coef(02,03)=coef(02,03)+alpha(2,4)*g1k
coef(02,04)=coef(02,04)+alpha(2,4)*g2k
coef(03,01)=coef(03,01)+alpha(3,4)*g1
coef(03,02)=coef(03,02)+alpha(3,4)*g2
coef(03,03)=coef(03,03)+alpha(3,4)*g1k
coef(03,04)=coef(03,04)+alpha(3,4)*g2k
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
2007-03-05 9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
2007-03-05 11:47 ` jv244 at cam dot ac dot uk
@ 2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rakdver at atrey dot karlin dot mff dot cuni dot cz @ 2007-03-05 11:50 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from rakdver at atrey dot karlin dot mff dot cuni dot cz 2007-03-05 11:49 -------
Subject: Re: unroll/peel loops not aggressive enough
> We don't unroll non-innermost loops at the moment. I don't know if sccp can
> be taught to handle this case (and if it's worth it).
It is fairly easy to make gcc completely unroll non-innermost loops, I
am working on that.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
2007-03-05 9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
` (2 preceding siblings ...)
2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
@ 2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
2007-07-03 18:21 ` jv244 at cam dot ac dot uk
2007-07-21 8:59 ` pinskia at gcc dot gnu dot org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-03-05 12:22 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from rguenth at gcc dot gnu dot org 2007-03-05 12:22 -------
Note that in addition to unrolling the outermost loop you can experiment with
adjusting the --param max-completely-peeled-insns param. Also I wonder if
DO lxb=0,0
is really common (if so, the frontend might want to lower this differently).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
2007-03-05 9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
` (3 preceding siblings ...)
2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
@ 2007-07-03 18:21 ` jv244 at cam dot ac dot uk
2007-07-21 8:59 ` pinskia at gcc dot gnu dot org
5 siblings, 0 replies; 7+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-07-03 18:21 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from jv244 at cam dot ac dot uk 2007-07-03 18:21 -------
The optimization asked for in this PR is now being performed:
> gfortran -O3 -funroll-loops -S test.f90
yields
globl lxy_
.type lxy_, @function
lxy_:
.LFB2:
movl $3, %eax
ret
.LFE2:
.size lxy_, .-lxy_
.section .eh_frame,"a",@progbits
.Lframe1:
--
jv244 at cam dot ac dot uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/31040] unroll/peel loops not aggressive enough
2007-03-05 9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
` (4 preceding siblings ...)
2007-07-03 18:21 ` jv244 at cam dot ac dot uk
@ 2007-07-21 8:59 ` pinskia at gcc dot gnu dot org
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-07-21 8:59 UTC (permalink / raw)
To: gcc-bugs
--
pinskia at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |pinskia at gcc dot gnu dot
| |org
Target Milestone|--- |4.3.0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31040
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-07-21 8:59 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-05 9:11 [Bug tree-optimization/31040] New: unroll/peel loops not aggressive enough jv244 at cam dot ac dot uk
2007-03-05 10:18 ` [Bug tree-optimization/31040] " rguenth at gcc dot gnu dot org
2007-03-05 11:47 ` jv244 at cam dot ac dot uk
2007-03-05 11:50 ` rakdver at atrey dot karlin dot mff dot cuni dot cz
2007-03-05 12:22 ` rguenth at gcc dot gnu dot org
2007-07-03 18:21 ` jv244 at cam dot ac dot uk
2007-07-21 8:59 ` pinskia at gcc dot gnu dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).