public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/25621] New: Missed optimisation @ 2006-01-01 12:40 jv244 at cam dot ac dot uk 2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org ` (10 more replies) 0 siblings, 11 replies; 16+ messages in thread From: jv244 at cam dot ac dot uk @ 2006-01-01 12:40 UTC (permalink / raw) To: gcc-bugs The following doesn't run as fast as the 'hand-optimised' routine provided as well (using current 4.2 on an opteron) using -ffast-math -O2 (makes a factor of 2 difference here). I've tried a number of further switches, but didn't manage to find a case where the simply loop was as fast as the other. ! simple loop ! assume N is even SUBROUTINE S31(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c integer :: i c=0.0D0 DO i=1,N c=c+a(i)*b(i) ENDDO END SUBROUTINE ! 'improved' loop SUBROUTINE S32(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c,tmp integer :: i c=0.0D0 tmp=0.0D0 DO i=1,N,2 c=c+a(i)*b(i) tmp=tmp+a(i+1)*b(i+1) ENDDO c=c+tmp END SUBROUTINE -- Summary: Missed optimisation Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: jv244 at cam dot ac dot uk http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimisation 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk @ 2006-01-01 17:31 ` pinskia at gcc dot gnu dot org 2006-01-01 18:14 ` jv244 at cam dot ac dot uk ` (9 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: pinskia at gcc dot gnu dot org @ 2006-01-01 17:31 UTC (permalink / raw) To: gcc-bugs ------- Comment #1 from pinskia at gcc dot gnu dot org 2006-01-01 17:31 ------- What happens if you use -funroll-loops? It should get about the same improvement. Also your two loops not equal if N is old. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimisation 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk 2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org @ 2006-01-01 18:14 ` jv244 at cam dot ac dot uk 2006-01-06 14:07 ` [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) pinskia at gcc dot gnu dot org ` (8 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: jv244 at cam dot ac dot uk @ 2006-01-01 18:14 UTC (permalink / raw) To: gcc-bugs ------- Comment #2 from jv244 at cam dot ac dot uk 2006-01-01 18:14 ------- (In reply to comment #1) > What happens if you use -funroll-loops? It should get about the same > improvement. I have the following timings (for N=1024, calling these subroutines a number of times+some external initialisation) -O2 -ffast-math -funroll-loops S31 S32 0.0229959786 0.0119980276 -O2 -ffast-math 0.0229960084 0.0119979978 I think the issue is not pure unrolling but the fact that you have two independent sums in the loop In fact, I now find that -O2 -ffast-math -funroll-loops -ftree-loop-ivcanon -fivopts -fvariable-expansion-in-unroller yields much improved code: 0.0119979978 0.0079990029 The last option indeed seems to do what I did by hand, still the routine S32 seems about 30% faster. > Also your two loops not equal if N is old. I've added at least the comment ;-) ! assume N is even -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk 2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org 2006-01-01 18:14 ` jv244 at cam dot ac dot uk @ 2006-01-06 14:07 ` pinskia at gcc dot gnu dot org 2007-07-03 19:30 ` jv244 at cam dot ac dot uk ` (7 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: pinskia at gcc dot gnu dot org @ 2006-01-06 14:07 UTC (permalink / raw) To: gcc-bugs ------- Comment #3 from pinskia at gcc dot gnu dot org 2006-01-06 14:07 ------- Confirmed. -- pinskia at gcc dot gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|0000-00-00 00:00:00 |2006-01-06 14:07:04 date| | Summary|Missed optimisation |Missed optimization when | |unrolling the loop | |(splitting up the sum) (only | |with -ffast-math) http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (2 preceding siblings ...) 2006-01-06 14:07 ` [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) pinskia at gcc dot gnu dot org @ 2007-07-03 19:30 ` jv244 at cam dot ac dot uk 2007-07-04 8:58 ` eres at il dot ibm dot com ` (6 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: jv244 at cam dot ac dot uk @ 2007-07-03 19:30 UTC (permalink / raw) To: gcc-bugs ------- Comment #4 from jv244 at cam dot ac dot uk 2007-07-03 19:30 ------- Now, I get the same timings for the hand-optimised loop and compiled loop if I use the option: gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops -fvariable-expansion-in-unroller test.f90 whereas -funroll-loops is quite common to add, -fvariable-expansion-in-unroller is not. Could one have a heuristic that switches that on by default if -funroll-loops (and -ffast-math) ? For S31 the timings are: > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops test.f90 > time ./a.out real 0m6.618s > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops -fvariable-expansion-in-unroller test.f90 > time ./a.out real 0m4.457s so a 50% improvement. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (3 preceding siblings ...) 2007-07-03 19:30 ` jv244 at cam dot ac dot uk @ 2007-07-04 8:58 ` eres at il dot ibm dot com 2007-07-04 9:23 ` jv244 at cam dot ac dot uk ` (5 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: eres at il dot ibm dot com @ 2007-07-04 8:58 UTC (permalink / raw) To: gcc-bugs ------- Comment #5 from eres at il dot ibm dot com 2007-07-04 08:57 ------- You can also try to tune --param max-variable-expansions-in-unroller. The default is to add one expansion (which seems to be the most helpful due to the fact that adding more expansions can increase register pressure). -- eres at il dot ibm dot com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |eres at il dot ibm dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (4 preceding siblings ...) 2007-07-04 8:58 ` eres at il dot ibm dot com @ 2007-07-04 9:23 ` jv244 at cam dot ac dot uk 2007-07-04 11:14 ` dorit at gcc dot gnu dot org ` (4 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: jv244 at cam dot ac dot uk @ 2007-07-04 9:23 UTC (permalink / raw) To: gcc-bugs ------- Comment #6 from jv244 at cam dot ac dot uk 2007-07-04 09:23 ------- (In reply to comment #5) > You can also try to tune --param max-variable-expansions-in-unroller. The > default is to add one expansion (which seems to be the most helpful due to the > fact that adding more expansions can increase register pressure). > there seems to be no effect from --param max-variable-expansions-in-unroller, I get the same timings for all values. I do notice that ifort is twice as fast as gfortran on the original loop on my machine (core2): > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 pr25621.f90 > ./a.out default loop 0.868054000000000 hand optimized loop 0.864054000000000 > ifort -xT -O3 pr25621.f90 pr25621.f90(32) : (col. 0) remark: LOOP WAS VECTORIZED. pr25621.f90(33) : (col. 0) remark: LOOP WAS VECTORIZED. pr25621.f90(9) : (col. 2) remark: LOOP WAS VECTORIZED. > ./a.out default loop 0.440027000000000 hand optimized loop 0.876055000000000 and it looks like ifort vectorizes the first loop (whereas gfortran does not ' unsupported use in stmt'). As a reference : > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops pr25621.f90 > ./a.out default loop 1.29608100000000 hand optimized loop 0.860054000000000 the code actually used for testing is : ! simple loop ! assume N is even SUBROUTINE S31(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c integer :: i c=0.0D0 DO i=1,N c=c+a(i)*b(i) ENDDO END SUBROUTINE ! 'improved' loop SUBROUTINE S32(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c,tmp integer :: i c=0.0D0 tmp=0.0D0 DO i=1,N,2 c=c+a(i)*b(i) tmp=tmp+a(i+1)*b(i+1) ENDDO c=c+tmp END SUBROUTINE integer, parameter :: N=1024 real*8 :: a(N),b(N),c,tmp,t1,t2 a=0.0_8 b=0.0_8 DO i=1,2000000 CALL S31(a,b,c,N) ENDDO CALL CPU_TIME(t1) DO i=1,1000000 CALL S31(a,b,c,N) ENDDO CALL CPU_TIME(t2) write(6,*) "default loop", t2-t1 CALL CPU_TIME(t1) DO i=1,1000000 CALL S32(a,b,c,N) ENDDO CALL CPU_TIME(t2) write(6,*) "hand optimized loop",t2-t1 END -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (5 preceding siblings ...) 2007-07-04 9:23 ` jv244 at cam dot ac dot uk @ 2007-07-04 11:14 ` dorit at gcc dot gnu dot org 2007-07-04 11:24 ` eres at il dot ibm dot com ` (3 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: dorit at gcc dot gnu dot org @ 2007-07-04 11:14 UTC (permalink / raw) To: gcc-bugs ------- Comment #7 from dorit at gcc dot gnu dot org 2007-07-04 11:14 ------- The vectorizer reports: pr25621.f90:7: note: reduction used in loop. pr25621.f90:7: note: Unknown def-use cycle pattern. because of the seemingly redundant assignment: c__lsm.63_30 = D.1361_38; which uses the reduction variable D.1361_38 inside the loop (only to be used outside the loop). Need to teach the vectorizer to ignore this assignment or clean it away before the vectorizer. <bb 4>: # prephitmp.57_5 = PHI <storetmp.55_34(3), D.1361_38(5)> # i_3 = PHI <1(3), i_40(5)> D.1357_31 = i_3 + -1; D.1358_33 = (*a_32(D))[D.1357_31]; D.1359_36 = (*b_35(D))[D.1357_31]; D.1360_37 = D.1359_36 * D.1358_33; D.1361_38 = prephitmp.57_5 + D.1360_37; c__lsm.63_30 = D.1361_38; i_40 = i_3 + 1; if (i_3 == D.1339_28) goto <bb 6>; else goto <bb 5>; -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (6 preceding siblings ...) 2007-07-04 11:14 ` dorit at gcc dot gnu dot org @ 2007-07-04 11:24 ` eres at il dot ibm dot com 2007-08-14 20:17 ` dorit at gcc dot gnu dot org ` (2 subsequent siblings) 10 siblings, 0 replies; 16+ messages in thread From: eres at il dot ibm dot com @ 2007-07-04 11:24 UTC (permalink / raw) To: gcc-bugs ------- Comment #8 from eres at il dot ibm dot com 2007-07-04 11:24 ------- I think c__lsm.63_30 is created during the store motion optimization. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (7 preceding siblings ...) 2007-07-04 11:24 ` eres at il dot ibm dot com @ 2007-08-14 20:17 ` dorit at gcc dot gnu dot org 2008-12-05 16:27 ` jv244 at cam dot ac dot uk 2010-04-27 18:25 ` jv244 at cam dot ac dot uk 10 siblings, 0 replies; 16+ messages in thread From: dorit at gcc dot gnu dot org @ 2007-08-14 20:17 UTC (permalink / raw) To: gcc-bugs ------- Comment #9 from dorit at gcc dot gnu dot org 2007-08-14 20:17 ------- PR32824 discusses a similar issue. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (8 preceding siblings ...) 2007-08-14 20:17 ` dorit at gcc dot gnu dot org @ 2008-12-05 16:27 ` jv244 at cam dot ac dot uk 2010-04-27 18:25 ` jv244 at cam dot ac dot uk 10 siblings, 0 replies; 16+ messages in thread From: jv244 at cam dot ac dot uk @ 2008-12-05 16:27 UTC (permalink / raw) To: gcc-bugs ------- Comment #10 from jv244 at cam dot ac dot uk 2008-12-05 16:25 ------- Timings in 4.4 are essentially unchanged gfortran -O3 -ffast-math -march=native PR25621.f90: default loop 1.2920810000000000 hand optimized loop 0.86405399999999988 fun enough inverse timings with a recent intel compiler: default loop 0.440028000000000 hand optimized loop 1.26007800000000 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk ` (9 preceding siblings ...) 2008-12-05 16:27 ` jv244 at cam dot ac dot uk @ 2010-04-27 18:25 ` jv244 at cam dot ac dot uk 10 siblings, 0 replies; 16+ messages in thread From: jv244 at cam dot ac dot uk @ 2010-04-27 18:25 UTC (permalink / raw) To: gcc-bugs ------- Comment #11 from jv244 at cam dot ac dot uk 2010-04-27 18:25 ------- the original loop gets now (4.6.0) vectorized, and gets the same performance as the 'hand optimized loop' (which does not get vectorized): > ./a.out default loop 0.88005500000000003 hand optimized loop 0.86005399999999987 it is still not quite as fast as the ifort code: ifort -fno-inline -O3 -xT -static t.f90 > ~/a.out default loop 0.444028000000000 hand optimized loop 0.964060000000000 ifort's asm looks good: # -- Begin s31_ # mark_begin; .align 16,0x90 .globl s31_ s31_: # parameter 1: %rdi # parameter 2: %rsi # parameter 3: %rdx # parameter 4: %rcx ..B2.1: # Preds ..B2.0 ..___tag_value_s31_.10: #3.12 xorps %xmm1, %xmm1 #9.2 movaps %xmm1, %xmm0 #9.2 xorl %eax, %eax #9.2 # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1 ..B2.2: # Preds ..B2.2 ..B2.1 movaps (%rdi,%rax,8), %xmm2 #10.8 movaps 16(%rdi,%rax,8), %xmm3 #10.8 movaps 32(%rdi,%rax,8), %xmm4 #10.8 movaps 48(%rdi,%rax,8), %xmm5 #10.8 mulpd (%rsi,%rax,8), %xmm2 #10.12 mulpd 16(%rsi,%rax,8), %xmm3 #10.12 mulpd 32(%rsi,%rax,8), %xmm4 #10.12 mulpd 48(%rsi,%rax,8), %xmm5 #10.12 addpd %xmm2, %xmm0 #10.4 addq $8, %rax #9.2 cmpq $1024, %rax #9.2 addpd %xmm3, %xmm1 #10.4 addpd %xmm4, %xmm0 #10.4 addpd %xmm5, %xmm1 #10.4 jb ..B2.2 # Prob 82% #9.2 # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1 ..B2.3: # Preds ..B2.2 addpd %xmm1, %xmm0 #9.2 haddpd %xmm0, %xmm0 #9.2 movsd %xmm0, (%rdx) #10.4 ret #12.1 .align 16,0x90 ..___tag_value_s31_.11: # while gcc has more complicated-looking asm .globl s31_ .type s31_, @function s31_: .LFB0: movl (%rcx), %r9d movq $0, (%rdx) testl %r9d, %r9d jle .L9 movl %r9d, %r8d shrl %r8d cmpl $4, %r9d leal (%r8,%r8), %r10d jbe .L15 testl %r10d, %r10d je .L15 xorl %eax, %eax xorl %ecx, %ecx xorpd %xmm1, %xmm1 .p2align 4,,10 .p2align 3 .L12: movsd (%rsi,%rax), %xmm2 movsd (%rdi,%rax), %xmm3 movhpd 8(%rsi,%rax), %xmm2 movhpd 8(%rdi,%rax), %xmm3 movapd %xmm2, %xmm0 incl %ecx mulpd %xmm3, %xmm0 addq $16, %rax addpd %xmm0, %xmm1 cmpl %ecx, %r8d ja .L12 haddpd %xmm1, %xmm1 leal 1(%r10), %eax cmpl %r9d, %r10d je .L13 .L11: movslq %eax, %rcx subl %eax, %r9d leaq -8(,%rcx,8), %rcx xorl %eax, %eax addq %rcx, %rdi addq %rcx, %rsi leaq 8(,%r9,8), %rcx .p2align 4,,10 .p2align 3 .L14: movsd (%rsi), %xmm0 addq $8, %rax mulsd (%rdi), %xmm0 addq $8, %rsi addq $8, %rdi addsd %xmm0, %xmm1 cmpq %rcx, %rax jne .L14 .L13: movsd %xmm1, (%rdx) .L9: rep ret .L15: xorpd %xmm1, %xmm1 movl $1, %eax jmp .L11 .LFE0: .size s31_, .-s31_ -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <bug-25621-4@http.gcc.gnu.org/bugzilla/>]
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/> @ 2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch 2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Joost.VandeVondele at mat dot ethz.ch @ 2013-03-29 10:07 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |Joost.VandeVondele at mat | |dot ethz.ch Depends on| |53947 --- Comment #12 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2013-03-29 10:07:06 UTC --- This has become much more a vectorizer problem. Basically ifort generates code that is twice as fast for routine S31 of the initial comment. Given that this is a common dot product, it might be good to see why that happens. Both compilers fail to notice that S32 is basically the same code hand-unrolled. Tested with the code in comment #6 (without inlining) > gfortran -march=native -ffast-math -O3 -fno-inline PR25621.f90 > ./a.out default loop 0.56491500000000006 hand optimized loop 0.74488600000000016 > ifort -xHost -O3 -fno-inline PR25621.f90 > ./a.out default loop 0.377943000000000 hand optimized loop 0.579911000000000 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/> 2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch @ 2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch 2014-03-24 11:02 ` iliyapalachev at gmail dot com 2023-09-23 21:08 ` rguenth at gcc dot gnu.org 3 siblings, 0 replies; 16+ messages in thread From: Joost.VandeVondele at mat dot ethz.ch @ 2014-03-16 15:54 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 --- Comment #13 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> --- (In reply to Joost VandeVondele from comment #12) > Both compilers fail to notice that S32 is basically the same code > hand-unrolled. with gcc 4.9 > ./a.out default loop 0.54291800000000001 hand optimized loop 0.54291700000000009 so, some progress, both versions of the loop give the same performance. Still not quite as good as ifort, however. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/> 2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch 2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch @ 2014-03-24 11:02 ` iliyapalachev at gmail dot com 2023-09-23 21:08 ` rguenth at gcc dot gnu.org 3 siblings, 0 replies; 16+ messages in thread From: iliyapalachev at gmail dot com @ 2014-03-24 11:02 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 Ilya Palachev <iliyapalachev at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |iliyapalachev at gmail dot com --- Comment #14 from Ilya Palachev <iliyapalachev at gmail dot com> --- (In reply to Joost VandeVondele from comment #13) At page http://gcc.gnu.org/wiki/VectorizationTasks it is written that the generalization of reduction support (http://gcc.gnu.org/ml/gcc-patches/2006-04/msg00172.html) can help to fix this bug. Is this information still correct for gcc-4.9? ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/> ` (2 preceding siblings ...) 2014-03-24 11:02 ` iliyapalachev at gmail dot com @ 2023-09-23 21:08 ` rguenth at gcc dot gnu.org 3 siblings, 0 replies; 16+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-09-23 21:08 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> --- We are generating the same vectorized loop for S31 and S32 now. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-09-23 21:08 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk 2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org 2006-01-01 18:14 ` jv244 at cam dot ac dot uk 2006-01-06 14:07 ` [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) pinskia at gcc dot gnu dot org 2007-07-03 19:30 ` jv244 at cam dot ac dot uk 2007-07-04 8:58 ` eres at il dot ibm dot com 2007-07-04 9:23 ` jv244 at cam dot ac dot uk 2007-07-04 11:14 ` dorit at gcc dot gnu dot org 2007-07-04 11:24 ` eres at il dot ibm dot com 2007-08-14 20:17 ` dorit at gcc dot gnu dot org 2008-12-05 16:27 ` jv244 at cam dot ac dot uk 2010-04-27 18:25 ` jv244 at cam dot ac dot uk [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/> 2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch 2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch 2014-03-24 11:02 ` iliyapalachev at gmail dot com 2023-09-23 21:08 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).