[Bug tree-optimization/25621] New: Missed optimisation

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/25621]  New: Missed optimisation
@ 2006-01-01 12:40 jv244 at cam dot ac dot uk
  2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org
                   ` (10 more replies)
  0 siblings, 11 replies; 16+ messages in thread
From: jv244 at cam dot ac dot uk @ 2006-01-01 12:40 UTC (permalink / raw)
  To: gcc-bugs

The following doesn't run as fast as the 'hand-optimised' routine provided as
well (using current 4.2 on an opteron) using -ffast-math -O2 (makes a factor of
2 difference here). I've tried a number of further switches, but didn't manage
to find a case where the simply loop was as fast as the other. 

! simple loop
! assume N is even
SUBROUTINE S31(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c
 integer :: i
 c=0.0D0
 DO i=1,N
   c=c+a(i)*b(i)
 ENDDO
END SUBROUTINE

! 'improved' loop
SUBROUTINE S32(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c,tmp
 integer :: i
 c=0.0D0
 tmp=0.0D0
 DO i=1,N,2
    c=c+a(i)*b(i)
    tmp=tmp+a(i+1)*b(i+1)
 ENDDO
 c=c+tmp
END SUBROUTINE


-- 
           Summary: Missed optimisation
           Product: gcc
           Version: 4.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: jv244 at cam dot ac dot uk


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimisation
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
@ 2006-01-01 17:31 ` pinskia at gcc dot gnu dot org
  2006-01-01 18:14 ` jv244 at cam dot ac dot uk
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2006-01-01 17:31 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from pinskia at gcc dot gnu dot org  2006-01-01 17:31 -------
What happens if you use -funroll-loops?  It should get about the same
improvement.

Also your two loops not equal if N is old.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimisation
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
  2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org
@ 2006-01-01 18:14 ` jv244 at cam dot ac dot uk
  2006-01-06 14:07 ` [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) pinskia at gcc dot gnu dot org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: jv244 at cam dot ac dot uk @ 2006-01-01 18:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from jv244 at cam dot ac dot uk  2006-01-01 18:14 -------
(In reply to comment #1)
> What happens if you use -funroll-loops?  It should get about the same
> improvement.

I have the following timings (for N=1024, calling these subroutines a number of
times+some external initialisation)
-O2 -ffast-math -funroll-loops
S31                 S32
0.0229959786        0.0119980276
-O2 -ffast-math 
0.0229960084        0.0119979978

I think the issue is not pure unrolling but the fact that you have two
independent sums in the loop

In fact, I now find that
-O2 -ffast-math -funroll-loops -ftree-loop-ivcanon -fivopts
-fvariable-expansion-in-unroller
yields much improved code:
0.0119979978        0.0079990029
The last option indeed seems to do what I did by hand, still the routine S32
seems about 30% faster.

> Also your two loops not equal if N is old.
I've added at least the comment ;-)
! assume N is even


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
  2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org
  2006-01-01 18:14 ` jv244 at cam dot ac dot uk
@ 2006-01-06 14:07 ` pinskia at gcc dot gnu dot org
  2007-07-03 19:30 ` jv244 at cam dot ac dot uk
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2006-01-06 14:07 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from pinskia at gcc dot gnu dot org  2006-01-06 14:07 -------
Confirmed.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2006-01-06 14:07:04
               date|                            |
            Summary|Missed optimisation         |Missed optimization when
                   |                            |unrolling the loop
                   |                            |(splitting up the sum) (only
                   |                            |with -ffast-math)


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (2 preceding siblings ...)
  2006-01-06 14:07 ` [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) pinskia at gcc dot gnu dot org
@ 2007-07-03 19:30 ` jv244 at cam dot ac dot uk
  2007-07-04  8:58 ` eres at il dot ibm dot com
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-07-03 19:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from jv244 at cam dot ac dot uk  2007-07-03 19:30 -------
Now, I get the same timings for the hand-optimised loop and compiled loop if I
use the option:

gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops
-fvariable-expansion-in-unroller test.f90

whereas -funroll-loops is quite common to add, -fvariable-expansion-in-unroller
is not. Could one have a heuristic that switches that on by default if
-funroll-loops (and -ffast-math) ? For S31 the timings are:

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops test.f90
> time ./a.out
real    0m6.618s

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops -fvariable-expansion-in-unroller test.f90
> time ./a.out
real    0m4.457s

so a 50% improvement. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (3 preceding siblings ...)
  2007-07-03 19:30 ` jv244 at cam dot ac dot uk
@ 2007-07-04  8:58 ` eres at il dot ibm dot com
  2007-07-04  9:23 ` jv244 at cam dot ac dot uk
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: eres at il dot ibm dot com @ 2007-07-04  8:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from eres at il dot ibm dot com  2007-07-04 08:57 -------
You can also try to tune --param max-variable-expansions-in-unroller. The
default is to add one expansion (which seems to be the most helpful due to the
fact that adding more expansions can increase register pressure).


-- 

eres at il dot ibm dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eres at il dot ibm dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (4 preceding siblings ...)
  2007-07-04  8:58 ` eres at il dot ibm dot com
@ 2007-07-04  9:23 ` jv244 at cam dot ac dot uk
  2007-07-04 11:14 ` dorit at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: jv244 at cam dot ac dot uk @ 2007-07-04  9:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from jv244 at cam dot ac dot uk  2007-07-04 09:23 -------
(In reply to comment #5)
> You can also try to tune --param max-variable-expansions-in-unroller. The
> default is to add one expansion (which seems to be the most helpful due to the
> fact that adding more expansions can increase register pressure).
> 

there seems to be no effect from --param max-variable-expansions-in-unroller, I
get the same timings for all values.

I do notice that ifort is twice as fast as gfortran on the original loop on my
machine (core2):

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 pr25621.f90
> ./a.out
 default loop  0.868054000000000
 hand optimized loop  0.864054000000000

> ifort -xT -O3 pr25621.f90
pr25621.f90(32) : (col. 0) remark: LOOP WAS VECTORIZED.
pr25621.f90(33) : (col. 0) remark: LOOP WAS VECTORIZED.
pr25621.f90(9) : (col. 2) remark: LOOP WAS VECTORIZED.
> ./a.out
 default loop  0.440027000000000
 hand optimized loop  0.876055000000000

and it looks like ifort vectorizes the first loop (whereas gfortran does not '
unsupported use in stmt'). As a reference :

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops pr25621.f90
> ./a.out
 default loop   1.29608100000000
 hand optimized loop  0.860054000000000

the code actually used for testing is :

! simple loop
! assume N is even
SUBROUTINE S31(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c
 integer :: i
 c=0.0D0
 DO i=1,N
   c=c+a(i)*b(i)
 ENDDO
END SUBROUTINE

! 'improved' loop
SUBROUTINE S32(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c,tmp
 integer :: i
 c=0.0D0
 tmp=0.0D0
 DO i=1,N,2
    c=c+a(i)*b(i)
    tmp=tmp+a(i+1)*b(i+1)
 ENDDO
 c=c+tmp
END SUBROUTINE

integer, parameter :: N=1024
real*8  :: a(N),b(N),c,tmp,t1,t2

a=0.0_8
b=0.0_8
DO i=1,2000000
   CALL S31(a,b,c,N)
ENDDO

CALL CPU_TIME(t1)
DO i=1,1000000
   CALL S31(a,b,c,N)
ENDDO
CALL CPU_TIME(t2)
write(6,*) "default loop", t2-t1
CALL CPU_TIME(t1)
DO i=1,1000000
   CALL S32(a,b,c,N)
ENDDO
CALL CPU_TIME(t2)
write(6,*) "hand optimized loop",t2-t1
END





-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (5 preceding siblings ...)
  2007-07-04  9:23 ` jv244 at cam dot ac dot uk
@ 2007-07-04 11:14 ` dorit at gcc dot gnu dot org
  2007-07-04 11:24 ` eres at il dot ibm dot com
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: dorit at gcc dot gnu dot org @ 2007-07-04 11:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from dorit at gcc dot gnu dot org  2007-07-04 11:14 -------
The vectorizer reports:
pr25621.f90:7: note: reduction used in loop.
pr25621.f90:7: note: Unknown def-use cycle pattern.

because of the seemingly redundant assignment:
c__lsm.63_30 = D.1361_38;
which uses the reduction variable D.1361_38 inside the loop (only to be used
outside the loop). Need to teach the vectorizer to ignore this assignment or
clean it away before the vectorizer.

<bb 4>:
  # prephitmp.57_5 = PHI <storetmp.55_34(3), D.1361_38(5)>
  # i_3 = PHI <1(3), i_40(5)>
  D.1357_31 = i_3 + -1;
  D.1358_33 = (*a_32(D))[D.1357_31];
  D.1359_36 = (*b_35(D))[D.1357_31];
  D.1360_37 = D.1359_36 * D.1358_33;
  D.1361_38 = prephitmp.57_5 + D.1360_37;
  c__lsm.63_30 = D.1361_38;
  i_40 = i_3 + 1;
  if (i_3 == D.1339_28)
    goto <bb 6>;
  else
    goto <bb 5>;


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (6 preceding siblings ...)
  2007-07-04 11:14 ` dorit at gcc dot gnu dot org
@ 2007-07-04 11:24 ` eres at il dot ibm dot com
  2007-08-14 20:17 ` dorit at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: eres at il dot ibm dot com @ 2007-07-04 11:24 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from eres at il dot ibm dot com  2007-07-04 11:24 -------
I think c__lsm.63_30 is created during the store motion optimization.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (7 preceding siblings ...)
  2007-07-04 11:24 ` eres at il dot ibm dot com
@ 2007-08-14 20:17 ` dorit at gcc dot gnu dot org
  2008-12-05 16:27 ` jv244 at cam dot ac dot uk
  2010-04-27 18:25 ` jv244 at cam dot ac dot uk
  10 siblings, 0 replies; 16+ messages in thread
From: dorit at gcc dot gnu dot org @ 2007-08-14 20:17 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from dorit at gcc dot gnu dot org  2007-08-14 20:17 -------
PR32824 discusses a similar issue.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (8 preceding siblings ...)
  2007-08-14 20:17 ` dorit at gcc dot gnu dot org
@ 2008-12-05 16:27 ` jv244 at cam dot ac dot uk
  2010-04-27 18:25 ` jv244 at cam dot ac dot uk
  10 siblings, 0 replies; 16+ messages in thread
From: jv244 at cam dot ac dot uk @ 2008-12-05 16:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from jv244 at cam dot ac dot uk  2008-12-05 16:25 -------
Timings in 4.4 are essentially unchanged

gfortran -O3 -ffast-math -march=native PR25621.f90:

 default loop   1.2920810000000000
 hand optimized loop  0.86405399999999988

fun enough inverse timings with a recent intel compiler:

 default loop  0.440028000000000
 hand optimized loop   1.26007800000000


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
  2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
                   ` (9 preceding siblings ...)
  2008-12-05 16:27 ` jv244 at cam dot ac dot uk
@ 2010-04-27 18:25 ` jv244 at cam dot ac dot uk
  10 siblings, 0 replies; 16+ messages in thread
From: jv244 at cam dot ac dot uk @ 2010-04-27 18:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from jv244 at cam dot ac dot uk  2010-04-27 18:25 -------
the original loop gets now (4.6.0) vectorized, and gets the same performance as
the 'hand optimized loop' (which does not get vectorized):

> ./a.out
 default loop  0.88005500000000003
 hand optimized loop  0.86005399999999987

it is still not quite as fast as the ifort code:

ifort -fno-inline -O3 -xT -static t.f90
> ~/a.out
 default loop  0.444028000000000
 hand optimized loop  0.964060000000000

ifort's asm looks good:

# -- Begin  s31_
# mark_begin;
       .align    16,0x90
        .globl s31_
s31_:
# parameter 1: %rdi
# parameter 2: %rsi
# parameter 3: %rdx
# parameter 4: %rcx
..B2.1:                         # Preds ..B2.0
..___tag_value_s31_.10:                                         #3.12
        xorps     %xmm1, %xmm1                                  #9.2
        movaps    %xmm1, %xmm0                                  #9.2
        xorl      %eax, %eax                                    #9.2
                                # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15
xmm0 xmm1
..B2.2:                         # Preds ..B2.2 ..B2.1
        movaps    (%rdi,%rax,8), %xmm2                          #10.8
        movaps    16(%rdi,%rax,8), %xmm3                        #10.8
        movaps    32(%rdi,%rax,8), %xmm4                        #10.8
        movaps    48(%rdi,%rax,8), %xmm5                        #10.8
        mulpd     (%rsi,%rax,8), %xmm2                          #10.12
        mulpd     16(%rsi,%rax,8), %xmm3                        #10.12
        mulpd     32(%rsi,%rax,8), %xmm4                        #10.12
        mulpd     48(%rsi,%rax,8), %xmm5                        #10.12
        addpd     %xmm2, %xmm0                                  #10.4
        addq      $8, %rax                                      #9.2
        cmpq      $1024, %rax                                   #9.2
        addpd     %xmm3, %xmm1                                  #10.4
        addpd     %xmm4, %xmm0                                  #10.4
        addpd     %xmm5, %xmm1                                  #10.4
        jb        ..B2.2        # Prob 82%                      #9.2
                                # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15
xmm0 xmm1
..B2.3:                         # Preds ..B2.2
        addpd     %xmm1, %xmm0                                  #9.2
        haddpd    %xmm0, %xmm0                                  #9.2
        movsd     %xmm0, (%rdx)                                 #10.4
        ret                                                     #12.1
        .align    16,0x90
..___tag_value_s31_.11:                                         #

while gcc has more complicated-looking asm
.globl s31_
        .type   s31_, @function
s31_:
.LFB0:
        movl    (%rcx), %r9d
        movq    $0, (%rdx)
        testl   %r9d, %r9d
        jle     .L9
        movl    %r9d, %r8d
        shrl    %r8d
        cmpl    $4, %r9d
        leal    (%r8,%r8), %r10d
        jbe     .L15
        testl   %r10d, %r10d
        je      .L15
        xorl    %eax, %eax
        xorl    %ecx, %ecx
        xorpd   %xmm1, %xmm1
        .p2align 4,,10
        .p2align 3
.L12:
        movsd   (%rsi,%rax), %xmm2
        movsd   (%rdi,%rax), %xmm3
        movhpd  8(%rsi,%rax), %xmm2
        movhpd  8(%rdi,%rax), %xmm3
        movapd  %xmm2, %xmm0
        incl    %ecx
        mulpd   %xmm3, %xmm0
        addq    $16, %rax
        addpd   %xmm0, %xmm1
        cmpl    %ecx, %r8d
        ja      .L12
        haddpd  %xmm1, %xmm1
        leal    1(%r10), %eax
        cmpl    %r9d, %r10d
        je      .L13
.L11:
        movslq  %eax, %rcx
        subl    %eax, %r9d
        leaq    -8(,%rcx,8), %rcx
        xorl    %eax, %eax
        addq    %rcx, %rdi
        addq    %rcx, %rsi
        leaq    8(,%r9,8), %rcx
        .p2align 4,,10
        .p2align 3
.L14:
        movsd   (%rsi), %xmm0
        addq    $8, %rax
        mulsd   (%rdi), %xmm0
        addq    $8, %rsi
        addq    $8, %rdi
        addsd   %xmm0, %xmm1
        cmpq    %rcx, %rax
        jne     .L14
.L13:
        movsd   %xmm1, (%rdx)
.L9:
        rep
        ret
.L15:
        xorpd   %xmm1, %xmm1
        movl    $1, %eax
        jmp     .L11
.LFE0:
        .size   s31_, .-s31_


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621


^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <bug-25621-4@http.gcc.gnu.org/bugzilla/>]

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
       [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/>
@ 2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch
  2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2013-03-29 10:07 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621

Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Joost.VandeVondele at mat
                   |                            |dot ethz.ch
         Depends on|                            |53947

--- Comment #12 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2013-03-29 10:07:06 UTC ---
This has become much more a vectorizer problem. Basically ifort generates code
that is twice as fast for routine S31 of the initial comment. Given that this
is a common dot product, it might be good to see why that happens. Both
compilers fail to notice that S32 is basically the same code hand-unrolled.

Tested with the code in comment #6 (without inlining)

> gfortran -march=native -ffast-math -O3 -fno-inline PR25621.f90
> ./a.out
 default loop  0.56491500000000006     
 hand optimized loop  0.74488600000000016     
> ifort -xHost -O3 -fno-inline PR25621.f90
> ./a.out
 default loop  0.377943000000000     
 hand optimized loop  0.579911000000000


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
       [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/>
  2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch
@ 2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch
  2014-03-24 11:02 ` iliyapalachev at gmail dot com
  2023-09-23 21:08 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 16+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2014-03-16 15:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621

--- Comment #13 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> ---
(In reply to Joost VandeVondele from comment #12)
> Both compilers fail to notice that S32 is basically the same code
> hand-unrolled.

with gcc 4.9

> ./a.out
 default loop  0.54291800000000001     
 hand optimized loop  0.54291700000000009     

so, some progress, both versions of the loop give the same performance. Still
not quite as good as ifort, however.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
       [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/>
  2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch
  2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch
@ 2014-03-24 11:02 ` iliyapalachev at gmail dot com
  2023-09-23 21:08 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 16+ messages in thread
From: iliyapalachev at gmail dot com @ 2014-03-24 11:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621

Ilya Palachev <iliyapalachev at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |iliyapalachev at gmail dot com

--- Comment #14 from Ilya Palachev <iliyapalachev at gmail dot com> ---
(In reply to Joost VandeVondele from comment #13)

At page http://gcc.gnu.org/wiki/VectorizationTasks

it is written that the generalization of reduction support
(http://gcc.gnu.org/ml/gcc-patches/2006-04/msg00172.html) can help to fix this
bug.

Is this information still correct for gcc-4.9?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
       [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2014-03-24 11:02 ` iliyapalachev at gmail dot com
@ 2023-09-23 21:08 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-09-23 21:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
We are generating the same vectorized loop for S31 and S32 now.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-09-23 21:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-01 12:40 [Bug tree-optimization/25621] New: Missed optimisation jv244 at cam dot ac dot uk
2006-01-01 17:31 ` [Bug tree-optimization/25621] " pinskia at gcc dot gnu dot org
2006-01-01 18:14 ` jv244 at cam dot ac dot uk
2006-01-06 14:07 ` [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math) pinskia at gcc dot gnu dot org
2007-07-03 19:30 ` jv244 at cam dot ac dot uk
2007-07-04  8:58 ` eres at il dot ibm dot com
2007-07-04  9:23 ` jv244 at cam dot ac dot uk
2007-07-04 11:14 ` dorit at gcc dot gnu dot org
2007-07-04 11:24 ` eres at il dot ibm dot com
2007-08-14 20:17 ` dorit at gcc dot gnu dot org
2008-12-05 16:27 ` jv244 at cam dot ac dot uk
2010-04-27 18:25 ` jv244 at cam dot ac dot uk
     [not found] <bug-25621-4@http.gcc.gnu.org/bugzilla/>
2013-03-29 10:07 ` Joost.VandeVondele at mat dot ethz.ch
2014-03-16 15:54 ` Joost.VandeVondele at mat dot ethz.ch
2014-03-24 11:02 ` iliyapalachev at gmail dot com
2023-09-23 21:08 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).