* Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) @ 2018-04-12 13:56 Laércio LIMA PILLA 2018-04-12 14:00 ` Richard Biener 0 siblings, 1 reply; 5+ messages in thread From: Laércio LIMA PILLA @ 2018-04-12 13:56 UTC (permalink / raw) To: fortran Dear all, TL;DR version: I have been noticing very extreme performance differences (up to a factor of 3) between ifort and gfortran. As I checked the assembly code, I noticed that the compilers are using different instructions (e.g., ifort uses 'vbroadcastsd'). Am I missing any special optimization flags (besides -march and -mtune native) or is this expected? Original version: I have been working on the optimization of a Fortran 95 [and over] application that makes use of several matrix-vector multiplication kernels. As the sizes of the matrices are well-known, the original developers generated different kernels for different sizes. An example for size 4 is given below. USE ISO_C_BINDING !... subroutine mv_mult_4_4(mat,vec,res) REAL(C_DOUBLE), INTENT(IN), DIMENSION(4,4) :: mat REAL(C_DOUBLE), INTENT(IN), DIMENSION(4) :: vec REAL(C_DOUBLE), INTENT(OUT), DIMENSION(4) :: res INTEGER(C_INT) :: iRow, iCol res = 0.0 do iCol=1,4 do iRow=1,4 res(iRow) = res(iRow) + mat(iRow,iCol)*vec(iCol) end do end do end subroutine mv_mult_4_4 I have been noticing very significant performance differences in my tests with gfortran, ifort, and different optimizations on my local system. On the special case for a 20x20 matrix, ifort provides a code that reduces the execution time by a factor of 3 for the same optimization flags. I started checking the assembly code generated by the different compilers and noticed some differences. For the code snippet above, the assembly versions from ifort and gfortran are presented below. We can notice that ifort is using some instructions (vbroadcastsd) that are not used by gfortran even though I am telling the compiler the specific architecture of my processor. As the general users of the application use gfortran, I would like to know: 1) Is this difference in instructions used expected? 2) Am I missing any additional optimization flag (besides -march and -mtune) that could change that? 3) Are there any directives (besides OpenMP ones) that could help in this case? Assembly: CPU: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz ifort w/ -O3 -march=native -mtune=native -autodouble -S: # mark_description "Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.3.191 Build 2017"; # mark_description "0404"; # mark_description "-O3 -march=native -mtune=native -autodouble -S"; # -- Begin mv_mult_4_4_ .text # mark_begin; .align 16,0x90 .globl mv_mult_4_4_ mv_mult_4_4_: # parameter 1: %rdi # parameter 2: %rsi # parameter 3: %rdx #... vbroadcastsd (%rsi), %ymm0 #157.50 vbroadcastsd 8(%rsi), %ymm2 #157.50 vbroadcastsd 16(%rsi), %ymm3 #157.50 vbroadcastsd 24(%rsi), %ymm4 #157.50 vmulpd (%rdi), %ymm0, %ymm1 #157.11 vfmadd132pd 32(%rdi), %ymm1, %ymm2 #157.11 vfmadd132pd 64(%rdi), %ymm2, %ymm3 #157.11 vfmadd132pd 96(%rdi), %ymm3, %ymm4 #157.11 vmovupd %ymm4, (%rdx) #157.11 vzeroupper #165.3 ret #165.3 .align 16,0x90 # LOE .cfi_endproc # mark_end; --- gfortran (GNU Fortran (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0) w/ -O3 -march=native -mtune=native -fdefault-double-8 -fdefault-real-8 -S: .p2align 4,,15 .globl __mv_mult_4_4 .type __mv_mult_4_4, @function __mv_mult_4_4: .LFB12: .cfi_startproc vpxor %xmm0, %xmm0, %xmm0 vmovups %xmm0, (%rdx) vmovups %xmm0, 16(%rdx) vmovupd (%rsi), %ymm0 vmovupd (%rdx), %ymm4 vpermpd $0, %ymm0, %ymm3 vfmadd132pd (%rdi), %ymm4, %ymm3 vpermpd $85, %ymm0, %ymm2 vfmadd132pd 32(%rdi), %ymm3, %ymm2 vpermpd $170, %ymm0, %ymm1 vpermpd $255, %ymm0, %ymm0 vfmadd132pd 64(%rdi), %ymm2, %ymm1 vfmadd132pd 96(%rdi), %ymm1, %ymm0 vmovupd %ymm0, (%rdx) vzeroupper ret .cfi_endproc .LFE12: .size __mv_mult_4_4, .-__mv_mult_4_4 .p2align 4,,15 --- Best regards, Laércio LIMA PILLA Postdoctoral Researcher @ Inria Grenoble - Rhône-Alpes, CORSE project-team Associate Professor @ UFSC, Brazil ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) 2018-04-12 13:56 Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) Laércio LIMA PILLA @ 2018-04-12 14:00 ` Richard Biener 2018-04-12 14:48 ` Steve Kargl 2018-04-12 15:00 ` Laércio LIMA PILLA 0 siblings, 2 replies; 5+ messages in thread From: Richard Biener @ 2018-04-12 14:00 UTC (permalink / raw) To: Laércio LIMA PILLA; +Cc: fortran On Thu, Apr 12, 2018 at 3:55 PM, Laércio LIMA PILLA <laercio.lima@inria.fr> wrote: > Dear all, > > TL;DR version: I have been noticing very extreme performance differences > (up to a factor of 3) between ifort and gfortran. > As I checked the assembly code, I noticed that the compilers are using > different instructions (e.g., ifort uses 'vbroadcastsd'). > Am I missing any special optimization flags (besides -march and -mtune > native) or is this expected? > > Original version: > > I have been working on the optimization of a Fortran 95 [and over] > application that makes use of several matrix-vector multiplication kernels. > As the sizes of the matrices are well-known, the original developers > generated different kernels for different sizes. > An example for size 4 is given below. > > USE ISO_C_BINDING > !... > subroutine mv_mult_4_4(mat,vec,res) > REAL(C_DOUBLE), INTENT(IN), DIMENSION(4,4) :: mat > REAL(C_DOUBLE), INTENT(IN), DIMENSION(4) :: vec > REAL(C_DOUBLE), INTENT(OUT), DIMENSION(4) :: res > INTEGER(C_INT) :: iRow, iCol > > res = 0.0 > > do iCol=1,4 > do iRow=1,4 > res(iRow) = res(iRow) + mat(iRow,iCol)*vec(iCol) > end do > end do > > end subroutine mv_mult_4_4 This doesn't seem complete as it doesn't compile for me... > I have been noticing very significant performance differences in my tests > with gfortran, ifort, and different optimizations on my local system. > On the special case for a 20x20 matrix, ifort provides a code that reduces > the execution time by a factor of 3 for the same optimization flags. > I started checking the assembly code generated by the different compilers > and noticed some differences. > For the code snippet above, the assembly versions from ifort and gfortran > are presented below. > We can notice that ifort is using some instructions (vbroadcastsd) that are > not used by gfortran even though I am telling the compiler the specific > architecture of my processor. > As the general users of the application use gfortran, I would like to know: > > 1) Is this difference in instructions used expected? > 2) Am I missing any additional optimization flag (besides -march and > -mtune) that could change that? > 3) Are there any directives (besides OpenMP ones) that could help in this > case? It looks like ifort does loop vecotrization on the inner loop while GCC most certainly unrolls that fully and vectorizes the outer loop which in turn requires all the shuffling. You can see if -fdisable-tree-cunrolli solves this (just for debugging!). Richard. > Assembly: > CPU: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz > > ifort w/ -O3 -march=native -mtune=native -autodouble -S: > # mark_description "Intel(R) Fortran Intel(R) 64 Compiler for applications > running on Intel(R) 64, Version 17.0.3.191 Build 2017"; > # mark_description "0404"; > # mark_description "-O3 -march=native -mtune=native -autodouble -S"; > # -- Begin mv_mult_4_4_ > .text > # mark_begin; > .align 16,0x90 > .globl mv_mult_4_4_ > mv_mult_4_4_: > # parameter 1: %rdi > # parameter 2: %rsi > # parameter 3: %rdx > #... > vbroadcastsd (%rsi), %ymm0 #157.50 > vbroadcastsd 8(%rsi), %ymm2 #157.50 > vbroadcastsd 16(%rsi), %ymm3 #157.50 > vbroadcastsd 24(%rsi), %ymm4 #157.50 > vmulpd (%rdi), %ymm0, %ymm1 #157.11 > vfmadd132pd 32(%rdi), %ymm1, %ymm2 #157.11 > vfmadd132pd 64(%rdi), %ymm2, %ymm3 #157.11 > vfmadd132pd 96(%rdi), %ymm3, %ymm4 #157.11 > vmovupd %ymm4, (%rdx) #157.11 > vzeroupper #165.3 > ret #165.3 > .align 16,0x90 > # LOE > .cfi_endproc > # mark_end; > > --- > > gfortran (GNU Fortran (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0) w/ -O3 > -march=native -mtune=native -fdefault-double-8 -fdefault-real-8 -S: > .p2align 4,,15 > .globl __mv_mult_4_4 > .type __mv_mult_4_4, @function > __mv_mult_4_4: > .LFB12: > .cfi_startproc > vpxor %xmm0, %xmm0, %xmm0 > vmovups %xmm0, (%rdx) > vmovups %xmm0, 16(%rdx) > vmovupd (%rsi), %ymm0 > vmovupd (%rdx), %ymm4 > vpermpd $0, %ymm0, %ymm3 > vfmadd132pd (%rdi), %ymm4, %ymm3 > vpermpd $85, %ymm0, %ymm2 > vfmadd132pd 32(%rdi), %ymm3, %ymm2 > vpermpd $170, %ymm0, %ymm1 > vpermpd $255, %ymm0, %ymm0 > vfmadd132pd 64(%rdi), %ymm2, %ymm1 > vfmadd132pd 96(%rdi), %ymm1, %ymm0 > vmovupd %ymm0, (%rdx) > vzeroupper > ret > .cfi_endproc > .LFE12: > .size __mv_mult_4_4, .-__mv_mult_4_4 > .p2align 4,,15 > > --- > > Best regards, > > Laércio LIMA PILLA > Postdoctoral Researcher @ Inria Grenoble - Rhône-Alpes, CORSE project-team > Associate Professor @ UFSC, Brazil ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) 2018-04-12 14:00 ` Richard Biener @ 2018-04-12 14:48 ` Steve Kargl 2018-04-12 20:10 ` Thomas Koenig 2018-04-12 15:00 ` Laércio LIMA PILLA 1 sibling, 1 reply; 5+ messages in thread From: Steve Kargl @ 2018-04-12 14:48 UTC (permalink / raw) To: Richard Biener; +Cc: Laércio LIMA PILLA, fortran On Thu, Apr 12, 2018 at 04:00:46PM +0200, Richard Biener wrote: > > > > USE ISO_C_BINDING Move the above statement to ... > > !... > > subroutine mv_mult_4_4(mat,vec,res) here. > > REAL(C_DOUBLE), INTENT(IN), DIMENSION(4,4) :: mat > > REAL(C_DOUBLE), INTENT(IN), DIMENSION(4) :: vec > > REAL(C_DOUBLE), INTENT(OUT), DIMENSION(4) :: res > > INTEGER(C_INT) :: iRow, iCol > > > > res = 0.0 > > > > do iCol=1,4 > > do iRow=1,4 > > res(iRow) = res(iRow) + mat(iRow,iCol)*vec(iCol) > > end do > > end do > > > > end subroutine mv_mult_4_4 > > This doesn't seem complete as it doesn't compile for me... See above. It would also be interesting to see the result of replacing the loops with res = matmul(mat, vec) as tkoenig (and jerryd?) works on optimizing matmul. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) 2018-04-12 14:48 ` Steve Kargl @ 2018-04-12 20:10 ` Thomas Koenig 0 siblings, 0 replies; 5+ messages in thread From: Thomas Koenig @ 2018-04-12 20:10 UTC (permalink / raw) To: sgk, Richard Biener; +Cc: Laércio LIMA PILLA, fortran Steve wrote: > It would also be interesting to see the result of replacing > the loops with > > > res = matmul(mat, vec) > > as tkoenig (and jerryd?) works on optimizing matmul. For a 4*4 matrix and a 4 vector, with optimization, the code will be inlined to equivalent DO loops. It should result in the same speed as the explicit DO loops. Regards Thomas ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) 2018-04-12 14:00 ` Richard Biener 2018-04-12 14:48 ` Steve Kargl @ 2018-04-12 15:00 ` Laércio LIMA PILLA 1 sibling, 0 replies; 5+ messages in thread From: Laércio LIMA PILLA @ 2018-04-12 15:00 UTC (permalink / raw) To: Richard Biener; +Cc: fortran Thank you for the quick reply. 2018-04-12 16:00 GMT+02:00 Richard Biener <richard.guenther@gmail.com>: > On Thu, Apr 12, 2018 at 3:55 PM, > > > > <laercio.lima@inria.fr> wrote: > > Dear all, > > > > TL;DR version: I have been noticing very extreme performance differences > > (up to a factor of 3) between ifort and gfortran. > > As I checked the assembly code, I noticed that the compilers are using > > different instructions (e.g., ifort uses 'vbroadcastsd'). > > Am I missing any special optimization flags (besides -march and -mtune > > native) or is this expected? > > > > Original version: > > > > I have been working on the optimization of a Fortran 95 [and over] > > application that makes use of several matrix-vector multiplication > kernels. > > As the sizes of the matrices are well-known, the original developers > > generated different kernels for different sizes. > > An example for size 4 is given below. > > > > USE ISO_C_BINDING > > !... > > subroutine mv_mult_4_4(mat,vec,res) > > REAL(C_DOUBLE), INTENT(IN), DIMENSION(4,4) :: mat > > REAL(C_DOUBLE), INTENT(IN), DIMENSION(4) :: vec > > REAL(C_DOUBLE), INTENT(OUT), DIMENSION(4) :: res > > INTEGER(C_INT) :: iRow, iCol > > > > res = 0.0 > > > > do iCol=1,4 > > do iRow=1,4 > > res(iRow) = res(iRow) + mat(iRow,iCol)*vec(iCol) > > end do > > end do > > > > end subroutine mv_mult_4_4 > > This doesn't seem complete as it doesn't compile for me... > Yes. My fault. When I took this part out of the code, I forgot to add the module information. Here is a more complete version: module example USE ISO_C_BINDING contains subroutine mv_mult_4_4(mat,vec,res) REAL(C_DOUBLE), INTENT(IN), DIMENSION(4,4) :: mat REAL(C_DOUBLE), INTENT(IN), DIMENSION(4) :: vec REAL(C_DOUBLE), INTENT(OUT), DIMENSION(4) :: res INTEGER(C_INT) :: iRow, iCol res = 0.0 do iCol=1,4 do iRow=1,4 res(iRow) = res(iRow) + mat(iRow,iCol)*vec(iCol) end do end do end subroutine mv_mult_4_4 end module example > > > I have been noticing very significant performance differences in my tests > > with gfortran, ifort, and different optimizations on my local system. > > On the special case for a 20x20 matrix, ifort provides a code that > reduces > > the execution time by a factor of 3 for the same optimization flags. > > I started checking the assembly code generated by the different compilers > > and noticed some differences. > > For the code snippet above, the assembly versions from ifort and gfortran > > are presented below. > > We can notice that ifort is using some instructions (vbroadcastsd) that > are > > not used by gfortran even though I am telling the compiler the specific > > architecture of my processor. > > As the general users of the application use gfortran, I would like to > know: > > > > 1) Is this difference in instructions used expected? > > 2) Am I missing any additional optimization flag (besides -march and > > -mtune) that could change that? > > 3) Are there any directives (besides OpenMP ones) that could help in this > > case? > > It looks like ifort does loop vecotrization on the inner loop while GCC > most certainly unrolls that fully and vectorizes the outer loop which in > turn > requires all the shuffling. You can see if -fdisable-tree-cunrolli solves > this > (just for debugging!). > I took your suggestion into account and added that flag. The result is a better code that even includes vbroadcast: .file "example.f90" .text .p2align 4,,15 .globl __example_MOD_mv_mult_4_4 .type __example_MOD_mv_mult_4_4, @function __example_MOD_mv_mult_4_4: .LFB0: .cfi_startproc vpxor %xmm0, %xmm0, %xmm0 vbroadcastsd (%rsi), %ymm1 vmovups %xmm0, (%rdx) vmovups %xmm0, 16(%rdx) vmovupd (%rdi), %ymm0 vfmadd213pd (%rdx), %ymm1, %ymm0 vbroadcastsd 8(%rsi), %ymm1 vfmadd132pd 32(%rdi), %ymm0, %ymm1 vbroadcastsd 16(%rsi), %ymm0 vfmadd231pd 64(%rdi), %ymm0, %ymm1 vbroadcastsd 24(%rsi), %ymm0 vfmadd132pd 96(%rdi), %ymm1, %ymm0 vmovupd %ymm0, (%rdx) vzeroupper ret .cfi_endproc .LFE0: .size __example_MOD_mv_mult_4_4, .-__example_MOD_mv_mult_4_4 .ident "GCC: (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0" .section .note.GNU-stack,"",@progbits I also experimented with loop permutation, which also lead to better assembly: .file "example.f90" .text .p2align 4,,15 .globl __example_MOD_mv_mult_4_4 .type __example_MOD_mv_mult_4_4, @function __example_MOD_mv_mult_4_4: .LFB0: .cfi_startproc vpxor %xmm0, %xmm0, %xmm0 vbroadcastsd (%rsi), %ymm3 vbroadcastsd 8(%rsi), %ymm2 vmovups %xmm0, (%rdx) vbroadcastsd 16(%rsi), %ymm1 vmovups %xmm0, 16(%rdx) vmovupd (%rdx), %ymm4 vfmadd132pd (%rdi), %ymm4, %ymm3 vfmadd132pd 32(%rdi), %ymm3, %ymm2 vbroadcastsd 24(%rsi), %ymm0 vfmadd132pd 64(%rdi), %ymm2, %ymm1 vfmadd132pd 96(%rdi), %ymm1, %ymm0 vmovupd %ymm0, (%rdx) vzeroupper ret .cfi_endproc .LFE0: .size __example_MOD_mv_mult_4_4, .-__example_MOD_mv_mult_4_4 .ident "GCC: (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0" .section .note.GNU-stack,"",@progbits Still, this does not seem to improve the code for the situation with a 20x20 matrix. I will try some more things in the next few days. Best regards, > > Richard. > > > Assembly: > > CPU: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz > > > > ifort w/ -O3 -march=native -mtune=native -autodouble -S: > > # mark_description "Intel(R) Fortran Intel(R) 64 Compiler for > applications > > running on Intel(R) 64, Version 17.0.3.191 Build 2017"; > > # mark_description "0404"; > > # mark_description "-O3 -march=native -mtune=native -autodouble -S"; > > # -- Begin mv_mult_4_4_ > > .text > > # mark_begin; > > .align 16,0x90 > > .globl mv_mult_4_4_ > > mv_mult_4_4_: > > # parameter 1: %rdi > > # parameter 2: %rsi > > # parameter 3: %rdx > > #... > > vbroadcastsd (%rsi), %ymm0 #157.50 > > vbroadcastsd 8(%rsi), %ymm2 #157.50 > > vbroadcastsd 16(%rsi), %ymm3 #157.50 > > vbroadcastsd 24(%rsi), %ymm4 #157.50 > > vmulpd (%rdi), %ymm0, %ymm1 #157.11 > > vfmadd132pd 32(%rdi), %ymm1, %ymm2 #157.11 > > vfmadd132pd 64(%rdi), %ymm2, %ymm3 #157.11 > > vfmadd132pd 96(%rdi), %ymm3, %ymm4 #157.11 > > vmovupd %ymm4, (%rdx) #157.11 > > vzeroupper #165.3 > > ret #165.3 > > .align 16,0x90 > > # LOE > > .cfi_endproc > > # mark_end; > > > > --- > > > > gfortran (GNU Fortran (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0) w/ -O3 > > -march=native -mtune=native -fdefault-double-8 -fdefault-real-8 -S: > > .p2align 4,,15 > > .globl __mv_mult_4_4 > > .type __mv_mult_4_4, @function > > __mv_mult_4_4: > > .LFB12: > > .cfi_startproc > > vpxor %xmm0, %xmm0, %xmm0 > > vmovups %xmm0, (%rdx) > > vmovups %xmm0, 16(%rdx) > > vmovupd (%rsi), %ymm0 > > vmovupd (%rdx), %ymm4 > > vpermpd $0, %ymm0, %ymm3 > > vfmadd132pd (%rdi), %ymm4, %ymm3 > > vpermpd $85, %ymm0, %ymm2 > > vfmadd132pd 32(%rdi), %ymm3, %ymm2 > > vpermpd $170, %ymm0, %ymm1 > > vpermpd $255, %ymm0, %ymm0 > > vfmadd132pd 64(%rdi), %ymm2, %ymm1 > > vfmadd132pd 96(%rdi), %ymm1, %ymm0 > > vmovupd %ymm0, (%rdx) > > vzeroupper > > ret > > .cfi_endproc > > .LFE12: > > .size __mv_mult_4_4, .-__mv_mult_4_4 > > .p2align 4,,15 > > > > --- > > > > Best regards, > > > > Laércio LIMA PILLA > > Postdoctoral Researcher @ Inria Grenoble - Rhône-Alpes, CORSE > project-team > > Associate Professor @ UFSC, Brazil > Laércio LIMA PILLA Postdoctoral Researcher @ Inria Grenoble - Rhône-Alpes, CORSE project-team Associate Professor @ UFSC, Brazil ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-04-12 20:10 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-12 13:56 Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) Laércio LIMA PILLA 2018-04-12 14:00 ` Richard Biener 2018-04-12 14:48 ` Steve Kargl 2018-04-12 20:10 ` Thomas Koenig 2018-04-12 15:00 ` Laércio LIMA PILLA
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).