public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed
* Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?)
@ 2018-04-12 13:56 Laércio LIMA PILLA
  2018-04-12 14:00 ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Laércio LIMA PILLA @ 2018-04-12 13:56 UTC (permalink / raw)
  To: fortran

Dear all,

TL;DR version: I have been noticing very extreme performance differences
(up to a factor of 3) between ifort and gfortran.
As I checked the assembly code, I noticed that the compilers are using
different instructions (e.g., ifort uses 'vbroadcastsd').
Am I missing any special optimization flags (besides -march and -mtune
native) or is this expected?

Original version:

I have been working on the optimization of a Fortran 95 [and over]
application that makes use of several matrix-vector multiplication kernels.
As the sizes of the matrices are well-known, the original developers
generated different kernels for different sizes.
An example for size 4 is given below.

  USE ISO_C_BINDING
  !...
  subroutine mv_mult_4_4(mat,vec,res)
    REAL(C_DOUBLE), INTENT(IN),  DIMENSION(4,4) :: mat
    REAL(C_DOUBLE), INTENT(IN),  DIMENSION(4)   :: vec
    REAL(C_DOUBLE), INTENT(OUT), DIMENSION(4)   :: res
    INTEGER(C_INT) :: iRow, iCol

    res = 0.0

    do iCol=1,4
       do iRow=1,4
          res(iRow) = res(iRow) + mat(iRow,iCol)*vec(iCol)
       end do
    end do

  end subroutine mv_mult_4_4

I have been noticing very significant performance differences in my tests
with gfortran, ifort, and different optimizations on my local system.
On the special case for a 20x20 matrix, ifort provides a code that reduces
the execution time by a factor of 3 for the same optimization flags.
I started checking the assembly code generated by the different compilers
and noticed some differences.
For the code snippet above, the assembly versions from ifort and gfortran
are presented below.
We can notice that ifort is using some instructions (vbroadcastsd) that are
not used by gfortran even though I am telling the compiler the specific
architecture of my processor.
As the general users of the application use gfortran, I would like to know:

1) Is this difference in instructions used expected?
2) Am I missing any additional optimization flag (besides -march and
-mtune) that could change that?
3) Are there any directives (besides OpenMP ones) that could help in this
case?

Assembly:
CPU: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz

ifort w/ -O3 -march=native -mtune=native -autodouble -S:
# mark_description "Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 17.0.3.191 Build 2017";
# mark_description "0404";
# mark_description "-O3 -march=native -mtune=native -autodouble -S";
# -- Begin  mv_mult_4_4_
.text
# mark_begin;
       .align    16,0x90
.globl mv_mult_4_4_
mv_mult_4_4_:
# parameter 1: %rdi
# parameter 2: %rsi
# parameter 3: %rdx
#...
        vbroadcastsd (%rsi), %ymm0                              #157.50
        vbroadcastsd 8(%rsi), %ymm2                             #157.50
        vbroadcastsd 16(%rsi), %ymm3                            #157.50
        vbroadcastsd 24(%rsi), %ymm4                            #157.50
        vmulpd    (%rdi), %ymm0, %ymm1                          #157.11
        vfmadd132pd 32(%rdi), %ymm1, %ymm2                      #157.11
        vfmadd132pd 64(%rdi), %ymm2, %ymm3                      #157.11
        vfmadd132pd 96(%rdi), %ymm3, %ymm4                      #157.11
        vmovupd   %ymm4, (%rdx)                                 #157.11
        vzeroupper                                              #165.3
        ret                                                     #165.3
        .align    16,0x90
                                # LOE
.cfi_endproc
# mark_end;

---

gfortran (GNU Fortran (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0) w/ -O3
-march=native -mtune=native -fdefault-double-8 -fdefault-real-8 -S:
.p2align 4,,15
.globl __mv_mult_4_4
.type __mv_mult_4_4, @function
__mv_mult_4_4:
.LFB12:
.cfi_startproc
vpxor %xmm0, %xmm0, %xmm0
vmovups %xmm0, (%rdx)
vmovups %xmm0, 16(%rdx)
vmovupd (%rsi), %ymm0
vmovupd (%rdx), %ymm4
vpermpd $0, %ymm0, %ymm3
vfmadd132pd (%rdi), %ymm4, %ymm3
vpermpd $85, %ymm0, %ymm2
vfmadd132pd 32(%rdi), %ymm3, %ymm2
vpermpd $170, %ymm0, %ymm1
vpermpd $255, %ymm0, %ymm0
vfmadd132pd 64(%rdi), %ymm2, %ymm1
vfmadd132pd 96(%rdi), %ymm1, %ymm0
vmovupd %ymm0, (%rdx)
vzeroupper
ret
.cfi_endproc
.LFE12:
.size __mv_mult_4_4, .-__mv_mult_4_4
.p2align 4,,15

---

Best regards,

Laércio LIMA PILLA
Postdoctoral Researcher @ Inria Grenoble - Rhône-Alpes, CORSE project-team
Associate Professor @ UFSC, Brazil

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-04-12 20:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-12 13:56 Difference in assembly code generated (gfortran vs ifort) (optimization flags missing?) Laércio LIMA PILLA
2018-04-12 14:00 ` Richard Biener
2018-04-12 14:48   ` Steve Kargl
2018-04-12 20:10     ` Thomas Koenig
2018-04-12 15:00   ` Laércio LIMA PILLA

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).