How to get best AVX2 performance from gfortran?

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

* How to get best AVX2 performance from gfortran?
@ 2018-04-18 14:22 Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
  2018-04-19  3:40 ` Jerry DeLisle
  2018-04-19  6:33 ` Thomas Koenig
  0 siblings, 2 replies; 7+ messages in thread
From: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] @ 2018-04-18 14:22 UTC (permalink / raw)
  To: fortran

Dear GNU Fortran Gurus,

I have a question mainly based around AVX2 and flags.

To wit, I work on a climate/weather model and our main compiler is 
ifort, but I also maintain a gfortran port as well for portability and 
standards checking.

Our "stock" ifort flags for the model are very safe:

-O3 -qopt-report0 -ftz -align all -fno-alias -traceback -assume 
realloc_lhs     -fPIC -fpe0 -fp-model source -heap-arrays 32 -assume 
noold_maxminloc -fimf-arch-consistency=true  -align dcommons

These have evolved over a many years but the science they produce is 
validated.

Recently, a push was made to get better performance and I found on our 
Haswell nodes, that if we turn on AVX2 and loosen some of the flags above:

-O3 -xCORE-AVX2 -fma -qopt-report0 -ftz -align all -fno-alias -align 
array32byte -traceback -assume realloc_lhs     -fPIC -fpe3 -fp-model 
consistent -g -assume noold_maxminloc  -align dcommons

we get a good performance boost in our dynamics. Here are some of our 
timers for a 1-degree equivalent run (96 Haswell cores) focusing on the 
main routines in dynamics. Using Intel 18.0.1:

Intel Safe:

    C_SW       2.0175
    D_SW       8.2301
    PG_D       1.1315
    TRACER_2D 49.4546
    REMAPPING 26.0622

Intel AVX2:

    C_SW       1.4931
    D_SW       5.4254
    PG_D       1.0878
    TRACER_2D 24.7418
    REMAPPING 27.2644

Now at this low of a resolution, these aren't too impressive, but scale 
up and essentially dynamics runs a little under 2x faster.

Now I looked at GNU Fortran (7.3.0). Here my "stock" flags are quite 
boring (and all flags, not just the optimization ones):

-O3 -march=westmere -mtune=generic -funroll-loops -g
-fcoarray=single    -fcray-pointer -D__GFORTRAN__ 
-ffree-line-length-none -fno-range-check  -Wno-missing-include-dirs 
-fPIC -ffpe-trap=zero,overflow -fbacktrace  -falign-commons

Since this is aimed mainly at running on desktops and laptops, it's 
fairly bland and Westmere is probably a bit old but it's safe. With 
these flags I get:

GNU Stock:
    C_SW        2.0775
    D_SW        9.6586
    PG_D        0.6730
    TRACER_2D 273.3723
    REMAPPING 115.3358

Well huh. gfortran does *not* like those two routines! But I might be 
targeting too low. I searched the net for people's suggestions for good 
AVX2 flags and found:

-O3 -ffast-math -fno-cx-limited-range -funroll-loops --param 
max-unroll-times=2 -march=haswell    -fcray-pointer -D__GFORTRAN__ 
-ffree-line-length-none -fno-range-check  -Wno-missing-include-dirs 
-fPIC -g -fbacktrace

Using these:

GNU Haswell:
    C_SW        2.3807
    D_SW        9.6157
    PG_D        0.7708
    TRACER_2D 176.9858
    REMAPPING 111.1660

Well, that's not nothing. The TRACER_2D is faster.

More searches online said that perhaps GNU doesn't do FMA as well and 
that should be turned off (added -mno-fma):

GNU Haswell NoFMA:
    C_SW        2.3933
    D_SW        9.6624
    PG_D        0.7837
    TRACER_2D 162.7971
    REMAPPING 100.4335

A little better... And then as I was looking at this mailing list, I saw 
Wilco Dijkstra's post where he said -frepack-arrays helped and, why not 
try that! And:

GNU Haswell NoFMA Repack:
    C_SW        2.4350
    D_SW        9.7109
    PG_D        0.7869
    TRACER_2D 163.6474
    REMAPPING 100.6820

So, my questions to you gurus are: Is there something I could try adding 
to my gfortran options that might help with this discrepancy between 
Intel AVX2 and GCC? Or perhaps I need to *remove* something (some flag 
kills the vectorizer)?

At this point I'm mainly searching for speed for speed's sake so I'm 
willing to try anything even if it's thought of as "stupid agressive". 
As long as it doesn't crash the model, I'm willing to try and then check 
to make sure the world isn't boiling or freezing because I broke reality.

Matt
-- 
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to get best AVX2 performance from gfortran?
  2018-04-18 14:22 How to get best AVX2 performance from gfortran? Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
@ 2018-04-19  3:40 ` Jerry DeLisle
  2018-04-20 15:51   ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
  2018-04-19  6:33 ` Thomas Koenig
  1 sibling, 1 reply; 7+ messages in thread
From: Jerry DeLisle @ 2018-04-19  3:40 UTC (permalink / raw)
  To: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
	fortran

On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND 
APPLICATIONS INC] wrote:
> Dear GNU Fortran Gurus,
> 
> I have a question mainly based around AVX2 and flags.
> 
--- snip ---

> 
> So, my questions to you gurus are: Is there something I could try adding 
> to my gfortran options that might help with this discrepancy between 
> Intel AVX2 and GCC? Or perhaps I need to *remove* something (some flag 
> kills the vectorizer)?
> 
> At this point I'm mainly searching for speed for speed's sake so I'm 
> willing to try anything even if it's thought of as "stupid agressive". 
> As long as it doesn't crash the model, I'm willing to try and then check 
> to make sure the world isn't boiling or freezing because I broke reality.
> 
> Matt

Well I have seen some codes pessimized by -O3 so try -O2.  Also 
generally I do -march=native.

Also you may try something like this:

-ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4 
-mprefer-avx128

Curious about what you find out, generally speakig one has to fine tune 
to the code being compiled since there are many many factors that can 
influence the outcome.

Regards,

Jerry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to get best AVX2 performance from gfortran?
  2018-04-18 14:22 How to get best AVX2 performance from gfortran? Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
  2018-04-19  3:40 ` Jerry DeLisle
@ 2018-04-19  6:33 ` Thomas Koenig
  2018-04-19  8:35   ` Richard Biener
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Koenig @ 2018-04-19  6:33 UTC (permalink / raw)
  To: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
	fortran, gcc mailing list

Hi Matt,
[timings]

> Intel AVX2:
> 
>  Â Â  C_SWÂ Â Â Â Â Â  1.4931
>  Â Â  D_SWÂ Â Â Â Â Â  5.4254
>  Â Â  PG_DÂ Â Â Â Â Â  1.0878
>  Â Â  TRACER_2D 24.7418
>  Â Â  REMAPPING 27.2644

> Now I looked at GNU Fortran (7.3.0). Here my "stock" flags are quite 
> boring (and all flags, not just the optimization ones):

[Various options elided, the best was]:

> GNU Haswell NoFMA Repack:
>  Â Â  C_SWÂ Â Â Â Â Â Â  2.4350
>  Â Â  D_SWÂ Â Â Â Â Â Â  9.7109
>  Â Â  PG_DÂ Â Â Â Â Â Â  0.7869
>  Â Â  TRACER_2D 163.6474
>  Â Â  REMAPPING 100.6820
> 
> So, my questions to you gurus are: Is there something I could try adding 
> to my gfortran options that might help with this discrepancy between 
> Intel AVX2 and GCC? Or perhaps I need to *remove* something (some flag 
> kills the vectorizer)?
The gcc 8 release is just around the corner, and a lot of improvements
have been made to code generation, also for AVX2. You might want to give
the current trunk (or the soon-to-be-released) release candidate, or the
then newly released gcc8 a spin.

Second, this performance gap with respect to Intel (a factor of 6.6 for
your TRACER_2D routine) is dramatic. If anything like this persists in 
gcc8, the only way to get this fixed is to submit a bug report.
Profile the code, try to reduce the code to something that shows
the the problem (and that you can put in a bug report).

Regards

	Thomas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to get best AVX2 performance from gfortran?
  2018-04-19  6:33 ` Thomas Koenig
@ 2018-04-19  8:35   ` Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2018-04-19  8:35 UTC (permalink / raw)
  To: Thomas Koenig
  Cc: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
	fortran, gcc mailing list

On Thu, Apr 19, 2018 at 8:33 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Hi Matt,
> [timings]
>
>> Intel AVX2:
>>
>>     C_SW       1.4931
>>     D_SW       5.4254
>>     PG_D       1.0878
>>     TRACER_2D 24.7418
>>     REMAPPING 27.2644
>
>
>> Now I looked at GNU Fortran (7.3.0). Here my "stock" flags are quite
>> boring (and all flags, not just the optimization ones):
>
>
> [Various options elided, the best was]:
>
>> GNU Haswell NoFMA Repack:
>>     C_SW        2.4350
>>     D_SW        9.7109
>>     PG_D        0.7869
>>     TRACER_2D 163.6474
>>     REMAPPING 100.6820
>>
>> So, my questions to you gurus are: Is there something I could try adding
>> to my gfortran options that might help with this discrepancy between Intel
>> AVX2 and GCC? Or perhaps I need to *remove* something (some flag kills the
>> vectorizer)?
>
> The gcc 8 release is just around the corner, and a lot of improvements
> have been made to code generation, also for AVX2. You might want to give
> the current trunk (or the soon-to-be-released) release candidate, or the
> then newly released gcc8 a spin.
>
> Second, this performance gap with respect to Intel (a factor of 6.6 for
> your TRACER_2D routine) is dramatic. If anything like this persists in gcc8,
> the only way to get this fixed is to submit a bug report.
> Profile the code, try to reduce the code to something that shows
> the the problem (and that you can put in a bug report).

Dependent on what those routines do (do they call math intrinsics like
sin or cos?) ICC has an advantage with a highly optimized vectorized
math library.  You can use that from gfortran as well by using
-mveclibabi=svml and linking against libsvml.{a,so} which comes with ICC.
Unfortunately gfortran cannot exercise glibcs libmvec at the moment.

Richard.

> Regards
>
>         Thomas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to get best AVX2 performance from gfortran?
  2018-04-19  3:40 ` Jerry DeLisle
@ 2018-04-20 15:51   ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
  2018-04-20 17:18     ` Jerry DeLisle
  0 siblings, 1 reply; 7+ messages in thread
From: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] @ 2018-04-20 15:51 UTC (permalink / raw)
  To: Jerry DeLisle, fortran

On 04/18/2018 11:40 PM, Jerry DeLisle wrote:
> On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND 
> APPLICATIONS INC] wrote:
>> Dear GNU Fortran Gurus,
>>
>> I have a question mainly based around AVX2 and flags.
>>
> --- snip ---
> 
>>
>> So, my questions to you gurus are: Is there something I could try 
>> adding to my gfortran options that might help with this discrepancy 
>> between Intel AVX2 and GCC? Or perhaps I need to *remove* something 
>> (some flag kills the vectorizer)?
>>
>> At this point I'm mainly searching for speed for speed's sake so I'm 
>> willing to try anything even if it's thought of as "stupid agressive". 
>> As long as it doesn't crash the model, I'm willing to try and then 
>> check to make sure the world isn't boiling or freezing because I broke 
>> reality.
>>
>> Matt
> 
> Well I have seen some codes pessimized by -O3 so try -O2.Â  Also 
> generally I do -march=native.
> 
> Also you may try something like this:
> 
> -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4 
> -mprefer-avx128
> 
> Curious about what you find out, generally speakig one has to fine tune 
> to the code being compiled since there are many many factors that can 
> influence the outcome.

An overall update. Jerry's flags with -mno-fma and -O2:

-O2 -march=native -ffast-math -ftree-vectorize -funroll-loops --param 
max-unroll-times=4 -mprefer-avx128 -mno-fma

seem to be the best I've found.:

Original:

    TRACER_2D 273.3723
    REMAPPING 115.3358

Jerry's:

    TRACER_2D 138.4858
    REMAPPING  94.4236

Though compared to Intel:

    TRACER_2D 24.7418
    REMAPPING 27.2644

still a ways to go.

That said, it turns out for now I can't actually *use* these flags. One 
of the requirements for our model is that it regress under layout change 
in MPI. A lot of work was put into getting flags with Intel that did 
this but still vectorized (-fp-model consistent was the key, I think). 
The offending flag in GNU seems to be -ffast-math, but if you remove 
that from Jerry's flags:

Native O2 Delisle NoFMA NoFastMath:

    TRACER_2D 264.1194
    REMAPPING 113.4083

Well, nuts. Looks like -ffast-math was the special flag.

I'll ask for a GCC 8 snapshot to be installed on our cluster to test 
with that at least to see if vectorization/speed is improved, though 
without reproducibility...

Thanks,
Matt


-- 
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to get best AVX2 performance from gfortran?
  2018-04-20 15:51   ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
@ 2018-04-20 17:18     ` Jerry DeLisle
  2018-04-20 17:25       ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Jerry DeLisle @ 2018-04-20 17:18 UTC (permalink / raw)
  To: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
	fortran

On 04/20/2018 08:51 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND 
APPLICATIONS INC] wrote:
> On 04/18/2018 11:40 PM, Jerry DeLisle wrote:
>> On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS 
--- snip---
> that from Jerry's flags:
> 
> Native O2 Delisle NoFMA NoFastMath:
> 
>  Â Â  TRACER_2D 264.1194
>  Â Â  REMAPPING 113.4083
> 
> Well, nuts. Looks like -ffast-math was the special flag.

Look in manual at gcc.gnu.org.  -ffast-math implies a few other flags 
that you can further try separately.

-ffast-math

     Sets the options -fno-math-errno, -funsafe-math-optimizations, 
-ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, 
-fcx-limited-range and -fexcess-precision=fast.

Also, if you can run some profiles and narrow down the code in question, 
maybe obfuscate it a bit or share privately with one of us, we can get 
someone in the know to take a closer look.

> 
> I'll ask for a GCC 8 snapshot to be installed on our cluster to test 
> with that at least to see if vectorization/speed is improved, though 
> without reproducibility...
> 
> Thanks,
> Matt
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to get best AVX2 performance from gfortran?
  2018-04-20 17:18     ` Jerry DeLisle
@ 2018-04-20 17:25       ` Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2018-04-20 17:25 UTC (permalink / raw)
  To: fortran, Jerry DeLisle, Thompson,
	Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]

On April 20, 2018 7:18:37 PM GMT+02:00, Jerry DeLisle <jvdelisle@charter.net> wrote:
>On 04/20/2018 08:51 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
>
>APPLICATIONS INC] wrote:
>> On 04/18/2018 11:40 PM, Jerry DeLisle wrote:
>>> On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS 
>--- snip---
>> that from Jerry's flags:
>> 
>> Native O2 Delisle NoFMA NoFastMath:
>> 
>>     TRACER_2D 264.1194
>>     REMAPPING 113.4083
>> 
>> Well, nuts. Looks like -ffast-math was the special flag.
>
>Look in manual at gcc.gnu.org.  -ffast-math implies a few other flags 
>that you can further try separately.
>
>-ffast-math
>
>     Sets the options -fno-math-errno, -funsafe-math-optimizations, 
>-ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, 
>-fcx-limited-range and -fexcess-precision=fast.

That doesn't seem to be a complete list. The Fortran FE also enables some of them unconditionally like -fno-math-errno. One important flag is -fno-trapping-math 

Richard. 

>Also, if you can run some profiles and narrow down the code in
>question, 
>maybe obfuscate it a bit or share privately with one of us, we can get 
>someone in the know to take a closer look.
>
>> 
>> I'll ask for a GCC 8 snapshot to be installed on our cluster to test 
>> with that at least to see if vectorization/speed is improved, though 
>> without reproducibility...
>> 
>> Thanks,
>> Matt
>> 
>> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-04-20 17:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18 14:22 How to get best AVX2 performance from gfortran? Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-19  3:40 ` Jerry DeLisle
2018-04-20 15:51   ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-20 17:18     ` Jerry DeLisle
2018-04-20 17:25       ` Richard Biener
2018-04-19  6:33 ` Thomas Koenig
2018-04-19  8:35   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).