* How to get best AVX2 performance from gfortran?
@ 2018-04-18 14:22 Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-19 3:40 ` Jerry DeLisle
2018-04-19 6:33 ` Thomas Koenig
0 siblings, 2 replies; 7+ messages in thread
From: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] @ 2018-04-18 14:22 UTC (permalink / raw)
To: fortran
Dear GNU Fortran Gurus,
I have a question mainly based around AVX2 and flags.
To wit, I work on a climate/weather model and our main compiler is
ifort, but I also maintain a gfortran port as well for portability and
standards checking.
Our "stock" ifort flags for the model are very safe:
-O3 -qopt-report0 -ftz -align all -fno-alias -traceback -assume
realloc_lhs -fPIC -fpe0 -fp-model source -heap-arrays 32 -assume
noold_maxminloc -fimf-arch-consistency=true -align dcommons
These have evolved over a many years but the science they produce is
validated.
Recently, a push was made to get better performance and I found on our
Haswell nodes, that if we turn on AVX2 and loosen some of the flags above:
-O3 -xCORE-AVX2 -fma -qopt-report0 -ftz -align all -fno-alias -align
array32byte -traceback -assume realloc_lhs -fPIC -fpe3 -fp-model
consistent -g -assume noold_maxminloc -align dcommons
we get a good performance boost in our dynamics. Here are some of our
timers for a 1-degree equivalent run (96 Haswell cores) focusing on the
main routines in dynamics. Using Intel 18.0.1:
Intel Safe:
C_SW 2.0175
D_SW 8.2301
PG_D 1.1315
TRACER_2D 49.4546
REMAPPING 26.0622
Intel AVX2:
C_SW 1.4931
D_SW 5.4254
PG_D 1.0878
TRACER_2D 24.7418
REMAPPING 27.2644
Now at this low of a resolution, these aren't too impressive, but scale
up and essentially dynamics runs a little under 2x faster.
Now I looked at GNU Fortran (7.3.0). Here my "stock" flags are quite
boring (and all flags, not just the optimization ones):
-O3 -march=westmere -mtune=generic -funroll-loops -g
-fcoarray=single -fcray-pointer -D__GFORTRAN__
-ffree-line-length-none -fno-range-check -Wno-missing-include-dirs
-fPIC -ffpe-trap=zero,overflow -fbacktrace -falign-commons
Since this is aimed mainly at running on desktops and laptops, it's
fairly bland and Westmere is probably a bit old but it's safe. With
these flags I get:
GNU Stock:
C_SW 2.0775
D_SW 9.6586
PG_D 0.6730
TRACER_2D 273.3723
REMAPPING 115.3358
Well huh. gfortran does *not* like those two routines! But I might be
targeting too low. I searched the net for people's suggestions for good
AVX2 flags and found:
-O3 -ffast-math -fno-cx-limited-range -funroll-loops --param
max-unroll-times=2 -march=haswell -fcray-pointer -D__GFORTRAN__
-ffree-line-length-none -fno-range-check -Wno-missing-include-dirs
-fPIC -g -fbacktrace
Using these:
GNU Haswell:
C_SW 2.3807
D_SW 9.6157
PG_D 0.7708
TRACER_2D 176.9858
REMAPPING 111.1660
Well, that's not nothing. The TRACER_2D is faster.
More searches online said that perhaps GNU doesn't do FMA as well and
that should be turned off (added -mno-fma):
GNU Haswell NoFMA:
C_SW 2.3933
D_SW 9.6624
PG_D 0.7837
TRACER_2D 162.7971
REMAPPING 100.4335
A little better... And then as I was looking at this mailing list, I saw
Wilco Dijkstra's post where he said -frepack-arrays helped and, why not
try that! And:
GNU Haswell NoFMA Repack:
C_SW 2.4350
D_SW 9.7109
PG_D 0.7869
TRACER_2D 163.6474
REMAPPING 100.6820
So, my questions to you gurus are: Is there something I could try adding
to my gfortran options that might help with this discrepancy between
Intel AVX2 and GCC? Or perhaps I need to *remove* something (some flag
kills the vectorizer)?
At this point I'm mainly searching for speed for speed's sake so I'm
willing to try anything even if it's thought of as "stupid agressive".
As long as it doesn't crash the model, I'm willing to try and then check
to make sure the world isn't boiling or freezing because I broke reality.
Matt
--
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How to get best AVX2 performance from gfortran?
2018-04-18 14:22 How to get best AVX2 performance from gfortran? Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
@ 2018-04-19 3:40 ` Jerry DeLisle
2018-04-20 15:51 ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-19 6:33 ` Thomas Koenig
1 sibling, 1 reply; 7+ messages in thread
From: Jerry DeLisle @ 2018-04-19 3:40 UTC (permalink / raw)
To: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
fortran
On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
APPLICATIONS INC] wrote:
> Dear GNU Fortran Gurus,
>
> I have a question mainly based around AVX2 and flags.
>
--- snip ---
>
> So, my questions to you gurus are: Is there something I could try adding
> to my gfortran options that might help with this discrepancy between
> Intel AVX2 and GCC? Or perhaps I need to *remove* something (some flag
> kills the vectorizer)?
>
> At this point I'm mainly searching for speed for speed's sake so I'm
> willing to try anything even if it's thought of as "stupid agressive".
> As long as it doesn't crash the model, I'm willing to try and then check
> to make sure the world isn't boiling or freezing because I broke reality.
>
> Matt
Well I have seen some codes pessimized by -O3 so try -O2. Also
generally I do -march=native.
Also you may try something like this:
-ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4
-mprefer-avx128
Curious about what you find out, generally speakig one has to fine tune
to the code being compiled since there are many many factors that can
influence the outcome.
Regards,
Jerry
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How to get best AVX2 performance from gfortran?
2018-04-18 14:22 How to get best AVX2 performance from gfortran? Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-19 3:40 ` Jerry DeLisle
@ 2018-04-19 6:33 ` Thomas Koenig
2018-04-19 8:35 ` Richard Biener
1 sibling, 1 reply; 7+ messages in thread
From: Thomas Koenig @ 2018-04-19 6:33 UTC (permalink / raw)
To: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
fortran, gcc mailing list
Hi Matt,
[timings]
> Intel AVX2:
>
> Â Â C_SWÂ Â Â Â Â Â 1.4931
> Â Â D_SWÂ Â Â Â Â Â 5.4254
> Â Â PG_DÂ Â Â Â Â Â 1.0878
> Â Â TRACER_2D 24.7418
> Â Â REMAPPING 27.2644
> Now I looked at GNU Fortran (7.3.0). Here my "stock" flags are quite
> boring (and all flags, not just the optimization ones):
[Various options elided, the best was]:
> GNU Haswell NoFMA Repack:
> Â Â C_SWÂ Â Â Â Â Â Â 2.4350
> Â Â D_SWÂ Â Â Â Â Â Â 9.7109
> Â Â PG_DÂ Â Â Â Â Â Â 0.7869
> Â Â TRACER_2D 163.6474
> Â Â REMAPPING 100.6820
>
> So, my questions to you gurus are: Is there something I could try adding
> to my gfortran options that might help with this discrepancy between
> Intel AVX2 and GCC? Or perhaps I need to *remove* something (some flag
> kills the vectorizer)?
The gcc 8 release is just around the corner, and a lot of improvements
have been made to code generation, also for AVX2. You might want to give
the current trunk (or the soon-to-be-released) release candidate, or the
then newly released gcc8 a spin.
Second, this performance gap with respect to Intel (a factor of 6.6 for
your TRACER_2D routine) is dramatic. If anything like this persists in
gcc8, the only way to get this fixed is to submit a bug report.
Profile the code, try to reduce the code to something that shows
the the problem (and that you can put in a bug report).
Regards
Thomas
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How to get best AVX2 performance from gfortran?
2018-04-19 6:33 ` Thomas Koenig
@ 2018-04-19 8:35 ` Richard Biener
0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2018-04-19 8:35 UTC (permalink / raw)
To: Thomas Koenig
Cc: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
fortran, gcc mailing list
On Thu, Apr 19, 2018 at 8:33 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Hi Matt,
> [timings]
>
>> Intel AVX2:
>>
>> C_SW 1.4931
>> D_SW 5.4254
>> PG_D 1.0878
>> TRACER_2D 24.7418
>> REMAPPING 27.2644
>
>
>> Now I looked at GNU Fortran (7.3.0). Here my "stock" flags are quite
>> boring (and all flags, not just the optimization ones):
>
>
> [Various options elided, the best was]:
>
>> GNU Haswell NoFMA Repack:
>> C_SW 2.4350
>> D_SW 9.7109
>> PG_D 0.7869
>> TRACER_2D 163.6474
>> REMAPPING 100.6820
>>
>> So, my questions to you gurus are: Is there something I could try adding
>> to my gfortran options that might help with this discrepancy between Intel
>> AVX2 and GCC? Or perhaps I need to *remove* something (some flag kills the
>> vectorizer)?
>
> The gcc 8 release is just around the corner, and a lot of improvements
> have been made to code generation, also for AVX2. You might want to give
> the current trunk (or the soon-to-be-released) release candidate, or the
> then newly released gcc8 a spin.
>
> Second, this performance gap with respect to Intel (a factor of 6.6 for
> your TRACER_2D routine) is dramatic. If anything like this persists in gcc8,
> the only way to get this fixed is to submit a bug report.
> Profile the code, try to reduce the code to something that shows
> the the problem (and that you can put in a bug report).
Dependent on what those routines do (do they call math intrinsics like
sin or cos?) ICC has an advantage with a highly optimized vectorized
math library. You can use that from gfortran as well by using
-mveclibabi=svml and linking against libsvml.{a,so} which comes with ICC.
Unfortunately gfortran cannot exercise glibcs libmvec at the moment.
Richard.
> Regards
>
> Thomas
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How to get best AVX2 performance from gfortran?
2018-04-19 3:40 ` Jerry DeLisle
@ 2018-04-20 15:51 ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-20 17:18 ` Jerry DeLisle
0 siblings, 1 reply; 7+ messages in thread
From: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] @ 2018-04-20 15:51 UTC (permalink / raw)
To: Jerry DeLisle, fortran
On 04/18/2018 11:40 PM, Jerry DeLisle wrote:
> On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
> APPLICATIONS INC] wrote:
>> Dear GNU Fortran Gurus,
>>
>> I have a question mainly based around AVX2 and flags.
>>
> --- snip ---
>
>>
>> So, my questions to you gurus are: Is there something I could try
>> adding to my gfortran options that might help with this discrepancy
>> between Intel AVX2 and GCC? Or perhaps I need to *remove* something
>> (some flag kills the vectorizer)?
>>
>> At this point I'm mainly searching for speed for speed's sake so I'm
>> willing to try anything even if it's thought of as "stupid agressive".
>> As long as it doesn't crash the model, I'm willing to try and then
>> check to make sure the world isn't boiling or freezing because I broke
>> reality.
>>
>> Matt
>
> Well I have seen some codes pessimized by -O3 so try -O2. Also
> generally I do -march=native.
>
> Also you may try something like this:
>
> -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4
> -mprefer-avx128
>
> Curious about what you find out, generally speakig one has to fine tune
> to the code being compiled since there are many many factors that can
> influence the outcome.
An overall update. Jerry's flags with -mno-fma and -O2:
-O2 -march=native -ffast-math -ftree-vectorize -funroll-loops --param
max-unroll-times=4 -mprefer-avx128 -mno-fma
seem to be the best I've found.:
Original:
TRACER_2D 273.3723
REMAPPING 115.3358
Jerry's:
TRACER_2D 138.4858
REMAPPING 94.4236
Though compared to Intel:
TRACER_2D 24.7418
REMAPPING 27.2644
still a ways to go.
That said, it turns out for now I can't actually *use* these flags. One
of the requirements for our model is that it regress under layout change
in MPI. A lot of work was put into getting flags with Intel that did
this but still vectorized (-fp-model consistent was the key, I think).
The offending flag in GNU seems to be -ffast-math, but if you remove
that from Jerry's flags:
Native O2 Delisle NoFMA NoFastMath:
TRACER_2D 264.1194
REMAPPING 113.4083
Well, nuts. Looks like -ffast-math was the special flag.
I'll ask for a GCC 8 snapshot to be installed on our cluster to test
with that at least to see if vectorization/speed is improved, though
without reproducibility...
Thanks,
Matt
--
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How to get best AVX2 performance from gfortran?
2018-04-20 15:51 ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
@ 2018-04-20 17:18 ` Jerry DeLisle
2018-04-20 17:25 ` Richard Biener
0 siblings, 1 reply; 7+ messages in thread
From: Jerry DeLisle @ 2018-04-20 17:18 UTC (permalink / raw)
To: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC],
fortran
On 04/20/2018 08:51 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
APPLICATIONS INC] wrote:
> On 04/18/2018 11:40 PM, Jerry DeLisle wrote:
>> On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS
--- snip---
> that from Jerry's flags:
>
> Native O2 Delisle NoFMA NoFastMath:
>
> Â Â TRACER_2D 264.1194
> Â Â REMAPPING 113.4083
>
> Well, nuts. Looks like -ffast-math was the special flag.
Look in manual at gcc.gnu.org. -ffast-math implies a few other flags
that you can further try separately.
-ffast-math
Sets the options -fno-math-errno, -funsafe-math-optimizations,
-ffinite-math-only, -fno-rounding-math, -fno-signaling-nans,
-fcx-limited-range and -fexcess-precision=fast.
Also, if you can run some profiles and narrow down the code in question,
maybe obfuscate it a bit or share privately with one of us, we can get
someone in the know to take a closer look.
>
> I'll ask for a GCC 8 snapshot to be installed on our cluster to test
> with that at least to see if vectorization/speed is improved, though
> without reproducibility...
>
> Thanks,
> Matt
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How to get best AVX2 performance from gfortran?
2018-04-20 17:18 ` Jerry DeLisle
@ 2018-04-20 17:25 ` Richard Biener
0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2018-04-20 17:25 UTC (permalink / raw)
To: fortran, Jerry DeLisle, Thompson,
Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
On April 20, 2018 7:18:37 PM GMT+02:00, Jerry DeLisle <jvdelisle@charter.net> wrote:
>On 04/20/2018 08:51 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
>
>APPLICATIONS INC] wrote:
>> On 04/18/2018 11:40 PM, Jerry DeLisle wrote:
>>> On 04/18/2018 07:22 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS
>--- snip---
>> that from Jerry's flags:
>>
>> Native O2 Delisle NoFMA NoFastMath:
>>
>> TRACER_2D 264.1194
>> REMAPPING 113.4083
>>
>> Well, nuts. Looks like -ffast-math was the special flag.
>
>Look in manual at gcc.gnu.org. -ffast-math implies a few other flags
>that you can further try separately.
>
>-ffast-math
>
> Sets the options -fno-math-errno, -funsafe-math-optimizations,
>-ffinite-math-only, -fno-rounding-math, -fno-signaling-nans,
>-fcx-limited-range and -fexcess-precision=fast.
That doesn't seem to be a complete list. The Fortran FE also enables some of them unconditionally like -fno-math-errno. One important flag is -fno-trapping-math
Richard.
>Also, if you can run some profiles and narrow down the code in
>question,
>maybe obfuscate it a bit or share privately with one of us, we can get
>someone in the know to take a closer look.
>
>>
>> I'll ask for a GCC 8 snapshot to be installed on our cluster to test
>> with that at least to see if vectorization/speed is improved, though
>> without reproducibility...
>>
>> Thanks,
>> Matt
>>
>>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-04-20 17:25 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18 14:22 How to get best AVX2 performance from gfortran? Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-19 3:40 ` Jerry DeLisle
2018-04-20 15:51 ` Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
2018-04-20 17:18 ` Jerry DeLisle
2018-04-20 17:25 ` Richard Biener
2018-04-19 6:33 ` Thomas Koenig
2018-04-19 8:35 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).