Re: Loop Vectorization

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: Loop Vectorization
       [not found] <93wB1t00W2qVqVd013wCG6>
@ 2016-06-20 18:46 ` sdcycling
  0 siblings, 0 replies; 7+ messages in thread
From: sdcycling @ 2016-06-20 18:46 UTC (permalink / raw)
  To: fortran, tprince; +Cc: Tim Prince

Hello Tim,

Details of my software configuration and compiler options are provided below.  I am running Ubuntu 16.04.  The processors are dual Intel Xeon CPU E5-2683 v4.  The range of the do loops varies from  1:2 to 1:256 in powers of 2 depending on the details of the multigrid and amr levels.     For the smaller loops, I will try to unroll them manually.

Thank you, Doug.

mpif90  -O3 -Ofast -fopt-info-vec-optimized -fopt-info-vec-missed -fdefault-real-8 -fdefault-double-8 -ffixed-line -length-none -ffree-line-length-none

mpif90 -v
mpifort for MPICH version 3.2
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.3.1-14ubuntu2.1' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2.1) 

gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.3.1-14ubuntu2.1' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2.1)

---- Tim Prince <n8tm@aol.com> wrote: 
> 
> 
> On 6/20/2016 11:04 AM, sdcycling wrote:
> > Hi Tim,
> >
> > I am currently using MPICH.   I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out.
> > The most computationally intensive loop uses derived data types within the loops that are nested 3 deep:  
> >
> > do concurrent (k=mg(mlev)%kb,mg(mlev):ke:1)
> > do concurrent (j=mg(mlev)%jb,mg(mlev):je:1)
> > do concurrent (i=mg(mlev)%ib,mg(mlev):ie:2)
> > residual= …
> > mgamr(mlev,ilev)%subdomain(n)%array(i,j,k)=residual …
> > enddo 
> > enddo
> > enddo
> >
> > It is a fragment of a Gauss-Seidel iteration.  mlev is the multigrid level.  ilev is the AMR level.  n is the subdomain.  The data dependency is eliminated by red-black ordering with a stride of 2 in the innermost loop.   The innermost loop is not vectorizing.   The specific compiler output is “note: not vectorized: control flow in loop.”
> >
> >
> I guess we are supposed to figure out your compiler according to the
> wording of the (possibly misleading) message.  I don't think we can
> infer your compile options.
> Both ifort and gfortran have become more aggressive lately about
> vectorizing stride 2 inner loops, even for cases where it slows them
> down.  I want to set -march=avx2 or -xHost without incurring such
> slowdowns, and so have had to explore ways of disabling such
> vectorization.  If you have double precision and don't set an avx
> target, you would be unlikely to see a benefit from vectorization, and
> compilers would be unlikely to try it without more of a directive than
> you have given.  At vectorlength==4 it's anybody's guess whether
> vectorization of stride 2 will be useful, and compilers don't analyze
> deeply enough to get the answer (including whether you have enough
> arithmetic to outweigh possibly slower memory access).
> As you are asking for a scatter store, presumably on a target which
> doesn't have scatter instructions, it would likely involve vmask stores,
> which are notoriously slow.  I don't have insider information about
> which future instruction set with scatter stores might change the
> picture (Skylake server?).
> 
> -- 
> Tim Prince
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Loop Vectorization
@ 2016-06-20  3:06 sdcycling
  2016-06-20  6:11 ` Tobias Burnus
  0 siblings, 1 reply; 7+ messages in thread
From: sdcycling @ 2016-06-20  3:06 UTC (permalink / raw)
  To: fortran

Hello,

I am using gfortran to build a finite-difference code.   How do I tell whether a do loop is being vectorized?   Also, how do I use compiler directives in gfortran to indicate that a do loop does not have any data dependencies?  

Thank you, Doug. 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Loop Vectorization
  2016-06-20  3:06 sdcycling
@ 2016-06-20  6:11 ` Tobias Burnus
  2016-06-20  7:53   ` Tim Prince
       [not found]   ` <8vtq1t00B2qVqVd01vtrfL>
  0 siblings, 2 replies; 7+ messages in thread
From: Tobias Burnus @ 2016-06-20  6:11 UTC (permalink / raw)
  To: sdcycling, fortran

Hello,

sdcycling wrote:
> I am using gfortran to build a finite-difference code.   How do I tell whether a do loop is being vectorized?
Try the -fopt-info-... options of GCC, in particular 
-fopt-info-vec-optimized

> Also, how do I use compiler directives in gfortran to indicate that a do loop does not have any data dependencies?

Not as directive, but using Fortran's "DO CONCURRENT" instead of a 
normal DO will imply this.

Cheers,

Tobias

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Loop Vectorization
  2016-06-20  6:11 ` Tobias Burnus
@ 2016-06-20  7:53   ` Tim Prince
       [not found]   ` <8vtq1t00B2qVqVd01vtrfL>
  1 sibling, 0 replies; 7+ messages in thread
From: Tim Prince @ 2016-06-20  7:53 UTC (permalink / raw)
  To: fortran

On 6/20/2016 2:11 AM, Tobias Burnus wrote:
> Hello,
>
> sdcycling wrote:
>> I am using gfortran to build a finite-difference code.   How do I tell
>> whether a do loop is being vectorized?
> Try the -fopt-info-... options of GCC, in particular
> -fopt-info-vec-optimized
>
>> Also, how do I use compiler directives in gfortran to indicate that a
>> do loop does not have any data dependencies?
>
> Not as directive, but using Fortran's "DO CONCURRENT" instead of a
> normal DO will imply this.
>
!$ omp simd may be useful, if your objective is simd vectorization.


-- 
Tim Prince

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <8vtq1t00B2qVqVd01vtrfL>]

* Re: Loop Vectorization
       [not found]   ` <8vtq1t00B2qVqVd01vtrfL>
@ 2016-06-20 15:04     ` sdcycling
  2016-06-20 15:56       ` Tim Prince
  2016-06-20 19:06       ` Tobias Burnus
  0 siblings, 2 replies; 7+ messages in thread
From: sdcycling @ 2016-06-20 15:04 UTC (permalink / raw)
  To: tprince; +Cc: fortran

Hi Tim,

I am currently using MPICH.   I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out.
The most computationally intensive loop uses derived data types within the loops that are nested 3 deep:  

do concurrent (k=mg(mlev)%kb,mg(mlev):ke:1)
do concurrent (j=mg(mlev)%jb,mg(mlev):je:1)
do concurrent (i=mg(mlev)%ib,mg(mlev):ie:2)
residual= …
mgamr(mlev,ilev)%subdomain(n)%array(i,j,k)=residual …
enddo 
enddo
enddo

It is a fragment of a Gauss-Seidel iteration.  mlev is the multigrid level.  ilev is the AMR level.  n is the subdomain.  The data dependency is eliminated by red-black ordering with a stride of 2 in the innermost loop.   The innermost loop is not vectorizing.   The specific compiler output is “note: not vectorized: control flow in loop.”

Thank you, Doug.

> On Jun 20, 2016, at 12:53 AM, Tim Prince <n8tm@aol.com> wrote:
> 
> On 6/20/2016 2:11 AM, Tobias Burnus wrote:
>> Hello,
>> 
>> sdcycling wrote:
>>> I am using gfortran to build a finite-difference code.   How do I tell
>>> whether a do loop is being vectorized?
>> Try the -fopt-info-... options of GCC, in particular
>> -fopt-info-vec-optimized
>> 
>>> Also, how do I use compiler directives in gfortran to indicate that a
>>> do loop does not have any data dependencies?
>> 
>> Not as directive, but using Fortran's "DO CONCURRENT" instead of a
>> normal DO will imply this.
>> 
> !$ omp simd may be useful, if your objective is simd vectorization.
> 
> 
> -- 
> Tim Prince

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Loop Vectorization
  2016-06-20 15:04     ` sdcycling
@ 2016-06-20 15:56       ` Tim Prince
  2016-06-20 19:06       ` Tobias Burnus
  1 sibling, 0 replies; 7+ messages in thread
From: Tim Prince @ 2016-06-20 15:56 UTC (permalink / raw)
  To: fortran



On 6/20/2016 11:04 AM, sdcycling wrote:
> Hi Tim,
>
> I am currently using MPICH.   I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out.
> The most computationally intensive loop uses derived data types within the loops that are nested 3 deep:  
>
> do concurrent (k=mg(mlev)%kb,mg(mlev):ke:1)
> do concurrent (j=mg(mlev)%jb,mg(mlev):je:1)
> do concurrent (i=mg(mlev)%ib,mg(mlev):ie:2)
> residual= Â…
> mgamr(mlev,ilev)%subdomain(n)%array(i,j,k)=residual Â…
> enddo 
> enddo
> enddo
>
> It is a fragment of a Gauss-Seidel iteration.  mlev is the multigrid level.  ilev is the AMR level.  n is the subdomain.  The data dependency is eliminated by red-black ordering with a stride of 2 in the innermost loop.   The innermost loop is not vectorizing.   The specific compiler output is Â“note: not vectorized: control flow in loop.Â”
>
>
I guess we are supposed to figure out your compiler according to the
wording of the (possibly misleading) message.  I don't think we can
infer your compile options.
Both ifort and gfortran have become more aggressive lately about
vectorizing stride 2 inner loops, even for cases where it slows them
down.  I want to set -march=avx2 or -xHost without incurring such
slowdowns, and so have had to explore ways of disabling such
vectorization.  If you have double precision and don't set an avx
target, you would be unlikely to see a benefit from vectorization, and
compilers would be unlikely to try it without more of a directive than
you have given.  At vectorlength==4 it's anybody's guess whether
vectorization of stride 2 will be useful, and compilers don't analyze
deeply enough to get the answer (including whether you have enough
arithmetic to outweigh possibly slower memory access).
As you are asking for a scatter store, presumably on a target which
doesn't have scatter instructions, it would likely involve vmask stores,
which are notoriously slow.  I don't have insider information about
which future instruction set with scatter stores might change the
picture (Skylake server?).

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Loop Vectorization
  2016-06-20 15:04     ` sdcycling
  2016-06-20 15:56       ` Tim Prince
@ 2016-06-20 19:06       ` Tobias Burnus
  1 sibling, 0 replies; 7+ messages in thread
From: Tobias Burnus @ 2016-06-20 19:06 UTC (permalink / raw)
  To: sdcycling, tprince; +Cc: fortran

sdcycling wrote:
> I am currently using MPICH.   I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out.
Side note: OpenMP's "simd" directives work on vectorization and do not 
imply multiple threads. With -fopenmp-simd, the OpenMP run time library 
is not even linked.

Regarding the code the compiler produces, using "-fdump-tree-original", 
one can see what operations the compiler does. (Dump of the internal 
representation, looks like low-level C, but not all properties are 
shown.) It's quite lengthy, but it might give a hint what code the 
compiler adds.

Tobias

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-20 19:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <93wB1t00W2qVqVd013wCG6>
2016-06-20 18:46 ` Loop Vectorization sdcycling
2016-06-20  3:06 sdcycling
2016-06-20  6:11 ` Tobias Burnus
2016-06-20  7:53   ` Tim Prince
     [not found]   ` <8vtq1t00B2qVqVd01vtrfL>
2016-06-20 15:04     ` sdcycling
2016-06-20 15:56       ` Tim Prince
2016-06-20 19:06       ` Tobias Burnus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).