* Re: Loop Vectorization [not found] <93wB1t00W2qVqVd013wCG6> @ 2016-06-20 18:46 ` sdcycling 0 siblings, 0 replies; 7+ messages in thread From: sdcycling @ 2016-06-20 18:46 UTC (permalink / raw) To: fortran, tprince; +Cc: Tim Prince Hello Tim, Details of my software configuration and compiler options are provided below. I am running Ubuntu 16.04. The processors are dual Intel Xeon CPU E5-2683 v4. The range of the do loops varies from 1:2 to 1:256 in powers of 2 depending on the details of the multigrid and amr levels. For the smaller loops, I will try to unroll them manually. Thank you, Doug. mpif90 -O3 -Ofast -fopt-info-vec-optimized -fopt-info-vec-missed -fdefault-real-8 -fdefault-double-8 -ffixed-line -length-none -ffree-line-length-none mpif90 -v mpifort for MPICH version 3.2 Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.3.1-14ubuntu2.1' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2.1) gfortran -v Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.3.1-14ubuntu2.1' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2.1) ---- Tim Prince <n8tm@aol.com> wrote: > > > On 6/20/2016 11:04 AM, sdcycling wrote: > > Hi Tim, > > > > I am currently using MPICH. I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out. > > The most computationally intensive loop uses derived data types within the loops that are nested 3 deep: > > > > do concurrent (k=mg(mlev)%kb,mg(mlev):ke:1) > > do concurrent (j=mg(mlev)%jb,mg(mlev):je:1) > > do concurrent (i=mg(mlev)%ib,mg(mlev):ie:2) > > residual= … > > mgamr(mlev,ilev)%subdomain(n)%array(i,j,k)=residual … > > enddo > > enddo > > enddo > > > > It is a fragment of a Gauss-Seidel iteration. mlev is the multigrid level. ilev is the AMR level. n is the subdomain. The data dependency is eliminated by red-black ordering with a stride of 2 in the innermost loop. The innermost loop is not vectorizing. The specific compiler output is “note: not vectorized: control flow in loop.” > > > > > I guess we are supposed to figure out your compiler according to the > wording of the (possibly misleading) message. I don't think we can > infer your compile options. > Both ifort and gfortran have become more aggressive lately about > vectorizing stride 2 inner loops, even for cases where it slows them > down. I want to set -march=avx2 or -xHost without incurring such > slowdowns, and so have had to explore ways of disabling such > vectorization. If you have double precision and don't set an avx > target, you would be unlikely to see a benefit from vectorization, and > compilers would be unlikely to try it without more of a directive than > you have given. At vectorlength==4 it's anybody's guess whether > vectorization of stride 2 will be useful, and compilers don't analyze > deeply enough to get the answer (including whether you have enough > arithmetic to outweigh possibly slower memory access). > As you are asking for a scatter store, presumably on a target which > doesn't have scatter instructions, it would likely involve vmask stores, > which are notoriously slow. I don't have insider information about > which future instruction set with scatter stores might change the > picture (Skylake server?). > > -- > Tim Prince > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Loop Vectorization @ 2016-06-20 3:06 sdcycling 2016-06-20 6:11 ` Tobias Burnus 0 siblings, 1 reply; 7+ messages in thread From: sdcycling @ 2016-06-20 3:06 UTC (permalink / raw) To: fortran Hello, I am using gfortran to build a finite-difference code. How do I tell whether a do loop is being vectorized? Also, how do I use compiler directives in gfortran to indicate that a do loop does not have any data dependencies? Thank you, Doug. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Loop Vectorization 2016-06-20 3:06 sdcycling @ 2016-06-20 6:11 ` Tobias Burnus 2016-06-20 7:53 ` Tim Prince [not found] ` <8vtq1t00B2qVqVd01vtrfL> 0 siblings, 2 replies; 7+ messages in thread From: Tobias Burnus @ 2016-06-20 6:11 UTC (permalink / raw) To: sdcycling, fortran Hello, sdcycling wrote: > I am using gfortran to build a finite-difference code. How do I tell whether a do loop is being vectorized? Try the -fopt-info-... options of GCC, in particular -fopt-info-vec-optimized > Also, how do I use compiler directives in gfortran to indicate that a do loop does not have any data dependencies? Not as directive, but using Fortran's "DO CONCURRENT" instead of a normal DO will imply this. Cheers, Tobias ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Loop Vectorization 2016-06-20 6:11 ` Tobias Burnus @ 2016-06-20 7:53 ` Tim Prince [not found] ` <8vtq1t00B2qVqVd01vtrfL> 1 sibling, 0 replies; 7+ messages in thread From: Tim Prince @ 2016-06-20 7:53 UTC (permalink / raw) To: fortran On 6/20/2016 2:11 AM, Tobias Burnus wrote: > Hello, > > sdcycling wrote: >> I am using gfortran to build a finite-difference code. How do I tell >> whether a do loop is being vectorized? > Try the -fopt-info-... options of GCC, in particular > -fopt-info-vec-optimized > >> Also, how do I use compiler directives in gfortran to indicate that a >> do loop does not have any data dependencies? > > Not as directive, but using Fortran's "DO CONCURRENT" instead of a > normal DO will imply this. > !$ omp simd may be useful, if your objective is simd vectorization. -- Tim Prince ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <8vtq1t00B2qVqVd01vtrfL>]
* Re: Loop Vectorization [not found] ` <8vtq1t00B2qVqVd01vtrfL> @ 2016-06-20 15:04 ` sdcycling 2016-06-20 15:56 ` Tim Prince 2016-06-20 19:06 ` Tobias Burnus 0 siblings, 2 replies; 7+ messages in thread From: sdcycling @ 2016-06-20 15:04 UTC (permalink / raw) To: tprince; +Cc: fortran Hi Tim, I am currently using MPICH. I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out. The most computationally intensive loop uses derived data types within the loops that are nested 3 deep: do concurrent (k=mg(mlev)%kb,mg(mlev):ke:1) do concurrent (j=mg(mlev)%jb,mg(mlev):je:1) do concurrent (i=mg(mlev)%ib,mg(mlev):ie:2) residual= … mgamr(mlev,ilev)%subdomain(n)%array(i,j,k)=residual … enddo enddo enddo It is a fragment of a Gauss-Seidel iteration. mlev is the multigrid level. ilev is the AMR level. n is the subdomain. The data dependency is eliminated by red-black ordering with a stride of 2 in the innermost loop. The innermost loop is not vectorizing. The specific compiler output is “note: not vectorized: control flow in loop.” Thank you, Doug. > On Jun 20, 2016, at 12:53 AM, Tim Prince <n8tm@aol.com> wrote: > > On 6/20/2016 2:11 AM, Tobias Burnus wrote: >> Hello, >> >> sdcycling wrote: >>> I am using gfortran to build a finite-difference code. How do I tell >>> whether a do loop is being vectorized? >> Try the -fopt-info-... options of GCC, in particular >> -fopt-info-vec-optimized >> >>> Also, how do I use compiler directives in gfortran to indicate that a >>> do loop does not have any data dependencies? >> >> Not as directive, but using Fortran's "DO CONCURRENT" instead of a >> normal DO will imply this. >> > !$ omp simd may be useful, if your objective is simd vectorization. > > > -- > Tim Prince ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Loop Vectorization 2016-06-20 15:04 ` sdcycling @ 2016-06-20 15:56 ` Tim Prince 2016-06-20 19:06 ` Tobias Burnus 1 sibling, 0 replies; 7+ messages in thread From: Tim Prince @ 2016-06-20 15:56 UTC (permalink / raw) To: fortran On 6/20/2016 11:04 AM, sdcycling wrote: > Hi Tim, > > I am currently using MPICH. I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out. > The most computationally intensive loop uses derived data types within the loops that are nested 3 deep: > > do concurrent (k=mg(mlev)%kb,mg(mlev):ke:1) > do concurrent (j=mg(mlev)%jb,mg(mlev):je:1) > do concurrent (i=mg(mlev)%ib,mg(mlev):ie:2) > residual= … > mgamr(mlev,ilev)%subdomain(n)%array(i,j,k)=residual … > enddo > enddo > enddo > > It is a fragment of a Gauss-Seidel iteration. mlev is the multigrid level. ilev is the AMR level. n is the subdomain. The data dependency is eliminated by red-black ordering with a stride of 2 in the innermost loop. The innermost loop is not vectorizing. The specific compiler output is “note: not vectorized: control flow in loop.” > > I guess we are supposed to figure out your compiler according to the wording of the (possibly misleading) message. I don't think we can infer your compile options. Both ifort and gfortran have become more aggressive lately about vectorizing stride 2 inner loops, even for cases where it slows them down. I want to set -march=avx2 or -xHost without incurring such slowdowns, and so have had to explore ways of disabling such vectorization. If you have double precision and don't set an avx target, you would be unlikely to see a benefit from vectorization, and compilers would be unlikely to try it without more of a directive than you have given. At vectorlength==4 it's anybody's guess whether vectorization of stride 2 will be useful, and compilers don't analyze deeply enough to get the answer (including whether you have enough arithmetic to outweigh possibly slower memory access). As you are asking for a scatter store, presumably on a target which doesn't have scatter instructions, it would likely involve vmask stores, which are notoriously slow. I don't have insider information about which future instruction set with scatter stores might change the picture (Skylake server?). -- Tim Prince ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Loop Vectorization 2016-06-20 15:04 ` sdcycling 2016-06-20 15:56 ` Tim Prince @ 2016-06-20 19:06 ` Tobias Burnus 1 sibling, 0 replies; 7+ messages in thread From: Tobias Burnus @ 2016-06-20 19:06 UTC (permalink / raw) To: sdcycling, tprince; +Cc: fortran sdcycling wrote: > I am currently using MPICH. I could try MPICH in combination with OMP, but the number of processors on my workstation are currently maxed out. Side note: OpenMP's "simd" directives work on vectorization and do not imply multiple threads. With -fopenmp-simd, the OpenMP run time library is not even linked. Regarding the code the compiler produces, using "-fdump-tree-original", one can see what operations the compiler does. (Dump of the internal representation, looks like low-level C, but not all properties are shown.) It's quite lengthy, but it might give a hint what code the compiler adds. Tobias ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-06-20 19:06 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <93wB1t00W2qVqVd013wCG6> 2016-06-20 18:46 ` Loop Vectorization sdcycling 2016-06-20 3:06 sdcycling 2016-06-20 6:11 ` Tobias Burnus 2016-06-20 7:53 ` Tim Prince [not found] ` <8vtq1t00B2qVqVd01vtrfL> 2016-06-20 15:04 ` sdcycling 2016-06-20 15:56 ` Tim Prince 2016-06-20 19:06 ` Tobias Burnus
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).