public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Array intrinsics vs do loops in gfortran
@ 2012-05-27 19:27 Chris Kerr
  2012-05-27 20:13 ` Tim Prince
  2012-05-28  8:12 ` N.M. Maclaren
  0 siblings, 2 replies; 3+ messages in thread
From: Chris Kerr @ 2012-05-27 19:27 UTC (permalink / raw)
  To: gcc-help, fortran

[-- Attachment #1: Type: text/plain, Size: 1701 bytes --]

Hi,

I'm working on RMCProfile ( http://www.rmcprofile.org ), trying to reduce the 
run time (which is on the order of 3 CPU years for my datasets). I've done 
some profiling and about 75% of the runtime is spent executing an operation of 
the form y(p:q) += a * x(r:s) [where q-p == s-r] - i.e. DAXPY but on slices of 
the original y and x arrays.

I've tried implementing this as a do loop and using Fortran array intrinsics - 
the version using the do loop is significantly faster (the overall running time 
of the program on a reduced problem set taking ~1 hour is 7% faster). Looking 
at the generated assembly, the do loop uses 
incq	%rax
cmpq	%rcx, %rax
jne	.L5
to increment and check the loop counter, whereas the array intrinsic uses
movq	%rdx, %rdi
leaq	1(%rdi), %rdx
cmpq	%rdi, %r8
jge	.L13

The same pattern can be reproduced using much simpler source files - I've 
attached the fortran source and the assembly output. 
NB as far as I can tell, none of the above rax, rdx or rdi play any part in 
their respective calculations, they function purely as counters.

I have two questions about the above behaviour:
1) Why does the array intrinsic method use an extra instruction compared to 
the do loop? Is there any way of stopping this?
2) Is there any way of getting these loops to use SIMD instructions such as 
vfmaddpd? Currently even when -march=native is switched on, the loop only uses 
the scalar instruction vfmaddsd. I'd rather not have to hand-code an unrolled 
loop, especially as I'm more used to C and Python so there would probably be 
off-by-one errors all over the place on my first ten tries.

Thanks in advance,

Chris Kerr
Department of Chemistry
University of Cambridge

[-- Attachment #2: make.out --]
[-- Type: text/plain, Size: 2719 bytes --]

gfortran -v -save-temps -S -fverbose-asm -O3 -march=native slice_daxpy.f90 -o slice_daxpy_O3_native.s
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6.1/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++,go --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) 
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-S' '-fverbose-asm' '-O3' '-march=native' '-o' 'slice_daxpy_O3_native.s'
 /usr/lib/gcc/x86_64-linux-gnu/4.6.1/f951 slice_daxpy.f90 -march=bdver1 -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mabm -mlwp -mno-fma -mfma4 -mxop -mno-bmi -mno-tbm -mavx -msse4.2 -msse4.1 --param l1-cache-size=16 --param l1-cache-line-size=64 --param l2-cache-size=2048 -mtune=bdver1 -quiet -dumpbase slice_daxpy.f90 -auxbase-strip slice_daxpy_O3_native.s -O3 -version -fverbose-asm -o slice_daxpy_O3_native.s -fintrinsic-modules-path /usr/lib/gcc/x86_64-linux-gnu/4.6.1/finclude
GNU Fortran (Ubuntu/Linaro 4.6.1-9ubuntu3) version 4.6.1 (x86_64-linux-gnu)
	compiled by GNU C version 4.6.1, GMP version 5.0.1, MPFR version 3.0.1-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
GNU Fortran (Ubuntu/Linaro 4.6.1-9ubuntu3) version 4.6.1 (x86_64-linux-gnu)
	compiled by GNU C version 4.6.1, GMP version 5.0.1, MPFR version 3.0.1-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
COMPILER_PATH=/usr/lib/gcc/x86_64-linux-gnu/4.6.1/:/usr/lib/gcc/x86_64-linux-gnu/4.6.1/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6.1/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/4.6.1/:/usr/lib/gcc/x86_64-linux-gnu/4.6.1/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6.1/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/4.6.1/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-S' '-fverbose-asm' '-O3' '-march=native' '-o' 'slice_daxpy_O3_native.s'

[-- Attachment #3: slice_daxpy.f90 --]
[-- Type: text/x-fortran, Size: 600 bytes --]

module slice_daxpy
implicit none
contains

subroutine slice_daxpy_do (a, x, y, xstart, ystart, yend)
  integer :: xstart, ystart, yend
  double precision :: a, x(:), y(:)

  integer :: xend

  xend = xstart + (yend-ystart)
  y(ystart:yend) = y(ystart:yend) + a * x(xstart:xend)

end subroutine


subroutine slice_daxpy_arrayintrinsics (a, x, y, xstart, ystart, yend)
  integer :: xstart, ystart, yend
  double precision :: a, x(:), y(:)

  integer :: xpos, ypos, i

  xpos = xstart
  do ypos = ystart,yend
    y(ypos) = y(ypos) + a * x(xpos)
    xpos = xpos + 1
  end do

end subroutine
  
end module

[-- Attachment #4: slice_daxpy_O3_native.s --]
[-- Type: text/plain, Size: 6889 bytes --]

	.file	"slice_daxpy.f90"
# GNU Fortran (Ubuntu/Linaro 4.6.1-9ubuntu3) version 4.6.1 (x86_64-linux-gnu)
#	compiled by GNU C version 4.6.1, GMP version 5.0.1, MPFR version 3.0.1-p3, MPC version 0.9
# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed:  slice_daxpy.f90 -march=bdver1 -mcx16 -msahf -mno-movbe
# -maes -mpclmul -mpopcnt -mabm -mlwp -mno-fma -mfma4 -mxop -mno-bmi
# -mno-tbm -mavx -msse4.2 -msse4.1 --param l1-cache-size=16
# --param l1-cache-line-size=64 --param l2-cache-size=2048 -mtune=bdver1
# -auxbase-strip slice_daxpy_O3_native.s -O3 -fverbose-asm
# -fintrinsic-modules-path /usr/lib/gcc/x86_64-linux-gnu/4.6.1/finclude
# options enabled:  -fasynchronous-unwind-tables -fauto-inc-dec
# -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
# -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
# -fdefer-pop -fdelete-null-pointer-checks -fdevirtualize -fdwarf2-cfi-asm
# -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations
# -fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm
# -fguess-branch-probability -fident -fif-conversion -fif-conversion2
# -findirect-inlining -finline -finline-functions
# -finline-functions-called-once -finline-small-functions -fipa-cp
# -fipa-cp-clone -fipa-profile -fipa-pure-const -fipa-reference -fipa-sra
# -fira-share-save-slots -fira-share-spill-slots -fivopts
# -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-constants
# -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
# -foptimize-register-move -foptimize-sibling-calls -fpartial-inlining
# -fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays
# -freg-struct-return -fregmove -freorder-blocks -freorder-functions
# -frerun-cse-after-loop -fsched-critical-path-heuristic
# -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
# -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
# -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule-insns2
# -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types
# -fstrict-aliasing -fstrict-overflow -fstrict-volatile-bitfields
# -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
# -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copy-prop
# -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts
# -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-distribute-patterns
# -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
# -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pre
# -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink
# -ftree-slp-vectorize -ftree-sra -ftree-switch-conversion -ftree-ter
# -ftree-vect-loop-version -ftree-vectorize -ftree-vrp -funit-at-a-time
# -funswitch-loops -funwind-tables -fvect-cost-model -fverbose-asm -fzee
# -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -mabm
# -maccumulate-outgoing-args -maes -malign-stringops -mavx
# -mavx256-split-unaligned-store -mcx16 -mfancy-math-387 -mfma4
# -mfp-ret-in-387 -mglibc -mieee-fp -mlwp -mmmx -mpclmul -mpopcnt
# -mprefer-avx128 -mpush-args -mred-zone -msahf -msse -msse2 -msse3 -msse4
# -msse4.1 -msse4.2 -msse4a -mssse3 -mtls-direct-seg-refs -mvzeroupper
# -mxop

	.text
	.p2align 5,,31
	.globl	__slice_daxpy_MOD_slice_daxpy_arrayintrinsics
	.type	__slice_daxpy_MOD_slice_daxpy_arrayintrinsics, @function
__slice_daxpy_MOD_slice_daxpy_arrayintrinsics:
.LFB0:
	.cfi_startproc
	pushq	%rbx	#
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movq	24(%rdx), %r10	# y_5(D)->dim[0].stride, stride.5
	movl	$1, %r11d	#, tmp155
	movq	24(%rsi), %rax	# x_22(D)->dim[0].stride, stride.1
	movq	(%rdx), %rbx	# y_5(D)->data, y.0
	movq	(%rsi), %rdx	# x_22(D)->data, x.0
	movl	(%r8), %esi	# *ystart_41(D), ypos
	movl	(%rcx), %ecx	# *xstart_39(D), xpos
	testq	%r10, %r10	# stride.5
	cmove	%r11, %r10	# stride.5,, tmp155, stride.5
	testq	%rax, %rax	# stride.1
	cmove	%r11, %rax	# stride.1,, tmp155, stride.1
	movl	(%r9), %r11d	# *yend_43(D), D.1584
	cmpl	%r11d, %esi	# D.1584, ypos
	jg	.L1	#,
	vmovsd	(%rdi), %xmm1	# *a_53(D), pretmp.53
	movslq	%ecx, %rcx	# xpos, xpos
	movslq	%esi, %rdi	# ypos, ypos
	decq	%rdi	# tmp146
	decq	%rcx	# tmp150
	leaq	0(,%rax,8), %r8	#, D.1753
	imulq	%rax, %rcx	# stride.1, tmp150
	subl	%esi, %r11d	# ypos, tmp153
	leaq	0(,%r10,8), %r9	#, D.1747
	imulq	%r10, %rdi	# stride.5, tmp146
	xorl	%eax, %eax	# ivtmp.61
	leaq	(%rdx,%rcx,8), %rdx	#, ivtmp.68
	leaq	1(%r11), %rcx	#, D.1765
	leaq	(%rbx,%rdi,8), %rdi	#, ivtmp.66
	.p2align 5,,24
	.p2align 3
.L5:
	vmovsd	(%rdx), %xmm2	# MEM[base: D.1759_116, offset: 0B],
	incq	%rax	# ivtmp.61
	addq	%r8, %rdx	# D.1753, ivtmp.68
	vfmaddsd	(%rdi), %xmm1, %xmm2, %xmm0	# MEM[base: D.1758_115, offset: 0B], pretmp.53,, tmp154
	vmovsd	%xmm0, (%rdi)	# tmp154, MEM[base: D.1758_115, offset: 0B]
	addq	%r9, %rdi	# D.1747, ivtmp.66
	cmpq	%rcx, %rax	# D.1765, ivtmp.61
	jne	.L5	#,
.L1:
	popq	%rbx	#
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc
.LFE0:
	.size	__slice_daxpy_MOD_slice_daxpy_arrayintrinsics, .-__slice_daxpy_MOD_slice_daxpy_arrayintrinsics
	.p2align 5,,31
	.globl	__slice_daxpy_MOD_slice_daxpy_do
	.type	__slice_daxpy_MOD_slice_daxpy_do, @function
__slice_daxpy_MOD_slice_daxpy_do:
.LFB1:
	.cfi_startproc
	pushq	%rbx	#
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movq	24(%rdx), %r10	# y_4(D)->dim[0].stride, stride.13
	movl	$1, %r11d	#, tmp134
	movq	24(%rsi), %rax	# x_21(D)->dim[0].stride, stride.9
	movq	(%rdx), %rbx	# y_4(D)->data, y.0
	movq	(%rsi), %rdx	# x_21(D)->data, x.0
	movslq	(%r8), %rsi	# *ystart_42(D), S.16
	movslq	(%r9), %r8	# *yend_40(D), D.1611
	vmovsd	(%rdi), %xmm1	# *a_60(D), D.1618
	testq	%r10, %r10	# stride.13
	cmove	%r11, %r10	# stride.13,, tmp134, stride.13
	testq	%rax, %rax	# stride.9
	cmove	%r11, %rax	# stride.9,, tmp134, stride.9
	movslq	(%rcx), %r11	# *xstart_38(D), D.1803
	cmpq	%r8, %rsi	# D.1611, S.16
	jg	.L8	#,
	leaq	1(%rsi), %rdi	#, D.1810
	decq	%rsi	# tmp128
	leaq	0(,%r10,8), %r9	#, D.1795
	imulq	%r10, %rsi	# stride.13, tmp128
	leaq	-1(%r11), %r10	#, tmp131
	imulq	%rax, %r10	# stride.9, tmp131
	leaq	(%rbx,%rsi,8), %rcx	#, ivtmp.93
	leaq	0(,%rax,8), %rsi	#, D.1801
	leaq	(%rdx,%r10,8), %rax	#, ivtmp.95
	jmp	.L12	#
	.p2align 5,,7
	.p2align 3
.L13:
	movq	%rdx, %rdi	# D.1810, D.1810
.L12:
	vmovsd	(%rax), %xmm2	# MEM[base: D.1808_127, offset: 0B],
	addq	%rsi, %rax	# D.1801, ivtmp.95
	leaq	1(%rdi), %rdx	#, D.1810
	vfmaddsd	(%rcx), %xmm1, %xmm2, %xmm0	# MEM[base: D.1807_126, offset: 0B], D.1618,, tmp133
	vmovsd	%xmm0, (%rcx)	# tmp133, MEM[base: D.1807_126, offset: 0B]
	addq	%r9, %rcx	# D.1795, ivtmp.93
	cmpq	%rdi, %r8	# D.1810, D.1611
	jge	.L13	#,
.L8:
	popq	%rbx	#
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc
.LFE1:
	.size	__slice_daxpy_MOD_slice_daxpy_do, .-__slice_daxpy_MOD_slice_daxpy_do
	.ident	"GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
	.section	.note.GNU-stack,"",@progbits

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Array intrinsics vs do loops in gfortran
  2012-05-27 19:27 Array intrinsics vs do loops in gfortran Chris Kerr
@ 2012-05-27 20:13 ` Tim Prince
  2012-05-28  8:12 ` N.M. Maclaren
  1 sibling, 0 replies; 3+ messages in thread
From: Tim Prince @ 2012-05-27 20:13 UTC (permalink / raw)
  To: gcc-help

On 05/27/2012 03:27 PM, Chris Kerr wrote:
> Hi,
>
> I'm working on RMCProfile ( http://www.rmcprofile.org ), trying to reduce the
> run time (which is on the order of 3 CPU years for my datasets). I've done
> some profiling and about 75% of the runtime is spent executing an operation of
> the form y(p:q) += a * x(r:s) [where q-p == s-r] - i.e. DAXPY but on slices of
> the original y and x arrays.
>
> I've tried implementing this as a do loop and using Fortran array intrinsics -
> the version using the do loop is significantly faster (the overall running time
> of the program on a reduced problem set taking ~1 hour is 7% faster). Looking
> at the generated assembly, the do loop uses
> incq	%rax
> cmpq	%rcx, %rax
> jne	.L5
> to increment and check the loop counter, whereas the array intrinsic uses
> movq	%rdx, %rdi
> leaq	1(%rdi), %rdx
> cmpq	%rdi, %r8
> jge	.L13
>
> The same pattern can be reproduced using much simpler source files - I've
> attached the fortran source and the assembly output.
> NB as far as I can tell, none of the above rax, rdx or rdi play any part in
> their respective calculations, they function purely as counters.
>
> I have two questions about the above behaviour:
> 1) Why does the array intrinsic method use an extra instruction compared to
> the do loop? Is there any way of stopping this?
> 2) Is there any way of getting these loops to use SIMD instructions such as
> vfmaddpd? Currently even when -march=native is switched on, the loop only uses
> the scalar instruction vfmaddsd. I'd rather not have to hand-code an unrolled
> loop, especially as I'm more used to C and Python so there would probably be
> off-by-one errors all over the place on my first ten tries.
>
> Thanks in advance,
>
> Chris Kerr
> Department of Chemistry
> University of Cambridge
Compilers appear to be concerned about data overlaps for these module 
subroutines.  ifort generates the same vectorized code body (unrolled by 
4 plus vectorization) in each case, but with a scalar copy of the 
arrays, which may cut sharply the advantage of vectorization.

gfortran -ftree-vectorizer-verbose complains about storage layout:

Analyzing loop at ck.f90:12

12: not vectorized: data dependence conflict prevents gather/strided 
loadD.2022_26 = *y.0_6[D.2019_25];

One would expect ready vectorization of these examples in the usual 
context (not using modules).  Without alignment specifications, avx-128 
code generation is likely.

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Array intrinsics vs do loops in gfortran
  2012-05-27 19:27 Array intrinsics vs do loops in gfortran Chris Kerr
  2012-05-27 20:13 ` Tim Prince
@ 2012-05-28  8:12 ` N.M. Maclaren
  1 sibling, 0 replies; 3+ messages in thread
From: N.M. Maclaren @ 2012-05-28  8:12 UTC (permalink / raw)
  To: Chris Kerr; +Cc: gcc-help, fortran

On May 27 2012, Chris Kerr wrote:
>
> I'm working on RMCProfile ( http://www.rmcprofile.org ), trying to reduce 
> the run time (which is on the order of 3 CPU years for my datasets). I've 
> done some profiling and about 75% of the runtime is spent executing an 
> operation of the form y(p:q) += a * x(r:s) [where q-p == s-r] - i.e. 
> DAXPY but on slices of the original y and x arrays.
>
> I have two questions about the above behaviour: 1) Why does the array 
> intrinsic method use an extra instruction compared to the do loop? Is 
> there any way of stopping this?

scientific-computing@ucs would have got to me more directly :-)
[ Aside: that is for the OP, as we work for the same university - it is
not for external use. ]

While many Fortran compilers used to generate very poor code for array
operations and intrinsics, it is typically better now, and I doubt that
the code difference you have seen is relevant.  In particular, the days
when the odd instruction made much difference are long gone.

> 2) Is there any way of getting these 
> loops to use SIMD instructions such as vfmaddpd? Currently even when 
> -march=native is switched on, the loop only uses the scalar instruction 
> vfmaddsd. I'd rather not have to hand-code an unrolled loop, especially 
> as I'm more used to C and Python so there would probably be off-by-one 
> errors all over the place on my first ten tries.

Now, THAT is the issue, and you are tackling this the wrong way.  There
is no difficulty using DAXPY on slices, and that is what to do if the
slices are reasonably long.  You whould use a tuned BLAS, not the basic
Linux one, such as ATLAS, Intel MKL or AMD ACML.  All can be called
from gfortran code.  You could even do the code conversion automatically
using a Python script, if such constructions are spread all over the
code.

Now, DAXPY is only a level 2 BLAS, and it is quite likely that doing
that will give only a moderate gain.  If you contact me on that Email
with an offending procedure, I may be able to suggest a better method.


Regards,
Nick Maclaren.




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-05-28  8:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-27 19:27 Array intrinsics vs do loops in gfortran Chris Kerr
2012-05-27 20:13 ` Tim Prince
2012-05-28  8:12 ` N.M. Maclaren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).