suboptimal asm generated for a simple loop

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* suboptimal asm generated for a simple loop
@ 2014-02-05 21:40 Marcin Krotkiewski
  2014-02-05 22:09 ` Marc Glisse
  0 siblings, 1 reply; 2+ messages in thread
From: Marcin Krotkiewski @ 2014-02-05 21:40 UTC (permalink / raw)
  To: gcc-help

Hello,

I am seeing suboptimal performance of the following loop compiled with
gcc (4.7.3 and 4.4.7, Ubuntu):

    for(i=0; i<NSIZE; i++){
      a[i] += b[i];
      c[i] += d[i];
    }

Arrays are dynamically allocated and aligned to page boundary, declared
with __restrict__ and __attribute__((aligned(32))). I am running on
Intel i7-2620M (Sandy Bridge).

The problem is IMHO related to '4k aliasing'. It happens for the most
common case of a/b/c/d starting at page boundary (e.g., natural result
of malloc). To demonstrate, here is the assembly generated with 'gcc
-mtune=native -mavx -O3':

.L8:
        vmovapd (%rdx,%rdi), %ymm0		#1 load b
        addq    $1, %r8				#2
        vaddpd  (%rcx,%rdi), %ymm0, %ymm0	#3 load a and add
        vmovapd %ymm0, (%rdx,%rdi)		#4 store a
        vmovapd (%rax,%rdi), %ymm0		#5 load d
        vaddpd  (%rsi,%rdi), %ymm0, %ymm0	#6 load c and add
        vmovapd %ymm0, (%rax,%rdi)		#7 store c
        addq    $32, %rdi			#8
        cmpq    %r8, %r12			#9
        ja      .L8				#10

The 4k aliasing problem is caused by lines 4 and 5 (writing result to
array a and reading data from either c or d). From my tests this seems
to be the default behavior for both AVX and SSE2 instruction sets, and
for both vectorized and non-vectorized cases.

It is easy to fix the problem by placing the two writes together, at the
end of the iteration, e.g.:

.L8:
        vmovapd (%rdx,%rdi), %ymm1		#1
        addq    $1, %r8				#2
        vaddpd  (%rcx,%rdi), %ymm1, %ymm1	#3
        vmovapd (%rax,%rdi), %ymm0		#4
        vaddpd  (%rsi,%rdi), %ymm0, %ymm0	#5
        vmovapd %ymm1, (%rdx,%rdi)		#6
        vmovapd %ymm0, (%rax,%rdi)		#7
        addq    $32, %rdi			#8
        cmpq    %r8, %r12			#9
        ja      .L8				#10

In this case the writes happen after all the loads. The above code is
(almost) what ICC generates for this case. For problem sizes small
enough to fit in L1 the speedup is roughly 50%.

I would like to know if this can be considered a 'performance bug' to be
reported, or (most likely..) there is a reason for this behavior that I
am not aware of? If so, can you suggest how can I write optimal C code
without resorting to intrinsics, or changing alignment of the data?

Thank you!

Marcin

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: suboptimal asm generated for a simple loop
  2014-02-05 21:40 suboptimal asm generated for a simple loop Marcin Krotkiewski
@ 2014-02-05 22:09 ` Marc Glisse
  0 siblings, 0 replies; 2+ messages in thread
From: Marc Glisse @ 2014-02-05 22:09 UTC (permalink / raw)
  To: Marcin Krotkiewski; +Cc: gcc-help

On Wed, 5 Feb 2014, Marcin Krotkiewski wrote:

> Hello,
>
> I am seeing suboptimal performance of the following loop compiled with
> gcc (4.7.3 and 4.4.7, Ubuntu):
>
>    for(i=0; i<NSIZE; i++){
>      a[i] += b[i];
>      c[i] += d[i];
>    }

Note: it really helps to show a complete example, so we can just 
copy-paste it.

> Arrays are dynamically allocated and aligned to page boundary, declared
> with __restrict__ and __attribute__((aligned(32))). I am running on
> Intel i7-2620M (Sandy Bridge).
>
> The problem is IMHO related to '4k aliasing'. It happens for the most
> common case of a/b/c/d starting at page boundary (e.g., natural result
> of malloc). To demonstrate, here is the assembly generated with 'gcc
> -mtune=native -mavx -O3':
>
> .L8:
>        vmovapd (%rdx,%rdi), %ymm0		#1 load b
>        addq    $1, %r8				#2
>        vaddpd  (%rcx,%rdi), %ymm0, %ymm0	#3 load a and add
>        vmovapd %ymm0, (%rdx,%rdi)		#4 store a
>        vmovapd (%rax,%rdi), %ymm0		#5 load d
>        vaddpd  (%rsi,%rdi), %ymm0, %ymm0	#6 load c and add
>        vmovapd %ymm0, (%rax,%rdi)		#7 store c
>        addq    $32, %rdi			#8
>        cmpq    %r8, %r12			#9
>        ja      .L8				#10
>
> The 4k aliasing problem is caused by lines 4 and 5 (writing result to
> array a and reading data from either c or d). From my tests this seems
> to be the default behavior for both AVX and SSE2 instruction sets, and
> for both vectorized and non-vectorized cases.
>
> It is easy to fix the problem by placing the two writes together, at the
> end of the iteration, e.g.:
>
> .L8:
>        vmovapd (%rdx,%rdi), %ymm1		#1
>        addq    $1, %r8				#2
>        vaddpd  (%rcx,%rdi), %ymm1, %ymm1	#3
>        vmovapd (%rax,%rdi), %ymm0		#4
>        vaddpd  (%rsi,%rdi), %ymm0, %ymm0	#5
>        vmovapd %ymm1, (%rdx,%rdi)		#6
>        vmovapd %ymm0, (%rax,%rdi)		#7
>        addq    $32, %rdi			#8
>        cmpq    %r8, %r12			#9
>        ja      .L8				#10
>
> In this case the writes happen after all the loads. The above code is
> (almost) what ICC generates for this case. For problem sizes small
> enough to fit in L1 the speedup is roughly 50%.
>
> I would like to know if this can be considered a 'performance bug' to be
> reported, or (most likely..) there is a reason for this behavior that I
> am not aware of? If so, can you suggest how can I write optimal C code
> without resorting to intrinsics, or changing alignment of the data?

If you can't find a duplicate in bugzilla (please do search first), it 
seems worth reporting, yes. You can help gcc generate the asm you expect 
by writing C code that looks more similar:

     double aa=a[i];
     double bb=b[i];
     double cc=c[i];
     double dd=d[i];
     a[i] = aa+bb;
     c[i] = cc+dd;

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-02-05 22:09 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-05 21:40 suboptimal asm generated for a simple loop Marcin Krotkiewski
2014-02-05 22:09 ` Marc Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).