Missing optimization on ARM NEON

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Missing optimization on ARM NEON
@ 2013-11-11  0:57 Povilas Kanapickas
  2013-11-11 16:22 ` Yufeng Zhang
  0 siblings, 1 reply; 3+ messages in thread
From: Povilas Kanapickas @ 2013-11-11  0:57 UTC (permalink / raw)
  To: gcc-help

Hello,

[ I don't have a way to test the described testcases against a newer
compiler: could someone verify whether this bug applies to the SVN
version of GCC? ]

GCC-4.8.1 misses several optimizations when using NEON intrinsics.
Consider the following snippet:

#include <arm_neon.h>

uint64_t* foo(uint64_t* x, uint32_t y)
{
    uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    return x;
}

'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
assembly:

_Z3fooPyj:
	push	{r4, r5, r6, r7}
	vdup.32	q8, r1
	add	r7, r0, #32
	add	r6, r0, #48
	add	r5, r0, #64
	add	r4, r0, #80
	add	r1, r0, #96
	add	r2, r0, #112
	mov	r3, r0
	adds	r0, r0, #128
	vst1.64	{d16-d17}, [r3:64]!
	vst1.64	{d16-d17}, [r3:64]
	vst1.64	{d16-d17}, [r7:64]
	vst1.64	{d16-d17}, [r6:64]
	vst1.64	{d16-d17}, [r5:64]
	vst1.64	{d16-d17}, [r4:64]
	vst1.64	{d16-d17}, [r1:64]
	vst1.64	{d16-d17}, [r2:64]
	pop	{r4, r5, r6, r7}
	bx	lr

It's obvious that the GCC aproach is not optimal. The main problem is
that pointer autoincrement feature of the vst1.64 instruction is not
fully utilized. GCC apparently figures it out for the first store, but
it becomes confused later. I would expect GCC to produce the following
output:

_Z3fooPyj:
	vdup.32	q8, r1
        vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	bx	lr

On unrolled loops GCC spills almost all registers to memory, which
causes two to three times worse performance compared to the optimal
version. Unfortunately I couldn't force GCC to generate it by any means
and had to use assembly.

Could someone verify whether the above bug ispresent in the SVN version?

Thanks,
Povilas

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Missing optimization on ARM NEON
  2013-11-11  0:57 Missing optimization on ARM NEON Povilas Kanapickas
@ 2013-11-11 16:22 ` Yufeng Zhang
  2013-11-13 19:50   ` Povilas Kanapickas
  0 siblings, 1 reply; 3+ messages in thread
From: Yufeng Zhang @ 2013-11-11 16:22 UTC (permalink / raw)
  To: Povilas Kanapickas; +Cc: gcc-help, Ramana Radhakrishnan

Hi Povilas,

I can confirm that the mainline arm gcc generates the similar code to 
what you've observed.  Can you please raise a bugzilla for this issue at 
http://gcc.gnu.org/bugzilla/

Thanks,
Yufeng

On 11/11/13 00:57, Povilas Kanapickas wrote:
> Hello,
>
> [ I don't have a way to test the described testcases against a newer
> compiler: could someone verify whether this bug applies to the SVN
> version of GCC? ]
>
> GCC-4.8.1 misses several optimizations when using NEON intrinsics.
> Consider the following snippet:
>
> #include<arm_neon.h>
>
> uint64_t* foo(uint64_t* x, uint32_t y)
> {
>      uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      return x;
> }
>
> 'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
> assembly:
>
> _Z3fooPyj:
> 	push	{r4, r5, r6, r7}
> 	vdup.32	q8, r1
> 	add	r7, r0, #32
> 	add	r6, r0, #48
> 	add	r5, r0, #64
> 	add	r4, r0, #80
> 	add	r1, r0, #96
> 	add	r2, r0, #112
> 	mov	r3, r0
> 	adds	r0, r0, #128
> 	vst1.64	{d16-d17}, [r3:64]!
> 	vst1.64	{d16-d17}, [r3:64]
> 	vst1.64	{d16-d17}, [r7:64]
> 	vst1.64	{d16-d17}, [r6:64]
> 	vst1.64	{d16-d17}, [r5:64]
> 	vst1.64	{d16-d17}, [r4:64]
> 	vst1.64	{d16-d17}, [r1:64]
> 	vst1.64	{d16-d17}, [r2:64]
> 	pop	{r4, r5, r6, r7}
> 	bx	lr
>
> It's obvious that the GCC aproach is not optimal. The main problem is
> that pointer autoincrement feature of the vst1.64 instruction is not
> fully utilized. GCC apparently figures it out for the first store, but
> it becomes confused later. I would expect GCC to produce the following
> output:
>
> _Z3fooPyj:
> 	vdup.32	q8, r1
>          vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	bx	lr
>
> On unrolled loops GCC spills almost all registers to memory, which
> causes two to three times worse performance compared to the optimal
> version. Unfortunately I couldn't force GCC to generate it by any means
> and had to use assembly.
>
> Could someone verify whether the above bug ispresent in the SVN version?
>
> Thanks,
> Povilas
>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Missing optimization on ARM NEON
  2013-11-11 16:22 ` Yufeng Zhang
@ 2013-11-13 19:50   ` Povilas Kanapickas
  0 siblings, 0 replies; 3+ messages in thread
From: Povilas Kanapickas @ 2013-11-13 19:50 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: gcc-help, Ramana Radhakrishnan

On 11/11/2013 06:22 PM, Yufeng Zhang wrote:
> Hi Povilas,
> 
> I can confirm that the mainline arm gcc generates the similar code to 
> what you've observed.  Can you please raise a bugzilla for this issue at 
> http://gcc.gnu.org/bugzilla/
> 
> Thanks,
> Yufeng
> 

Thanks for confirming. Opened as #59078.

Regards,
Povilas

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-11-13 19:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-11  0:57 Missing optimization on ARM NEON Povilas Kanapickas
2013-11-11 16:22 ` Yufeng Zhang
2013-11-13 19:50   ` Povilas Kanapickas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).