* Missing optimization on ARM NEON
@ 2013-11-11 0:57 Povilas Kanapickas
2013-11-11 16:22 ` Yufeng Zhang
0 siblings, 1 reply; 3+ messages in thread
From: Povilas Kanapickas @ 2013-11-11 0:57 UTC (permalink / raw)
To: gcc-help
Hello,
[ I don't have a way to test the described testcases against a newer
compiler: could someone verify whether this bug applies to the SVN
version of GCC? ]
GCC-4.8.1 misses several optimizations when using NEON intrinsics.
Consider the following snippet:
#include <arm_neon.h>
uint64_t* foo(uint64_t* x, uint32_t y)
{
uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
return x;
}
'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
assembly:
_Z3fooPyj:
push {r4, r5, r6, r7}
vdup.32 q8, r1
add r7, r0, #32
add r6, r0, #48
add r5, r0, #64
add r4, r0, #80
add r1, r0, #96
add r2, r0, #112
mov r3, r0
adds r0, r0, #128
vst1.64 {d16-d17}, [r3:64]!
vst1.64 {d16-d17}, [r3:64]
vst1.64 {d16-d17}, [r7:64]
vst1.64 {d16-d17}, [r6:64]
vst1.64 {d16-d17}, [r5:64]
vst1.64 {d16-d17}, [r4:64]
vst1.64 {d16-d17}, [r1:64]
vst1.64 {d16-d17}, [r2:64]
pop {r4, r5, r6, r7}
bx lr
It's obvious that the GCC aproach is not optimal. The main problem is
that pointer autoincrement feature of the vst1.64 instruction is not
fully utilized. GCC apparently figures it out for the first store, but
it becomes confused later. I would expect GCC to produce the following
output:
_Z3fooPyj:
vdup.32 q8, r1
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
bx lr
On unrolled loops GCC spills almost all registers to memory, which
causes two to three times worse performance compared to the optimal
version. Unfortunately I couldn't force GCC to generate it by any means
and had to use assembly.
Could someone verify whether the above bug ispresent in the SVN version?
Thanks,
Povilas
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Missing optimization on ARM NEON
2013-11-11 0:57 Missing optimization on ARM NEON Povilas Kanapickas
@ 2013-11-11 16:22 ` Yufeng Zhang
2013-11-13 19:50 ` Povilas Kanapickas
0 siblings, 1 reply; 3+ messages in thread
From: Yufeng Zhang @ 2013-11-11 16:22 UTC (permalink / raw)
To: Povilas Kanapickas; +Cc: gcc-help, Ramana Radhakrishnan
Hi Povilas,
I can confirm that the mainline arm gcc generates the similar code to
what you've observed. Can you please raise a bugzilla for this issue at
http://gcc.gnu.org/bugzilla/
Thanks,
Yufeng
On 11/11/13 00:57, Povilas Kanapickas wrote:
> Hello,
>
> [ I don't have a way to test the described testcases against a newer
> compiler: could someone verify whether this bug applies to the SVN
> version of GCC? ]
>
> GCC-4.8.1 misses several optimizations when using NEON intrinsics.
> Consider the following snippet:
>
> #include<arm_neon.h>
>
> uint64_t* foo(uint64_t* x, uint32_t y)
> {
> uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> return x;
> }
>
> 'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
> assembly:
>
> _Z3fooPyj:
> push {r4, r5, r6, r7}
> vdup.32 q8, r1
> add r7, r0, #32
> add r6, r0, #48
> add r5, r0, #64
> add r4, r0, #80
> add r1, r0, #96
> add r2, r0, #112
> mov r3, r0
> adds r0, r0, #128
> vst1.64 {d16-d17}, [r3:64]!
> vst1.64 {d16-d17}, [r3:64]
> vst1.64 {d16-d17}, [r7:64]
> vst1.64 {d16-d17}, [r6:64]
> vst1.64 {d16-d17}, [r5:64]
> vst1.64 {d16-d17}, [r4:64]
> vst1.64 {d16-d17}, [r1:64]
> vst1.64 {d16-d17}, [r2:64]
> pop {r4, r5, r6, r7}
> bx lr
>
> It's obvious that the GCC aproach is not optimal. The main problem is
> that pointer autoincrement feature of the vst1.64 instruction is not
> fully utilized. GCC apparently figures it out for the first store, but
> it becomes confused later. I would expect GCC to produce the following
> output:
>
> _Z3fooPyj:
> vdup.32 q8, r1
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> bx lr
>
> On unrolled loops GCC spills almost all registers to memory, which
> causes two to three times worse performance compared to the optimal
> version. Unfortunately I couldn't force GCC to generate it by any means
> and had to use assembly.
>
> Could someone verify whether the above bug ispresent in the SVN version?
>
> Thanks,
> Povilas
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Missing optimization on ARM NEON
2013-11-11 16:22 ` Yufeng Zhang
@ 2013-11-13 19:50 ` Povilas Kanapickas
0 siblings, 0 replies; 3+ messages in thread
From: Povilas Kanapickas @ 2013-11-13 19:50 UTC (permalink / raw)
To: Yufeng Zhang; +Cc: gcc-help, Ramana Radhakrishnan
On 11/11/2013 06:22 PM, Yufeng Zhang wrote:
> Hi Povilas,
>
> I can confirm that the mainline arm gcc generates the similar code to
> what you've observed. Can you please raise a bugzilla for this issue at
> http://gcc.gnu.org/bugzilla/
>
> Thanks,
> Yufeng
>
Thanks for confirming. Opened as #59078.
Regards,
Povilas
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-11-13 19:50 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-11 0:57 Missing optimization on ARM NEON Povilas Kanapickas
2013-11-11 16:22 ` Yufeng Zhang
2013-11-13 19:50 ` Povilas Kanapickas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).