public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/43725] New: Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
@ 2010-04-12 7:27 siarhei dot siamashka at gmail dot com
2010-05-11 7:35 ` [Bug target/43725] " ramana at gcc dot gnu dot org
0 siblings, 1 reply; 13+ messages in thread
From: siarhei dot siamashka at gmail dot com @ 2010-04-12 7:27 UTC (permalink / raw)
To: gcc-bugs
gcc version 4.5.0-rc20100406
/**************/
#include <arm_neon.h>
void x(int32x4_t a, int32x4_t b, int32x4_t *p)
{
#define X(n) p[n] = vaddq_s32(p[n], a); p[n] = vorrq_s32(p[n], b);
X(0); X(1); X(2); X(3); X(4); X(5); X(6); X(7);
X(8); X(9); X(10); X(11); X(12);
}
/**************/
# gcc -O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=hard -c test.c
# objdump -d test.o
00000000 <x>:
0: edd0eb2c vldr d30, [r0, #176] ; 0xb0
4: edd0fb2e vldr d31, [r0, #184] ; 0xb8
8: ecd02b04 vldmia r0, {d18-d19}
c: ed2d8b10 vpush {d8-d15}
10: ed904b30 vldr d4, [r0, #192] ; 0xc0
14: ed905b32 vldr d5, [r0, #200] ; 0xc8
18: e24dd020 sub sp, sp, #32
1c: f22ec8c0 vadd.i32 q6, q15, q0
20: f26228c0 vadd.i32 q9, q9, q0
24: edd04b08 vldr d20, [r0, #32]
28: edd05b0a vldr d21, [r0, #40] ; 0x28
2c: edd0cb18 vldr d28, [r0, #96] ; 0x60
30: edd0db1a vldr d29, [r0, #104] ; 0x68
34: f264e840 vadd.i32 q15, q2, q0
38: f26448c0 vadd.i32 q10, q10, q0
3c: f26cc8c0 vadd.i32 q14, q14, q0
40: edd00b04 vldr d16, [r0, #16]
44: edd01b06 vldr d17, [r0, #24]
48: edd0ab14 vldr d26, [r0, #80] ; 0x50
4c: edd0bb16 vldr d27, [r0, #88] ; 0x58
50: ec8dcb04 vstmia sp, {d12-d13}
54: f222c1d2 vorr q6, q9, q1
58: f26008c0 vadd.i32 q8, q8, q0
5c: f26aa8c0 vadd.i32 q13, q13, q0
60: edd06b0c vldr d22, [r0, #48] ; 0x30
64: edd07b0e vldr d23, [r0, #56] ; 0x38
68: edd08b10 vldr d24, [r0, #64] ; 0x40
6c: edd09b12 vldr d25, [r0, #72] ; 0x48
70: ed906b1c vldr d6, [r0, #112] ; 0x70
74: ed907b1e vldr d7, [r0, #120] ; 0x78
78: ed908b20 vldr d8, [r0, #128] ; 0x80
7c: ed909b22 vldr d9, [r0, #136] ; 0x88
80: ed90ab24 vldr d10, [r0, #144] ; 0x90
84: ed90bb26 vldr d11, [r0, #152] ; 0x98
88: ed90eb28 vldr d14, [r0, #160] ; 0xa0
8c: ed90fb2a vldr d15, [r0, #168] ; 0xa8
90: edcdeb04 vstr d30, [sp, #16]
94: edcdfb06 vstr d31, [sp, #24]
98: ec80cb04 vstmia r0, {d12-d13}
9c: f224c1d2 vorr q6, q10, q1
a0: f26c41d2 vorr q10, q14, q1
a4: ecddcb04 vldmia sp, {d28-d29}
a8: f26021d2 vorr q9, q8, q1
ac: f26888c0 vadd.i32 q12, q12, q0
b0: f26ae1d2 vorr q15, q13, q1
b4: f26668c0 vadd.i32 q11, q11, q0
b8: f26ca1d2 vorr q13, q14, q1
bc: f2266840 vadd.i32 q3, q3, q0
c0: f2288840 vadd.i32 q4, q4, q0
c4: f22aa840 vadd.i32 q5, q5, q0
c8: f22ee840 vadd.i32 q7, q7, q0
cc: edddcb04 vldr d28, [sp, #16]
d0: eddddb06 vldr d29, [sp, #24]
d4: f22601d2 vorr q0, q11, q1
d8: f22841d2 vorr q2, q12, q1
dc: f2680152 vorr q8, q4, q1
e0: f26a6152 vorr q11, q5, q1
e4: f26e8152 vorr q12, q7, q1
e8: edc02b04 vstr d18, [r0, #16]
ec: edc03b06 vstr d19, [r0, #24]
f0: f2662152 vorr q9, q3, q1
f4: f22c21d2 vorr q1, q14, q1
f8: ed80cb08 vstr d12, [r0, #32]
fc: ed80db0a vstr d13, [r0, #40] ; 0x28
100: ed800b0c vstr d0, [r0, #48] ; 0x30
104: ed801b0e vstr d1, [r0, #56] ; 0x38
108: ed804b10 vstr d4, [r0, #64] ; 0x40
10c: ed805b12 vstr d5, [r0, #72] ; 0x48
110: edc0eb14 vstr d30, [r0, #80] ; 0x50
114: edc0fb16 vstr d31, [r0, #88] ; 0x58
118: edc04b18 vstr d20, [r0, #96] ; 0x60
11c: edc05b1a vstr d21, [r0, #104] ; 0x68
120: edc02b1c vstr d18, [r0, #112] ; 0x70
124: edc03b1e vstr d19, [r0, #120] ; 0x78
128: edc00b20 vstr d16, [r0, #128] ; 0x80
12c: edc01b22 vstr d17, [r0, #136] ; 0x88
130: edc06b24 vstr d22, [r0, #144] ; 0x90
134: edc07b26 vstr d23, [r0, #152] ; 0x98
138: edc08b28 vstr d24, [r0, #160] ; 0xa0
13c: edc09b2a vstr d25, [r0, #168] ; 0xa8
140: edc0ab2c vstr d26, [r0, #176] ; 0xb0
144: edc0bb2e vstr d27, [r0, #184] ; 0xb8
148: ed802b30 vstr d2, [r0, #192] ; 0xc0
14c: ed803b32 vstr d3, [r0, #200] ; 0xc8
150: e28dd020 add sp, sp, #32
154: ecbd8b10 vpop {d8-d15}
158: e12fff1e bx lr
This shows multiple performance problems:
1. The use of inherently slower VLDR/VSTR instructions instead of VLD1/VST1
2. Failure to make proper use of ARM Cortex-A8 NEON LS/ALU dual issue
3. Unnecessary spills to stack
This is a general issue with NEON intrinsics, causing serious performance
problems for practically any nontrivial code. I guess this itself can be a
meta-bug, with each individual performance issue tracked separately.
--
Summary: Poor instructions selection, scheduling and registers
allocation for ARM NEON intrinsics
Product: gcc
Version: 4.5.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: siarhei dot siamashka at gmail dot com
GCC target triplet: armv7l-unknown-linux-gnueabi
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <bug-43725-4@http.gcc.gnu.org/bugzilla/>]
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
@ 2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
` (9 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2010-09-29 20:50 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2010-05-11 07:35:23 |2010-09-29 7:35:23
date| |
CC| |rearnsha at gcc dot gnu.org
--- Comment #1 from Richard Earnshaw <rearnsha at gcc dot gnu.org> 2010-09-29 16:28:17 UTC ---
So the compiler is correct not to be using vld1 for this code. The memory
format of int32x4_t is defined to be the format of a neon register that has
been filled from an array of int32 values and then stored to memory using VSTM
(or equivalent sequence). The implication of all this is that int32x4_t does
not (necessarily) have the same memory layout as int32_t[4].
arm_neon.h provides intrinsics for filling neon registers from arrays in
memory, and in this case I think you should be using these directly. That is,
your macro should be modified to contain:
#define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v =
vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);}
There are still problems after doing this, however. In particular the compiler
is not correctly tracking alias information for the load/store intrinsics,
which means it is unable to move stores past loads to reduce stalls in the
pipeline.
The stack wastage appears to be fixed in trunk gcc; at least I don't see any
stack allocation for your testcase.
I haven't looked into the scheduling issues at this time.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
@ 2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
2010-10-04 23:46 ` joseph at codesourcery dot com
` (8 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2010-10-04 23:00 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
--- Comment #2 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-04 22:59:56 UTC ---
(In reply to comment #1)
> So the compiler is correct not to be using vld1 for this code. The memory
> format of int32x4_t is defined to be the format of a neon register that has
> been filled from an array of int32 values and then stored to memory using VSTM
> (or equivalent sequence). The implication of all this is that int32x4_t does
> not (necessarily) have the same memory layout as int32_t[4].
Could you elaborate on this? Specifically about the case when memory format for
VSTM and VST1 may differ.
I thought that VST1 instruction could be always used as a replacement for VSTM,
it is just a little bit less convenient in some cases because it is lacking
some more advanced addressing modes. Moreover, VSTM is VFP instruction and VST1
is NEON one. So I guess mixing VSTM with true NEON instructions may be
additionally a bad idea (for performance reasons on Cortex-A9 or other
processors?).
There also used to be FLDMX/FSTMX instructions, but they are deprecated now. I
believe they existed specifically to reserve the use of normal VFP load/store
instructions for floating point data formats only, but later this turned out to
be unnecessary.
> arm_neon.h provides intrinsics for filling neon registers from arrays in
> memory, and in this case I think you should be using these directly. That is,
> your macro should be modified to contain:
>
> #define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v =
> vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);}
I'm sorry, but this looks like a completely unjustified limitation to me. Why
intrinsics should be so much more difficult and less intuitive to use than just
inline assembly? Additionally, gcc allows to use normal arithmetic operations
on vector data types, something like:
void x(int32x4_t a, int32x4_t b, int32x4_t *p)
{
#define X(n) p[n] += a; p[n] |= b;
X(0); X(1); X(2); X(3); X(4); X(5); X(6); X(7);
X(8); X(9); X(10); X(11); X(12);
}
> There are still problems after doing this, however. In particular the compiler
> is not correctly tracking alias information for the load/store intrinsics,
> which means it is unable to move stores past loads to reduce stalls in the
> pipeline.
OK, thanks for the explanation.
> The stack wastage appears to be fixed in trunk gcc; at least I don't see any
> stack allocation for your testcase.
Yes, looks like it got a little bit better. Anyway stack allocation shows up
again after adding just a few more of these X() macros:
... X(13); X(14); X(15); X(16); ...
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
@ 2010-10-04 23:46 ` joseph at codesourcery dot com
2010-10-05 7:16 ` ramana at gcc dot gnu.org
` (7 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: joseph at codesourcery dot com @ 2010-10-04 23:46 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
--- Comment #3 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2010-10-04 23:45:57 UTC ---
On Mon, 4 Oct 2010, siarhei.siamashka at gmail dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
>
> --- Comment #2 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-04 22:59:56 UTC ---
> (In reply to comment #1)
> > So the compiler is correct not to be using vld1 for this code. The memory
> > format of int32x4_t is defined to be the format of a neon register that has
> > been filled from an array of int32 values and then stored to memory using VSTM
> > (or equivalent sequence). The implication of all this is that int32x4_t does
> > not (necessarily) have the same memory layout as int32_t[4].
>
> Could you elaborate on this? Specifically about the case when memory format for
> VSTM and VST1 may differ.
Big-endian.
I previously explained the issues with big-endian NEON vectors in GCC at
length:
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2010-10-04 23:46 ` joseph at codesourcery dot com
@ 2010-10-05 7:16 ` ramana at gcc dot gnu.org
2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
` (6 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: ramana at gcc dot gnu.org @ 2010-10-05 7:16 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
--- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2010-10-05 07:16:35 UTC ---
(In reply to comment #2)
> (In reply to comment #1)
> > So the compiler is correct not to be using vld1 for this code. The memory
> > format of int32x4_t is defined to be the format of a neon register that has
> > been filled from an array of int32 values and then stored to memory using VSTM
> > (or equivalent sequence). The implication of all this is that int32x4_t does
> > not (necessarily) have the same memory layout as int32_t[4].
>
> Could you elaborate on this? Specifically about the case when memory format for
> VSTM and VST1 may differ.
>
> I thought that VST1 instruction could be always used as a replacement for VSTM,
> it is just a little bit less convenient in some cases because it is lacking
> some more advanced addressing modes. Moreover, VSTM is VFP instruction and VST1
> is NEON one. So I guess mixing VSTM with true NEON instructions may be
> additionally a bad idea (for performance reasons on Cortex-A9 or other
> processors?).
The ARM ARM states that VLDM / VSTM and VLDR / VSTR for 64 bit values are
compliant with VFPv2 / VFPv3 and advanced SIMD i.e. they can be executed by
both the units . Thus there should be no performance regressions on the A9
AFAIK for VLDM and VSTM / VLDR and VSTR of 64 bit registers interleaved with
other Neon instructions.
cheers
Ramana
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2010-10-05 7:16 ` ramana at gcc dot gnu.org
@ 2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
` (5 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2010-10-08 14:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
--- Comment #5 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-08 14:13:08 UTC ---
(In reply to comment #3)
> On Mon, 4 Oct 2010, siarhei.siamashka at gmail dot com wrote:
>
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
> >
> > --- Comment #2 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-04 22:59:56 UTC ---
> > (In reply to comment #1)
> > > So the compiler is correct not to be using vld1 for this code. The memory
> > > format of int32x4_t is defined to be the format of a neon register that has
> > > been filled from an array of int32 values and then stored to memory using VSTM
> > > (or equivalent sequence). The implication of all this is that int32x4_t does
> > > not (necessarily) have the same memory layout as int32_t[4].
> >
> > Could you elaborate on this? Specifically about the case when memory format for
> > VSTM and VST1 may differ.
>
> Big-endian.
OK, I see. Looks like VLDM/VSTM instructions could be replaced with VLD1/VST1
(by artificially forcing element size to 64) in almost all cases except when
SCTLR.A == 1 due to unwanted alignment traps potentially happening in this
case.
But the question is whether it is really necessary to suffer from a performance
penalty on little endian systems?
> I previously explained the issues with big-endian NEON vectors in GCC at
> length:
>
> http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html
Thanks for the link, something seems to be seriously overengineered. Looks like
you brought a problem upon yourself and now are trying to valiantly solve it.
Does (efficient) support of NEON intrinsics on big endian systems even have any
practical value? Maybe it makes sense to get a reasonable performance at least
on little endian systems first. To me it looks like you are just running after
two hares...
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (4 preceding siblings ...)
2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
@ 2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
2014-07-09 12:26 ` m.zakirov at samsung dot com
` (4 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2011-06-29 13:35 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
--- Comment #6 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2011-06-29 13:35:13 UTC ---
Created attachment 24630
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24630
test.c
Attached a slightly updated testcase, which can demonstrate unnecessary spills
to stack even with more recent versions of gcc as explained in comment 2
earlier (just slightly increased the number of uses for X() macro)
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (5 preceding siblings ...)
2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
@ 2014-07-09 12:26 ` m.zakirov at samsung dot com
2014-07-29 11:35 ` m.zakirov at samsung dot com
` (3 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: m.zakirov at samsung dot com @ 2014-07-09 12:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
Marat Zakirov <m.zakirov at samsung dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |joseph at codesourcery dot com,
| |m.zakirov at samsung dot com
--- Comment #7 from Marat Zakirov <m.zakirov at samsung dot com> ---
Another neon alloc issue.
Code:
#include <arm_neon.h>
#include <inttypes.h>
extern uint16x8x4_t m0;
extern uint16x8x4_t m1;
void foo(uint16_t * in_ptr)
{
uint16x8x4_t t0, t1;
t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
t0.val[0] *= 333;
t0.val[1] *= 333;
t0.val[2] *= 333;
t0.val[3] *= 333;
t1.val[0] *= 333;
t1.val[1] *= 333;
t1.val[2] *= 333;
t1.val[3] *= 333;
m0 = t0;
m1 = t1;
}
Asm file:
.vsave {d8, d9, d10, d11, d12, d13, d14, d15}
add r1, r0, #160
vld4.16 {d8, d10, d12, d14}, [r0]
add r0, r0, #32
.pad #64
sub sp, sp, #64
vld4.16 {d16, d18, d20, d22}, [r2]
movw r3, #:lower16:m1
movw r2, #:lower16:m0
vldr d6, .L3
vldr d7, .L3+8
movt r3, #:upper16:m1
movt r2, #:upper16:m0
vld4.16 {d9, d11, d13, d15}, [r0]
vld4.16 {d17, d19, d21, d23}, [r1]
vmul.i16 q12, q3, q4
vstmia sp, {d16-d23} <<< *
vld1.64 {d4-d5}, [sp:64] <<< *
vmul.i16 q13, q3, q5 <<< **
vmul.i16 q9, q3, q9
vmul.i16 q14, q3, q6 <<< **
vmul.i16 q10, q3, q10
vmul.i16 q8, q3, q2 <<< **, ***
vmul.i16 q15, q3, q7 <<< **
vmul.i16 q11, q3, q11
vstmia r2, {d24-d31}
vstmia r3, {d16-d23}
add sp, sp, #64
@ sp needed
fldmfdd sp!, {d8-d15}
bx lr
So my qustion are:
1) Why do we need * and why compiler used q2 in *** ?
2) Why compiler didn't reuse registers q5,q6,q2,q7 in ** ?
Command line:
cc1 -quiet -v t.c -quiet -dumpbase t.c -mfpu=neon -mcpu=cortex-a15
-mfloat-abi=softfp -marm -mtls-dialect=gnu -auxbase-strip t.s -O3
-Wno-error=unused-local-typedefs -version -fdump-tree-all -fdump-rtl-all
-funwind-tables -o t.s
gcc version = 4.10.0
--build=x86_64-pc-linux-gnu
--host=x86_64-pc-linux-gnu
--target=arm-v7a15v5r2-linux-gnueabi
--Marat
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (6 preceding siblings ...)
2014-07-09 12:26 ` m.zakirov at samsung dot com
@ 2014-07-29 11:35 ` m.zakirov at samsung dot com
2014-07-29 11:46 ` m.zakirov at samsung dot com
` (2 subsequent siblings)
10 siblings, 0 replies; 13+ messages in thread
From: m.zakirov at samsung dot com @ 2014-07-29 11:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
--- Comment #8 from Marat Zakirov <m.zakirov at samsung dot com> ---
UPDATE
Using little fix you may got a much better code...
transpose_16x16:
.fnstart
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add r2, r0, #128
vld4.16 {d24, d26, d28, d30}, [r0]
add r1, r0, #160
vld4.16 {d16, d18, d20, d22}, [r2]
add r0, r0, #32
movw r3, #:lower16:m1
vldr d6, .L2
vldr d7, .L2+8(in CSE)
movw r2, #:lower16:m0
movt r3, #:upper16:m1
movt r2, #:upper16:m0
vld4.16 {d25, d27, d29, d31}, [r0]
vld4.16 {d17, d19, d21, d23}, [r1]
vmul.i16 q12, q3, q12
vmul.i16 q8, q3, q8
vmul.i16 q13, q3, q13
vmul.i16 q9, q3, q9
vmul.i16 q14, q3, q14
vmul.i16 q10, q3, q10
vmul.i16 q15, q3, q15
vmul.i16 q11, q3, q11
vstmia r2, {d24-d31}
vstmia r3, {d16-d23}
bx lr
.L3:
About fix:
I discovered that GCC register allocator has 'weak' support for stream (in my
case NEON) registers. RA works with stream resgisters as with unsplitible
ranges. So if some register of range becomes free GCC do not reuse them untill
whole range becomes free.
Is actually OK, but...
I found that GCC CSE phase makes partly substitution for register-ranges and
this leads to terrible register pressure increse.
Example
Before CSE
a = b
a0 = a0 * 3
a1 = a1 * 3
a2 = a2 * 3
a3 = a3 * 3
After
a = b
a0 = b0 * 3
a1 = a1 * 3 <<< *
a2 = a2 * 3
a3 = a3 * 3
CSE do not substitute b1 to a1 because at the moment (*) a0 was define so
actually a != b. Yes but a1 = b1, unfortuanatly CSE also do not how to handle
register-ranges parts as RA does. And I am not sure that 'unfortuanatly'.
Because.
a0 = b0 * 3
a1 = b1 * 3
a2 = b2 * 3
a3 = b3 * 3
Also requres x2 more stream registers than its really need to.
My solution here is to forbid CSE for XImode registers.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (7 preceding siblings ...)
2014-07-29 11:35 ` m.zakirov at samsung dot com
@ 2014-07-29 11:46 ` m.zakirov at samsung dot com
2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
2021-09-27 7:21 ` pinskia at gcc dot gnu.org
10 siblings, 0 replies; 13+ messages in thread
From: m.zakirov at samsung dot com @ 2014-07-29 11:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
--- Comment #9 from Marat Zakirov <m.zakirov at samsung dot com> ---
I used following patch
diff --git a/gcc/cse.c b/gcc/cse.c
index 34f9364..a9e0442 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -2862,6 +2862,9 @@ canon_reg (rtx x, rtx insn)
|| ! REGNO_QTY_VALID_P (REGNO (x)))
return x;
+ if (GET_MODE (x) == XImode)
+ return x;
+
q = REG_QTY (REGNO (x));
ent = &qty_table[q];
first = ent->first_reg;
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 547fcd6..eadc729 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1317,6 +1317,9 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn,
rtx def_set)
if (!new_rtx)
return false;
+ if (GET_MODE (reg) == XImode)
+ return false;
+
return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
}
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (8 preceding siblings ...)
2014-07-29 11:46 ` m.zakirov at samsung dot com
@ 2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
2021-09-27 7:21 ` pinskia at gcc dot gnu.org
10 siblings, 0 replies; 13+ messages in thread
From: mkuvyrkov at gcc dot gnu.org @ 2014-08-20 16:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
CC| |mkuvyrkov at gcc dot gnu.org
Assignee|unassigned at gcc dot gnu.org |mkuvyrkov at gcc dot gnu.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
` (9 preceding siblings ...)
2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
@ 2021-09-27 7:21 ` pinskia at gcc dot gnu.org
10 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-27 7:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-09-27 7:21 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-12 7:27 [Bug target/43725] New: Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics siarhei dot siamashka at gmail dot com
2010-05-11 7:35 ` [Bug target/43725] " ramana at gcc dot gnu dot org
[not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
2010-10-04 23:46 ` joseph at codesourcery dot com
2010-10-05 7:16 ` ramana at gcc dot gnu.org
2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
2014-07-09 12:26 ` m.zakirov at samsung dot com
2014-07-29 11:35 ` m.zakirov at samsung dot com
2014-07-29 11:46 ` m.zakirov at samsung dot com
2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
2021-09-27 7:21 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).