public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/51509] New: Inefficient neon intrinsic code sequence
@ 2011-12-12 7:33 carrot at google dot com
2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: carrot at google dot com @ 2011-12-12 7:33 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
Bug #: 51509
Summary: Inefficient neon intrinsic code sequence
Classification: Unclassified
Product: gcc
Version: 4.7.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: carrot@google.com
Target: arm-linux-androideabi
Compile the following code with options -march=armv7-a -mfloat-abi=softfp
-mfpu=neon -mthumb -O2 -Wall -fpic
#include <arm_neon.h>
void simple_vld_intrin(uint8_t *src, uint8_t *dst)
{
uint8x8x4_t x;
uint8x8x2_t y;
x = vld4_lane_u8(src, x, 0);
y.val[0][0] = x.val[1][0];
y.val[1][0] = x.val[2][0];
vst2_lane_u8(dst, y, 0);
}
gcc 4.7 generates:
.LC0:
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.text
.align 2
.global simple_vld_intrin
.thumb
.thumb_func
.type simple_vld_intrin, %function
simple_vld_intrin:
@ args = 0, pretend = 0, frame = 32
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r2, .L2
sub sp, sp, #32
.LPIC0:
add r2, pc
vldmia r2, {d18-d21}
vmov.i32 d19, #0 @ v8qi
vmov d20, d19 @ v8qi
vmov q11, q9 @ ti
vmov q12, q10 @ ti
vmov d16, d19 @ v8qi
vmov d17, d19 @ v8qi
vld4.8 {d22[0], d23[0], d24[0], d25[0]}, [r0]
vstmia sp, {d22-d25}
ldrb r2, [sp, #8] @ zero_extendqisi2
vmov.8 d16[0], r2
vmov.u8 r3, d24[0]
vmov.8 d17[0], r3
vst2.8 {d16[0], d17[0]}, [r1]
add sp, sp, #32
bx lr
.L3:
.align 2
.L2:
.word .LC0-(.LPIC0+4)
An ideal result should be:
vld4.8 {d16[0], d17[0], d18[0], d19[0]}, [r0]
vmov d20, d17 @ v8qi
vmov d21, d18 @ v8qi
vst2.8 {d20[0], d21[0]}, [r1]
bx lr
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/51509] Inefficient neon intrinsic code sequence
2011-12-12 7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
@ 2011-12-12 19:05 ` ramana at gcc dot gnu.org
2011-12-13 9:14 ` rsandifo at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: ramana at gcc dot gnu.org @ 2011-12-12 19:05 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target|arm-linux-androideabi |arm-linux-androideabi,
| |arm-linux-gnueabi
Status|UNCONFIRMED |NEW
Keywords| |missed-optimization
Last reconfirmed| |2011-12-12
CC| |ramana at gcc dot gnu.org,
| |rsandifo at gcc dot gnu.org
Blocks| |47562
Ever Confirmed|0 |1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/51509] Inefficient neon intrinsic code sequence
2011-12-12 7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
@ 2011-12-13 9:14 ` rsandifo at gcc dot gnu.org
2011-12-13 9:36 ` rsandifo at gcc dot gnu.org
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2011-12-13 9:14 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
--- Comment #1 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 2011-12-13 09:07:38 UTC ---
At least part of the problem here is the uninitialised
variable in the vld4 call. GCC tries to create a zero
initialisation of "x" before the vld4, so that the other
lanes have defined values. Obviously we could be doing
that much better than we are, and perhaps we should have
some kind of special case so that uninitialised NEON vectors
are never zero-initialised (e.g. use a plain clobber instead).
But uninitialised variables aren't really ideal either way.
Something like:
x = vld4_dup_u8(src);
y.val[0][0] = x.val[1][0];
y.val[1][0] = x.val[2][0];
vst2_lane_u8(dst, y, 0);
would be better in principle. Unfortunately, we don't
generate good code for that either. Part of the problem
is introduced by lower-subreg, but it's not good even
with -fno-split-wide-types.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/51509] Inefficient neon intrinsic code sequence
2011-12-12 7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
2011-12-13 9:14 ` rsandifo at gcc dot gnu.org
@ 2011-12-13 9:36 ` rsandifo at gcc dot gnu.org
2012-06-15 0:51 ` ramana at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2011-12-13 9:36 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 2011-12-13 09:20:54 UTC ---
FWIW,
uint8x8x4_t x;
uint8x8x2_t y;
x = vld4_dup_u8(src);
y.val[0] = x.val[1];
y.val[1] = x.val[2];
vst2_lane_u8(dst, y, 0);
does give the expected output. I.e. the remaining inefficiency
from comment #1 is in the uninitialised parts of y.
Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/51509] Inefficient neon intrinsic code sequence
2011-12-12 7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
` (2 preceding siblings ...)
2011-12-13 9:36 ` rsandifo at gcc dot gnu.org
@ 2012-06-15 0:51 ` ramana at gcc dot gnu.org
2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: ramana at gcc dot gnu.org @ 2012-06-15 0:51 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
--- Comment #3 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2012-06-15 00:51:26 UTC ---
With -fno-split-wide-types I can end up getting identical output to what is
expected in this case with FSF trunk. I suspect this might be another of those
costs with lower-subreg issues.
Ramana
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/51509] Inefficient neon intrinsic code sequence
2011-12-12 7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
` (3 preceding siblings ...)
2012-06-15 0:51 ` ramana at gcc dot gnu.org
@ 2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: mkuvyrkov at gcc dot gnu.org @ 2015-04-13 16:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |clyon at gcc dot gnu.org,
| |mkuvyrkov at gcc dot gnu.org
Assignee|unassigned at gcc dot gnu.org |kugan at gcc dot gnu.org
--- Comment #4 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> ---
Kugan,
Would you please check if your patch for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65375 also affects this one?
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/51509] Inefficient neon intrinsic code sequence
2011-12-12 7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
` (4 preceding siblings ...)
2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
@ 2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: mkuvyrkov at gcc dot gnu.org @ 2015-04-13 16:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|kugan at gcc dot gnu.org |cbaylis at gcc dot gnu.org
--- Comment #5 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> ---
Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for
armv7. Charles, would you please look at this?
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-04-13 16:33 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-12 7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
2011-12-13 9:14 ` rsandifo at gcc dot gnu.org
2011-12-13 9:36 ` rsandifo at gcc dot gnu.org
2012-06-15 0:51 ` ramana at gcc dot gnu.org
2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).