public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/51509] New: Inefficient neon intrinsic code sequence
@ 2011-12-12  7:33 carrot at google dot com
  2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: carrot at google dot com @ 2011-12-12  7:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

             Bug #: 51509
           Summary: Inefficient neon intrinsic code sequence
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: carrot@google.com
            Target: arm-linux-androideabi


Compile the following code with options -march=armv7-a -mfloat-abi=softfp
-mfpu=neon -mthumb -O2 -Wall -fpic

#include <arm_neon.h>
void simple_vld_intrin(uint8_t *src, uint8_t *dst)
{
  uint8x8x4_t x;
  uint8x8x2_t y;

  x = vld4_lane_u8(src, x, 0);

  y.val[0][0] = x.val[1][0];
 y.val[1][0] = x.val[2][0];

 vst2_lane_u8(dst, y, 0);
}

gcc 4.7 generates:


.LC0:
    .word    0
    .word    0
    .word    0
    .word    0
    .word    0
    .word    0
    .word    0
    .word    0
    .text
    .align    2
    .global    simple_vld_intrin
    .thumb
    .thumb_func
    .type    simple_vld_intrin, %function
simple_vld_intrin:
    @ args = 0, pretend = 0, frame = 32
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    ldr    r2, .L2
    sub    sp, sp, #32
.LPIC0:
    add    r2, pc
    vldmia    r2, {d18-d21}
    vmov.i32    d19, #0  @ v8qi
    vmov    d20, d19  @ v8qi
    vmov    q11, q9  @ ti
    vmov    q12, q10  @ ti
    vmov    d16, d19  @ v8qi
    vmov    d17, d19  @ v8qi
    vld4.8    {d22[0], d23[0], d24[0], d25[0]}, [r0]
    vstmia    sp, {d22-d25}
    ldrb    r2, [sp, #8]    @ zero_extendqisi2
    vmov.8    d16[0], r2
    vmov.u8    r3, d24[0]
    vmov.8    d17[0], r3
    vst2.8    {d16[0], d17[0]}, [r1]
    add    sp, sp, #32
    bx    lr
.L3:
    .align    2
.L2:
    .word    .LC0-(.LPIC0+4)


An ideal result should be:

    vld4.8    {d16[0], d17[0], d18[0], d19[0]}, [r0]
    vmov    d20, d17  @ v8qi
    vmov    d21, d18  @ v8qi
    vst2.8    {d20[0], d21[0]}, [r1]
    bx    lr


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/51509] Inefficient neon intrinsic code sequence
  2011-12-12  7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
@ 2011-12-12 19:05 ` ramana at gcc dot gnu.org
  2011-12-13  9:14 ` rsandifo at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: ramana at gcc dot gnu.org @ 2011-12-12 19:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|arm-linux-androideabi       |arm-linux-androideabi,
                   |                            |arm-linux-gnueabi
             Status|UNCONFIRMED                 |NEW
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2011-12-12
                 CC|                            |ramana at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org
             Blocks|                            |47562
     Ever Confirmed|0                           |1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/51509] Inefficient neon intrinsic code sequence
  2011-12-12  7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
  2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
@ 2011-12-13  9:14 ` rsandifo at gcc dot gnu.org
  2011-12-13  9:36 ` rsandifo at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2011-12-13  9:14 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

--- Comment #1 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 2011-12-13 09:07:38 UTC ---
At least part of the problem here is the uninitialised
variable in the vld4 call.  GCC tries to create a zero
initialisation of "x" before the vld4, so that the other
lanes have defined values.  Obviously we could be doing
that much better than we are, and perhaps we should have
some kind of special case so that uninitialised NEON vectors
are never zero-initialised (e.g. use a plain clobber instead).
But uninitialised variables aren't really ideal either way.
Something like:

  x = vld4_dup_u8(src);

  y.val[0][0] = x.val[1][0];
  y.val[1][0] = x.val[2][0];

  vst2_lane_u8(dst, y, 0);

would be better in principle.  Unfortunately, we don't
generate good code for that either.  Part of the problem
is introduced by lower-subreg, but it's not good even
with -fno-split-wide-types.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/51509] Inefficient neon intrinsic code sequence
  2011-12-12  7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
  2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
  2011-12-13  9:14 ` rsandifo at gcc dot gnu.org
@ 2011-12-13  9:36 ` rsandifo at gcc dot gnu.org
  2012-06-15  0:51 ` ramana at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2011-12-13  9:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 2011-12-13 09:20:54 UTC ---
FWIW,

  uint8x8x4_t x;
  uint8x8x2_t y;

  x = vld4_dup_u8(src);

  y.val[0] = x.val[1];
  y.val[1] = x.val[2];

  vst2_lane_u8(dst, y, 0);

does give the expected output.  I.e. the remaining inefficiency
from comment #1 is in the uninitialised parts of y.

Richard


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/51509] Inefficient neon intrinsic code sequence
  2011-12-12  7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
                   ` (2 preceding siblings ...)
  2011-12-13  9:36 ` rsandifo at gcc dot gnu.org
@ 2012-06-15  0:51 ` ramana at gcc dot gnu.org
  2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
  2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: ramana at gcc dot gnu.org @ 2012-06-15  0:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

--- Comment #3 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2012-06-15 00:51:26 UTC ---
With -fno-split-wide-types I can end up getting identical output to what is
expected in this case with FSF trunk. I suspect this might be another of those
costs with lower-subreg issues. 


Ramana


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/51509] Inefficient neon intrinsic code sequence
  2011-12-12  7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
                   ` (3 preceding siblings ...)
  2012-06-15  0:51 ` ramana at gcc dot gnu.org
@ 2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
  2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: mkuvyrkov at gcc dot gnu.org @ 2015-04-13 16:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |clyon at gcc dot gnu.org,
                   |                            |mkuvyrkov at gcc dot gnu.org
           Assignee|unassigned at gcc dot gnu.org      |kugan at gcc dot gnu.org

--- Comment #4 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> ---
Kugan,

Would you please check if your patch for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65375 also affects this one?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/51509] Inefficient neon intrinsic code sequence
  2011-12-12  7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
                   ` (4 preceding siblings ...)
  2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
@ 2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: mkuvyrkov at gcc dot gnu.org @ 2015-04-13 16:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|kugan at gcc dot gnu.org           |cbaylis at gcc dot gnu.org

--- Comment #5 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> ---
Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for
armv7.  Charles, would you please look at this?


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-04-13 16:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-12  7:33 [Bug target/51509] New: Inefficient neon intrinsic code sequence carrot at google dot com
2011-12-12 19:05 ` [Bug target/51509] " ramana at gcc dot gnu.org
2011-12-13  9:14 ` rsandifo at gcc dot gnu.org
2011-12-13  9:36 ` rsandifo at gcc dot gnu.org
2012-06-15  0:51 ` ramana at gcc dot gnu.org
2015-04-13 16:30 ` mkuvyrkov at gcc dot gnu.org
2015-04-13 16:33 ` mkuvyrkov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).