public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around
@ 2014-09-16 12:07 janne-gcc at jannau dot net
  2014-09-16 13:20 ` [Bug target/63277] " ktkachov at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: janne-gcc at jannau dot net @ 2014-09-16 12:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63277

            Bug ID: 63277
           Summary: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2
                    for shuffling data unnecessarily around
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: janne-gcc at jannau dot net

Created attachment 33500
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33500&action=edit
small example source code

armv7a-hardfloat-linux-gnueabi-gcc-5.0.0 -v                      
Using built-in specs.
COLLECT_GCC=/opt/gcc/bin/armv7a-hardfloat-linux-gnueabi-gcc-5.0.0
COLLECT_LTO_WRAPPER=/opt/gcc/libexec/gcc/armv7a-hardfloat-linux-gnueabi/5.0.0/lto-wrapper
Target: armv7a-hardfloat-linux-gnueabi
Configured with: /home/janne/src/gcc-trunk/configure --host=x86_64-pc-linux-gnu
--target=armv7a-hardfloat-linux-gnueabi --build=x86_64-pc-linux-gnu
--prefix=/opt/gcc/ --enable-languages=c,c++,fortran --enable-obsolete
--enable-secureplt --disable-werror --with-system-zlib --enable-nls
--without-included-gettext --enable-checking=release --enable-libstdcxx-time
--enable-poison-system-directories
--with-sysroot=/usr/armv7a-hardfloat-linux-gnueabi --disable-bootstrap
--enable-__cxa_atexit --enable-clocale=gnu --disable-multilib --disable-altivec
--disable-fixed-point --with-float=hard --with-arch=armv7-a --with-float=hard
--with-fpu=neon --disable-libgcj --enable-libgomp --disable-libmudflap
--disable-libssp --enable-lto --without-cloog
Thread model: posix
gcc version 5.0.0 20140916 (experimental) (GCC) 

armv7a-hardfloat-linux-gnueabi-gcc-5.0.0 -march=armv7-a -mfpu=neon -O3 -S
arm_neon_excessive_vmov.c -o -
        .arch armv7-a
        .eabi_attribute 27, 3
        .eabi_attribute 28, 1
        .fpu neon
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 2
        .eabi_attribute 30, 2
        .eabi_attribute 34, 1
        .eabi_attribute 18, 4
        .file   "arm_neon_excessive_vmov.c"
        .text
        .align  2
        .global gf_w8_split_multiply_region_neon
        .type   gf_w8_split_multiply_region_neon, %function
gf_w8_split_multiply_region_neon:
        @ args = 4, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        str     lr, [sp, #-4]!
        mov     r3, r3, asl #4
        ldr     ip, [sp, #4]
        add     lr, r0, #4096
        add     lr, lr, r3
        add     r0, r0, r3
        vmov.i8 q15, #15  @ v16qi
        add     ip, r2, ip, lsl #4
        vld1.8  {d18-d19}, [lr]
        cmp     r1, ip
        vld1.8  {d16-d17}, [r0]
        ldrcs   pc, [sp], #4
        vmov    d27, d18  @ v8qi
        vmov    d26, d19  @ v8qi
        vmov    d25, d16  @ v8qi
        vmov    d24, d17  @ v8qi
.L3:
        vld1.8  {d18-d19}, [r1]
        vmov    d20, d25  @ v8qi
        vmov    d21, d24  @ v8qi
        add     r1, r1, #16
        vshr.u8 q14, q9, #4
        cmp     r1, ip
        vmov    d22, d27  @ v8qi
        vmov    d23, d26  @ v8qi
        vtbl.8  d16, {d20, d21}, d28
        vand    q9, q9, q15
        vtbl.8  d28, {d20, d21}, d29
        vtbl.8  d17, {d22, d23}, d18
        vmov    d29, d28  @ v8qi
        vmov    d28, d16  @ v8qi
        vtbl.8  d16, {d22, d23}, d19
        vswp    d16, d17
        veor    q8, q8, q14
        vst1.8  {d16-d17}, [r2]
        add     r2, r2, #16
        bcc     .L3
        ldr     pc, [sp], #4
        .size   gf_w8_split_multiply_region_neon,
.-gf_w8_split_multiply_region_neon
        .ident  "GCC: (GNU) 5.0.0 20140916 (experimental)"
        .section        .note.GNU-stack,"",%progbits

There is no need for the vmovs/vswp and clang 3.4 generates from the same
source file assembly without them.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/63277] ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around
  2014-09-16 12:07 [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around janne-gcc at jannau dot net
@ 2014-09-16 13:20 ` ktkachov at gcc dot gnu.org
  2014-09-16 14:45 ` janne-gcc at jannau dot net
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2014-09-16 13:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63277

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |arm
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-09-16
                 CC|                            |ktkachov at gcc dot gnu.org
     Ever confirmed|0                           |1
      Known to fail|                            |5.0

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed.

The vmovs and vswp are generated from the implementation of our vcombine. The
define_insn_and split for neon_vcombine<mode> splits into moves/swaps after
reload, thus not giving the register allocator a chance to optimise the moves
away.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/63277] ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around
  2014-09-16 12:07 [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around janne-gcc at jannau dot net
  2014-09-16 13:20 ` [Bug target/63277] " ktkachov at gcc dot gnu.org
@ 2014-09-16 14:45 ` janne-gcc at jannau dot net
  2014-09-16 14:45 ` janne-gcc at jannau dot net
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: janne-gcc at jannau dot net @ 2014-09-16 14:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63277

--- Comment #3 from Janne Grunau <janne-gcc at jannau dot net> ---
Created attachment 33501
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33501&action=edit
arm_neon_excessive_vmov_wo_vcombine.c


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/63277] ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around
  2014-09-16 12:07 [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around janne-gcc at jannau dot net
  2014-09-16 13:20 ` [Bug target/63277] " ktkachov at gcc dot gnu.org
  2014-09-16 14:45 ` janne-gcc at jannau dot net
@ 2014-09-16 14:45 ` janne-gcc at jannau dot net
  2014-09-18  1:26 ` cbaylis at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: janne-gcc at jannau dot net @ 2014-09-16 14:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63277

--- Comment #2 from Janne Grunau <janne-gcc at jannau dot net> ---
It is not only the vcombine.

The handling of the table vectors is even more dreadful. The loads are combined
to properly paired registers. Then moved in reverse in order to different
registers to be assembled again in the loop to properly paired registers for
vtbl2. See the attached arm_neon_excessive_vmov_wo_vcombine.c


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/63277] ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around
  2014-09-16 12:07 [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around janne-gcc at jannau dot net
                   ` (2 preceding siblings ...)
  2014-09-16 14:45 ` janne-gcc at jannau dot net
@ 2014-09-18  1:26 ` cbaylis at gcc dot gnu.org
  2015-05-14 10:31 ` [Bug rtl-optimization/63277] " ramana at gcc dot gnu.org
  2015-06-25 20:51 ` ramana at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cbaylis at gcc dot gnu.org @ 2014-09-18  1:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63277

cbaylis at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |cbaylis at gcc dot gnu.org

--- Comment #4 from cbaylis at gcc dot gnu.org ---
A much simplified test case based on arm_neon_excessive_vmov_wo_vcombine.c 

$ arm-unknown-linux-gnueabihf-gcc -O2 -S -o - -mfpu=neon mini.c 

#include <arm_neon.h>

void f(int8_t *p)
{
   int8x16_t v;
   int8x8_t v2;
   int8x8x2_t vx;

   v=vld1q_s8(p);
   v2=vld1_s8(p);
   vx.val[0] = vget_low_s8(v);
   vx.val[1] = vget_high_s8(v);
   v2 = vtbl2_s8(vx, v2);
   vst1_s8(p, v2);
}

With -dp, the generated code is:
f:
        vld1.8  {d18-d19}, [r0] @ 6     neon_vld1v16qi  [length = 4]
        vmov    d16, d18  @ v8qi        @ 10    *neon_movv8qi/1 [length = 4]
        vld1.8  {d20}, [r0]     @ 7     neon_vld1v8qi   [length = 4]
        vmov    d17, d19  @ v8qi        @ 11    *neon_movv8qi/1 [length = 4]
        vtbl.8  d16, {d16, d17}, d20    @ 12    neon_vtbl2v8qi  [length = 4]
        vst1.8  {d16}, [r0]     @ 13    neon_vst1v8qi   [length = 4]
        bx      lr      @ 24    *thumb2_return  [length = 4]

By the time IRA runs, the insns which result in the moves look like this:
(insn 9 18 11 2 (set (subreg:V8QI (reg/v:TI 116 [ vx ]) 0)
        (subreg:V8QI (reg:V16QI 114 [ D.14019 ]) 0)) /tmp/mini.c:11 827
{*neon_movv8qi}

The registers 116 and 114 are allocated to different hard registers, as they
conflict. Presumably, the register allocator could be taught to treat this
subreg->subreg move as a copy and allow the same hard register to be allocated.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/63277] ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around
  2014-09-16 12:07 [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around janne-gcc at jannau dot net
                   ` (3 preceding siblings ...)
  2014-09-18  1:26 ` cbaylis at gcc dot gnu.org
@ 2015-05-14 10:31 ` ramana at gcc dot gnu.org
  2015-06-25 20:51 ` ramana at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-05-14 10:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63277

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |ra
                 CC|                            |ramana at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
          Component|target                      |rtl-optimization

--- Comment #5 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
Adding Vlad to CC.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/63277] ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around
  2014-09-16 12:07 [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around janne-gcc at jannau dot net
                   ` (4 preceding siblings ...)
  2015-05-14 10:31 ` [Bug rtl-optimization/63277] " ramana at gcc dot gnu.org
@ 2015-06-25 20:51 ` ramana at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-06-25 20:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63277

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |47562

--- Comment #6 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
Link to meta bug.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562
[Bug 47562] [meta-bug] keep track of Neon enhancements


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-06-25 20:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-16 12:07 [Bug target/63277] New: ARM - NEON excessive use of vmov for vtbl2 / uint8x8x2 for shuffling data unnecessarily around janne-gcc at jannau dot net
2014-09-16 13:20 ` [Bug target/63277] " ktkachov at gcc dot gnu.org
2014-09-16 14:45 ` janne-gcc at jannau dot net
2014-09-16 14:45 ` janne-gcc at jannau dot net
2014-09-18  1:26 ` cbaylis at gcc dot gnu.org
2015-05-14 10:31 ` [Bug rtl-optimization/63277] " ramana at gcc dot gnu.org
2015-06-25 20:51 ` ramana at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).