[Bug target/43725] New: Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/43725]  New: Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
@ 2010-04-12  7:27 siarhei dot siamashka at gmail dot com
  2010-05-11  7:35 ` [Bug target/43725] " ramana at gcc dot gnu dot org
  0 siblings, 1 reply; 13+ messages in thread
From: siarhei dot siamashka at gmail dot com @ 2010-04-12  7:27 UTC (permalink / raw)
  To: gcc-bugs

gcc version 4.5.0-rc20100406

/**************/
#include <arm_neon.h>

void x(int32x4_t a, int32x4_t b, int32x4_t *p)
{
        #define X(n) p[n] = vaddq_s32(p[n], a); p[n] = vorrq_s32(p[n], b);
        X(0); X(1); X(2); X(3); X(4); X(5); X(6); X(7);
        X(8); X(9); X(10); X(11); X(12);
}
/**************/

# gcc -O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=hard -c test.c
# objdump -d test.o

00000000 <x>:
   0:   edd0eb2c        vldr    d30, [r0, #176] ; 0xb0
   4:   edd0fb2e        vldr    d31, [r0, #184] ; 0xb8
   8:   ecd02b04        vldmia  r0, {d18-d19}         
   c:   ed2d8b10        vpush   {d8-d15}              
  10:   ed904b30        vldr    d4, [r0, #192]  ; 0xc0
  14:   ed905b32        vldr    d5, [r0, #200]  ; 0xc8
  18:   e24dd020        sub     sp, sp, #32           
  1c:   f22ec8c0        vadd.i32        q6, q15, q0   
  20:   f26228c0        vadd.i32        q9, q9, q0    
  24:   edd04b08        vldr    d20, [r0, #32]        
  28:   edd05b0a        vldr    d21, [r0, #40]  ; 0x28
  2c:   edd0cb18        vldr    d28, [r0, #96]  ; 0x60
  30:   edd0db1a        vldr    d29, [r0, #104] ; 0x68
  34:   f264e840        vadd.i32        q15, q2, q0   
  38:   f26448c0        vadd.i32        q10, q10, q0  
  3c:   f26cc8c0        vadd.i32        q14, q14, q0  
  40:   edd00b04        vldr    d16, [r0, #16]        
  44:   edd01b06        vldr    d17, [r0, #24]        
  48:   edd0ab14        vldr    d26, [r0, #80]  ; 0x50
  4c:   edd0bb16        vldr    d27, [r0, #88]  ; 0x58
  50:   ec8dcb04        vstmia  sp, {d12-d13}         
  54:   f222c1d2        vorr    q6, q9, q1            
  58:   f26008c0        vadd.i32        q8, q8, q0    
  5c:   f26aa8c0        vadd.i32        q13, q13, q0  
  60:   edd06b0c        vldr    d22, [r0, #48]  ; 0x30
  64:   edd07b0e        vldr    d23, [r0, #56]  ; 0x38
  68:   edd08b10        vldr    d24, [r0, #64]  ; 0x40
  6c:   edd09b12        vldr    d25, [r0, #72]  ; 0x48
  70:   ed906b1c        vldr    d6, [r0, #112]  ; 0x70
  74:   ed907b1e        vldr    d7, [r0, #120]  ; 0x78
  78:   ed908b20        vldr    d8, [r0, #128]  ; 0x80
  7c:   ed909b22        vldr    d9, [r0, #136]  ; 0x88
  80:   ed90ab24        vldr    d10, [r0, #144] ; 0x90
  84:   ed90bb26        vldr    d11, [r0, #152] ; 0x98
  88:   ed90eb28        vldr    d14, [r0, #160] ; 0xa0
  8c:   ed90fb2a        vldr    d15, [r0, #168] ; 0xa8
  90:   edcdeb04        vstr    d30, [sp, #16]        
  94:   edcdfb06        vstr    d31, [sp, #24]        
  98:   ec80cb04        vstmia  r0, {d12-d13}         
  9c:   f224c1d2        vorr    q6, q10, q1           
  a0:   f26c41d2        vorr    q10, q14, q1          
  a4:   ecddcb04        vldmia  sp, {d28-d29}         
  a8:   f26021d2        vorr    q9, q8, q1            
  ac:   f26888c0        vadd.i32        q12, q12, q0  
  b0:   f26ae1d2        vorr    q15, q13, q1          
  b4:   f26668c0        vadd.i32        q11, q11, q0  
  b8:   f26ca1d2        vorr    q13, q14, q1          
  bc:   f2266840        vadd.i32        q3, q3, q0    
  c0:   f2288840        vadd.i32        q4, q4, q0    
  c4:   f22aa840        vadd.i32        q5, q5, q0    
  c8:   f22ee840        vadd.i32        q7, q7, q0    
  cc:   edddcb04        vldr    d28, [sp, #16]        
  d0:   eddddb06        vldr    d29, [sp, #24]        
  d4:   f22601d2        vorr    q0, q11, q1           
  d8:   f22841d2        vorr    q2, q12, q1           
  dc:   f2680152        vorr    q8, q4, q1            
  e0:   f26a6152        vorr    q11, q5, q1           
  e4:   f26e8152        vorr    q12, q7, q1           
  e8:   edc02b04        vstr    d18, [r0, #16]        
  ec:   edc03b06        vstr    d19, [r0, #24]        
  f0:   f2662152        vorr    q9, q3, q1            
  f4:   f22c21d2        vorr    q1, q14, q1           
  f8:   ed80cb08        vstr    d12, [r0, #32]        
  fc:   ed80db0a        vstr    d13, [r0, #40]  ; 0x28
 100:   ed800b0c        vstr    d0, [r0, #48]   ; 0x30
 104:   ed801b0e        vstr    d1, [r0, #56]   ; 0x38
 108:   ed804b10        vstr    d4, [r0, #64]   ; 0x40
 10c:   ed805b12        vstr    d5, [r0, #72]   ; 0x48
 110:   edc0eb14        vstr    d30, [r0, #80]  ; 0x50
 114:   edc0fb16        vstr    d31, [r0, #88]  ; 0x58
 118:   edc04b18        vstr    d20, [r0, #96]  ; 0x60
 11c:   edc05b1a        vstr    d21, [r0, #104] ; 0x68
 120:   edc02b1c        vstr    d18, [r0, #112] ; 0x70
 124:   edc03b1e        vstr    d19, [r0, #120] ; 0x78
 128:   edc00b20        vstr    d16, [r0, #128] ; 0x80
 12c:   edc01b22        vstr    d17, [r0, #136] ; 0x88
 130:   edc06b24        vstr    d22, [r0, #144] ; 0x90
 134:   edc07b26        vstr    d23, [r0, #152] ; 0x98
 138:   edc08b28        vstr    d24, [r0, #160] ; 0xa0
 13c:   edc09b2a        vstr    d25, [r0, #168] ; 0xa8
 140:   edc0ab2c        vstr    d26, [r0, #176] ; 0xb0
 144:   edc0bb2e        vstr    d27, [r0, #184] ; 0xb8
 148:   ed802b30        vstr    d2, [r0, #192]  ; 0xc0
 14c:   ed803b32        vstr    d3, [r0, #200]  ; 0xc8
 150:   e28dd020        add     sp, sp, #32
 154:   ecbd8b10        vpop    {d8-d15}
 158:   e12fff1e        bx      lr

This shows multiple performance problems:
1. The use of inherently slower VLDR/VSTR instructions instead of VLD1/VST1
2. Failure to make proper use of ARM Cortex-A8 NEON LS/ALU dual issue
3. Unnecessary spills to stack

This is a general issue with NEON intrinsics, causing serious performance
problems for practically any nontrivial code. I guess this itself can be a
meta-bug, with each individual performance issue tracked separately.


-- 
           Summary: Poor instructions selection, scheduling and registers
                    allocation for ARM NEON intrinsics
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: siarhei dot siamashka at gmail dot com
GCC target triplet: armv7l-unknown-linux-gnueabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
  2010-04-12  7:27 [Bug target/43725] New: Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics siarhei dot siamashka at gmail dot com
@ 2010-05-11  7:35 ` ramana at gcc dot gnu dot org
  0 siblings, 0 replies; 13+ messages in thread
From: ramana at gcc dot gnu dot org @ 2010-05-11  7:35 UTC (permalink / raw)
  To: gcc-bugs



-- 

ramana at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|0000-00-00 00:00:00         |2010-05-11 07:35:23
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
@ 2021-09-27  7:21 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-27  7:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2014-07-29 11:46 ` m.zakirov at samsung dot com
@ 2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
  2021-09-27  7:21 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 13+ messages in thread
From: mkuvyrkov at gcc dot gnu.org @ 2014-08-20 16:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |mkuvyrkov at gcc dot gnu.org
           Assignee|unassigned at gcc dot gnu.org      |mkuvyrkov at gcc dot gnu.org


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2014-07-29 11:35 ` m.zakirov at samsung dot com
@ 2014-07-29 11:46 ` m.zakirov at samsung dot com
  2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
  2021-09-27  7:21 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 13+ messages in thread
From: m.zakirov at samsung dot com @ 2014-07-29 11:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #9 from Marat Zakirov <m.zakirov at samsung dot com> ---
I used following patch

diff --git a/gcc/cse.c b/gcc/cse.c
index 34f9364..a9e0442 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -2862,6 +2862,9 @@ canon_reg (rtx x, rtx insn)
            || ! REGNO_QTY_VALID_P (REGNO (x)))
          return x;

+        if (GET_MODE (x) == XImode)
+          return x;
+
        q = REG_QTY (REGNO (x));
        ent = &qty_table[q];
        first = ent->first_reg;
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 547fcd6..eadc729 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1317,6 +1317,9 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn,
rtx def_set)
   if (!new_rtx)
     return false;

+  if (GET_MODE (reg) == XImode)
+    return false;
+
   return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
 }


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2014-07-09 12:26 ` m.zakirov at samsung dot com
@ 2014-07-29 11:35 ` m.zakirov at samsung dot com
  2014-07-29 11:46 ` m.zakirov at samsung dot com
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: m.zakirov at samsung dot com @ 2014-07-29 11:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #8 from Marat Zakirov <m.zakirov at samsung dot com> ---
UPDATE

Using little fix you may got a much better code...

transpose_16x16:
        .fnstart
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        add     r2, r0, #128
        vld4.16 {d24, d26, d28, d30}, [r0]
        add     r1, r0, #160
        vld4.16 {d16, d18, d20, d22}, [r2]
        add     r0, r0, #32
        movw    r3, #:lower16:m1
        vldr    d6, .L2
        vldr    d7, .L2+8(in CSE)
        movw    r2, #:lower16:m0
        movt    r3, #:upper16:m1
        movt    r2, #:upper16:m0
        vld4.16 {d25, d27, d29, d31}, [r0]
        vld4.16 {d17, d19, d21, d23}, [r1]
        vmul.i16        q12, q3, q12
        vmul.i16        q8, q3, q8
        vmul.i16        q13, q3, q13
        vmul.i16        q9, q3, q9
        vmul.i16        q14, q3, q14
        vmul.i16        q10, q3, q10
        vmul.i16        q15, q3, q15
        vmul.i16        q11, q3, q11
        vstmia  r2, {d24-d31}
        vstmia  r3, {d16-d23}
        bx      lr
.L3:

About fix:

I discovered that GCC register allocator has 'weak' support for stream (in my
case NEON) registers. RA works with stream resgisters as with unsplitible
ranges. So if some register of range becomes free GCC do not reuse them untill
whole range becomes free.

Is actually OK, but...

I found that GCC CSE phase makes partly substitution for register-ranges and
this leads to terrible register pressure increse.

Example

Before CSE

a = b
a0 = a0 * 3
a1 = a1 * 3
a2 = a2 * 3
a3 = a3 * 3

After

a = b
a0 = b0 * 3 
a1 = a1 * 3 <<< *
a2 = a2 * 3
a3 = a3 * 3

CSE do not substitute b1 to a1 because at the moment (*) a0 was define so
actually a != b. Yes but a1 = b1, unfortuanatly CSE also do not how to handle
register-ranges parts as RA does. And I am not sure that 'unfortuanatly'.

Because.

a0 = b0 * 3 
a1 = b1 * 3
a2 = b2 * 3
a3 = b3 * 3

Also requres x2 more stream registers than its really need to.

My solution here is to forbid CSE for XImode registers.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
@ 2014-07-09 12:26 ` m.zakirov at samsung dot com
  2014-07-29 11:35 ` m.zakirov at samsung dot com
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: m.zakirov at samsung dot com @ 2014-07-09 12:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

Marat Zakirov <m.zakirov at samsung dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |joseph at codesourcery dot com,
                   |                            |m.zakirov at samsung dot com

--- Comment #7 from Marat Zakirov <m.zakirov at samsung dot com> ---
Another neon alloc issue.

Code:

#include <arm_neon.h>
#include <inttypes.h>

extern  uint16x8x4_t m0;
extern  uint16x8x4_t m1;

void foo(uint16_t * in_ptr)
{
    uint16x8x4_t t0, t1;
    t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
    t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
    t0.val[0] *= 333;
    t0.val[1] *= 333;
    t0.val[2] *= 333;
    t0.val[3] *= 333;
    t1.val[0] *= 333;
    t1.val[1] *= 333;
    t1.val[2] *= 333;
    t1.val[3] *= 333;
    m0 = t0;
    m1 = t1;
}

Asm file:

       .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
        add     r1, r0, #160
        vld4.16 {d8, d10, d12, d14}, [r0]
        add     r0, r0, #32
        .pad #64
        sub     sp, sp, #64
        vld4.16 {d16, d18, d20, d22}, [r2]
        movw    r3, #:lower16:m1
        movw    r2, #:lower16:m0
        vldr    d6, .L3
        vldr    d7, .L3+8
        movt    r3, #:upper16:m1
        movt    r2, #:upper16:m0
        vld4.16 {d9, d11, d13, d15}, [r0]
        vld4.16 {d17, d19, d21, d23}, [r1]
        vmul.i16        q12, q3, q4
        vstmia  sp, {d16-d23}      <<< *
        vld1.64 {d4-d5}, [sp:64]   <<< *
        vmul.i16        q13, q3, q5  <<< **
        vmul.i16        q9, q3, q9   
        vmul.i16        q14, q3, q6  <<< **
        vmul.i16        q10, q3, q10
        vmul.i16        q8, q3, q2   <<< **, ***
        vmul.i16        q15, q3, q7  <<< **
        vmul.i16        q11, q3, q11
        vstmia  r2, {d24-d31}
        vstmia  r3, {d16-d23}
        add     sp, sp, #64
        @ sp needed
        fldmfdd sp!, {d8-d15}
        bx      lr

So my qustion are:
1) Why do we need * and why compiler used q2 in *** ?
2) Why compiler didn't reuse registers q5,q6,q2,q7 in ** ?

Command line:

cc1 -quiet -v t.c -quiet -dumpbase t.c -mfpu=neon -mcpu=cortex-a15
-mfloat-abi=softfp -marm -mtls-dialect=gnu -auxbase-strip t.s -O3
-Wno-error=unused-local-typedefs -version -fdump-tree-all -fdump-rtl-all
-funwind-tables -o t.s

gcc version = 4.10.0
--build=x86_64-pc-linux-gnu
--host=x86_64-pc-linux-gnu
--target=arm-v7a15v5r2-linux-gnueabi

--Marat


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
@ 2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
  2014-07-09 12:26 ` m.zakirov at samsung dot com
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2011-06-29 13:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #6 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2011-06-29 13:35:13 UTC ---
Created attachment 24630
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24630
test.c

Attached a slightly updated testcase, which can demonstrate unnecessary spills
to stack even with more recent versions of gcc as explained in comment 2
earlier (just slightly increased the number of uses for X() macro)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2010-10-05  7:16 ` ramana at gcc dot gnu.org
@ 2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
  2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2010-10-08 14:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #5 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-08 14:13:08 UTC ---
(In reply to comment #3)
> On Mon, 4 Oct 2010, siarhei.siamashka at gmail dot com wrote:
> 
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
> > 
> > --- Comment #2 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-04 22:59:56 UTC ---
> > (In reply to comment #1)
> > > So the compiler is correct not to be using vld1 for this code.  The memory
> > > format of int32x4_t is defined to be the format of a neon register that has
> > > been filled from an array of int32 values and then stored to memory using VSTM
> > > (or equivalent sequence).  The implication of all this is that int32x4_t does
> > > not (necessarily) have the same memory layout as int32_t[4].
> > 
> > Could you elaborate on this? Specifically about the case when memory format for
> > VSTM and VST1 may differ.
> 
> Big-endian.

OK, I see. Looks like VLDM/VSTM instructions could be replaced with VLD1/VST1
(by artificially forcing element size to 64) in almost all cases except when
SCTLR.A == 1 due to unwanted alignment traps potentially happening in this
case.

But the question is whether it is really necessary to suffer from a performance
penalty on little endian systems?

> I previously explained the issues with big-endian NEON vectors in GCC at 
> length:
> 
> http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html

Thanks for the link, something seems to be seriously overengineered. Looks like
you brought a problem upon yourself and now are trying to valiantly solve it.

Does (efficient) support of NEON intrinsics on big endian systems even have any
practical value? Maybe it makes sense to get a reasonable performance at least
on little endian systems first. To me it looks like you are just running after
two hares...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2010-10-04 23:46 ` joseph at codesourcery dot com
@ 2010-10-05  7:16 ` ramana at gcc dot gnu.org
  2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: ramana at gcc dot gnu.org @ 2010-10-05  7:16 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2010-10-05 07:16:35 UTC ---
(In reply to comment #2)
> (In reply to comment #1)
> > So the compiler is correct not to be using vld1 for this code.  The memory
> > format of int32x4_t is defined to be the format of a neon register that has
> > been filled from an array of int32 values and then stored to memory using VSTM
> > (or equivalent sequence).  The implication of all this is that int32x4_t does
> > not (necessarily) have the same memory layout as int32_t[4].
> 
> Could you elaborate on this? Specifically about the case when memory format for
> VSTM and VST1 may differ.
> 
> I thought that VST1 instruction could be always used as a replacement for VSTM,
> it is just a little bit less convenient in some cases because it is lacking
> some more advanced addressing modes. Moreover, VSTM is VFP instruction and VST1
> is NEON one. So I guess mixing VSTM with true NEON instructions may be
> additionally a bad idea (for performance reasons on Cortex-A9 or other
> processors?).

The ARM ARM states that VLDM / VSTM and VLDR / VSTR for 64 bit values are
compliant with VFPv2 / VFPv3 and advanced SIMD i.e. they can be executed by
both the units . Thus there should be no performance regressions on the A9
AFAIK for VLDM and VSTM / VLDR and VSTR of 64 bit registers interleaved with
other Neon instructions. 


cheers
Ramana


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
  2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
  2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
@ 2010-10-04 23:46 ` joseph at codesourcery dot com
  2010-10-05  7:16 ` ramana at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: joseph at codesourcery dot com @ 2010-10-04 23:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #3 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2010-10-04 23:45:57 UTC ---
On Mon, 4 Oct 2010, siarhei.siamashka at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
> 
> --- Comment #2 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-04 22:59:56 UTC ---
> (In reply to comment #1)
> > So the compiler is correct not to be using vld1 for this code.  The memory
> > format of int32x4_t is defined to be the format of a neon register that has
> > been filled from an array of int32 values and then stored to memory using VSTM
> > (or equivalent sequence).  The implication of all this is that int32x4_t does
> > not (necessarily) have the same memory layout as int32_t[4].
> 
> Could you elaborate on this? Specifically about the case when memory format for
> VSTM and VST1 may differ.

Big-endian.

I previously explained the issues with big-endian NEON vectors in GCC at 
length:

http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
  2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
@ 2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
  2010-10-04 23:46 ` joseph at codesourcery dot com
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2010-10-04 23:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #2 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-04 22:59:56 UTC ---
(In reply to comment #1)
> So the compiler is correct not to be using vld1 for this code.  The memory
> format of int32x4_t is defined to be the format of a neon register that has
> been filled from an array of int32 values and then stored to memory using VSTM
> (or equivalent sequence).  The implication of all this is that int32x4_t does
> not (necessarily) have the same memory layout as int32_t[4].

Could you elaborate on this? Specifically about the case when memory format for
VSTM and VST1 may differ.

I thought that VST1 instruction could be always used as a replacement for VSTM,
it is just a little bit less convenient in some cases because it is lacking
some more advanced addressing modes. Moreover, VSTM is VFP instruction and VST1
is NEON one. So I guess mixing VSTM with true NEON instructions may be
additionally a bad idea (for performance reasons on Cortex-A9 or other
processors?).

There also used to be FLDMX/FSTMX instructions, but they are deprecated now. I
believe they existed specifically to reserve the use of normal VFP load/store
instructions for floating point data formats only, but later this turned out to
be unnecessary.

> arm_neon.h provides intrinsics for filling neon registers from arrays in
> memory, and in this case I think you should be using these directly.  That is,
> your macro should be modified to contain:
> 
> #define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v =
> vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);}

I'm sorry, but this looks like a completely unjustified limitation to me. Why
intrinsics should be so much more difficult and less intuitive to use than just
inline assembly? Additionally, gcc allows to use normal arithmetic operations
on vector data types, something like:

void x(int32x4_t a, int32x4_t b, int32x4_t *p)
{
        #define X(n) p[n] += a; p[n] |= b;
        X(0); X(1); X(2); X(3); X(4); X(5); X(6); X(7);
        X(8); X(9); X(10); X(11); X(12);
}

> There are still problems after doing this, however.  In particular the compiler
> is not correctly tracking alias information for the load/store intrinsics,
> which means it is unable to move stores past loads to reduce stalls in the
> pipeline.

OK, thanks for the explanation.

> The stack wastage appears to be fixed in trunk gcc; at least I don't see any
> stack allocation for your testcase.

Yes, looks like it got a little bit better. Anyway stack allocation shows up
again after adding just a few more of these X() macros:
... X(13); X(14); X(15); X(16); ...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
       [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
@ 2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
  2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2010-09-29 20:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2010-05-11 07:35:23         |2010-09-29 7:35:23
               date|                            |
                 CC|                            |rearnsha at gcc dot gnu.org

--- Comment #1 from Richard Earnshaw <rearnsha at gcc dot gnu.org> 2010-09-29 16:28:17 UTC ---
So the compiler is correct not to be using vld1 for this code.  The memory
format of int32x4_t is defined to be the format of a neon register that has
been filled from an array of int32 values and then stored to memory using VSTM
(or equivalent sequence).  The implication of all this is that int32x4_t does
not (necessarily) have the same memory layout as int32_t[4].

arm_neon.h provides intrinsics for filling neon registers from arrays in
memory, and in this case I think you should be using these directly.  That is,
your macro should be modified to contain:

#define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v =
vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);}

There are still problems after doing this, however.  In particular the compiler
is not correctly tracking alias information for the load/store intrinsics,
which means it is unable to move stores past loads to reduce stalls in the
pipeline.

The stack wastage appears to be fixed in trunk gcc; at least I don't see any
stack allocation for your testcase.

I haven't looked into the scheduling issues at this time.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-09-27  7:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-12  7:27 [Bug target/43725] New: Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics siarhei dot siamashka at gmail dot com
2010-05-11  7:35 ` [Bug target/43725] " ramana at gcc dot gnu dot org
     [not found] <bug-43725-4@http.gcc.gnu.org/bugzilla/>
2010-09-29 20:50 ` rearnsha at gcc dot gnu.org
2010-10-04 23:00 ` siarhei.siamashka at gmail dot com
2010-10-04 23:46 ` joseph at codesourcery dot com
2010-10-05  7:16 ` ramana at gcc dot gnu.org
2010-10-08 14:13 ` siarhei.siamashka at gmail dot com
2011-06-29 13:35 ` siarhei.siamashka at gmail dot com
2014-07-09 12:26 ` m.zakirov at samsung dot com
2014-07-29 11:35 ` m.zakirov at samsung dot com
2014-07-29 11:46 ` m.zakirov at samsung dot com
2014-08-20 16:44 ` mkuvyrkov at gcc dot gnu.org
2021-09-27  7:21 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).