public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler.
@ 2011-06-10 11:35 mark.pupilli at dyson dot com
  2011-06-10 11:38 ` [Bug c/49362] " mark.pupilli at dyson dot com
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: mark.pupilli at dyson dot com @ 2011-06-10 11:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362

           Summary: Arm Neon intrinsic types not correctly interpreted by
                    compiler.
           Product: gcc
           Version: 4.4.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: mark.pupilli@dyson.com


Created attachment 24485
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24485
C-file with 2 funs that show the bug when compiled.

Arm neon intrinsics define the type uint32x4x2_t as

typedef struct uint32x4x2_t { uint32x4_t val[2]; };

This is interpreted by the compiler literally as a struct. This should not be
the case. The compiler should treat it as a pair of registers, just as it
treats uint32_t as a single register and not an array of 4 x uint32_t.

The attached c file contains two version of the same function - one that uses
quad word loads (vld1q), and one that uses double quad word loads (vld2q). The
function thats uses double quad word loads should take 2 instructions fewer but
it is actually 44 instructions long compared to 19 for the vld1q version. (Both
functions compute the same results).

I believe this bug arises because the compiler treats the following as array
access instead of a reference into the register file:

uint32x4x2_t A = vld2q_u32 ( a );
A.val[0]; // This statement should be treated as a reference to a register -
not an array access!

Assembly for vld2q version - hopefully I am not mistaken as I am new to ARM
assembly but it appears to do double quad word loads in Neon pipeline, then
transfers the registers back to the ARM processor, indexes them as arrays and
then reloads them into the Neon pipeline again!:

vld2q variant, 44 instructions:

00000014 <_ZN4Neon16hamming_distanceEPjS0_>:
  14:    e92d0070     push    {r4, r5, r6}
  18:    e24dd084     sub    sp, sp, #132    ; 0x84
  1c:    f460c38f     vld2.32    {d28-d31}, [r0]
  20:    e28d6020     add    r6, sp, #32
  24:    ecc6cb08     vstmia    r6, {d28-d31}
  28:    e1a0c001     mov    ip, r1
  2c:    e8b6000f     ldm    r6!, {r0, r1, r2, r3}
  30:    e28d4060     add    r4, sp, #96    ; 0x60
  34:    e1a05004     mov    r5, r4
  38:    f46c038f     vld2.32    {d16-d19}, [ip]
  3c:    e8a5000f     stmia    r5!, {r0, r1, r2, r3}
  40:    eccd0b08     vstmia    sp, {d16-d19}
  44:    e896000f     ldm    r6, {r0, r1, r2, r3}
  48:    e1a0c00d     mov    ip, sp
  4c:    e28d4040     add    r4, sp, #64    ; 0x40
  50:    e885000f     stm    r5, {r0, r1, r2, r3}
  54:    e1a05004     mov    r5, r4
  58:    e8bc000f     ldm    ip!, {r0, r1, r2, r3}
  5c:    e8a5000f     stmia    r5!, {r0, r1, r2, r3}
  60:    e89c000f     ldm    ip, {r0, r1, r2, r3}
  64:    e885000f     stm    r5, {r0, r1, r2, r3}
  68:    eddd4b10     vldr    d20, [sp, #64]    ; 0x40
  6c:    eddd5b12     vldr    d21, [sp, #72]    ; 0x48
  70:    edddab18     vldr    d26, [sp, #96]    ; 0x60
  74:    edddbb1a     vldr    d27, [sp, #104]    ; 0x68
  78:    f34a61f4     veor    q11, q13, q10
  7c:    eddd8b14     vldr    d24, [sp, #80]    ; 0x50
  80:    eddd9b16     vldr    d25, [sp, #88]    ; 0x58
  84:    eddd4b1c     vldr    d20, [sp, #112]    ; 0x70
  88:    eddd5b1e     vldr    d21, [sp, #120]    ; 0x78
  8c:    f30461f8     veor    q3, q10, q12
  90:    f3f00546     vcnt.8    q8, q3
  94:    f3b04566     vcnt.8    q2, q11
  98:    f2042860     vadd.i8    q1, q2, q8
  9c:    f3f022c2     vpaddl.u8    q9, q1
  a0:    f3f422e2     vpaddl.u16    q9, q9
  a4:    f22201b2     vorr    d0, d18, d18
  a8:    f26321b3     vorr    d18, d19, d19
  ac:    f2620b90     vpadd.i32    d16, d18, d0
  b0:    f2600bb0     vpadd.i32    d16, d16, d16
  b4:    ee100b90     vmov.32    r0, d16[0]
  b8:    e28dd084     add    sp, sp, #132    ; 0x84
  bc:    e8bd0070     pop    {r4, r5, r6}
  c0:    e12fff1e     bx    lr

vld1q variant, only 19 instructions:

00000014 <_ZN4Neon16hamming_distanceEPjS0_>:
  14:    e2802010     add    r2, r0, #16
  18:    e2813010     add    r3, r1, #16
  1c:    f4606a8f     vld1.32    {d22-d23}, [r0]
  20:    f4624a8f     vld1.32    {d20-d21}, [r2]
  24:    f463aa8f     vld1.32    {d26-d27}, [r3]
  28:    f461ca8f     vld1.32    {d28-d29}, [r1]
  2c:    f34681fc     veor    q12, q11, q14
  30:    f30461fa     veor    q3, q10, q13
  34:    f3f00546     vcnt.8    q8, q3
  38:    f3b04568     vcnt.8    q2, q12
  3c:    f2042860     vadd.i8    q1, q2, q8
  40:    f3f022c2     vpaddl.u8    q9, q1
  44:    f3f422e2     vpaddl.u16    q9, q9
  48:    f22201b2     vorr    d0, d18, d18
  4c:    f26321b3     vorr    d18, d19, d19
  50:    f2620b90     vpadd.i32    d16, d18, d0
  54:    f2600bb0     vpadd.i32    d16, d16, d16
  58:    ee100b90     vmov.32    r0, d16[0]
  5c:    e12fff1e     bx    lr


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug c/49362] Arm Neon intrinsic types not correctly interpreted by compiler.
  2011-06-10 11:35 [Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler mark.pupilli at dyson dot com
@ 2011-06-10 11:38 ` mark.pupilli at dyson dot com
  2011-06-13 19:57 ` mark.pupilli at dyson dot com
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: mark.pupilli at dyson dot com @ 2011-06-10 11:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362

--- Comment #1 from mark.pupilli at dyson dot com 2011-06-10 11:38:40 UTC ---

There is a typo -

'treats uint32_t as a single register and not an array of 4 x uint32_t'

should read:

'treats uint32x4_t as a single register and not an array of 4 x uint32_t'


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug c/49362] Arm Neon intrinsic types not correctly interpreted by compiler.
  2011-06-10 11:35 [Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler mark.pupilli at dyson dot com
  2011-06-10 11:38 ` [Bug c/49362] " mark.pupilli at dyson dot com
@ 2011-06-13 19:57 ` mark.pupilli at dyson dot com
  2011-06-14 12:59 ` Greta.Yorsh at arm dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: mark.pupilli at dyson dot com @ 2011-06-13 19:57 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362

--- Comment #2 from mark.pupilli at dyson dot com 2011-06-13 19:56:43 UTC ---
The vld2q version should actually be 15 instructions (not 17!) as follows:

     vld2.32    {d20-d23}, [r0]
     vld2.32    {d26-d29}, [r1]
     veor       q12, q11, q14
     veor       q3, q10, q13
     vcnt.8     q8, q3
     vcnt.8     q2, q12
     vadd.i8    q1, q2, q8
     vpaddl.u8  q9, q1
     vpaddl.u16 q9, q9
     vorr       d0, d18, d18
     vorr       d18, d19, d19
     vpadd.i32  d16, d18, d0
     vpadd.i32  d16, d16, d16
     vmov.32    r0, d16[0]
     bx         lr


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug c/49362] Arm Neon intrinsic types not correctly interpreted by compiler.
  2011-06-10 11:35 [Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler mark.pupilli at dyson dot com
  2011-06-10 11:38 ` [Bug c/49362] " mark.pupilli at dyson dot com
  2011-06-13 19:57 ` mark.pupilli at dyson dot com
@ 2011-06-14 12:59 ` Greta.Yorsh at arm dot com
  2011-06-14 13:47 ` mark.pupilli at dyson dot com
  2011-06-14 14:26 ` ramana at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: Greta.Yorsh at arm dot com @ 2011-06-14 12:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362

Greta Yorsh <Greta.Yorsh at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Greta.Yorsh at arm dot com

--- Comment #3 from Greta Yorsh <Greta.Yorsh at arm dot com> 2011-06-14 12:59:11 UTC ---
It looks like the problem you described has already been fixed.

When the example is compiled with gcc from trunk (gcc version 4.7.0 with -O2),
vld1q variant has 15 instructions and vld2q variant has 13 instructions (see
below).
The version of gcc you use is 4.4.1. The issue hasn't been fixed in the latest
gcc release 4.6, but the fix should be included in the next release and
probably won't be backported to 4.5 and 4.6 releases.


Disassembly of section .text:

00000000 <hamming_distance_vld2q>:
   0:    f460438f     vld2.32    {d20-d23}, [r0]
   4:    f461038f     vld2.32    {d16-d19}, [r1]
   8:    f34481f0     veor    q12, q10, q8
   c:    f34601f2     veor    q8, q11, q9
  10:    f3f02568     vcnt.8    q9, q12
  14:    f3f00560     vcnt.8    q8, q8
  18:    f24208e0     vadd.i8    q8, q9, q8
  1c:    f3f002e0     vpaddl.u8    q8, q8
  20:    f3f402e0     vpaddl.u16    q8, q8
  24:    f26121b1     vorr    d18, d17, d17
  28:    f2620bb0     vpadd.i32    d16, d18, d16
  2c:    f2600bb0     vpadd.i32    d16, d16, d16
  30:    ee100b90     vmov.32    r0, d16[0]
  34:    e12fff1e     bx    lr

00000038 <hamming_distance_vld1q>:
  38:    f4602a8d     vld1.32    {d18-d19}, [r0]!
  3c:    f4610a8d     vld1.32    {d16-d17}, [r1]!
  40:    f34221f0     veor    q9, q9, q8
  44:    f4604a8f     vld1.32    {d20-d21}, [r0]
  48:    f3f02562     vcnt.8    q9, q9
  4c:    f4610a8f     vld1.32    {d16-d17}, [r1]
  50:    f34401f0     veor    q8, q10, q8
  54:    f3f00560     vcnt.8    q8, q8
  58:    f24208e0     vadd.i8    q8, q9, q8
  5c:    f3f002e0     vpaddl.u8    q8, q8
  60:    f3f402e0     vpaddl.u16    q8, q8
  64:    f26121b1     vorr    d18, d17, d17
  68:    f2620bb0     vpadd.i32    d16, d18, d16
  6c:    f2600bb0     vpadd.i32    d16, d16, d16
  70:    ee100b90     vmov.32    r0, d16[0]
  74:    e12fff1e     bx    lr


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug c/49362] Arm Neon intrinsic types not correctly interpreted by compiler.
  2011-06-10 11:35 [Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler mark.pupilli at dyson dot com
                   ` (2 preceding siblings ...)
  2011-06-14 12:59 ` Greta.Yorsh at arm dot com
@ 2011-06-14 13:47 ` mark.pupilli at dyson dot com
  2011-06-14 14:26 ` ramana at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: mark.pupilli at dyson dot com @ 2011-06-14 13:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362

mark.pupilli at dyson dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|                            |FIXED

--- Comment #4 from mark.pupilli at dyson dot com 2011-06-14 13:46:51 UTC ---
Ok, looks like the optimiser will be better in 4.7 as well. Thank you!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug c/49362] Arm Neon intrinsic types not correctly interpreted by compiler.
  2011-06-10 11:35 [Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler mark.pupilli at dyson dot com
                   ` (3 preceding siblings ...)
  2011-06-14 13:47 ` mark.pupilli at dyson dot com
@ 2011-06-14 14:26 ` ramana at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: ramana at gcc dot gnu.org @ 2011-06-14 14:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ramana at gcc dot gnu.org
   Target Milestone|---                         |4.7.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-06-14 14:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-10 11:35 [Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler mark.pupilli at dyson dot com
2011-06-10 11:38 ` [Bug c/49362] " mark.pupilli at dyson dot com
2011-06-13 19:57 ` mark.pupilli at dyson dot com
2011-06-14 12:59 ` Greta.Yorsh at arm dot com
2011-06-14 13:47 ` mark.pupilli at dyson dot com
2011-06-14 14:26 ` ramana at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).