From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-457314-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 27219 invoked by alias); 29 Jul 2014 11:35:49 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 27172 invoked by uid 48); 29 Jul 2014 11:35:46 -0000
From: "m.zakirov at samsung dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
Date: Tue, 29 Jul 2014 11:35:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 4.5.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: m.zakirov at samsung dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-43725-4-xLJEOVrVfT@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-43725-4@http.gcc.gnu.org/bugzilla/>
References: <bug-43725-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-07/txt/msg01905.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #8 from Marat Zakirov <m.zakirov at samsung dot com> ---
UPDATE

Using little fix you may got a much better code...

transpose_16x16:
        .fnstart
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        add     r2, r0, #128
        vld4.16 {d24, d26, d28, d30}, [r0]
        add     r1, r0, #160
        vld4.16 {d16, d18, d20, d22}, [r2]
        add     r0, r0, #32
        movw    r3, #:lower16:m1
        vldr    d6, .L2
        vldr    d7, .L2+8(in CSE)
        movw    r2, #:lower16:m0
        movt    r3, #:upper16:m1
        movt    r2, #:upper16:m0
        vld4.16 {d25, d27, d29, d31}, [r0]
        vld4.16 {d17, d19, d21, d23}, [r1]
        vmul.i16        q12, q3, q12
        vmul.i16        q8, q3, q8
        vmul.i16        q13, q3, q13
        vmul.i16        q9, q3, q9
        vmul.i16        q14, q3, q14
        vmul.i16        q10, q3, q10
        vmul.i16        q15, q3, q15
        vmul.i16        q11, q3, q11
        vstmia  r2, {d24-d31}
        vstmia  r3, {d16-d23}
        bx      lr
.L3:

About fix:

I discovered that GCC register allocator has 'weak' support for stream (in my
case NEON) registers. RA works with stream resgisters as with unsplitible
ranges. So if some register of range becomes free GCC do not reuse them untill
whole range becomes free.

Is actually OK, but...

I found that GCC CSE phase makes partly substitution for register-ranges and
this leads to terrible register pressure increse.

Example

Before CSE

a = b
a0 = a0 * 3
a1 = a1 * 3
a2 = a2 * 3
a3 = a3 * 3

After

a = b
a0 = b0 * 3 
a1 = a1 * 3 <<< *
a2 = a2 * 3
a3 = a3 * 3

CSE do not substitute b1 to a1 because at the moment (*) a0 was define so
actually a != b. Yes but a1 = b1, unfortuanatly CSE also do not how to handle
register-ranges parts as RA does. And I am not sure that 'unfortuanatly'.

Because.

a0 = b0 * 3 
a1 = b1 * 3
a2 = b2 * 3
a3 = b3 * 3

Also requres x2 more stream registers than its really need to.

My solution here is to forbid CSE for XImode registers.