From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27219 invoked by alias); 29 Jul 2014 11:35:49 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 27172 invoked by uid 48); 29 Jul 2014 11:35:46 -0000 From: "m.zakirov at samsung dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics Date: Tue, 29 Jul 2014 11:35:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 4.5.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: m.zakirov at samsung dot com X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-07/txt/msg01905.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725 --- Comment #8 from Marat Zakirov --- UPDATE Using little fix you may got a much better code... transpose_16x16: .fnstart @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. add r2, r0, #128 vld4.16 {d24, d26, d28, d30}, [r0] add r1, r0, #160 vld4.16 {d16, d18, d20, d22}, [r2] add r0, r0, #32 movw r3, #:lower16:m1 vldr d6, .L2 vldr d7, .L2+8(in CSE) movw r2, #:lower16:m0 movt r3, #:upper16:m1 movt r2, #:upper16:m0 vld4.16 {d25, d27, d29, d31}, [r0] vld4.16 {d17, d19, d21, d23}, [r1] vmul.i16 q12, q3, q12 vmul.i16 q8, q3, q8 vmul.i16 q13, q3, q13 vmul.i16 q9, q3, q9 vmul.i16 q14, q3, q14 vmul.i16 q10, q3, q10 vmul.i16 q15, q3, q15 vmul.i16 q11, q3, q11 vstmia r2, {d24-d31} vstmia r3, {d16-d23} bx lr .L3: About fix: I discovered that GCC register allocator has 'weak' support for stream (in my case NEON) registers. RA works with stream resgisters as with unsplitible ranges. So if some register of range becomes free GCC do not reuse them untill whole range becomes free. Is actually OK, but... I found that GCC CSE phase makes partly substitution for register-ranges and this leads to terrible register pressure increse. Example Before CSE a = b a0 = a0 * 3 a1 = a1 * 3 a2 = a2 * 3 a3 = a3 * 3 After a = b a0 = b0 * 3 a1 = a1 * 3 <<< * a2 = a2 * 3 a3 = a3 * 3 CSE do not substitute b1 to a1 because at the moment (*) a0 was define so actually a != b. Yes but a1 = b1, unfortuanatly CSE also do not how to handle register-ranges parts as RA does. And I am not sure that 'unfortuanatly'. Because. a0 = b0 * 3 a1 = b1 * 3 a2 = b2 * 3 a3 = b3 * 3 Also requres x2 more stream registers than its really need to. My solution here is to forbid CSE for XImode registers.