From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-332969-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 18351 invoked by alias); 4 Oct 2010 23:00:16 -0000
Received: (qmail 18310 invoked by uid 22791); 4 Oct 2010 23:00:14 -0000
X-SWARE-Spam-Status: No, hits=-2.3 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00,MISSING_MID
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO gcc.gnu.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 04 Oct 2010 23:00:07 +0000
From: "siarhei.siamashka at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: siarhei.siamashka at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
In-Reply-To: <bug-43725-4@http.gcc.gnu.org/bugzilla/>
References: <bug-43725-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Date: Mon, 04 Oct 2010 23:00:00 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2010-10/txt/msg00358.txt.bz2
Message-ID: <20101004230000.FcPAXhHHqRFMVa9vMcfbL6b4UUt1mFN8IvLlgXptOj0@z>

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #2 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2010-10-04 22:59:56 UTC ---
(In reply to comment #1)
> So the compiler is correct not to be using vld1 for this code.  The memory
> format of int32x4_t is defined to be the format of a neon register that has
> been filled from an array of int32 values and then stored to memory using VSTM
> (or equivalent sequence).  The implication of all this is that int32x4_t does
> not (necessarily) have the same memory layout as int32_t[4].

Could you elaborate on this? Specifically about the case when memory format for
VSTM and VST1 may differ.

I thought that VST1 instruction could be always used as a replacement for VSTM,
it is just a little bit less convenient in some cases because it is lacking
some more advanced addressing modes. Moreover, VSTM is VFP instruction and VST1
is NEON one. So I guess mixing VSTM with true NEON instructions may be
additionally a bad idea (for performance reasons on Cortex-A9 or other
processors?).

There also used to be FLDMX/FSTMX instructions, but they are deprecated now. I
believe they existed specifically to reserve the use of normal VFP load/store
instructions for floating point data formats only, but later this turned out to
be unnecessary.

> arm_neon.h provides intrinsics for filling neon registers from arrays in
> memory, and in this case I think you should be using these directly.  That is,
> your macro should be modified to contain:
> 
> #define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v =
> vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);}

I'm sorry, but this looks like a completely unjustified limitation to me. Why
intrinsics should be so much more difficult and less intuitive to use than just
inline assembly? Additionally, gcc allows to use normal arithmetic operations
on vector data types, something like:

void x(int32x4_t a, int32x4_t b, int32x4_t *p)
{
        #define X(n) p[n] += a; p[n] |= b;
        X(0); X(1); X(2); X(3); X(4); X(5); X(6); X(7);
        X(8); X(9); X(10); X(11); X(12);
}

> There are still problems after doing this, however.  In particular the compiler
> is not correctly tracking alias information for the load/store intrinsics,
> which means it is unable to move stores past loads to reduce stalls in the
> pipeline.

OK, thanks for the explanation.

> The stack wastage appears to be fixed in trunk gcc; at least I don't see any
> stack allocation for your testcase.

Yes, looks like it got a little bit better. Anyway stack allocation shows up
again after adding just a few more of these X() macros:
... X(13); X(14); X(15); X(16); ...