From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18351 invoked by alias); 4 Oct 2010 23:00:16 -0000 Received: (qmail 18310 invoked by uid 22791); 4 Oct 2010 23:00:14 -0000 X-SWARE-Spam-Status: No, hits=-2.3 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,MISSING_MID X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 04 Oct 2010 23:00:07 +0000 From: "siarhei.siamashka at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: siarhei.siamashka at gmail dot com X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: In-Reply-To: References: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Date: Mon, 04 Oct 2010 23:00:00 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2010-10/txt/msg00358.txt.bz2 Message-ID: <20101004230000.FcPAXhHHqRFMVa9vMcfbL6b4UUt1mFN8IvLlgXptOj0@z> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725 --- Comment #2 from Siarhei Siamashka 2010-10-04 22:59:56 UTC --- (In reply to comment #1) > So the compiler is correct not to be using vld1 for this code. The memory > format of int32x4_t is defined to be the format of a neon register that has > been filled from an array of int32 values and then stored to memory using VSTM > (or equivalent sequence). The implication of all this is that int32x4_t does > not (necessarily) have the same memory layout as int32_t[4]. Could you elaborate on this? Specifically about the case when memory format for VSTM and VST1 may differ. I thought that VST1 instruction could be always used as a replacement for VSTM, it is just a little bit less convenient in some cases because it is lacking some more advanced addressing modes. Moreover, VSTM is VFP instruction and VST1 is NEON one. So I guess mixing VSTM with true NEON instructions may be additionally a bad idea (for performance reasons on Cortex-A9 or other processors?). There also used to be FLDMX/FSTMX instructions, but they are deprecated now. I believe they existed specifically to reserve the use of normal VFP load/store instructions for floating point data formats only, but later this turned out to be unnecessary. > arm_neon.h provides intrinsics for filling neon registers from arrays in > memory, and in this case I think you should be using these directly. That is, > your macro should be modified to contain: > > #define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v = > vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);} I'm sorry, but this looks like a completely unjustified limitation to me. Why intrinsics should be so much more difficult and less intuitive to use than just inline assembly? Additionally, gcc allows to use normal arithmetic operations on vector data types, something like: void x(int32x4_t a, int32x4_t b, int32x4_t *p) { #define X(n) p[n] += a; p[n] |= b; X(0); X(1); X(2); X(3); X(4); X(5); X(6); X(7); X(8); X(9); X(10); X(11); X(12); } > There are still problems after doing this, however. In particular the compiler > is not correctly tracking alias information for the load/store intrinsics, > which means it is unable to move stores past loads to reduce stalls in the > pipeline. OK, thanks for the explanation. > The stack wastage appears to be fixed in trunk gcc; at least I don't see any > stack allocation for your testcase. Yes, looks like it got a little bit better. Anyway stack allocation shows up again after adding just a few more of these X() macros: ... X(13); X(14); X(15); X(16); ...