From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26763 invoked by alias); 8 Oct 2010 14:13:30 -0000 Received: (qmail 26752 invoked by uid 22791); 8 Oct 2010 14:13:29 -0000 X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,MISSING_MID X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 08 Oct 2010 14:13:25 +0000 From: "siarhei.siamashka at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: siarhei.siamashka at gmail dot com X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: In-Reply-To: References: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Date: Fri, 08 Oct 2010 14:13:00 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2010-10/txt/msg00730.txt.bz2 Message-ID: <20101008141300.K5ScD5cIb09I9aY-h33f5U3IO2q65Mh3BX-hFISmzrc@z> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725 --- Comment #5 from Siarhei Siamashka 2010-10-08 14:13:08 UTC --- (In reply to comment #3) > On Mon, 4 Oct 2010, siarhei.siamashka at gmail dot com wrote: > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725 > > > > --- Comment #2 from Siarhei Siamashka 2010-10-04 22:59:56 UTC --- > > (In reply to comment #1) > > > So the compiler is correct not to be using vld1 for this code. The memory > > > format of int32x4_t is defined to be the format of a neon register that has > > > been filled from an array of int32 values and then stored to memory using VSTM > > > (or equivalent sequence). The implication of all this is that int32x4_t does > > > not (necessarily) have the same memory layout as int32_t[4]. > > > > Could you elaborate on this? Specifically about the case when memory format for > > VSTM and VST1 may differ. > > Big-endian. OK, I see. Looks like VLDM/VSTM instructions could be replaced with VLD1/VST1 (by artificially forcing element size to 64) in almost all cases except when SCTLR.A == 1 due to unwanted alignment traps potentially happening in this case. But the question is whether it is really necessary to suffer from a performance penalty on little endian systems? > I previously explained the issues with big-endian NEON vectors in GCC at > length: > > http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html Thanks for the link, something seems to be seriously overengineered. Looks like you brought a problem upon yourself and now are trying to valiantly solve it. Does (efficient) support of NEON intrinsics on big endian systems even have any practical value? Maybe it makes sense to get a reasonable performance at least on little endian systems first. To me it looks like you are just running after two hares...