From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 117999 invoked by alias); 22 Apr 2015 14:03:42 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 117579 invoked by uid 48); 22 Apr 2015 14:03:38 -0000 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2 Date: Wed, 22 Apr 2015 14:03:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 6.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: keywords cf_gcctarget bug_status cf_reconfirmed_on cc everconfirmed Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-04/txt/msg01910.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65847 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Target| |x86_64-*-* Status|UNCONFIRMED |NEW Last reconfirmed| |2015-04-22 CC| |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Confirmed. The issue is that the vectorizer thinks x and y reside in memory and thus it vectorizes the code as : vect__2.5_11 = MEM[(double *)&x]; vect__3.8_13 = MEM[(double *)&y]; vect__4.9_14 = vect__2.5_11 + vect__3.8_13; MEM[(double *)&D.1840] = vect__4.9_14; return D.1840; which looks good. But now comes the ABI and passes x, y and the return value in registers ... But even the best vectorized sequence would have four stmts - two to pack arguments into vector registers, one add and one upack for the return value. Thus it seems the vectorizer should be informed of this ABI detail or simply as heuristic never consider function arguments "memory" it can perform vector loads on (which probably means to disable group analysis on them?). On i?86 with SSE2 we get movupd 8(%esp), %xmm1 movl 4(%esp), %eax movupd 24(%esp), %xmm0 addpd %xmm1, %xmm0 movups %xmm0, (%eax) vs. movsd 16(%esp), %xmm0 movl 4(%esp), %eax movsd 8(%esp), %xmm1 addsd 32(%esp), %xmm0 addsd 24(%esp), %xmm1 movsd %xmm0, 8(%eax) movsd %xmm1, (%eax) which eventually looks even profitable (with -mfpmath=sse). So a simple heuristic might pessimize things too much. Replicating calls.c code to compute how the arguments are passed sounds odd though... Eventually the target can pessimize the loads in the target cost model though (at least it can perform a more reasonable "heuristic").