From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 9994 invoked by alias); 30 Jan 2012 23:17:41 -0000 Received: (qmail 9984 invoked by uid 22791); 30 Jan 2012 23:17:39 -0000 X-SWARE-Spam-Status: No, hits=-2.8 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,TW_MX X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 30 Jan 2012 23:17:26 +0000 From: "jakub at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/52056] Code optimization sensitive to trivial changes Date: Tue, 31 Jan 2012 01:04:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: minor X-Bugzilla-Who: jakub at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: CC Message-ID: In-Reply-To: References: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2012-01/txt/msg03603.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52056 Jakub Jelinek changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |irar at gcc dot gnu.org, | |jakub at gcc dot gnu.org --- Comment #2 from Jakub Jelinek 2012-01-30 23:16:03 UTC --- The signed vs. unsigned long right shift is quite significant, because Intel chips don't support signed quadword right shifts, only unsigned quadword right shifts (and left shifts), except that AMD chips with -mxop do support that. So, with the unsigned long right shift the loop is vectorized, while with signed long right shift it is not, and clearly in this case the vectorization (at least two elements at a time) isn't beneficial, but the cost model doesn't figure that out. So the faster times are without vectorization, you can get the same speed with -O3 -fno-tree-vectorize even with the unsigned shift. Even AVX can't process more than two elements at a time, only AVX2 will be able, how fast is that loop on AVX2 capable chips compared to non-vectorized remains to be seen.