From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 31746 invoked by alias); 12 Nov 2013 04:24:13 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 31680 invoked by uid 48); 12 Nov 2013 04:24:08 -0000 From: "hendrik.greving.intel at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug c/59084] New: Sub-optimal vector moves in AVX2 vectorized loop for unaligned loads. Date: Tue, 12 Nov 2013 04:24:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: c X-Bugzilla-Version: 4.9.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: hendrik.greving.intel at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2013-11/txt/msg01109.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59084 Bug ID: 59084 Summary: Sub-optimal vector moves in AVX2 vectorized loop for unaligned loads. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: hendrik.greving.intel at gmail dot com The simple test case below produces sub-optimal split load/stores (AVX1/16 bytes), apparently due to the fact that g_a, g_b, g_c are put in a common section which doesn't guarantee alignment. Compiling with -fno-common actually produces good code. Only affects C, due to the described alignment issue above. This bug might be related to or be a duplicate of #41464. Sub-optimal code: compiled with gcc -S -O3 -march=core-avx2 foo.c -ftree-vectorizer-verbose=1 -dp -v -da vmovdqu (%rsi,%rax), %xmm0 # 160 sse2_loaddquv16qi [length = 5] vinserti128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0 # 161 avx_vec_concatv32qi/1 [length = 8] addl $1, %edx # 165 *addsi_1/1 [length = 3] vpaddd (%r8,%rax), %ymm0, %ymm0 # 162 *addv8si3/2 [length = 6] vmovups %xmm0, (%rcx,%rax) # 410 *movv16qi_internal/3 [length = 5] vextracti128 $0x1, %ymm0, 16(%rcx,%rax) # 164 vec_extract_hi_v32qi/2 [leng Good code: compiled with gcc -S -O3 -march=core-avx2 foo.c -ftree-vectorizer-verbose=1 -dp -v -da -fno-common vmovdqa g_a(%rax), %ymm0 # 26 *movv8si_internal/2 [length = 8] vpaddd g_b(%rax), %ymm0, %ymm0 # 27 *addv8si3/2 [length = 8] addq $32, %rax # 29 *adddi_1/1 [length = 4] vmovaps %ymm0, g_c-32(%rax) # 28 *movv8si_internal/3 [length = 8] Test case: #include #include #define LENGTH 10000 int g_a[LENGTH]; int g_b[LENGTH]; int g_c[LENGTH]; void foo() { int i ; for (i = 0; i < LENGTH; i++) { g_c[i] = g_a[i] + g_b[i]; } }