From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12111 invoked by alias); 7 Nov 2007 09:05:39 -0000 Received: (qmail 12079 invoked by uid 48); 7 Nov 2007 09:05:28 -0000 Date: Wed, 07 Nov 2007 09:05:00 -0000 Subject: [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop X-Bugzilla-Reason: CC Message-ID: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "ubizjak at gmail dot com" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2007-11/txt/msg00567.txt.bz2 Following testcase exposes optimization problem with current SVN gcc: --cut here-- extern const int srcshift; void good (const int *srcdata, int *dstdata) { int i; for (i = 0; i < 256; i++) dstdata[i] = srcdata[i] << srcshift; } void bad (const int *srcdata, int *dstdata) { int i; for (i = 0; i < 256; i++) { dstdata[i] |= srcdata[i] << srcshift; } } --cut here-- Using -O3 -msse2, the loop in above testcase gets vectorized, and produced code differs substantially between good and bad function: good: ... .L8: xorl %eax, %eax movd srcshift, %xmm1 .p2align 4,,7 .p2align 3 .L4: movdqu (%ebx,%eax), %xmm0 pslld %xmm1, %xmm0 movdqa %xmm0, (%esi,%eax) addl $16, %eax cmpl $1024, %eax jne .L4 ... bad: ... .L21: movl %esi, %eax (2) movl %ebx, %edx leal 1024(%esi), %ecx .p2align 4,,7 .p2align 3 .L17: movdqu (%edx), %xmm0 movd srcshift, %xmm1 (1) pslld %xmm1, %xmm0 movdqu (%eax), %xmm1 (3) por %xmm1, %xmm0 movdqa %xmm0, (%eax) addl $16, %eax (4) addl $16, %edx cmpl %ecx, %eax jne .L17 popl %ebx popl %esi popl %ebp ret In addition to memory load in the loop (1), several other problems can be identified: There is no need to move registers (2), because loop is followed by function exit. For some reason, additional IV is used (4) and the same address is accessed with unaligned access (3) as well as aligned access. Expected code for "bad" case would be something like "good" case with additional movaps+por instructions: .L8: xorl %eax, %eax movd srcshift, %xmm1 .p2align 4,,7 .p2align 3 .L4: movdqu (%ebx,%eax), %xmm0 movaps %xmm0, %xmm2 pslld %xmm1, %xmm0 por %xmm2, %xmm0 movdqa %xmm0, (%esi,%eax) addl $16, %eax cmpl $1024, %eax jne .L4 Missing IV elimination could be attributed to tree loop optimizations, but others are IMO RTL optimization problems, because we enter RTL generation with: good: : MEM[base: dstdata, index: ivtmp.60] = M*(vect_p.29 + ivtmp.60){misalignment: 0} << srcshift.1; bad: : MEM[index: ivtmp.127] = M*(vector int *) ivtmp.130{misalignment: 0} << srcshift.3 | M*(vector int *) ivtmp.127{misalignment: 0}; -- Summary: Memory load is not eliminated from tight vectorized loop Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: ubizjak at gmail dot com GCC target triplet: i686-*-*, x86_64-*-* http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011