From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 32671 invoked by alias); 22 Aug 2008 09:54:35 -0000 Received: (qmail 32403 invoked by uid 48); 22 Aug 2008 09:53:11 -0000 Date: Fri, 22 Aug 2008 09:54:00 -0000 Message-ID: <20080822095311.32402.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug tree-optimization/37194] Autovectorization of small constant iteration loop degrades performance In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "rguenth at gcc dot gnu dot org" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg01594.txt.bz2 ------- Comment #2 from rguenth at gcc dot gnu dot org 2008-08-22 09:53 ------- The x86_64 generated code looks like ggSpectrum_Set: .LFB0: .cfi_startproc movq %rdi, %rax xorl %ecx, %ecx movq %rdi, %rdx andl $15, %eax shrq $2, %rax negl %eax andl $3, %eax je .L15 movl $8, %r8d .p2align 4,,10 .p2align 3 .L10: addl $1, %ecx movl %r8d, %esi movss %xmm0, (%rdx) subl %ecx, %esi addq $4, %rdx cmpl %ecx, %eax ja .L10 .L3: movl $8, %r10d subl %eax, %r10d movl %r10d, %r8d shrl $2, %r8d leal 0(,%r8,4), %r9d testl %r9d, %r9d je .L5 movaps %xmm0, %xmm2 sall $2, %eax mov %eax, %eax xorl %edx, %edx shufps $0, %xmm2, %xmm2 leaq (%rdi,%rax), %rax movaps %xmm2, %xmm1 .p2align 4,,10 .p2align 3 .L6: addl $1, %edx movaps %xmm1, (%rax) addq $16, %rax cmpl %r8d, %edx jb .L6 addl %r9d, %ecx subl %r9d, %esi cmpl %r9d, %r10d je .L9 .L5: movslq %ecx,%rax leaq (%rdi,%rax,4), %rax .p2align 4,,10 .p2align 3 .L8: movss %xmm0, (%rax) addq $4, %rax subl $1, %esi jne .L8 .L9: rep ret .L15: movl $8, %esi movl %eax, %ecx jmp .L3 .cfi_endproc I wonder why we do not use movups instead. t.i:3: note: Alignment of access forced using peeling. t.i:3: note: Peeling for alignment will be applied. t.i:3: note: Cost model analysis: Vector inside of loop cost: 1 Vector outside of loop cost: 13 Scalar iteration cost: 1 Scalar outside cost: 7 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 7 t.i:3: note: === vect_do_peeling_for_alignment === t.i:3: note: created vect_p.29_13 t.i:3: note: niters for prolog loop: (unsigned int) (4 - (((long unsigned int) vect_p.29_13 & 15) >> 2)) & 3 t.i:3: note: Vectorization may not be profitable. -- rguenth at gcc dot gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu dot | |org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37194