From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-259244-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 32671 invoked by alias); 22 Aug 2008 09:54:35 -0000
Received: (qmail 32403 invoked by uid 48); 22 Aug 2008 09:53:11 -0000
Date: Fri, 22 Aug 2008 09:54:00 -0000
Message-ID: <20080822095311.32402.qmail@sourceware.org>
X-Bugzilla-Reason: CC
References: <bug-37194-14936@http.gcc.gnu.org/bugzilla/>
Subject: [Bug tree-optimization/37194] Autovectorization of small constant iteration loop degrades performance
In-Reply-To: <bug-37194-14936@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "rguenth at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2008-08/txt/msg01594.txt.bz2


------- Comment #2 from rguenth at gcc dot gnu dot org  2008-08-22 09:53 -------
The x86_64 generated code looks like

ggSpectrum_Set:
.LFB0:
        .cfi_startproc
        movq    %rdi, %rax
        xorl    %ecx, %ecx
        movq    %rdi, %rdx
        andl    $15, %eax
        shrq    $2, %rax
        negl    %eax
        andl    $3, %eax
        je      .L15
        movl    $8, %r8d
        .p2align 4,,10
        .p2align 3
.L10:
        addl    $1, %ecx
        movl    %r8d, %esi
        movss   %xmm0, (%rdx)
        subl    %ecx, %esi
        addq    $4, %rdx
        cmpl    %ecx, %eax
        ja      .L10
.L3:
        movl    $8, %r10d
        subl    %eax, %r10d
        movl    %r10d, %r8d
        shrl    $2, %r8d
        leal    0(,%r8,4), %r9d
        testl   %r9d, %r9d
        je      .L5
        movaps  %xmm0, %xmm2
        sall    $2, %eax
        mov     %eax, %eax
        xorl    %edx, %edx
        shufps  $0, %xmm2, %xmm2
        leaq    (%rdi,%rax), %rax
        movaps  %xmm2, %xmm1
        .p2align 4,,10
        .p2align 3
.L6:
        addl    $1, %edx
        movaps  %xmm1, (%rax)
        addq    $16, %rax
        cmpl    %r8d, %edx
        jb      .L6
        addl    %r9d, %ecx
        subl    %r9d, %esi
        cmpl    %r9d, %r10d
        je      .L9
.L5:
        movslq  %ecx,%rax
        leaq    (%rdi,%rax,4), %rax
        .p2align 4,,10
        .p2align 3
.L8:
        movss   %xmm0, (%rax)
        addq    $4, %rax
        subl    $1, %esi
        jne     .L8
.L9:
        rep
        ret
.L15:
        movl    $8, %esi
        movl    %eax, %ecx
        jmp     .L3
        .cfi_endproc

I wonder why we do not use movups instead.

t.i:3: note: Alignment of access forced using peeling.
t.i:3: note: Peeling for alignment will be applied.

t.i:3: note: Cost model analysis:
  Vector inside of loop cost: 1
  Vector outside of loop cost: 13
  Scalar iteration cost: 1
  Scalar outside cost: 7
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

t.i:3: note: === vect_do_peeling_for_alignment ===
t.i:3: note: created vect_p.29_13
t.i:3: note: niters for prolog loop: (unsigned int) (4 - (((long unsigned int)
vect_p.29_13 & 15) >> 2)) & 3
t.i:3: note: Vectorization may not be profitable.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37194