From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14434 invoked by alias); 11 Oct 2011 01:12:06 -0000 Received: (qmail 14418 invoked by uid 22791); 11 Oct 2011 01:12:04 -0000 X-SWARE-Spam-Status: No, hits=-2.8 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,TW_DD,TW_OV X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 11 Oct 2011 01:11:50 +0000 From: "dje at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/50693] Loop optimization restricted by GOTOs Date: Tue, 11 Oct 2011 01:12:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: dje at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Status Last reconfirmed Ever Confirmed Message-ID: In-Reply-To: References: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2011-10/txt/msg00941.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693 David Edelsohn changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2011-10-11 Ever Confirmed|0 |1 --- Comment #8 from David Edelsohn 2011-10-11 01:11:47 UTC --- Both loop1 and loop2 produce the same code on LLVM, presumably from its memset pattern: movq %rax, 8(%r15) movq %rbx, (%r15) testq %rbx, %rbx je .LBB1_3 # BB#1: movq %rbx, %rcx movq %rax, %rdx .align 16, 0x90 .LBB1_2: # %.lr.ph # =>This Inner Loop Header: Depth=1 movb %r14b, (%rdx) incq %rdx decq %rcx jne .LBB1_2 .LBB1_3: # %._crit_edge movb $0, (%rax,%rbx) Direct pointer arithmetic might not be recommended, but Intel makes do. For loop1, GCC produces: testq %rbx, %rbx movq %rax, 8(%rbp) movq %rbx, 0(%rbp) je .L3 xorl %edx, %edx .p2align 4,,10 .p2align 3 .L5: movb %r12b, (%rax,%rdx) addq $1, %rdx movq 8(%rbp), %rax cmpq %rbx, %rdx jne .L5 .L3: movb $0, (%rax,%rbx) For loop2, GCC produces: xorl %edx, %edx testq %rbx, %rbx movq %rax, 8(%rbp) movq %rbx, 0(%rbp) jne .L13 jmp .L9 .p2align 4,,10 .p2align 3 .L11: movq 8(%rbp), %rax .L8: .L13: .L10: movb %r12b, (%rax,%rdx) addq $1, %rdx cmpq %rbx, %rdx jne .L11 movq 8(%rbp), %rax .L9: movb $0, (%rax,%rbx) In both cases GCC unnecessarily re-reads v->chars. Is loop2 slower because jne .L13 jump into the middle of the loop confuses the Intel loop branch predictor logic? Or the loop2 instructions order cracks into uops badly? The cause of the performance difference is not obvious.