From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 3623 invoked by alias); 23 Aug 2011 17:47:41 -0000 Received: (qmail 3611 invoked by uid 22791); 23 Aug 2011 17:47:39 -0000 X-SWARE-Spam-Status: No, hits=-1.2 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,TW_JN,TW_VZ X-Spam-Check-By: sourceware.org Received: from smtp2.riverbed.com (HELO smtp2.riverbed.com) (208.70.196.44) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 23 Aug 2011 17:47:25 +0000 Received: from unknown (HELO tlssmtp) ([10.16.4.52]) by smtp2.riverbed.com with ESMTP; 23 Aug 2011 10:47:24 -0700 Received: from [10.35.81.21] (unknown [216.52.20.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by tlssmtp (Postfix) with ESMTP id 396276B95D; Tue, 23 Aug 2011 10:47:24 -0700 (PDT) Message-ID: <4E53E7AB.2040403@riverbed.com> Date: Tue, 23 Aug 2011 17:47:00 -0000 From: Oleg Smolsky User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20110812 Thunderbird/6.0 MIME-Version: 1.0 To: Andrew Pinski CC: Xinliang David Li , gcc@gcc.gnu.org Subject: Re: Performance degradation on g++ 4.6 References: <4E32F44F.7090201@riverbed.com> <4E330282.5000303@riverbed.com> <4E36F3BD.9080804@riverbed.com> <4E52FDAC.40104@riverbed.com> <4E53039A.3090701@riverbed.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2011-08/txt/msg00396.txt.bz2 Hey Andrew, On 2011/8/22 18:37, Andrew Pinski wrote: > On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky wrote: >> On 2011/8/22 18:09, Oleg Smolsky wrote: >>> Both compilers fully inline the templated function and the emitted code >>> looks very similar. I am puzzled as to why one of these loops is >>> significantly slower than the other. I've attached disassembled listings - >>> perhaps someone could have a look please? (the body of the loop starts at >>> 0000000000400FD for gcc41 and at 0000000000400D90 for gcc46) >> The difference, theoretically, should be due to the inner loop: >> >> v4.6: >> .text:0000000000400DA0 loc_400DA0: >> .text:0000000000400DA0 add eax, 0Ah >> .text:0000000000400DA3 add al, [rdx] >> .text:0000000000400DA5 add rdx, 1 >> .text:0000000000400DA9 cmp rdx, 5034E0h >> .text:0000000000400DB0 jnz short loc_400DA0 >> >> v4.1: >> .text:0000000000400FE0 loc_400FE0: >> .text:0000000000400FE0 movzx eax, ds:data8[rdx] >> .text:0000000000400FE7 add rdx, 1 >> .text:0000000000400FEB add eax, 0Ah >> .text:0000000000400FEE cmp rdx, 1F40h >> .text:0000000000400FF5 lea ecx, [rax+rcx] >> .text:0000000000400FF8 jnz short loc_400FE0 >> >> However, I cannot see how the first version would be slow... The custom >> templated "shifter" degenerates into "add 0xa", which is the point of the >> test... Hmm... > It is slower because of the subregister depedency between eax and al. > Hmm... it is little difficult to reason about these fragments as they are not equivalent in functionality. The g++4.1 version discards the result while the other version (correctly) accumulates. Oh, I've just realized that I grabbed the first iteration of the inner loop which was factored out (perhaps due to unrolling?) Oops, my apologies. Here are complete loops, out of a further digested test: g++ 4.1 (1.35 sec, 1185M ops/s): .text:0000000000400FDB loc_400FDB: .text:0000000000400FDB xor ecx, ecx .text:0000000000400FDD xor edx, edx .text:0000000000400FDF nop .text:0000000000400FE0 .text:0000000000400FE0 loc_400FE0: .text:0000000000400FE0 movzx eax, ds:data8[rdx] .text:0000000000400FE7 add rdx, 1 .text:0000000000400FEB add eax, 0Ah .text:0000000000400FEE cmp rdx, 1F40h .text:0000000000400FF5 lea ecx, [rax+rcx] .text:0000000000400FF8 jnz short loc_400FE0 .text:0000000000400FFA movsx eax, cl .text:0000000000400FFD add esi, 1 .text:0000000000401000 add ebx, eax .text:0000000000401002 cmp esi, edi .text:0000000000401004 jnz short loc_400FDB g++ 4.6 (2.86s, 563M ops/s) : .text:0000000000400D80 loc_400D80: .text:0000000000400D80 mov edx, offset data8 .text:0000000000400D85 xor eax, eax .text:0000000000400D87 db 66h, 66h .text:0000000000400D87 nop .text:0000000000400D8A db 66h, 66h .text:0000000000400D8A nop .text:0000000000400D8D db 66h, 66h .text:0000000000400D8D nop .text:0000000000400D90 .text:0000000000400D90 loc_400D90: .text:0000000000400D90 add eax, 0Ah .text:0000000000400D93 add al, [rdx] .text:0000000000400D95 add rdx, 1 .text:0000000000400D99 cmp rdx, 503480h .text:0000000000400DA0 jnz short loc_400D90 .text:0000000000400DA2 movsx eax, al .text:0000000000400DA5 add ecx, 1 .text:0000000000400DA8 add ebx, eax .text:0000000000400DAA cmp ecx, esi .text:0000000000400DAC jnz short loc_400D80 Your observation still holds - there are two sequential instructions that operate on the same register. So, I manually patched the 4.6 binary's inner loop to the following: .text:0000000000400D90 add al, [rdx] .text:0000000000400D92 add rdx, 1 .text:0000000000400D96 add eax, 0Ah .text:0000000000400D99 cmp rdx, 503480h .text:0000000000400DA0 jnz short loc_400D90 and that made no significant difference in performance. Is this dependency really a performance issue? BTW, the outer loop executes 200,000 times... Thanks! Oleg. P.S. GDB disassembles the v4.6 emitted padding as: 0x0000000000400d87 <+231>: data32 xchg ax,ax 0x0000000000400d8a <+234>: data32 xchg ax,ax 0x0000000000400d8d <+237>: data32 xchg ax,ax