From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-169989-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 3623 invoked by alias); 23 Aug 2011 17:47:41 -0000
Received: (qmail 3611 invoked by uid 22791); 23 Aug 2011 17:47:39 -0000
X-SWARE-Spam-Status: No, hits=-1.2 required=5.0	tests=AWL,BAYES_00,RP_MATCHES_RCVD,TW_JN,TW_VZ
X-Spam-Check-By: sourceware.org
Received: from smtp2.riverbed.com (HELO smtp2.riverbed.com) (208.70.196.44)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 23 Aug 2011 17:47:25 +0000
Received: from unknown (HELO tlssmtp) ([10.16.4.52])  by smtp2.riverbed.com with ESMTP; 23 Aug 2011 10:47:24 -0700
Received: from [10.35.81.21] (unknown [216.52.20.2])	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))	(No client certificate requested)	by tlssmtp (Postfix) with ESMTP id 396276B95D;	Tue, 23 Aug 2011 10:47:24 -0700 (PDT)
Message-ID: <4E53E7AB.2040403@riverbed.com>
Date: Tue, 23 Aug 2011 17:47:00 -0000
From: Oleg Smolsky <oleg.smolsky@riverbed.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20110812 Thunderbird/6.0
MIME-Version: 1.0
To: Andrew Pinski <pinskia@gmail.com>
CC: Xinliang David Li <davidxl@google.com>, gcc@gcc.gnu.org
Subject: Re: Performance degradation on g++ 4.6
References: <4E32F44F.7090201@riverbed.com> <CAAkRFZJPrLYdoKwnsH-L-AVPcJb8Wiq26f17zQFt2qoBkKUkGw@mail.gmail.com> <4E330282.5000303@riverbed.com> <CAAkRFZ+kOYPhXxGNy-xiujBp-w1sm_7ThrZN51XB+3vUSdxc8Q@mail.gmail.com> <4E36F3BD.9080804@riverbed.com> <CAAkRFZ+AcU0CffOZg0GthxNKxxuEf98r925VagXiKgZYcR7whg@mail.gmail.com> <4E52FDAC.40104@riverbed.com> <4E53039A.3090701@riverbed.com> <CA+=Sn1mWE4FheO4HhL+5-8_49YEHZHadFYjBJwm5Pm2FDZBxow@mail.gmail.com>
In-Reply-To: <CA+=Sn1mWE4FheO4HhL+5-8_49YEHZHadFYjBJwm5Pm2FDZBxow@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
X-SW-Source: 2011-08/txt/msg00396.txt.bz2

Hey Andrew,

On 2011/8/22 18:37, Andrew Pinski wrote:
> On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky<oleg.smolsky@riverbed.com>  wrote:
>> On 2011/8/22 18:09, Oleg Smolsky wrote:
>>> Both compilers fully inline the templated function and the emitted code
>>> looks very similar. I am puzzled as to why one of these loops is
>>> significantly slower than the other. I've attached disassembled listings -
>>> perhaps someone could have a look please? (the body of the loop starts at
>>> 0000000000400FD for gcc41 and at 0000000000400D90 for gcc46)
>> The difference, theoretically, should be due to the inner loop:
>>
>> v4.6:
>> .text:0000000000400DA0 loc_400DA0:
>> .text:0000000000400DA0                 add     eax, 0Ah
>> .text:0000000000400DA3                 add     al, [rdx]
>> .text:0000000000400DA5                 add     rdx, 1
>> .text:0000000000400DA9                 cmp     rdx, 5034E0h
>> .text:0000000000400DB0                 jnz     short loc_400DA0
>>
>> v4.1:
>> .text:0000000000400FE0 loc_400FE0:
>> .text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
>> .text:0000000000400FE7                 add     rdx, 1
>> .text:0000000000400FEB                 add     eax, 0Ah
>> .text:0000000000400FEE                 cmp     rdx, 1F40h
>> .text:0000000000400FF5                 lea     ecx, [rax+rcx]
>> .text:0000000000400FF8                 jnz     short loc_400FE0
>>
>> However, I cannot see how the first version would be slow... The custom
>> templated "shifter" degenerates into "add 0xa", which is the point of the
>> test... Hmm...
> It is slower because of the subregister depedency between eax and al.
>
Hmm... it is little difficult to reason about these fragments as they 
are not equivalent in functionality. The g++4.1 version discards the 
result while the other version (correctly) accumulates. Oh, I've just 
realized that I grabbed the first iteration of the inner loop which was 
factored out (perhaps due to unrolling?) Oops, my apologies.

Here are complete loops, out of a further digested test:

g++ 4.1 (1.35 sec, 1185M ops/s):

.text:0000000000400FDB loc_400FDB:
.text:0000000000400FDB                 xor     ecx, ecx
.text:0000000000400FDD                 xor     edx, edx
.text:0000000000400FDF                 nop
.text:0000000000400FE0
.text:0000000000400FE0 loc_400FE0:
.text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
.text:0000000000400FE7                 add     rdx, 1
.text:0000000000400FEB                 add     eax, 0Ah
.text:0000000000400FEE                 cmp     rdx, 1F40h
.text:0000000000400FF5                 lea     ecx, [rax+rcx]
.text:0000000000400FF8                 jnz     short loc_400FE0
.text:0000000000400FFA                 movsx   eax, cl
.text:0000000000400FFD                 add     esi, 1
.text:0000000000401000                 add     ebx, eax
.text:0000000000401002                 cmp     esi, edi
.text:0000000000401004                 jnz     short loc_400FDB

g++ 4.6 (2.86s, 563M ops/s) :

.text:0000000000400D80 loc_400D80:
.text:0000000000400D80                 mov     edx, offset data8
.text:0000000000400D85                 xor     eax, eax
.text:0000000000400D87                 db      66h, 66h
.text:0000000000400D87                 nop
.text:0000000000400D8A                 db      66h, 66h
.text:0000000000400D8A                 nop
.text:0000000000400D8D                 db      66h, 66h
.text:0000000000400D8D                 nop
.text:0000000000400D90
.text:0000000000400D90 loc_400D90:
.text:0000000000400D90                 add     eax, 0Ah
.text:0000000000400D93                 add     al, [rdx]
.text:0000000000400D95                 add     rdx, 1
.text:0000000000400D99                 cmp     rdx, 503480h
.text:0000000000400DA0                 jnz     short loc_400D90
.text:0000000000400DA2                 movsx   eax, al
.text:0000000000400DA5                 add     ecx, 1
.text:0000000000400DA8                 add     ebx, eax
.text:0000000000400DAA                 cmp     ecx, esi
.text:0000000000400DAC                 jnz     short loc_400D80

Your observation still holds - there are two sequential instructions 
that operate on the same register. So, I manually patched the 4.6 
binary's inner loop to the following:

.text:0000000000400D90                 add     al, [rdx]
.text:0000000000400D92                 add     rdx, 1
.text:0000000000400D96                 add     eax, 0Ah
.text:0000000000400D99                 cmp     rdx, 503480h
.text:0000000000400DA0                 jnz     short loc_400D90

and that made no significant difference in performance.

Is this dependency really a performance issue? BTW, the outer loop 
executes 200,000 times...

Thanks!

Oleg.

P.S. GDB disassembles the v4.6 emitted padding as:

    0x0000000000400d87 <+231>:   data32 xchg ax,ax
    0x0000000000400d8a <+234>:   data32 xchg ax,ax
    0x0000000000400d8d <+237>:   data32 xchg ax,ax