From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16646 invoked by alias); 14 May 2009 09:01:44 -0000 Received: (qmail 14332 invoked by uid 48); 14 May 2009 09:01:09 -0000 Date: Thu, 14 May 2009 09:01:00 -0000 Message-ID: <20090514090109.14331.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "vvv at ru dot ru" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2009-05/txt/msg01210.txt.bz2 ------- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 ------- Created an attachment (id=17863) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) Testing tool. Here is results of my testing. Code: align 128 test_cikl: rept 14 ; 14 if SH=0, 15 if SH=1, 16 if SH=2 { nop } cmp al,0 ; 2 bytes jz $+10h+NOPS ; 2 bytes offset=xxxx0 cmp al,1 ; 2 bytes offset=xxxx2 jz $+0Ch+NOPS ; 2 bytes offset=xxxx4 cmp al,2 ; 2 bytes offset=xxxx6 jz $+08h+NOPS ; 2 bytes offset=xxxx8 cmp al,3 ; 2 bytes offset=xxxxA match =1, NOPS { nop } match =2, NOPS { xchg eax,eax ; 2-bytes NOP } jz $+04h ; 2 bytes offset=xxxxC ja $+02h ; 2 bytes offset=xxxxE mov eax,ecx and eax,7h loop test_cikl This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks. ; Core 2 Duo ; NOPS/tick/Max NOPS/tick/Max NOPS/tick/Max ; SH=0 0/571/729 1/306/594 2/315/630 ; SH=1 0/338/612 1/338/648 2/339/648 ; SH=2 0/339/666 1/339/675 2/333/693 ; Xeon 3110 ; NOPS/tick/Max NOPS/tick/Max NOPS/tick/Max ; SH=0 0/586/693 1/310/675 2/310/675 ; SH=1 0/333/657 1/330/648 2/464/630 ; SH=2 0/333/657 1/470/594 2/474/603 ; P4 ; NOPS/tick/Max NOPS/tick/Max NOPS/tick/Max ; SH=0 0/1027/1317 1/1094/1258 2/1028/1207 ; SH=1 0/1151/1377 1/1068/1352 2/902/1275 ; SH=2 0/1124/1275 1/1148/1335 2/979/1139 Conclusion: 1. Core2 and Xeon - similar results. P4 - something strange. For Core2 & Xeon padding very effective. Code with padding almoust 2 times faster. No sence for P4? 2. My previous sentence VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but VVV> Intel limitation for 16-bytes chunk (memory range XXXX - XXXX+10h) is wrong. At leat for Core2 & Xeon. For this CPU "16-bytes chunk" means memory range XXX0 - XXXF. Unfortunately, I can't test AMD. PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode, switch to 64-bit mode and measure rdtsc ticks for test code. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942