From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16382 invoked by alias); 8 Jan 2003 18:16:25 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 16360 invoked from network); 8 Jan 2003 18:16:16 -0000 Received: from unknown (HELO mail.goquest.com) (12.18.108.6) by 209.249.29.67 with SMTP; 8 Jan 2003 18:16:16 -0000 Received: (qmail 2184 invoked by uid 0); 8 Jan 2003 18:15:10 -0000 Received: from mszick@goquest.com by mail.goquest.com by uid 502 with qmail-scanner-1.12 (spamassassin: 2.31. . Clear:. Processed in 1.462891 secs); 08 Jan 2003 18:15:10 -0000 Received: from unknown (HELO localhost.localdomain) (66.90.208.42) by mail.goquest.com with SMTP; 8 Jan 2003 18:15:08 -0000 Content-Type: text/plain; charset="iso-8859-1" From: Michael S. Zick To: Andy Walker , Subject: Re: An unusual Performance approach using Synthetic registers Date: Wed, 08 Jan 2003 19:29:00 -0000 Cc: gcc@gcc.gnu.org References: In-Reply-To: MIME-Version: 1.0 Message-Id: <03010812102700.00905@localhost.localdomain> Content-Transfer-Encoding: 8bit X-SW-Source: 2003-01/txt/msg00441.txt.bz2 On Tuesday 07 January 2003 11:35 pm, Andy Walker wrote: > On Tuesday 07 January 2003 12:16 pm, tm_gccmail@mail.kloo.net wrote: > > > > > Now that I think about it, it's even worse on the Pentium/Pentium MMX > > than I initially thought. > > > > There's two instruction pipelines on the Pentium: the U pipe and the V > > pipe. The U pipe can execute all the instructions, but the V pipe can > > only execute simple instructions. > > > > > Toshi > > I will take your good advice and not use XCHG as a performance enhancing > option. > > Andy Andy, I do not make any claims of this being anything other than a WAFG... It wasn't used as a numerical measure, just "==", "<", ">" to determine an order among alternative code sequences. But I used it as my guide in the past and is why I suggested XCHG. Why: If user wanted "Best Size" I dropped the "C" term if user wanted "Best Speed" I dropped the "D" term Otherwise, just use the diagonal of a cube. How: Scaled everything so it could be done with integer math. Legend: B == Buss Cycles C == Clock Cycles S == Instruction Size D == (Instruction Size DIV D-Cache Size) Cost == SQRT(256*( B*B + C*C + D*D)) Presumes: 1) Write to Stack meets the "Write Before Read" requirement So the first stack read does not generate a buss cycle. 2) If temporary is required, use EAX 3) If EAX not available, spill/restore with push/pop 4) Newer processors will never be worse than 80386 5) D-Cache line size 64 bytes Notes: Case 1 leaves a buss write pending Follow with a Reg <-> Reg to hide write cycle Case 2 the "load/store" version, needs register Follow with another Reg <-> Reg if available Case 3 leaves a buss write pending Case 4 puts other Reg <-> Reg ops to hide buss write PATH____________B_|_C_|_S_|__D__|__Cost Case 1 == Cost 80 xchg ebx, [esp+16]__0_|_5_|_3_|_0.05_|___80 With a pending Buss Cycle so, Reg <-> Reg pad here Case 2 == Cost 129 mov eax, [esp+16]__0_|_4_|_3_|_0.05 mov [esp+16], ebx__1_|_2_|_3_|_0.05 mov ebx, eax______0_|_2_|_2_|_0.03 - - - - - - - _________________1_|_8_|_8_|_0.13_|__129 Case 3 == Cost 229 push eax_________1_|_2_|_1_|_0.02 mov eax, [esp+20]_0_|_4_|_3_|_0.05 mov [esp+20], ebx_1_|_2_|_3_|_0.05 mov ebx, eax_____0_|_2_|_2_|_0.03 pop eax_________1_|_4_|_1_|_0.02 - - - - - - - _______________3_|_14_|_10_|_0.16_|__229 Case 4 == Cost 226 push eax_________1_|_2_|_1_|_0.02 mov eax, [esp+20]_0_|_4_|_3_|_0.05 mov [esp+20], ebx_0_|_2_|_3_|_0.05 mov ebx, eax_____0_|_2_|_2_|_0.03 > > Reg <-> Reg pad here pop eax_________1_|_4_|_1_|_0.02 - - - - - - - _______________2_|_14_|_10_|_0.16_|__226