From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20237 invoked by alias); 13 Dec 2012 00:16:20 -0000 Received: (qmail 20212 invoked by uid 22791); 13 Dec 2012 00:16:13 -0000 X-SWARE-Spam-Status: No, hits=-5.2 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE,RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mail-wg0-f53.google.com (HELO mail-wg0-f53.google.com) (74.125.82.53) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 13 Dec 2012 00:16:02 +0000 Received: by mail-wg0-f53.google.com with SMTP id ei8so530760wgb.8 for ; Wed, 12 Dec 2012 16:16:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding:x-gm-message-state; bh=lfKVtVBcpYMHmDAAQkJPs7cjkoEfpV1h53hqnRGdUeo=; b=LGvP6bIg4WwCHApWWZGZe0B/fXKZowS3PG66d2z3Psy+rGa9aiLwcq/x+tLZqB7cuZ KrE59XmZkah0VW1E179QvMdUfbx9epuUegiwVfW1IlV58oAEohB4U81CJLI1BpRPC853 +9aBjBCNiF3dXFYbh53HgIcH6xqBGMsIje5wRvn05APzf4pp3Zdj0XrIcetQLfQfpXBx 7YxJ2aQM4FAif9NoBZ4Sibl6vtWW3AWpwc9cUVkBvCOZwI5x4qzpg0/RSSry5UTTFKmJ OG4QPRoF1vc5TnzIbBtIYP/df7vDTGz+KGwvwNPK3ATDne139b5t4C5CjuGcw7DotC20 ylYA== MIME-Version: 1.0 Received: by 10.194.158.201 with SMTP id ww9mr4866581wjb.12.1355357760465; Wed, 12 Dec 2012 16:16:00 -0800 (PST) Received: by 10.216.190.207 with HTTP; Wed, 12 Dec 2012 16:16:00 -0800 (PST) In-Reply-To: <20121212183036.GB5303@atrey.karlin.mff.cuni.cz> References: <20121212163722.GA21037@atrey.karlin.mff.cuni.cz> <20121212183036.GB5303@atrey.karlin.mff.cuni.cz> Date: Thu, 13 Dec 2012 00:16:00 -0000 Message-ID: Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs From: Xinliang David Li To: Jan Hubicka Cc: GCC Patches , Teresa Johnson Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQknjvMLtx7fJmk83SJzIS2frSOMjOAmTH3bQJ1lRW54ldfCTWLybQBtGl47+RGq9u9kZWUMvjzAxZv2hel5YFbh6xhrkmMoAlthVRdfbV5s9ldiMgl+0e1745IsMN+4b52oGs7BYAPGPNh63pf3yC7wf0mT+xjX6K32mbNoQQLHfBfAZNbmBfJlu4zY4Dgy48ES1Ww3 X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2012-12/txt/msg00876.txt.bz2 On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka wrote: > Concerning 1push per cycle, I think it is same as K7 hardware did, so move > prologue should be a win. >> > Index: config/i386/i386.c >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> > --- config/i386/i386.c (revision 194452) >> > +++ config/i386/i386.c (working copy) >> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost =3D { >> > COSTS_N_INSNS (8), /* cost of FABS instruction. = */ >> > COSTS_N_INSNS (8), /* cost of FCHS instruction. = */ >> > COSTS_N_INSNS (40), /* cost of FSQRT instruction. = */ >> > - {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, >> > - {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true}, >> > + {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}}, >> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, >> > {-1, libcall, false}}}}, >> > {{libcall, {{6, loop_1_byte, true}, >> > {24, loop, true}, >> > {8192, rep_prefix_4_byte, true}, >> > {-1, libcall, false}}}, >> > - {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true}, >> > + {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true}, > > libcall is not faster up to 8KB to rep sequence that is better for regall= oc/code > cache than fully blowin function call. Be careful with this. My recollection is that REP sequence is good for any size -- for smaller size, the REP initial set up cost is too high (10s of cycles), while for large size copy, it is less efficient compared with library version. >> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe >> > m_PPRO, >> > >> > /* X86_TUNE_PARTIAL_FLAG_REG_STALL */ >> > - m_CORE2I7 | m_GENERIC, >> > + m_GENERIC | m_CORE2, > > This disable shifts that store just some flags. Acroding to Agner's manua= l I7 handle > this well. > ok. > Partial flags stall > The Sandy Bridge uses the method of an extra =C3=82=C4=BEop to join parti= al registers not only for > general purpose registers but also for the flags register, unlike previou= s processors which > used this method only for general purpose registers. This occurs when a w= rite to a part of > the flags register is followed by a read from a larger part of the flags = register. The partial > flags stall of previous processors (See page 75) is therefore replaced by= an extra =C3=82=C4=BEop. The > Sandy Bridge also generates an extra =C3=82=C4=BEop when reading the flag= s after a rotate instruction. > > This is cheaper than the 7 cycle delay on Core this flags is trying to av= oid. ok. >> > >> > /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix st= all >> > * on 16-bit immediate moves into memory on Core2 and Corei7. */ >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe >> > m_K6, >> > >> > /* X86_TUNE_USE_CLTD */ >> > - ~(m_PENT | m_ATOM | m_K6), >> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), My change was to enable CLTD for generic. Is your change intended to revert that? > > None of CPUs that generic care about are !USE_CLTD now after your change. >> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe >> > m_ATHLON_K8, >> > >> > /* X86_TUNE_SSE_TYPELESS_STORES */ >> > - m_AMD_MULTIPLE, >> > + m_AMD_MULTIPLE | m_CORE2I7, /*????*/ > > Hmm, I can not seem to find this in manual now, but I believe that stores= also do not type, > so movaps is preferred over movapd store because it is shorter. If not, = this change should > produce a lot of slowdowns. >> > >> > /* X86_TUNE_SSE_LOAD0_BY_PXOR */ >> > - m_PPRO | m_P4_NOCONA, >> > + m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/ > > Agner: > A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX.= The > Core2 and Nehalem processors recognize that certain instructions are inde= pendent of the > prior value of the register if the source and destination registers are t= he same. > > This applies to all of the following instructions: XOR, SUB, PXOR, XORPS,= XORPD, and all > variants of PSUBxxx and PCMPxxx except PCMPEQQ. >> > >> > /* X86_TUNE_MEMORY_MISMATCH_STALL */ >> > m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe >> > >> > /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict= more >> > than 4 branch instructions in the 16 byte window. */ >> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENE= RIC, >> > + m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, > > This is special passs to handle limitations of AMD's K7/K8/K10 branch pre= diction. > Intel never had similar design, so this flag is pointless. I noticed that too, but Andi has a better answer to it. > > We apparently ought to disable it for K10, at least per Agner's manual. >> > >> > /* X86_TUNE_SCHEDULE */ >> > m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE = | m_GENERIC, >> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe >> > m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, >> > >> > /* X86_TUNE_USE_INCDEC */ >> > - ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC), >> > + ~(m_P4_NOCONA | m_ATOM | m_GENERIC), > > Skipping inc/dec is to avoid partial flag stall happening on P4 only. >> > K8 and K10 partitions the flags into groups. References to flags to the same group can still cause the stall -- not sure how that can be handled. >> > /* X86_TUNE_PAD_RETURNS */ >> > - m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC, >> > + m_AMD_MULTIPLE | m_GENERIC, > > Again this deals specifically with AMD K7/K8/K10 branch prediction. I am= not even > sure this should be enabled for K10. >> > >> > /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion. */ >> > m_ATOM, >> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe >> > m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K= 8 | m_GENERIC, >> > >> > /* X86_TUNE_AVOID_VECTOR_DECODE */ >> > - m_CORE2I7 | m_K8 | m_GENERIC64, >> > + m_K8 | m_GENERIC64, > > This avoid AMD vector decoded instructions, again if it helped it did so = by accident. >> > >> > /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for = HImode >> > and SImode multiply, but 386 and 486 do HImode multiply faster. = */ >> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe >> > >> > /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory= is >> > vector path on AMD machines. */ >> > - m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64, >> > + m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER, >> > >> > /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path o= n AMD >> > machines. */ >> > - m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64, >> > + m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER, > > This is similarly targetd for AMD hardware only. I did not find ismilar l= imiation > in the optimization manual. >> > >> > /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 vi= a OR >> > than a MOV. */ >> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe >> > >> > /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conver= sion >> > from FP to FP. */ >> > - m_CORE2I7 | m_AMDFAM10 | m_GENERIC, >> > + m_AMDFAM10 | m_GENERIC, > > This is quite specific feature of AMD chips to preffer packed converts ov= er > scalar. Nothing like this is documented for cores >> > >> > /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion >> > from integer to FP. */ >> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe >> > /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction >> > with a subsequent conditional jump instruction into a single >> > compare-and-branch uop. */ >> > - m_BDVER, >> > + m_BDVER | m_CORE2I7, > > Core iplements fusion similar to what AMD does so I think this just appli= es here. yes. thanks, David >> > >> > /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag >> > will impact LEA instruction selection. */ >> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe >> > }; >> > >> > static const unsigned int x86_accumulate_outgoing_args >> > - =3D m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_= GENERIC; >> > + =3D m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC; > > Stack engine should make this cheap, just like prologues in moves. > This definitely needs some validation, the accumulate-outgoing-args > codegen differs quite a lot. Also this leads to unwind tables bloat. >> > >> > static const unsigned int x86_arch_always_fancy_math_387 >> > =3D m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULT= IPLE | m_GENERIC; > > Honza