From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-334134-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20237 invoked by alias); 13 Dec 2012 00:16:20 -0000
Received: (qmail 20212 invoked by uid 22791); 13 Dec 2012 00:16:13 -0000
X-SWARE-Spam-Status: No, hits=-5.2 required=5.0	tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE,RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from mail-wg0-f53.google.com (HELO mail-wg0-f53.google.com) (74.125.82.53)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 13 Dec 2012 00:16:02 +0000
Received: by mail-wg0-f53.google.com with SMTP id ei8so530760wgb.8        for <gcc-patches@gcc.gnu.org>; Wed, 12 Dec 2012 16:16:00 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=google.com; s=20120113;        h=mime-version:in-reply-to:references:date:message-id:subject:from:to         :cc:content-type:content-transfer-encoding:x-gm-message-state;        bh=lfKVtVBcpYMHmDAAQkJPs7cjkoEfpV1h53hqnRGdUeo=;        b=LGvP6bIg4WwCHApWWZGZe0B/fXKZowS3PG66d2z3Psy+rGa9aiLwcq/x+tLZqB7cuZ         KrE59XmZkah0VW1E179QvMdUfbx9epuUegiwVfW1IlV58oAEohB4U81CJLI1BpRPC853         +9aBjBCNiF3dXFYbh53HgIcH6xqBGMsIje5wRvn05APzf4pp3Zdj0XrIcetQLfQfpXBx         7YxJ2aQM4FAif9NoBZ4Sibl6vtWW3AWpwc9cUVkBvCOZwI5x4qzpg0/RSSry5UTTFKmJ         OG4QPRoF1vc5TnzIbBtIYP/df7vDTGz+KGwvwNPK3ATDne139b5t4C5CjuGcw7DotC20         ylYA==
MIME-Version: 1.0
Received: by 10.194.158.201 with SMTP id ww9mr4866581wjb.12.1355357760465; Wed, 12 Dec 2012 16:16:00 -0800 (PST)
Received: by 10.216.190.207 with HTTP; Wed, 12 Dec 2012 16:16:00 -0800 (PST)
In-Reply-To: <20121212183036.GB5303@atrey.karlin.mff.cuni.cz>
References: <CAAkRFZLMofkNZs9NUkfUDnMzVd5YsVhbx0xsb8jZuXy_eqEj6w@mail.gmail.com>	<20121212163722.GA21037@atrey.karlin.mff.cuni.cz>	<CAAkRFZKBa3GtEh=mmWiAmy-oGffYFxrmWetpaz+pKYSG1zSvSw@mail.gmail.com>	<20121212183036.GB5303@atrey.karlin.mff.cuni.cz>
Date: Thu, 13 Dec 2012 00:16:00 -0000
Message-ID: <CAAkRFZLAe0CuO+-sBps9pCBDVi5k2ti8cBgL9Ukw4fBmrnpUeg@mail.gmail.com>
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
From: Xinliang David Li <davidxl@google.com>
To: Jan Hubicka <hubicka@ucw.cz>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, Teresa Johnson <tejohnson@google.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Gm-Message-State: ALoCoQknjvMLtx7fJmk83SJzIS2frSOMjOAmTH3bQJ1lRW54ldfCTWLybQBtGl47+RGq9u9kZWUMvjzAxZv2hel5YFbh6xhrkmMoAlthVRdfbV5s9ldiMgl+0e1745IsMN+4b52oGs7BYAPGPNh63pf3yC7wf0mT+xjX6K32mbNoQQLHfBfAZNbmBfJlu4zY4Dgy48ES1Ww3
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2012-12/txt/msg00876.txt.bz2

On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
> prologue should be a win.
>> > Index: config/i386/i386.c
>> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> > --- config/i386/i386.c  (revision 194452)
>> > +++ config/i386/i386.c  (working copy)
>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost =3D {
>> >    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  =
*/
>> >    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  =
*/
>> >    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction. =
 */
>> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>> >                {-1, libcall, false}}}},
>> >    {{libcall, {{6, loop_1_byte, true},
>> >                {24, loop, true},
>> >                {8192, rep_prefix_4_byte, true},
>> >                {-1, libcall, false}}},
>> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>
> libcall is not faster up to 8KB to rep sequence that is better for regall=
oc/code
> cache than fully blowin function call.

Be careful with this. My recollection is that REP sequence is good for
any size -- for smaller size, the REP initial set up cost is too high
(10s of cycles), while for large size copy, it is less efficient
compared with library version.


>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>> >    m_PPRO,
>> >
>> >    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> > -  m_CORE2I7 | m_GENERIC,
>> > +  m_GENERIC | m_CORE2,
>
> This disable shifts that store just some flags. Acroding to Agner's manua=
l I7 handle
> this well.
>

ok.

> Partial flags stall
> The Sandy Bridge uses the method of an extra =C3=82=C4=BEop to join parti=
al registers not only for
> general purpose registers but also for the flags register, unlike previou=
s processors which
> used this method only for general purpose registers. This occurs when a w=
rite to a part of
> the flags register is followed by a read from a larger part of the flags =
register. The partial
> flags stall of previous processors (See page 75) is therefore replaced by=
 an extra =C3=82=C4=BEop. The
> Sandy Bridge also generates an extra =C3=82=C4=BEop when reading the flag=
s after a rotate instruction.
>
> This is cheaper than the 7 cycle delay on Core this flags is trying to av=
oid.

ok.

>> >
>> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix st=
all
>> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>> >    m_K6,
>> >
>> >    /* X86_TUNE_USE_CLTD */
>> > -  ~(m_PENT | m_ATOM | m_K6),
>> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

My change was to enable CLTD for generic. Is your change intended to
revert that?

>
> None of CPUs that generic care about are !USE_CLTD now after your change.
>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>> >    m_ATHLON_K8,
>> >
>> >    /* X86_TUNE_SSE_TYPELESS_STORES */
>> > -  m_AMD_MULTIPLE,
>> > +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>
> Hmm, I can not seem to find this in manual now, but I believe that stores=
 also do not type,
> so movaps is preferred over movapd store because it is shorter.  If not, =
this change should
> produce a lot of slowdowns.
>> >
>> >    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
>> > -  m_PPRO | m_P4_NOCONA,
>> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>
> Agner:
> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX.=
 The
> Core2 and Nehalem processors recognize that certain instructions are inde=
pendent of the
> prior value of the register if the source and destination registers are t=
he same.
>
> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS,=
 XORPD, and all
> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>> >
>> >    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>> >    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict=
 more
>> >       than 4 branch instructions in the 16 byte window.  */
>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENE=
RIC,
>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
> This is special passs to handle limitations of AMD's K7/K8/K10 branch pre=
diction.
> Intel never had similar design, so this flag is pointless.

I noticed that too, but Andi has a better answer to it.

>
> We apparently ought to disable it for K10, at least per Agner's manual.
>> >
>> >    /* X86_TUNE_SCHEDULE */
>> >    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE =
| m_GENERIC,
>> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>> >    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> >
>> >    /* X86_TUNE_USE_INCDEC */
>> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
>> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>
> Skipping inc/dec is to avoid partial flag stall happening on P4 only.
>> >


K8 and K10 partitions the flags into groups. References to flags to
the same group can still cause the stall -- not sure how that can be
handled.

>> >    /* X86_TUNE_PAD_RETURNS */
>> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
>> > +  m_AMD_MULTIPLE | m_GENERIC,
>
> Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am=
 not even
> sure this should be enabled for K10.
>> >
>> >    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>> >    m_ATOM,
>> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>> >    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K=
8 | m_GENERIC,
>> >
>> >    /* X86_TUNE_AVOID_VECTOR_DECODE */
>> > -  m_CORE2I7 | m_K8 | m_GENERIC64,
>> > +  m_K8 | m_GENERIC64,
>
> This avoid AMD vector decoded instructions, again if it helped it did so =
by accident.
>> >
>> >    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for =
HImode
>> >       and SImode multiply, but 386 and 486 do HImode multiply faster. =
 */
>> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory=
 is
>> >       vector path on AMD machines.  */
>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>> >
>> >    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path o=
n AMD
>> >       machines.  */
>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>
> This is similarly targetd for AMD hardware only. I did not find ismilar l=
imiation
> in the optimization manual.
>> >
>> >    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 vi=
a OR
>> >       than a MOV.  */
>> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conver=
sion
>> >       from FP to FP. */
>> > -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
>> > +  m_AMDFAM10 | m_GENERIC,
>
> This is quite specific feature of AMD chips to preffer packed converts ov=
er
> scalar. Nothing like this is documented for cores
>> >
>> >    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>> >       from integer to FP. */
>> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>> >    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>> >       with a subsequent conditional jump instruction into a single
>> >       compare-and-branch uop.  */
>> > -  m_BDVER,
>> > +  m_BDVER | m_CORE2I7,
>
> Core iplements fusion similar to what AMD does so I think this just appli=
es here.

yes.


thanks,

David


>> >
>> >    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>> >       will impact LEA instruction selection. */
>> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>> >  };
>> >
>> >  static const unsigned int x86_accumulate_outgoing_args
>> > -  =3D m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_=
GENERIC;
>> > +  =3D m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>
> Stack engine should make this cheap, just like prologues in moves.
> This definitely needs some validation, the accumulate-outgoing-args
> codegen differs quite a lot. Also this leads to unwind tables bloat.
>> >
>> >  static const unsigned int x86_arch_always_fancy_math_387
>> >    =3D m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULT=
IPLE | m_GENERIC;
>
> Honza