Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jan Hubicka <hubicka@ucw.cz>
To: Xinliang David Li <davidxl@google.com>
Cc: Jan Hubicka <hubicka@ucw.cz>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	Teresa Johnson <tejohnson@google.com>
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Date: Wed, 12 Dec 2012 18:30:00 -0000	[thread overview]
Message-ID: <20121212183036.GB5303@atrey.karlin.mff.cuni.cz> (raw)
In-Reply-To: <CAAkRFZKBa3GtEh=mmWiAmy-oGffYFxrmWetpaz+pKYSG1zSvSw@mail.gmail.com>

Concerning 1push per cycle, I think it is same as K7 hardware did, so move
prologue should be a win.
> > Index: config/i386/i386.c
> > ===================================================================
> > --- config/i386/i386.c  (revision 194452)
> > +++ config/i386/i386.c  (working copy)
> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
> >    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
> >    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
> >    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
> >                {-1, libcall, false}}}},
> >    {{libcall, {{6, loop_1_byte, true},
> >                {24, loop, true},
> >                {8192, rep_prefix_4_byte, true},
> >                {-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},

libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
cache than fully blowin function call.
> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_PPRO,
> >
> >    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> > -  m_CORE2I7 | m_GENERIC,
> > +  m_GENERIC | m_CORE2,

This disable shifts that store just some flags. Acroding to Agner's manual I7 handle
this well.

Partial flags stall
The Sandy Bridge uses the method of an extra Ă‚Âľop to join partial registers not only for
general purpose registers but also for the flags register, unlike previous processors which
used this method only for general purpose registers. This occurs when a write to a part of
the flags register is followed by a read from a larger part of the flags register. The partial
flags stall of previous processors (See page 75) is therefore replaced by an extra Ă‚Âľop. The
Sandy Bridge also generates an extra Ă‚Âľop when reading the flags after a rotate instruction.

This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
> >
> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_K6,
> >
> >    /* X86_TUNE_USE_CLTD */
> > -  ~(m_PENT | m_ATOM | m_K6),
> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

None of CPUs that generic care about are !USE_CLTD now after your change.
> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
> >    m_ATHLON_K8,
> >
> >    /* X86_TUNE_SSE_TYPELESS_STORES */
> > -  m_AMD_MULTIPLE,
> > +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/

Hmm, I can not seem to find this in manual now, but I believe that stores also do not type,
so movaps is preferred over movapd store because it is shorter.  If not, this change should
produce a lot of slowdowns.
> >
> >    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
> > -  m_PPRO | m_P4_NOCONA,
> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/

Agner:
A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
Core2 and Nehalem processors recognize that certain instructions are independent of the
prior value of the register if the source and destination registers are the same.

This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all
variants of PSUBxxx and PCMPxxx except PCMPEQQ.
> >
> >    /* X86_TUNE_MEMORY_MISMATCH_STALL */
> >    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
> >       than 4 branch instructions in the 16 byte window.  */
> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
Intel never had similar design, so this flag is pointless.

We apparently ought to disable it for K10, at least per Agner's manual.
> >
> >    /* X86_TUNE_SCHEDULE */
> >    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
> >    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >
> >    /* X86_TUNE_USE_INCDEC */
> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),

Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >
> >    /* X86_TUNE_PAD_RETURNS */
> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_AMD_MULTIPLE | m_GENERIC,

Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am not even
sure this should be enabled for K10.
> >
> >    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
> >    m_ATOM,
> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
> >
> >    /* X86_TUNE_AVOID_VECTOR_DECODE */
> > -  m_CORE2I7 | m_K8 | m_GENERIC64,
> > +  m_K8 | m_GENERIC64,

This avoid AMD vector decoded instructions, again if it helped it did so by accident.
> >
> >    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
> >       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
> >       vector path on AMD machines.  */
> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
> >
> >    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
> >       machines.  */
> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,

This is similarly targetd for AMD hardware only. I did not find ismilar limiation
in the optimization manual.
> >
> >    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
> >       than a MOV.  */
> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
> >       from FP to FP. */
> > -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
> > +  m_AMDFAM10 | m_GENERIC,

This is quite specific feature of AMD chips to preffer packed converts over
scalar. Nothing like this is documented for cores
> >
> >    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
> >       from integer to FP. */
> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
> >    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
> >       with a subsequent conditional jump instruction into a single
> >       compare-and-branch uop.  */
> > -  m_BDVER,
> > +  m_BDVER | m_CORE2I7,

Core iplements fusion similar to what AMD does so I think this just applies here.
> >
> >    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
> >       will impact LEA instruction selection. */
> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
> >  };
> >
> >  static const unsigned int x86_accumulate_outgoing_args
> > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
> > +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;

Stack engine should make this cheap, just like prologues in moves.
This definitely needs some validation, the accumulate-outgoing-args
codegen differs quite a lot. Also this leads to unwind tables bloat.
> >
> >  static const unsigned int x86_arch_always_fancy_math_387
> >    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;

Honza

next prev parent reply	other threads:[~2012-12-12 18:30 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-08 18:13 Xinliang David Li
2012-12-12 16:37 ` Jan Hubicka
2012-12-12 17:25   ` Xinliang David Li
2012-12-12 17:34   ` Xinliang David Li
2012-12-12 18:30     ` Jan Hubicka [this message]
2012-12-12 18:37       ` Andi Kleen
2012-12-12 18:43         ` Andi Kleen
2012-12-12 18:43         ` Jan Hubicka
2012-12-12 20:56         ` x86-64 medium memory model Leif Ekblad
2012-12-12 20:59           ` H.J. Lu
2012-12-12 21:33             ` Leif Ekblad
2012-12-13  0:16       ` [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs Xinliang David Li
2012-12-13  0:16         ` Xinliang David Li
2012-12-13  1:19         ` Jan Hubicka
2012-12-13  6:09           ` Xinliang David Li
2012-12-13  6:21             ` Jakub Jelinek
2012-12-13  7:05               ` Xinliang David Li
2012-12-13 19:28                 ` Jan Hubicka
2012-12-13 10:22               ` Richard Biener
2012-12-13 19:43               ` H.J. Lu
2012-12-13 20:26                 ` Jan Hubicka
2012-12-13 20:28                   ` H.J. Lu
2012-12-13 20:40                     ` Jan Hubicka
2012-12-13 21:02                       ` H.J. Lu
2012-12-13 21:35                         ` Jan Hubicka
2012-12-20 12:13                         ` Melik-adamyan, Areg
2012-12-20 14:08                           ` H.J. Lu
2012-12-20 15:05                             ` Jan Hubicka
2012-12-20 15:07                               ` Jan Hubicka
2012-12-20 15:22                                 ` H.J. Lu
2012-12-21  8:28   ` Zamyatin, Igor
2012-12-09 13:50 Uros Bizjak
2012-12-09 17:09 ` Дмитрий Дьяченко
2012-12-10  9:23 ` Richard Biener
2012-12-10 20:42   ` Xinliang David Li
2012-12-10 21:07     ` Mike Stump
2012-12-11  9:49       ` Richard Biener
2012-12-11 17:15         ` Xinliang David Li
2012-12-11 22:53 Xinliang David Li
2012-12-11 23:39 ` Xinliang David Li
2012-12-12 11:00 ` Richard Biener
2012-12-21  7:26 Xinliang David Li
2012-12-21  8:20 ` Zamyatin, Igor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121212183036.GB5303@atrey.karlin.mff.cuni.cz \
    --to=hubicka@ucw.cz \
    --cc=davidxl@google.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=tejohnson@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).