Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-09 13:50 Uros Bizjak
  2012-12-09 17:09 ` Дмитрий Дьяченко
  2012-12-10  9:23 ` Richard Biener
  0 siblings, 2 replies; 40+ messages in thread
From: Uros Bizjak @ 2012-12-09 13:50 UTC (permalink / raw)
  To: gcc-patches; +Cc: Xinliang David Li

Hello!

> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency.  In modern
> micro-architecture, push/pop is implemented using a mechanism called
> stack engine. The data dependency is removed by the hardware, and
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them.   This is
> also what ICC does.

> 2012-12-08  Xinliang David Li  <davidxl@google.com>
>            * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs.

s/moderen/modern

OK for mainline SVN.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-09 13:50 [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs Uros Bizjak
@ 2012-12-09 17:09 ` Дмитрий Дьяченко
  2012-12-10  9:23 ` Richard Biener
  1 sibling, 0 replies; 40+ messages in thread
From: Дмитрий Дьяченко @ 2012-12-09 17:09 UTC (permalink / raw)
  To: gcc-patches, Xinliang David Li

s/Eanble/Enable/


Thanks,
Dmitry

2012/12/9 Uros Bizjak <ubizjak@gmail.com>:
> Hello!
>
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>
>> 2012-12-08  Xinliang David Li  <davidxl@google.com>
>>            * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs.
>
> s/moderen/modern
>
> OK for mainline SVN.
>
> Thanks,
> Uros.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-09 13:50 [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs Uros Bizjak
  2012-12-09 17:09 ` Дмитрий Дьяченко
@ 2012-12-10  9:23 ` Richard Biener
  2012-12-10 20:42   ` Xinliang David Li
  1 sibling, 1 reply; 40+ messages in thread
From: Richard Biener @ 2012-12-10  9:23 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, Xinliang David Li

On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> Hello!
>
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>
>> 2012-12-08  Xinliang David Li  <davidxl@google.com>
>>            * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs.
>
> s/moderen/modern
>
> OK for mainline SVN.

It's also more costly for unwind info in the prologue/epilogue.  Thus, did you
measure the effect on CFI size?

Thanks,
Richard.

> Thanks,
> Uros.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-10  9:23 ` Richard Biener
@ 2012-12-10 20:42   ` Xinliang David Li
  2012-12-10 21:07     ` Mike Stump
  0 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-10 20:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: Uros Bizjak, GCC Patches

I have not measured the CFI size impact -- but conceivably it should
be larger -- which is unfortunate.

David

On Mon, Dec 10, 2012 at 1:23 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> Hello!
>>
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>>> SP adjustment instead of a sequence of pushes/pops. The preference to
>>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>>> K10), because it breaks the data dependency.  In modern
>>> micro-architecture, push/pop is implemented using a mechanism called
>>> stack engine. The data dependency is removed by the hardware, and
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them.   This is
>>> also what ICC does.
>>
>>> 2012-12-08  Xinliang David Li  <davidxl@google.com>
>>>            * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs.
>>
>> s/moderen/modern
>>
>> OK for mainline SVN.
>
> It's also more costly for unwind info in the prologue/epilogue.  Thus, did you
> measure the effect on CFI size?
>
> Thanks,
> Richard.
>
>> Thanks,
>> Uros.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-10 20:42   ` Xinliang David Li
@ 2012-12-10 21:07     ` Mike Stump
  2012-12-11  9:49       ` Richard Biener
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Stump @ 2012-12-10 21:07 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Richard Biener, Uros Bizjak, GCC Patches

On Dec 10, 2012, at 12:42 PM, Xinliang David Li <davidxl@google.com> wrote:
> I have not measured the CFI size impact -- but conceivably it should
> be larger -- which is unfortunate.

Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let dwarf 5 fix it!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-10 21:07     ` Mike Stump
@ 2012-12-11  9:49       ` Richard Biener
  2012-12-11 17:15         ` Xinliang David Li
  0 siblings, 1 reply; 40+ messages in thread
From: Richard Biener @ 2012-12-11  9:49 UTC (permalink / raw)
  To: Mike Stump; +Cc: Xinliang David Li, Uros Bizjak, GCC Patches

On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump <mikestump@comcast.net> wrote:
> On Dec 10, 2012, at 12:42 PM, Xinliang David Li <davidxl@google.com> wrote:
>> I have not measured the CFI size impact -- but conceivably it should
>> be larger -- which is unfortunate.
>
> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let dwarf 5 fix it!

Well, different to debug info, CFI data has to be in memory to make
unwinding work.
These days most Linux distributions enable asyncronous unwind tables so any
size savings due to shorter push/pop epilogue/prologue sequences has to be
offsetted by the increase in CFI data.  I'm not sure there is really a
speed difference
between both variants (well, maybe due to better icache footprint of
the push/pop
variant).

That said - I'd prefer to have more data on this before making the switch for
the generic model.  What was your original motivation?  Just "theory" or was
it a real case?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-11  9:49       ` Richard Biener
@ 2012-12-11 17:15         ` Xinliang David Li
  0 siblings, 0 replies; 40+ messages in thread
From: Xinliang David Li @ 2012-12-11 17:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: Mike Stump, Uros Bizjak, GCC Patches

On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump <mikestump@comcast.net> wrote:
>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> I have not measured the CFI size impact -- but conceivably it should
>>> be larger -- which is unfortunate.
>>
>> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let dwarf 5 fix it!
>
> Well, different to debug info, CFI data has to be in memory to make
> unwinding work.
> These days most Linux distributions enable asyncronous unwind tables so any
> size savings due to shorter push/pop epilogue/prologue sequences has to be
> offsetted by the increase in CFI data.  I'm not sure there is really a
> speed difference
> between both variants (well, maybe due to better icache footprint of
> the push/pop
> variant).

Yes, for large applications, this can be crucial to performance.

>
> That said - I'd prefer to have more data on this before making the switch for
> the generic model.  What was your original motivation?  Just "theory" or was
> it a real case?

1) some of the very large internal apps I measured benefit from this
change (in terms of performance)
2) both ICC and LLVM do the same.

I have already committed the patch. I will find some time to collect
more size data and post it later.

thanks,

David


>
> Thanks,
> Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-21  7:26 Xinliang David Li
  2012-12-21  8:20 ` Zamyatin, Igor
  0 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-21  7:26 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Ahmad Sharif

Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically
significant regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.

David

On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>>> SP adjustment instead of a sequence of pushes/pops. The preference to
>>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>>> K10), because it breaks the data dependency.  In modern
>>> micro-architecture, push/pop is implemented using a mechanism called
>>> stack engine. The data dependency is removed by the hardware, and
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them.   This is
>>> also what ICC does.
>>>
>>> The following patch fixed the problem. It passes bootstrap/regression
>>> test. OK to install?
>>>
>>> thanks,
>>>
>>> David
>>>
>>> Index: config/i386/i386.c
>>> ===================================================================
>>> --- config/i386/i386.c (revision 194324)
>>> +++ config/i386/i386.c (working copy)
>>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>>
>>>    /* X86_TUNE_PROLOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>>
>>>    /* X86_TUNE_EPILOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>
>> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
>> is gone from generic (in fact I had similar patch pending).
>> Are you sure about Atom having stack engine, too?
>>
>
> Good question. The instruction latency table
> (http://www.agner.org/optimize/instruction_tables.pdf) shows that for
> Atom: push r has one 1uop, 1 cycle latency. However the instruction is
> not pairable which will affect ILP. The guide here
> http://www.agner.org/optimize/microarchitecture.pdf does not mention
> Atom has stack engine either.
>
> I will help collect some performance data on Atom.
>
>
> thanks,
>
> David
>
>
>> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
>> give it a try on K10.
>>
>> Honza
>>
>> I am attaching the changes for core costs I made if someone is interested in
>> testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
>> generic, there is room for improvement in generic, too. Like using inc/dec
>> again.
>>
>> Honza
>>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c  (revision 194452)
>> +++ config/i386/i386.c  (working copy)
>> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>>    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>>    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
>> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>                {-1, libcall, false}}}},
>>    {{libcall, {{6, loop_1_byte, true},
>>                {24, loop, true},
>>                {8192, rep_prefix_4_byte, true},
>>                {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>                {-1, libcall, false}}}},
>>    1,                                   /* scalar_stmt_cost.  */
>>    1,                                   /* scalar load_cost.  */
>> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_PPRO,
>>
>>    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> -  m_CORE2I7 | m_GENERIC,
>> +  m_GENERIC | m_CORE2,
>>
>>    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_K6,
>>
>>    /* X86_TUNE_USE_CLTD */
>> -  ~(m_PENT | m_ATOM | m_K6),
>> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>>    /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>>    m_PENT4,
>> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_COREI7 | m_BDVER,
>>
>>    /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
>> -  m_BDVER ,
>> +  m_BDVER,
>>
>>    /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>>       are resolved on SSE register parts instead of whole registers, so we may
>> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_ATHLON_K8,
>>
>>    /* X86_TUNE_SSE_TYPELESS_STORES */
>> -  m_AMD_MULTIPLE,
>> +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>>
>>    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
>> -  m_PPRO | m_P4_NOCONA,
>> +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>>
>>    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>>       than 4 branch instructions in the 16 byte window.  */
>> -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_SCHEDULE */
>>    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
>> @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_USE_INCDEC */
>> -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
>> +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>>
>>    /* X86_TUNE_PAD_RETURNS */
>> -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
>> +  m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>>    m_ATOM,
>> @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
>>
>>    /* X86_TUNE_AVOID_VECTOR_DECODE */
>> -  m_CORE2I7 | m_K8 | m_GENERIC64,
>> +  m_K8 | m_GENERIC64,
>>
>>    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
>>       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
>> @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>>       vector path on AMD machines.  */
>> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>>    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
>>       machines.  */
>> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>>    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>>       than a MOV.  */
>> @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>>       from FP to FP. */
>> -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
>> +  m_AMDFAM10 | m_GENERIC,
>>
>>    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>>       from integer to FP. */
>> @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>>    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>>       with a subsequent conditional jump instruction into a single
>>       compare-and-branch uop.  */
>> -  m_BDVER,
>> +  m_BDVER | m_CORE2I7,
>>
>>    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>>       will impact LEA instruction selection. */
>> @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>>  };
>>
>>  static const unsigned int x86_accumulate_outgoing_args
>> -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
>> +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>>
>>  static const unsigned int x86_arch_always_fancy_math_387
>>    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>> Index: config/i386/i386.md
>> ===================================================================
>> --- config/i386/i386.md (revision 194452)
>> +++ config/i386/i386.md (working copy)
>> @@ -965,7 +965,7 @@
>>         (compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
>>                     (match_operand:SDWIM 2 "<general_operand>")))
>>     (set (pc) (if_then_else
>> -              (match_operator 0 "ordered_comparison_operator"
>> +              (match_operator 0 "comparison_operator"
>>                 [(reg:CC FLAGS_REG) (const_int 0)])
>>                (label_ref (match_operand 3))
>>                (pc)))]
>> @@ -983,7 +983,7 @@
>>         (compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
>>                     (match_operand:SWIM 3 "<general_operand>")))
>>     (set (match_operand:QI 0 "register_operand")
>> -       (match_operator 1 "ordered_comparison_operator"
>> +       (match_operator 1 "comparison_operator"
>>           [(reg:CC FLAGS_REG) (const_int 0)]))]
>>    ""
>>  {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-21  7:26 Xinliang David Li
@ 2012-12-21  8:20 ` Zamyatin, Igor
  0 siblings, 0 replies; 40+ messages in thread
From: Zamyatin, Igor @ 2012-12-21  8:20 UTC (permalink / raw)
  To: Xinliang David Li, Jan Hubicka; +Cc: GCC Patches, Ahmad Sharif

We checked also spec2000 and eembc_2_0 on Atom - no visible regressions and gains

-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Xinliang David Li
Sent: Friday, December 21, 2012 11:26 AM
To: Jan Hubicka
Cc: GCC Patches; Ahmad Sharif
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically significant regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.

David

On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by 
>>> a SP adjustment instead of a sequence of pushes/pops. The preference 
>>> to the MOVs are good for old CPU micro-architectures (before 
>>> pentium-4, K10), because it breaks the data dependency.  In modern 
>>> micro-architecture, push/pop is implemented using a mechanism called 
>>> stack engine. The data dependency is removed by the hardware, and 
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them.   This is
>>> also what ICC does.
>>>
>>> The following patch fixed the problem. It passes 
>>> bootstrap/regression test. OK to install?
>>>
>>> thanks,
>>>
>>> David
>>>
>>> Index: config/i386/i386.c
>>> ===================================================================
>>> --- config/i386/i386.c (revision 194324)
>>> +++ config/i386/i386.c (working copy)
>>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>>
>>>    /* X86_TUNE_PROLOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>>
>>>    /* X86_TUNE_EPILOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>
>> Push/pops wrt moves was always difficult to tune on old CPUs, so I am 
>> happy it is gone from generic (in fact I had similar patch pending).
>> Are you sure about Atom having stack engine, too?
>>
>
> Good question. The instruction latency table
> (http://www.agner.org/optimize/instruction_tables.pdf) shows that for
> Atom: push r has one 1uop, 1 cycle latency. However the instruction is 
> not pairable which will affect ILP. The guide here 
> http://www.agner.org/optimize/microarchitecture.pdf does not mention 
> Atom has stack engine either.
>
> I will help collect some performance data on Atom.
>
>
> thanks,
>
> David
>
>
>> Related thing is accumulate_outgoing_args. Igor is testing it on Core 
>> and I will give it a try on K10.
>>
>> Honza
>>
>> I am attaching the changes for core costs I made if someone is 
>> interested in testing them.  If we can declare P4/PPRo and maybe K8 
>> chips obsolette for generic, there is room for improvement in 
>> generic, too. Like using inc/dec again.
>>
>> Honza
>>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c  (revision 194452)
>> +++ config/i386/i386.c  (working copy)
>> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>>    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>>    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
>> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>                {-1, libcall, false}}}},
>>    {{libcall, {{6, loop_1_byte, true},
>>                {24, loop, true},
>>                {8192, rep_prefix_4_byte, true},
>>                {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>                {-1, libcall, false}}}},
>>    1,                                   /* scalar_stmt_cost.  */
>>    1,                                   /* scalar load_cost.  */
>> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_PPRO,
>>
>>    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> -  m_CORE2I7 | m_GENERIC,
>> +  m_GENERIC | m_CORE2,
>>
>>    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>     * on 16-bit immediate moves into memory on Core2 and Corei7.  */ 
>> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_K6,
>>
>>    /* X86_TUNE_USE_CLTD */
>> -  ~(m_PENT | m_ATOM | m_K6),
>> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>>    /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>>    m_PENT4,
>> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_COREI7 | m_BDVER,
>>
>>    /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
>> -  m_BDVER ,
>> +  m_BDVER,
>>
>>    /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>>       are resolved on SSE register parts instead of whole registers, 
>> so we may @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_ATHLON_K8,
>>
>>    /* X86_TUNE_SSE_TYPELESS_STORES */
>> -  m_AMD_MULTIPLE,
>> +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>>
>>    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
>> -  m_PPRO | m_P4_NOCONA,
>> +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>>
>>    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ 
>> -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>>       than 4 branch instructions in the 16 byte window.  */
>> -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | 
>> m_GENERIC,
>> +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_SCHEDULE */
>>    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE 
>> | m_GENERIC, @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_USE_INCDEC */
>> -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
>> +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>>
>>    /* X86_TUNE_PAD_RETURNS */
>> -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
>> +  m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>>    m_ATOM,
>> @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | 
>> m_ATHLON_K8 | m_GENERIC,
>>
>>    /* X86_TUNE_AVOID_VECTOR_DECODE */
>> -  m_CORE2I7 | m_K8 | m_GENERIC64,
>> +  m_K8 | m_GENERIC64,
>>
>>    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
>>       and SImode multiply, but 386 and 486 do HImode multiply faster.  
>> */ @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>>       vector path on AMD machines.  */
>> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>>    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
>>       machines.  */
>> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>>    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>>       than a MOV.  */
>> @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>>       from FP to FP. */
>> -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
>> +  m_AMDFAM10 | m_GENERIC,
>>
>>    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>>       from integer to FP. */
>> @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>>    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>>       with a subsequent conditional jump instruction into a single
>>       compare-and-branch uop.  */
>> -  m_BDVER,
>> +  m_BDVER | m_CORE2I7,
>>
>>    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>>       will impact LEA instruction selection. */ @@ -2052,7 +2052,7 @@ 
>> static unsigned int initial_ix86_arch_fe  };
>>
>>  static const unsigned int x86_accumulate_outgoing_args
>> -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | 
>> m_GENERIC;
>> +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>>
>>  static const unsigned int x86_arch_always_fancy_math_387
>>    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | 
>> m_AMD_MULTIPLE | m_GENERIC;
>> Index: config/i386/i386.md
>> ===================================================================
>> --- config/i386/i386.md (revision 194452)
>> +++ config/i386/i386.md (working copy)
>> @@ -965,7 +965,7 @@
>>         (compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
>>                     (match_operand:SDWIM 2 "<general_operand>")))
>>     (set (pc) (if_then_else
>> -              (match_operator 0 "ordered_comparison_operator"
>> +              (match_operator 0 "comparison_operator"
>>                 [(reg:CC FLAGS_REG) (const_int 0)])
>>                (label_ref (match_operand 3))
>>                (pc)))]
>> @@ -983,7 +983,7 @@
>>         (compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
>>                     (match_operand:SWIM 3 "<general_operand>")))
>>     (set (match_operand:QI 0 "register_operand")
>> -       (match_operator 1 "ordered_comparison_operator"
>> +       (match_operator 1 "comparison_operator"
>>           [(reg:CC FLAGS_REG) (const_int 0)]))]
>>    ""
>>  {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-11 22:53 Xinliang David Li
  2012-12-11 23:39 ` Xinliang David Li
  2012-12-12 11:00 ` Richard Biener
  0 siblings, 2 replies; 40+ messages in thread
From: Xinliang David Li @ 2012-12-11 22:53 UTC (permalink / raw)
  To: Richard Biener; +Cc: Mike Stump, Uros Bizjak, GCC Patches

The following the O2 size data from SPEC2k.  Note that with push/pop,
it is a always a net win (negative delta) in terms of total binary or
total loadable section size.

thanks,

David

                   .text    .eh_frame  Total_binary
vortex-move 440252 40796 584066
vortex-push 415436 57452 575906
delta            -5.6% 40.8%  -1.397%

twolf-move 169324 10748 223521
twolf-push 168876 11124 223449
delta       -0.3% 3.5% -0.032%

gzip-move 30668 3652 374399
gzip-push 30524 3740 374343
delta     -0.5% 2.4% -0.015%

bzip2-move 22748 3196 111616
bzip2-push 22636 3284 111592
delta      -0.5% 2.8% -0.022%

vpr-move 104684 9380 147378
vpr-push 104236 9788 147338
delta     -0.4% 4.3% -0.027%

mcf-move 8444 1244 26760
mcf-push 8444 1244 26760
delta    0.0% 0.0% 0.000%

cc1-move 1093964 90772 1576994
cc1-push 1078988 104068 1575314
delta      -1.4% 14.6% -0.107%

crafty-move 130556 5508 1256037
crafty-push 130236 5772 1255981
delta        -0.2% 4.8% -0.004%

eon-move 333660 33220 516491
eon-push 330140 35812 515555
delta     -1.1% 7.8% -0.181%

gap-move 404092 46732 1457735
gap-push 396012 53180 1456103
delta     -2.0% 13.8% -0.112%

perlbmk-move 456572 45324 618585
perlbmk-push 449516 52340 618545
delta         -1.5% 15.5% -0.006%

parser-move 81244 15788 334003
parser-push 80684 16332 333987
delta       -0.7% 3.4% -0.005%


On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump <mikestump@comcast.net> wrote:
>>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> I have not measured the CFI size impact -- but conceivably it should
>>>> be larger -- which is unfortunate.
>>>
>>> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let dwarf 5 fix it!
>>
>> Well, different to debug info, CFI data has to be in memory to make
>> unwinding work.
>> These days most Linux distributions enable asyncronous unwind tables so any
>> size savings due to shorter push/pop epilogue/prologue sequences has to be
>> offsetted by the increase in CFI data.  I'm not sure there is really a
>> speed difference
>> between both variants (well, maybe due to better icache footprint of
>> the push/pop
>> variant).
>
> Yes, for large applications, this can be crucial to performance.
>
>>
>> That said - I'd prefer to have more data on this before making the switch for
>> the generic model.  What was your original motivation?  Just "theory" or was
>> it a real case?
>
> 1) some of the very large internal apps I measured benefit from this
> change (in terms of performance)
> 2) both ICC and LLVM do the same.
>
> I have already committed the patch. I will find some time to collect
> more size data and post it later.
>
> thanks,
>
> David
>
>
>>
>> Thanks,
>> Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-11 22:53 Xinliang David Li
@ 2012-12-11 23:39 ` Xinliang David Li
  2012-12-12 11:00 ` Richard Biener
  1 sibling, 0 replies; 40+ messages in thread
From: Xinliang David Li @ 2012-12-11 23:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: Mike Stump, Uros Bizjak, GCC Patches

Some SPEC2k performance number (with 3 runs on core2):

Push wins over move on 3 benchmarks. Others are noises.

perlbmk : ~+1.9%
gap:       ~+1.4%
vortex:    ~ +0.7%

David

On Tue, Dec 11, 2012 at 2:53 PM, Xinliang David Li <davidxl@google.com> wrote:
> The following the O2 size data from SPEC2k.  Note that with push/pop,
> it is a always a net win (negative delta) in terms of total binary or
> total loadable section size.
>
> thanks,
>
> David
>
>                    .text    .eh_frame  Total_binary
> vortex-move 440252 40796 584066
> vortex-push 415436 57452 575906
> delta            -5.6% 40.8%  -1.397%
>
> twolf-move 169324 10748 223521
> twolf-push 168876 11124 223449
> delta       -0.3% 3.5% -0.032%
>
> gzip-move 30668 3652 374399
> gzip-push 30524 3740 374343
> delta     -0.5% 2.4% -0.015%
>
> bzip2-move 22748 3196 111616
> bzip2-push 22636 3284 111592
> delta      -0.5% 2.8% -0.022%
>
> vpr-move 104684 9380 147378
> vpr-push 104236 9788 147338
> delta     -0.4% 4.3% -0.027%
>
> mcf-move 8444 1244 26760
> mcf-push 8444 1244 26760
> delta    0.0% 0.0% 0.000%
>
> cc1-move 1093964 90772 1576994
> cc1-push 1078988 104068 1575314
> delta      -1.4% 14.6% -0.107%
>
> crafty-move 130556 5508 1256037
> crafty-push 130236 5772 1255981
> delta        -0.2% 4.8% -0.004%
>
> eon-move 333660 33220 516491
> eon-push 330140 35812 515555
> delta     -1.1% 7.8% -0.181%
>
> gap-move 404092 46732 1457735
> gap-push 396012 53180 1456103
> delta     -2.0% 13.8% -0.112%
>
> perlbmk-move 456572 45324 618585
> perlbmk-push 449516 52340 618545
> delta         -1.5% 15.5% -0.006%
>
> parser-move 81244 15788 334003
> parser-push 80684 16332 333987
> delta       -0.7% 3.4% -0.005%
>
>
> On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li <davidxl@google.com> wrote:
>> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump <mikestump@comcast.net> wrote:
>>>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>> I have not measured the CFI size impact -- but conceivably it should
>>>>> be larger -- which is unfortunate.
>>>>
>>>> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let dwarf 5 fix it!
>>>
>>> Well, different to debug info, CFI data has to be in memory to make
>>> unwinding work.
>>> These days most Linux distributions enable asyncronous unwind tables so any
>>> size savings due to shorter push/pop epilogue/prologue sequences has to be
>>> offsetted by the increase in CFI data.  I'm not sure there is really a
>>> speed difference
>>> between both variants (well, maybe due to better icache footprint of
>>> the push/pop
>>> variant).
>>
>> Yes, for large applications, this can be crucial to performance.
>>
>>>
>>> That said - I'd prefer to have more data on this before making the switch for
>>> the generic model.  What was your original motivation?  Just "theory" or was
>>> it a real case?
>>
>> 1) some of the very large internal apps I measured benefit from this
>> change (in terms of performance)
>> 2) both ICC and LLVM do the same.
>>
>> I have already committed the patch. I will find some time to collect
>> more size data and post it later.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>> Thanks,
>>> Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-11 22:53 Xinliang David Li
  2012-12-11 23:39 ` Xinliang David Li
@ 2012-12-12 11:00 ` Richard Biener
  1 sibling, 0 replies; 40+ messages in thread
From: Richard Biener @ 2012-12-12 11:00 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Mike Stump, Uros Bizjak, GCC Patches

On Tue, Dec 11, 2012 at 11:53 PM, Xinliang David Li <davidxl@google.com> wrote:
> The following the O2 size data from SPEC2k.  Note that with push/pop,
> it is a always a net win (negative delta) in terms of total binary or
> total loadable section size.

Thanks for the data!

Richard.

> thanks,
>
> David
>
>                    .text    .eh_frame  Total_binary
> vortex-move 440252 40796 584066
> vortex-push 415436 57452 575906
> delta            -5.6% 40.8%  -1.397%
>
> twolf-move 169324 10748 223521
> twolf-push 168876 11124 223449
> delta       -0.3% 3.5% -0.032%
>
> gzip-move 30668 3652 374399
> gzip-push 30524 3740 374343
> delta     -0.5% 2.4% -0.015%
>
> bzip2-move 22748 3196 111616
> bzip2-push 22636 3284 111592
> delta      -0.5% 2.8% -0.022%
>
> vpr-move 104684 9380 147378
> vpr-push 104236 9788 147338
> delta     -0.4% 4.3% -0.027%
>
> mcf-move 8444 1244 26760
> mcf-push 8444 1244 26760
> delta    0.0% 0.0% 0.000%
>
> cc1-move 1093964 90772 1576994
> cc1-push 1078988 104068 1575314
> delta      -1.4% 14.6% -0.107%
>
> crafty-move 130556 5508 1256037
> crafty-push 130236 5772 1255981
> delta        -0.2% 4.8% -0.004%
>
> eon-move 333660 33220 516491
> eon-push 330140 35812 515555
> delta     -1.1% 7.8% -0.181%
>
> gap-move 404092 46732 1457735
> gap-push 396012 53180 1456103
> delta     -2.0% 13.8% -0.112%
>
> perlbmk-move 456572 45324 618585
> perlbmk-push 449516 52340 618545
> delta         -1.5% 15.5% -0.006%
>
> parser-move 81244 15788 334003
> parser-push 80684 16332 333987
> delta       -0.7% 3.4% -0.005%
>
>
> On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li <davidxl@google.com> wrote:
>> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump <mikestump@comcast.net> wrote:
>>>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>> I have not measured the CFI size impact -- but conceivably it should
>>>>> be larger -- which is unfortunate.
>>>>
>>>> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let dwarf 5 fix it!
>>>
>>> Well, different to debug info, CFI data has to be in memory to make
>>> unwinding work.
>>> These days most Linux distributions enable asyncronous unwind tables so any
>>> size savings due to shorter push/pop epilogue/prologue sequences has to be
>>> offsetted by the increase in CFI data.  I'm not sure there is really a
>>> speed difference
>>> between both variants (well, maybe due to better icache footprint of
>>> the push/pop
>>> variant).
>>
>> Yes, for large applications, this can be crucial to performance.
>>
>>>
>>> That said - I'd prefer to have more data on this before making the switch for
>>> the generic model.  What was your original motivation?  Just "theory" or was
>>> it a real case?
>>
>> 1) some of the very large internal apps I measured benefit from this
>> change (in terms of performance)
>> 2) both ICC and LLVM do the same.
>>
>> I have already committed the patch. I will find some time to collect
>> more size data and post it later.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>> Thanks,
>>> Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-08 18:13 Xinliang David Li
  2012-12-12 16:37 ` Jan Hubicka
  0 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-08 18:13 UTC (permalink / raw)
  To: GCC Patches

I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
SP adjustment instead of a sequence of pushes/pops. The preference to
the MOVs are good for old CPU micro-architectures (before pentium-4,
K10), because it breaks the data dependency.  In modern
micro-architecture, push/pop is implemented using a mechanism called
stack engine. The data dependency is removed by the hardware, and
push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
smaller. There is no longer the need to avoid using them.   This is
also what ICC does.

The following patch fixed the problem. It passes bootstrap/regression
test. OK to install?

thanks,

David

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c (revision 194324)
+++ config/i386/i386.c (working copy)
@@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

   /* X86_TUNE_PROLOGUE_USING_MOVE */
-  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
+  m_PPRO | m_ATHLON_K8,

   /* X86_TUNE_EPILOGUE_USING_MOVE */
-  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
+  m_PPRO | m_ATHLON_K8,

   /* X86_TUNE_SHIFT1 */
   ~m_486,

2012-12-08  Xinliang David Li  <davidxl@google.com>
           * config/i386/i386.c: Eanble push/pop in pro/epilogue for
moderen CPUs.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-08 18:13 Xinliang David Li
@ 2012-12-12 16:37 ` Jan Hubicka
  2012-12-12 17:25   ` Xinliang David Li
                     ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Jan Hubicka @ 2012-12-12 16:37 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: GCC Patches

> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency.  In modern
> micro-architecture, push/pop is implemented using a mechanism called
> stack engine. The data dependency is removed by the hardware, and
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them.   This is
> also what ICC does.
> 
> The following patch fixed the problem. It passes bootstrap/regression
> test. OK to install?
> 
> thanks,
> 
> David
> 
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c (revision 194324)
> +++ config/i386/i386.c (working copy)
> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> 
>    /* X86_TUNE_PROLOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,
> 
>    /* X86_TUNE_EPILOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,

Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
is gone from generic (in fact I had similar patch pending).
Are you sure about Atom having stack engine, too?

Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
give it a try on K10.

Honza

I am attaching the changes for core costs I made if someone is interested in
testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
generic, there is room for improvement in generic, too. Like using inc/dec
again.

Honza

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 194452)
+++ config/i386/i386.c	(working copy)
@@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
+  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
               {-1, libcall, false}}}},
   {{libcall, {{6, loop_1_byte, true},
               {24, loop, true},
               {8192, rep_prefix_4_byte, true},
               {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
               {-1, libcall, false}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
@@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO,
 
   /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
-  m_CORE2I7 | m_GENERIC,
+  m_GENERIC | m_CORE2,
 
   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
    * on 16-bit immediate moves into memory on Core2 and Corei7.  */
@@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
   m_K6,
 
   /* X86_TUNE_USE_CLTD */
-  ~(m_PENT | m_ATOM | m_K6),
+  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
 
   /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
   m_PENT4,
@@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
   m_COREI7 | m_BDVER,
 
   /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
-  m_BDVER ,
+  m_BDVER,
 
   /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
      are resolved on SSE register parts instead of whole registers, so we may
@@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
   m_ATHLON_K8,
 
   /* X86_TUNE_SSE_TYPELESS_STORES */
-  m_AMD_MULTIPLE,
+  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
 
   /* X86_TUNE_SSE_LOAD0_BY_PXOR */
-  m_PPRO | m_P4_NOCONA,
+  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
 
   /* X86_TUNE_MEMORY_MISMATCH_STALL */
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
@@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
      than 4 branch instructions in the 16 byte window.  */
-  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
+  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_SCHEDULE */
   m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
@@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
   m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_USE_INCDEC */
-  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
+  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
 
   /* X86_TUNE_PAD_RETURNS */
-  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
+  m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
   m_ATOM,
@@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
 
   /* X86_TUNE_AVOID_VECTOR_DECODE */
-  m_CORE2I7 | m_K8 | m_GENERIC64,
+  m_K8 | m_GENERIC64,
 
   /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
      and SImode multiply, but 386 and 486 do HImode multiply faster.  */
@@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
      vector path on AMD machines.  */
-  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
+  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
 
   /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
      machines.  */
-  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
+  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
 
   /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
      than a MOV.  */
@@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
      from FP to FP. */
-  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
+  m_AMDFAM10 | m_GENERIC,
 
   /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
      from integer to FP. */
@@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
   /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
      with a subsequent conditional jump instruction into a single
      compare-and-branch uop.  */
-  m_BDVER,
+  m_BDVER | m_CORE2I7,
 
   /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
      will impact LEA instruction selection. */
@@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
 };
 
 static const unsigned int x86_accumulate_outgoing_args
-  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
+  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
 
 static const unsigned int x86_arch_always_fancy_math_387
   = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 194452)
+++ config/i386/i386.md	(working copy)
@@ -965,7 +965,7 @@
 	(compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
 		    (match_operand:SDWIM 2 "<general_operand>")))
    (set (pc) (if_then_else
-	       (match_operator 0 "ordered_comparison_operator"
+	       (match_operator 0 "comparison_operator"
 		[(reg:CC FLAGS_REG) (const_int 0)])
 	       (label_ref (match_operand 3))
 	       (pc)))]
@@ -983,7 +983,7 @@
 	(compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
 		    (match_operand:SWIM 3 "<general_operand>")))
    (set (match_operand:QI 0 "register_operand")
-	(match_operator 1 "ordered_comparison_operator"
+	(match_operator 1 "comparison_operator"
 	  [(reg:CC FLAGS_REG) (const_int 0)]))]
   ""
 {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 16:37 ` Jan Hubicka
@ 2012-12-12 17:25   ` Xinliang David Li
  2012-12-12 17:34   ` Xinliang David Li
  2012-12-21  8:28   ` Zamyatin, Igor
  2 siblings, 0 replies; 40+ messages in thread
From: Xinliang David Li @ 2012-12-12 17:25 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches

On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>>
>> The following patch fixed the problem. It passes bootstrap/regression
>> test. OK to install?
>>
>> thanks,
>>
>> David
>>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c (revision 194324)
>> +++ config/i386/i386.c (working copy)
>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_PROLOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>>
>>    /* X86_TUNE_EPILOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>
> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
> is gone from generic (in fact I had similar patch pending).
> Are you sure about Atom having stack engine, too?
>

Good question. The instruction latency table
(http://www.agner.org/optimize/instruction_tables.pdf) shows that for
Atom: push r has one 1uop, 1 cycle latency. However the instruction is
not pairable which will affect ILP. The guide here
http://www.agner.org/optimize/microarchitecture.pdf does not mention
Atom has stack engine either.

I will help collect some performance data on Atom.


thanks,

David


> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
> give it a try on K10.
>
> Honza
>
> I am attaching the changes for core costs I made if someone is interested in
> testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
> generic, there is room for improvement in generic, too. Like using inc/dec
> again.
>
> Honza
>
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c  (revision 194452)
> +++ config/i386/i386.c  (working copy)
> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>                {-1, libcall, false}}}},
>    {{libcall, {{6, loop_1_byte, true},
>                {24, loop, true},
>                {8192, rep_prefix_4_byte, true},
>                {-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>                {-1, libcall, false}}}},
>    1,                                   /* scalar_stmt_cost.  */
>    1,                                   /* scalar load_cost.  */
> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>    m_PPRO,
>
>    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> -  m_CORE2I7 | m_GENERIC,
> +  m_GENERIC | m_CORE2,
>
>    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>    m_K6,
>
>    /* X86_TUNE_USE_CLTD */
> -  ~(m_PENT | m_ATOM | m_K6),
> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
>    /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>    m_PENT4,
> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>    m_COREI7 | m_BDVER,
>
>    /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
> -  m_BDVER ,
> +  m_BDVER,
>
>    /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>       are resolved on SSE register parts instead of whole registers, so we may
> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>    m_ATHLON_K8,
>
>    /* X86_TUNE_SSE_TYPELESS_STORES */
> -  m_AMD_MULTIPLE,
> +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>
>    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
> -  m_PPRO | m_P4_NOCONA,
> +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>
>    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>
>    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>       than 4 branch instructions in the 16 byte window.  */
> -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
>    /* X86_TUNE_SCHEDULE */
>    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
>    /* X86_TUNE_USE_INCDEC */
> -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
> +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>
>    /* X86_TUNE_PAD_RETURNS */
> -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
> +  m_AMD_MULTIPLE | m_GENERIC,
>
>    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>    m_ATOM,
> @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
>
>    /* X86_TUNE_AVOID_VECTOR_DECODE */
> -  m_CORE2I7 | m_K8 | m_GENERIC64,
> +  m_K8 | m_GENERIC64,
>
>    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
>       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
> @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>
>    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>       vector path on AMD machines.  */
> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>
>    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
>       machines.  */
> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>
>    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>       than a MOV.  */
> @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>
>    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>       from FP to FP. */
> -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
> +  m_AMDFAM10 | m_GENERIC,
>
>    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>       from integer to FP. */
> @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>       with a subsequent conditional jump instruction into a single
>       compare-and-branch uop.  */
> -  m_BDVER,
> +  m_BDVER | m_CORE2I7,
>
>    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>       will impact LEA instruction selection. */
> @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>  };
>
>  static const unsigned int x86_accumulate_outgoing_args
> -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
> +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>
>  static const unsigned int x86_arch_always_fancy_math_387
>    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
> Index: config/i386/i386.md
> ===================================================================
> --- config/i386/i386.md (revision 194452)
> +++ config/i386/i386.md (working copy)
> @@ -965,7 +965,7 @@
>         (compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
>                     (match_operand:SDWIM 2 "<general_operand>")))
>     (set (pc) (if_then_else
> -              (match_operator 0 "ordered_comparison_operator"
> +              (match_operator 0 "comparison_operator"
>                 [(reg:CC FLAGS_REG) (const_int 0)])
>                (label_ref (match_operand 3))
>                (pc)))]
> @@ -983,7 +983,7 @@
>         (compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
>                     (match_operand:SWIM 3 "<general_operand>")))
>     (set (match_operand:QI 0 "register_operand")
> -       (match_operator 1 "ordered_comparison_operator"
> +       (match_operator 1 "comparison_operator"
>           [(reg:CC FLAGS_REG) (const_int 0)]))]
>    ""
>  {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 16:37 ` Jan Hubicka
  2012-12-12 17:25   ` Xinliang David Li
@ 2012-12-12 17:34   ` Xinliang David Li
  2012-12-12 18:30     ` Jan Hubicka
  2012-12-21  8:28   ` Zamyatin, Igor
  2 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-12 17:34 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Teresa Johnson

Honza, can you explain each change and point to the reference?

thanks,

David

On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks the data dependency.  In modern
>> micro-architecture, push/pop is implemented using a mechanism called
>> stack engine. The data dependency is removed by the hardware, and
>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>> smaller. There is no longer the need to avoid using them.   This is
>> also what ICC does.
>>
>> The following patch fixed the problem. It passes bootstrap/regression
>> test. OK to install?
>>
>> thanks,
>>
>> David
>>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c (revision 194324)
>> +++ config/i386/i386.c (working copy)
>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_PROLOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>>
>>    /* X86_TUNE_EPILOGUE_USING_MOVE */
>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>> +  m_PPRO | m_ATHLON_K8,
>
> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
> is gone from generic (in fact I had similar patch pending).
> Are you sure about Atom having stack engine, too?
>
> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
> give it a try on K10.
>
> Honza
>
> I am attaching the changes for core costs I made if someone is interested in
> testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
> generic, there is room for improvement in generic, too. Like using inc/dec
> again.
>
> Honza
>
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c  (revision 194452)
> +++ config/i386/i386.c  (working copy)
> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>                {-1, libcall, false}}}},
>    {{libcall, {{6, loop_1_byte, true},
>                {24, loop, true},
>                {8192, rep_prefix_4_byte, true},
>                {-1, libcall, false}}},
> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>                {-1, libcall, false}}}},
>    1,                                   /* scalar_stmt_cost.  */
>    1,                                   /* scalar load_cost.  */
> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>    m_PPRO,
>
>    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> -  m_CORE2I7 | m_GENERIC,
> +  m_GENERIC | m_CORE2,
>
>    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>    m_K6,
>
>    /* X86_TUNE_USE_CLTD */
> -  ~(m_PENT | m_ATOM | m_K6),
> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
>    /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>    m_PENT4,
> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>    m_COREI7 | m_BDVER,
>
>    /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
> -  m_BDVER ,
> +  m_BDVER,
>
>    /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>       are resolved on SSE register parts instead of whole registers, so we may
> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>    m_ATHLON_K8,
>
>    /* X86_TUNE_SSE_TYPELESS_STORES */
> -  m_AMD_MULTIPLE,
> +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>
>    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
> -  m_PPRO | m_P4_NOCONA,
> +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>
>    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>
>    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>       than 4 branch instructions in the 16 byte window.  */
> -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
>    /* X86_TUNE_SCHEDULE */
>    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
> @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
>    /* X86_TUNE_USE_INCDEC */
> -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
> +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>
>    /* X86_TUNE_PAD_RETURNS */
> -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
> +  m_AMD_MULTIPLE | m_GENERIC,
>
>    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>    m_ATOM,
> @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
>
>    /* X86_TUNE_AVOID_VECTOR_DECODE */
> -  m_CORE2I7 | m_K8 | m_GENERIC64,
> +  m_K8 | m_GENERIC64,
>
>    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
>       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
> @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>
>    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>       vector path on AMD machines.  */
> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>
>    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
>       machines.  */
> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>
>    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>       than a MOV.  */
> @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>
>    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>       from FP to FP. */
> -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
> +  m_AMDFAM10 | m_GENERIC,
>
>    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>       from integer to FP. */
> @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>       with a subsequent conditional jump instruction into a single
>       compare-and-branch uop.  */
> -  m_BDVER,
> +  m_BDVER | m_CORE2I7,
>
>    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>       will impact LEA instruction selection. */
> @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>  };
>
>  static const unsigned int x86_accumulate_outgoing_args
> -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
> +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>
>  static const unsigned int x86_arch_always_fancy_math_387
>    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
> Index: config/i386/i386.md
> ===================================================================
> --- config/i386/i386.md (revision 194452)
> +++ config/i386/i386.md (working copy)
> @@ -965,7 +965,7 @@
>         (compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
>                     (match_operand:SDWIM 2 "<general_operand>")))
>     (set (pc) (if_then_else
> -              (match_operator 0 "ordered_comparison_operator"
> +              (match_operator 0 "comparison_operator"
>                 [(reg:CC FLAGS_REG) (const_int 0)])
>                (label_ref (match_operand 3))
>                (pc)))]
> @@ -983,7 +983,7 @@
>         (compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
>                     (match_operand:SWIM 3 "<general_operand>")))
>     (set (match_operand:QI 0 "register_operand")
> -       (match_operator 1 "ordered_comparison_operator"
> +       (match_operator 1 "comparison_operator"
>           [(reg:CC FLAGS_REG) (const_int 0)]))]
>    ""
>  {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 17:34   ` Xinliang David Li
@ 2012-12-12 18:30     ` Jan Hubicka
  2012-12-12 18:37       ` Andi Kleen
  2012-12-13  0:16       ` Xinliang David Li
  0 siblings, 2 replies; 40+ messages in thread
From: Jan Hubicka @ 2012-12-12 18:30 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, GCC Patches, Teresa Johnson

Concerning 1push per cycle, I think it is same as K7 hardware did, so move
prologue should be a win.
> > Index: config/i386/i386.c
> > ===================================================================
> > --- config/i386/i386.c  (revision 194452)
> > +++ config/i386/i386.c  (working copy)
> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
> >    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
> >    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
> >    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
> >                {-1, libcall, false}}}},
> >    {{libcall, {{6, loop_1_byte, true},
> >                {24, loop, true},
> >                {8192, rep_prefix_4_byte, true},
> >                {-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},

libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
cache than fully blowin function call.
> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_PPRO,
> >
> >    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> > -  m_CORE2I7 | m_GENERIC,
> > +  m_GENERIC | m_CORE2,

This disable shifts that store just some flags. Acroding to Agner's manual I7 handle
this well.

Partial flags stall
The Sandy Bridge uses the method of an extra Ă‚Âľop to join partial registers not only for
general purpose registers but also for the flags register, unlike previous processors which
used this method only for general purpose registers. This occurs when a write to a part of
the flags register is followed by a read from a larger part of the flags register. The partial
flags stall of previous processors (See page 75) is therefore replaced by an extra Ă‚Âľop. The
Sandy Bridge also generates an extra Ă‚Âľop when reading the flags after a rotate instruction.

This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
> >
> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_K6,
> >
> >    /* X86_TUNE_USE_CLTD */
> > -  ~(m_PENT | m_ATOM | m_K6),
> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

None of CPUs that generic care about are !USE_CLTD now after your change.
> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
> >    m_ATHLON_K8,
> >
> >    /* X86_TUNE_SSE_TYPELESS_STORES */
> > -  m_AMD_MULTIPLE,
> > +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/

Hmm, I can not seem to find this in manual now, but I believe that stores also do not type,
so movaps is preferred over movapd store because it is shorter.  If not, this change should
produce a lot of slowdowns.
> >
> >    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
> > -  m_PPRO | m_P4_NOCONA,
> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/

Agner:
A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
Core2 and Nehalem processors recognize that certain instructions are independent of the
prior value of the register if the source and destination registers are the same.

This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all
variants of PSUBxxx and PCMPxxx except PCMPEQQ.
> >
> >    /* X86_TUNE_MEMORY_MISMATCH_STALL */
> >    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
> >       than 4 branch instructions in the 16 byte window.  */
> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
Intel never had similar design, so this flag is pointless.

We apparently ought to disable it for K10, at least per Agner's manual.
> >
> >    /* X86_TUNE_SCHEDULE */
> >    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
> >    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >
> >    /* X86_TUNE_USE_INCDEC */
> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),

Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >
> >    /* X86_TUNE_PAD_RETURNS */
> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_AMD_MULTIPLE | m_GENERIC,

Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am not even
sure this should be enabled for K10.
> >
> >    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
> >    m_ATOM,
> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
> >
> >    /* X86_TUNE_AVOID_VECTOR_DECODE */
> > -  m_CORE2I7 | m_K8 | m_GENERIC64,
> > +  m_K8 | m_GENERIC64,

This avoid AMD vector decoded instructions, again if it helped it did so by accident.
> >
> >    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
> >       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
> >       vector path on AMD machines.  */
> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
> >
> >    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
> >       machines.  */
> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,

This is similarly targetd for AMD hardware only. I did not find ismilar limiation
in the optimization manual.
> >
> >    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
> >       than a MOV.  */
> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
> >       from FP to FP. */
> > -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
> > +  m_AMDFAM10 | m_GENERIC,

This is quite specific feature of AMD chips to preffer packed converts over
scalar. Nothing like this is documented for cores
> >
> >    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
> >       from integer to FP. */
> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
> >    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
> >       with a subsequent conditional jump instruction into a single
> >       compare-and-branch uop.  */
> > -  m_BDVER,
> > +  m_BDVER | m_CORE2I7,

Core iplements fusion similar to what AMD does so I think this just applies here.
> >
> >    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
> >       will impact LEA instruction selection. */
> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
> >  };
> >
> >  static const unsigned int x86_accumulate_outgoing_args
> > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
> > +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;

Stack engine should make this cheap, just like prologues in moves.
This definitely needs some validation, the accumulate-outgoing-args
codegen differs quite a lot. Also this leads to unwind tables bloat.
> >
> >  static const unsigned int x86_arch_always_fancy_math_387
> >    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;

Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 18:30     ` Jan Hubicka
@ 2012-12-12 18:37       ` Andi Kleen
  2012-12-12 18:43         ` Jan Hubicka
  2012-12-12 18:43         ` Andi Kleen
  2012-12-13  0:16       ` Xinliang David Li
  1 sibling, 2 replies; 40+ messages in thread
From: Andi Kleen @ 2012-12-12 18:37 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Xinliang David Li, GCC Patches, Teresa Johnson

Jan Hubicka <hubicka@ucw.cz> writes:
>
> libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
> cache than fully blowin function call.

I noticed btw that some of the generated string instructions are slower 
than just calling the C library.

rep scasb etc. is rarely a win over an optimized library function,
it's not very optimized. Perhaps those patterns should just be disabled.
The way to optimize that on modern CPUs is to use PCMP*STR*, but that's
quite a bit more complicated and has some constraints.


>> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>> >       than 4 branch instructions in the 16 byte window.  */
>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
> This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
> Intel never had similar design, so this flag is pointless.

Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
16 byte window. If you exceed that it falls back to running 
the full decoder from the normal icache.

I don't have solid data, but it may be a win for frontend limited
code (otherwise possibly more in power than performance)

I would revisit that for Sandy Bridge

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 18:37       ` Andi Kleen
@ 2012-12-12 18:43         ` Jan Hubicka
  2012-12-12 18:43         ` Andi Kleen
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Hubicka @ 2012-12-12 18:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jan Hubicka, Xinliang David Li, GCC Patches, Teresa Johnson

> Jan Hubicka <hubicka@ucw.cz> writes:
> >
> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
> > cache than fully blowin function call.
> 
> I noticed btw that some of the generated string instructions are slower 
> than just calling the C library.
> 
> rep scasb etc. is rarely a win over an optimized library function,
> it's not very optimized. Perhaps those patterns should just be disabled.
> The way to optimize that on modern CPUs is to use PCMP*STR*, but that's
> quite a bit more complicated and has some constraints.

This is only about memset/memcpy expanding.  The other sequences are quite lame indeed...
> 
> 
> >> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
> >> >       than 4 branch instructions in the 16 byte window.  */
> >> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >
> > This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
> > Intel never had similar design, so this flag is pointless.
> 
> Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
> 16 byte window. If you exceed that it falls back to running 
> the full decoder from the normal icache.
> 
> I don't have solid data, but it may be a win for frontend limited
> code (otherwise possibly more in power than performance)
> 
> I would revisit that for Sandy Bridge

We are not particularly good on avoiding the branches - basically the code inserts alignment
whenever it things the 4 consecutive branches fit in the window.
I can make patch to change this to 3 and we can see if it helps at all.
> 
> -Andi
> -- 
> ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 18:37       ` Andi Kleen
  2012-12-12 18:43         ` Jan Hubicka
@ 2012-12-12 18:43         ` Andi Kleen
  1 sibling, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2012-12-12 18:43 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Xinliang David Li, GCC Patches, Teresa Johnson

Andi Kleen <andi@firstfloor.org> writes:
>
>>> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>>> >       than 4 branch instructions in the 16 byte window.  */
>>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>> This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
>> Intel never had similar design, so this flag is pointless.
>
> Actually the Sandy Bridge decoded icache has a limit of 3 jumps per
> 16 byte window.

Actually it's four per 32bytes, sorry.

Here's an old patch I had lying around to optimize for that.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1b871be..9b57316 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2713,6 +2713,7 @@ ix86_target_string (HOST_WIDE_INT isa, int flags, const char *arch,
     { "-mavx256-split-unaligned-load",	MASK_AVX256_SPLIT_UNALIGNED_LOAD},
     { "-mavx256-split-unaligned-store",	MASK_AVX256_SPLIT_UNALIGNED_STORE},
     { "-mprefer-avx128",		MASK_PREFER_AVX128},
+    { "-mjump-pad-32bytes",		MASK_JUMP_PAD_32BYTES},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -32182,6 +32183,7 @@ ix86_avoid_jump_mispredicts (void)
   rtx insn, start = get_insns ();
   int nbytes = 0, njumps = 0;
   int isjump = 0;
+  int jump_pad_window_size = TARGET_JUMP_PAD_32BYTES ? 32 : 16;
 
   /* Look for all minimal intervals of instructions containing 4 jumps.
      The intervals are bounded by START and INSN.  NBYTES is the total
@@ -32202,8 +32204,8 @@ ix86_avoid_jump_mispredicts (void)
 	  int align = label_to_alignment (insn);
 	  int max_skip = label_to_max_skip (insn);
 
-	  if (max_skip > 15)
-	    max_skip = 15;
+	  if (max_skip > jump_pad_window_size - 1)
+	    max_skip = jump_pad_window_size - 1;
 	  /* If align > 3, only up to 16 - max_skip - 1 bytes can be
 	     already in the current 16 byte page, because otherwise
 	     ASM_OUTPUT_MAX_SKIP_ALIGN could skip max_skip or fewer
@@ -32216,7 +32218,7 @@ ix86_avoid_jump_mispredicts (void)
 		     INSN_UID (insn), max_skip);
 	  if (max_skip)
 	    {
-	      while (nbytes + max_skip >= 16)
+	      while (nbytes + max_skip >= jump_pad_window_size)
 		{
 		  start = NEXT_INSN (start);
 		  if ((JUMP_P (start)
@@ -32262,10 +32264,11 @@ ix86_avoid_jump_mispredicts (void)
         fprintf (dump_file, "Interval %i to %i has %i bytes\n",
 		 INSN_UID (start), INSN_UID (insn), nbytes);
 
-      if (njumps == 3 && isjump && nbytes < 16)
+      if (njumps == 3 && isjump && nbytes < jump_pad_window_size)
 	{
-	  int padsize = 15 - nbytes + min_insn_size (insn);
-
+	  int padsize = jump_pad_window_size - 1 - nbytes + 
+	    min_insn_size (insn);
+	  
 	  if (dump_file)
 	    fprintf (dump_file, "Padding insn %i by %i bytes!\n",
 		     INSN_UID (insn), padsize);
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 6c516e7..b38d163 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -223,6 +223,10 @@ mintel-syntax
 Target Undocumented Alias(masm=, intel, att) Warn(%<-mintel-syntax%> and %<-mno-intel-syntax%> are deprecated; use %<-masm=intel%> and %<-masm=att%> instead)
 ;; Deprecated
 
+mjump-pad-32bytes
+Target RejectNegative Mask(JUMP_PAD_32BYTES) Save
+Avoid more than 4 jumps in each 32byte code window.
+
 mms-bitfields
 Target Report Mask(MS_BITFIELD_LAYOUT) Save
 Use native (MS) bitfield layout


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 18:30     ` Jan Hubicka
  2012-12-12 18:37       ` Andi Kleen
@ 2012-12-13  0:16       ` Xinliang David Li
  2012-12-13  0:16         ` Xinliang David Li
  2012-12-13  1:19         ` Jan Hubicka
  1 sibling, 2 replies; 40+ messages in thread
From: Xinliang David Li @ 2012-12-13  0:16 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Teresa Johnson

On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
> prologue should be a win.
>> > Index: config/i386/i386.c
>> > ===================================================================
>> > --- config/i386/i386.c  (revision 194452)
>> > +++ config/i386/i386.c  (working copy)
>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>> >    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>> >    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>> >    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
>> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>> >                {-1, libcall, false}}}},
>> >    {{libcall, {{6, loop_1_byte, true},
>> >                {24, loop, true},
>> >                {8192, rep_prefix_4_byte, true},
>> >                {-1, libcall, false}}},
>> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>
> libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
> cache than fully blowin function call.

Be careful with this. My recollection is that REP sequence is good for
any size -- for smaller size, the REP initial set up cost is too high
(10s of cycles), while for large size copy, it is less efficient
compared with library version.


>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>> >    m_PPRO,
>> >
>> >    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> > -  m_CORE2I7 | m_GENERIC,
>> > +  m_GENERIC | m_CORE2,
>
> This disable shifts that store just some flags. Acroding to Agner's manual I7 handle
> this well.
>

ok.

> Partial flags stall
> The Sandy Bridge uses the method of an extra Âľop to join partial registers not only for
> general purpose registers but also for the flags register, unlike previous processors which
> used this method only for general purpose registers. This occurs when a write to a part of
> the flags register is followed by a read from a larger part of the flags register. The partial
> flags stall of previous processors (See page 75) is therefore replaced by an extra Âľop. The
> Sandy Bridge also generates an extra Âľop when reading the flags after a rotate instruction.
>
> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.

ok.

>> >
>> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>> >    m_K6,
>> >
>> >    /* X86_TUNE_USE_CLTD */
>> > -  ~(m_PENT | m_ATOM | m_K6),
>> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

My change was to enable CLTD for generic. Is your change intended to
revert that?

>
> None of CPUs that generic care about are !USE_CLTD now after your change.
>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>> >    m_ATHLON_K8,
>> >
>> >    /* X86_TUNE_SSE_TYPELESS_STORES */
>> > -  m_AMD_MULTIPLE,
>> > +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>
> Hmm, I can not seem to find this in manual now, but I believe that stores also do not type,
> so movaps is preferred over movapd store because it is shorter.  If not, this change should
> produce a lot of slowdowns.
>> >
>> >    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
>> > -  m_PPRO | m_P4_NOCONA,
>> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>
> Agner:
> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
> Core2 and Nehalem processors recognize that certain instructions are independent of the
> prior value of the register if the source and destination registers are the same.
>
> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all
> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>> >
>> >    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>> >    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>> >       than 4 branch instructions in the 16 byte window.  */
>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>
> This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
> Intel never had similar design, so this flag is pointless.

I noticed that too, but Andi has a better answer to it.

>
> We apparently ought to disable it for K10, at least per Agner's manual.
>> >
>> >    /* X86_TUNE_SCHEDULE */
>> >    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
>> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>> >    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> >
>> >    /* X86_TUNE_USE_INCDEC */
>> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
>> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>
> Skipping inc/dec is to avoid partial flag stall happening on P4 only.
>> >


K8 and K10 partitions the flags into groups. References to flags to
the same group can still cause the stall -- not sure how that can be
handled.

>> >    /* X86_TUNE_PAD_RETURNS */
>> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
>> > +  m_AMD_MULTIPLE | m_GENERIC,
>
> Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am not even
> sure this should be enabled for K10.
>> >
>> >    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>> >    m_ATOM,
>> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>> >    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
>> >
>> >    /* X86_TUNE_AVOID_VECTOR_DECODE */
>> > -  m_CORE2I7 | m_K8 | m_GENERIC64,
>> > +  m_K8 | m_GENERIC64,
>
> This avoid AMD vector decoded instructions, again if it helped it did so by accident.
>> >
>> >    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
>> >       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
>> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>> >       vector path on AMD machines.  */
>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>> >
>> >    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
>> >       machines.  */
>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>
> This is similarly targetd for AMD hardware only. I did not find ismilar limiation
> in the optimization manual.
>> >
>> >    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>> >       than a MOV.  */
>> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>> >
>> >    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>> >       from FP to FP. */
>> > -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
>> > +  m_AMDFAM10 | m_GENERIC,
>
> This is quite specific feature of AMD chips to preffer packed converts over
> scalar. Nothing like this is documented for cores
>> >
>> >    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>> >       from integer to FP. */
>> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>> >    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>> >       with a subsequent conditional jump instruction into a single
>> >       compare-and-branch uop.  */
>> > -  m_BDVER,
>> > +  m_BDVER | m_CORE2I7,
>
> Core iplements fusion similar to what AMD does so I think this just applies here.

yes.


thanks,

David


>> >
>> >    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>> >       will impact LEA instruction selection. */
>> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>> >  };
>> >
>> >  static const unsigned int x86_accumulate_outgoing_args
>> > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
>> > +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>
> Stack engine should make this cheap, just like prologues in moves.
> This definitely needs some validation, the accumulate-outgoing-args
> codegen differs quite a lot. Also this leads to unwind tables bloat.
>> >
>> >  static const unsigned int x86_arch_always_fancy_math_387
>> >    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>
> Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  0:16       ` Xinliang David Li
@ 2012-12-13  0:16         ` Xinliang David Li
  2012-12-13  1:19         ` Jan Hubicka
  1 sibling, 0 replies; 40+ messages in thread
From: Xinliang David Li @ 2012-12-13  0:16 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Teresa Johnson

On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
>> prologue should be a win.
>>> > Index: config/i386/i386.c
>>> > ===================================================================
>>> > --- config/i386/i386.c  (revision 194452)
>>> > +++ config/i386/i386.c  (working copy)
>>> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>> >    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>>> >    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>>> >    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
>>> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>>> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>> >                {-1, libcall, false}}}},
>>> >    {{libcall, {{6, loop_1_byte, true},
>>> >                {24, loop, true},
>>> >                {8192, rep_prefix_4_byte, true},
>>> >                {-1, libcall, false}}},
>>> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>>> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>
>> libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
>> cache than fully blowin function call.
>
> Be careful with this. My recollection is that REP sequence is good for


s/good/not good/


David

> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles), while for large size copy, it is less efficient
> compared with library version.
>
>
>>> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_PPRO,
>>> >
>>> >    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>>> > -  m_CORE2I7 | m_GENERIC,
>>> > +  m_GENERIC | m_CORE2,
>>
>> This disable shifts that store just some flags. Acroding to Agner's manual I7 handle
>> this well.
>>
>
> ok.
>
>> Partial flags stall
>> The Sandy Bridge uses the method of an extra Âľop to join partial registers not only for
>> general purpose registers but also for the flags register, unlike previous processors which
>> used this method only for general purpose registers. This occurs when a write to a part of
>> the flags register is followed by a read from a larger part of the flags register. The partial
>> flags stall of previous processors (See page 75) is therefore replaced by an extra Âľop. The
>> Sandy Bridge also generates an extra Âľop when reading the flags after a rotate instruction.
>>
>> This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
>
> ok.
>
>>> >
>>> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>>> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_K6,
>>> >
>>> >    /* X86_TUNE_USE_CLTD */
>>> > -  ~(m_PENT | m_ATOM | m_K6),
>>> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
> My change was to enable CLTD for generic. Is your change intended to
> revert that?
>
>>
>> None of CPUs that generic care about are !USE_CLTD now after your change.
>>> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_ATHLON_K8,
>>> >
>>> >    /* X86_TUNE_SSE_TYPELESS_STORES */
>>> > -  m_AMD_MULTIPLE,
>>> > +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>>
>> Hmm, I can not seem to find this in manual now, but I believe that stores also do not type,
>> so movaps is preferred over movapd store because it is shorter.  If not, this change should
>> produce a lot of slowdowns.
>>> >
>>> >    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
>>> > -  m_PPRO | m_P4_NOCONA,
>>> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>>
>> Agner:
>> A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
>> Core2 and Nehalem processors recognize that certain instructions are independent of the
>> prior value of the register if the source and destination registers are the same.
>>
>> This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all
>> variants of PSUBxxx and PCMPxxx except PCMPEQQ.
>>> >
>>> >    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>>> >    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>>> >       than 4 branch instructions in the 16 byte window.  */
>>> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>> This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
>> Intel never had similar design, so this flag is pointless.
>
> I noticed that too, but Andi has a better answer to it.
>
>>
>> We apparently ought to disable it for K10, at least per Agner's manual.
>>> >
>>> >    /* X86_TUNE_SCHEDULE */
>>> >    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
>>> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> >
>>> >    /* X86_TUNE_USE_INCDEC */
>>> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
>>> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>>
>> Skipping inc/dec is to avoid partial flag stall happening on P4 only.
>>> >
>
>
> K8 and K10 partitions the flags into groups. References to flags to
> the same group can still cause the stall -- not sure how that can be
> handled.
>
>>> >    /* X86_TUNE_PAD_RETURNS */
>>> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
>>> > +  m_AMD_MULTIPLE | m_GENERIC,
>>
>> Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am not even
>> sure this should be enabled for K10.
>>> >
>>> >    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>>> >    m_ATOM,
>>> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
>>> >
>>> >    /* X86_TUNE_AVOID_VECTOR_DECODE */
>>> > -  m_CORE2I7 | m_K8 | m_GENERIC64,
>>> > +  m_K8 | m_GENERIC64,
>>
>> This avoid AMD vector decoded instructions, again if it helped it did so by accident.
>>> >
>>> >    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
>>> >       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
>>> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>>> >       vector path on AMD machines.  */
>>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>> >
>>> >    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
>>> >       machines.  */
>>> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>>> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>> This is similarly targetd for AMD hardware only. I did not find ismilar limiation
>> in the optimization manual.
>>> >
>>> >    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>>> >       than a MOV.  */
>>> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>>> >
>>> >    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>>> >       from FP to FP. */
>>> > -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
>>> > +  m_AMDFAM10 | m_GENERIC,
>>
>> This is quite specific feature of AMD chips to preffer packed converts over
>> scalar. Nothing like this is documented for cores
>>> >
>>> >    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>>> >       from integer to FP. */
>>> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>>> >    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>>> >       with a subsequent conditional jump instruction into a single
>>> >       compare-and-branch uop.  */
>>> > -  m_BDVER,
>>> > +  m_BDVER | m_CORE2I7,
>>
>> Core iplements fusion similar to what AMD does so I think this just applies here.
>
> yes.
>
>
> thanks,
>
> David
>
>
>>> >
>>> >    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>>> >       will impact LEA instruction selection. */
>>> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>>> >  };
>>> >
>>> >  static const unsigned int x86_accumulate_outgoing_args
>>> > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
>>> > +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>>
>> Stack engine should make this cheap, just like prologues in moves.
>> This definitely needs some validation, the accumulate-outgoing-args
>> codegen differs quite a lot. Also this leads to unwind tables bloat.
>>> >
>>> >  static const unsigned int x86_arch_always_fancy_math_387
>>> >    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>>
>> Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  0:16       ` Xinliang David Li
  2012-12-13  0:16         ` Xinliang David Li
@ 2012-12-13  1:19         ` Jan Hubicka
  2012-12-13  6:09           ` Xinliang David Li
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Hubicka @ 2012-12-13  1:19 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, GCC Patches, Teresa Johnson

> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
> > cache than fully blowin function call.
> 
> Be careful with this. My recollection is that REP sequence is good for
> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles), while for large size copy, it is less efficient
> compared with library version.

Well this is based on the data from the memtest script.  
Core has good REP implementation - it is a win from rather small blocks (16
bytes if I recall) and it does not need alignment.
Library version starts to be interesting with caching hints, but I think till 80KB
it is still not a win for my setup (glibc-2.15)
> >> >
> >> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> >> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >> >    m_K6,
> >> >
> >> >    /* X86_TUNE_USE_CLTD */
> >> > -  ~(m_PENT | m_ATOM | m_K6),
> >> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
> 
> My change was to enable CLTD for generic. Is your change intended to
> revert that?

No, it is merge conflict, sorry.  I will update it in my tree.
> > Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >> >
> 
> 
> K8 and K10 partitions the flags into groups. References to flags to
> the same group can still cause the stall -- not sure how that can be
> handled.

I  belive the stalls happends only in quite special cases where compare instruction
combines flags from multiple instructions.  GCC don't generate this type of code, so
we should be safe.

Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  1:19         ` Jan Hubicka
@ 2012-12-13  6:09           ` Xinliang David Li
  2012-12-13  6:21             ` Jakub Jelinek
  0 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-13  6:09 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Teresa Johnson

On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
>> > cache than fully blowin function call.
>>
>> Be careful with this. My recollection is that REP sequence is good for
>> any size -- for smaller size, the REP initial set up cost is too high
>> (10s of cycles), while for large size copy, it is less efficient
>> compared with library version.
>
> Well this is based on the data from the memtest script.
> Core has good REP implementation - it is a win from rather small blocks (16
> bytes if I recall) and it does not need alignment.
> Library version starts to be interesting with caching hints, but I think till 80KB
> it is still not a win for my setup (glibc-2.15)

A simple test shows that -mstringop-strategy=libcall always beats
-mstringop-strategy=rep_8byte (on core2 and corei7) except for size
smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
Can you share your memtest ?

thanks,

David

>> >> >
>> >> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>> >> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>> >> >    m_K6,
>> >> >
>> >> >    /* X86_TUNE_USE_CLTD */
>> >> > -  ~(m_PENT | m_ATOM | m_K6),
>> >> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>> My change was to enable CLTD for generic. Is your change intended to
>> revert that?
>
> No, it is merge conflict, sorry.  I will update it in my tree.
>> > Skipping inc/dec is to avoid partial flag stall happening on P4 only.
>> >> >
>>
>>
>> K8 and K10 partitions the flags into groups. References to flags to
>> the same group can still cause the stall -- not sure how that can be
>> handled.
>
> I  belive the stalls happends only in quite special cases where compare instruction
> combines flags from multiple instructions.  GCC don't generate this type of code, so
> we should be safe.
>
> Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  6:09           ` Xinliang David Li
@ 2012-12-13  6:21             ` Jakub Jelinek
  2012-12-13  7:05               ` Xinliang David Li
                                 ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Jakub Jelinek @ 2012-12-13  6:21 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, GCC Patches, Teresa Johnson

On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
> >> > cache than fully blowin function call.
> >>
> >> Be careful with this. My recollection is that REP sequence is good for
> >> any size -- for smaller size, the REP initial set up cost is too high
> >> (10s of cycles), while for large size copy, it is less efficient
> >> compared with library version.
> >
> > Well this is based on the data from the memtest script.
> > Core has good REP implementation - it is a win from rather small blocks (16
> > bytes if I recall) and it does not need alignment.
> > Library version starts to be interesting with caching hints, but I think till 80KB
> > it is still not a win for my setup (glibc-2.15)
> 
> A simple test shows that -mstringop-strategy=libcall always beats
> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
> Can you share your memtest ?

I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
libcall.  The PLT call overhead is simply too high.

	Jakub

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  6:21             ` Jakub Jelinek
@ 2012-12-13  7:05               ` Xinliang David Li
  2012-12-13 19:28                 ` Jan Hubicka
  2012-12-13 10:22               ` Richard Biener
  2012-12-13 19:43               ` H.J. Lu
  2 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-13  7:05 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jan Hubicka, GCC Patches, Teresa Johnson

Try the following one. 1) -minline-all-stringops
-mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
-O2.

David


#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#ifndef LEN
#define LEN 16
#endif

void copy(char* s1, char* s2,int len) __attribute__((noinline));
void copy(char* s1, char* s2,int len)
{
   memcpy(s2,s1,len);
}


int main() {

  char* s1 = (char*) malloc(LEN  +10);
  char* s2 = (char*) malloc(LEN  +10);
  int i = 0;
  for (i =  0; i < 1000000000; i++)
  {
    copy(s1+1,s2+3,LEN);
  }
}

On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall.  The PLT call overhead is simply too high.
>
>         Jakub

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  7:05               ` Xinliang David Li
@ 2012-12-13 19:28                 ` Jan Hubicka
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Hubicka @ 2012-12-13 19:28 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jakub Jelinek, Jan Hubicka, GCC Patches, Teresa Johnson

> Try the following one. 1) -minline-all-stringops
> -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
> -O2.
> 
> David
> 
> 
> #include <string.h>
> #include <stdio.h>
> #include <stdlib.h>
> #ifndef LEN
> #define LEN 16
> #endif
> 
> void copy(char* s1, char* s2,int len) __attribute__((noinline));
> void copy(char* s1, char* s2,int len)
> {
>    memcpy(s2,s1,len);
> }

I guess the catch here is that you force the copy to be noinline and thus you
eliminate the benefits of inlined sequence.  With inline stringop one saves
regalloc and often can get rid of the alignment tests.

This is script I use to tune the tables.

Honza

test()
{
rm -f a.out
cat <<END | $1 -x c -O3 $3 -DAVG_SIZE=$2 $STRINGOP -DMEMORY_COPIES=$memsize -
#define BUFFER_SIZE (16*1024*1024 + AVG_SIZE*2)
/*#define MEMORY_COPIES (1024*1024*64*(long long)10)*/
$type t[BUFFER_SIZE];
main()
{
  unsigned int i;
  for (i=0;i<((long long)MEMORY_COPIES + AVG_SIZE * 2 - 1)/AVG_SIZE*2;i++)
#ifdef test_memset
    __builtin_memset (t+(i*1024*1024+i*1)%(BUFFER_SIZE - AVG_SIZE*2), i, (AVG_SIZE + i) % (AVG_SIZE * 2 + 0));
#else
    __builtin_memcpy (t+(i*1024*1024+i*1)%(BUFFER_SIZE - AVG_SIZE*2), t+((i+1)*1024*1024*4+i*1)%(BUFFER_SIZE - AVG_SIZE *2), (AVG_SIZE + i) % (AVG_SIZE * 2 + 0));
#endif
  return 0;
}
END
TIME=`/usr/bin/time -f "%E" ./a.out 2>&1`
echo -n " "$TIME
echo $TIME $4 >>/tmp/accum
}

testrow()
{
echo -n "" >/tmp/accum
printf "block size %7i" $3
test "$2" "$3" "-mstringop-strategy=libcall" libcall
test "$2" "$3" "-mstringop-strategy=rep_byte -malign-stringops" rep1
test "$2" "$3" "-mstringop-strategy=rep_byte -mno-align-stringops" rep1noalign
test "$2" "$3" "-mstringop-strategy=rep_4byte -malign-stringops" rep4
test "$2" "$3" "-mstringop-strategy=rep_4byte -mno-align-stringops" rep4noalign
if [ "$mode" == 64 ]
then
test "$2" "$3" "-mstringop-strategy=rep_8byte -malign-stringops" rep8
test "$2" "$3" "-mstringop-strategy=rep_8byte -mno-align-stringops" rep8noalign
fi
test "$2" "$3" "-mstringop-strategy=loop -malign-stringops"  loop
test "$2" "$3" "-mstringop-strategy=loop -mno-align-stringops"  loopnoalign
test "$2" "$3" "-mstringop-strategy=unrolled_loop -malign-stringops" unrl
test "$2" "$3" "-mstringop-strategy=unrolled_loop -mno-align-stringops" unrlnoalign
test "$2" "$3" "-mstringop-strategy=sse_loop -malign-stringops" sse
test "$2" "$3" "-mstringop-strategy=sse_loop -mno-align-stringops -msse2" ssenoalign
test "$2" "$3" "-mstringop-strategy=byte_loop" byte
best=`cat /tmp/accum | sort | head -1`
test "$2" "$3" " -fprofile-generate" >/dev/null 2>&1
test "$2" "$3" " -fprofile-use"
test "$2" "$3" " -minline-stringops-dynamically"
echo " best: $best"
}

test_all_sizes()
{
if [ "$mode" == 64 ]
then
echo "                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    sse   noalg    byte profiled dynamic"
else
echo "                   libcall   rep1   noalg    rep4   noalg    loop   noalg    unrl   noalg    sse    noalg    byte profiled dynamic"
fi
#for size in 1 2 3 4 6 8 10 12 14 16 24 32 48 64 128 256 512 1024 4096 8192 81920 819200 8192000
#for size in 8192000 819200 81920 8192 4096 2048 1024 512 256 128 64 48 32 24 16 14 12 10 8 6 5 4 3 2 1
for size in 8192000 819200 81920 20480 8192 4096 2048 1024 512 256 128 64 48 32 24 16 14 12 10 8 6 4 1
#for size in 128 256 1024 4096 8192 81920 819200
do
testrow "$1" "$2" $size
done
}

mode=$1
shift
export memsize=$1
shift
cmdline=$*
if [ "$mode" != 32 ]
then
  if [ "$mode" != 64 ]
  then
    echo "Usage:"
    echo "test_stringop mode size cmdline"
    echo "mode is either 32 or 64"
    echo "size is amount of memory copied in each test.  Should be chosed small enough so runtime is less than minute for each test and sorting works"
    echo "Example: test_stringop 32 640000000 ./xgcc -B ./ -march=pentium3"
    exit
  fi
fi
echo "memcpy mode:$mode size:$memsize"
export STRINGOP=""
type=char
test_all_sizes $mode "$cmdline -m$mode"
echo "Aligned"
type=long
test_all_sizes $mode "$cmdline -m$mode"
echo "memset"
type=char
export STRINGOP="-Dtest_memset=1"
test_all_sizes $mode "$cmdline -m$mode"
echo "Aligned"
type=long
test_all_sizes $mode "$cmdline -m$mode"

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  6:21             ` Jakub Jelinek
  2012-12-13  7:05               ` Xinliang David Li
@ 2012-12-13 10:22               ` Richard Biener
  2012-12-13 19:43               ` H.J. Lu
  2 siblings, 0 replies; 40+ messages in thread
From: Richard Biener @ 2012-12-13 10:22 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Xinliang David Li, Jan Hubicka, GCC Patches, Teresa Johnson

On Thu, Dec 13, 2012 at 7:21 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall.  The PLT call overhead is simply too high.

I believe the PLT call overhead may be effectively zero if the benchmarking
is just a loop around a memcpy.  Thus for measuring the PLT overhead
I call the benchmark broken ;)

Richard.

>         Jakub

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13  6:21             ` Jakub Jelinek
  2012-12-13  7:05               ` Xinliang David Li
  2012-12-13 10:22               ` Richard Biener
@ 2012-12-13 19:43               ` H.J. Lu
  2012-12-13 20:26                 ` Jan Hubicka
  2 siblings, 1 reply; 40+ messages in thread
From: H.J. Lu @ 2012-12-13 19:43 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Xinliang David Li, Jan Hubicka, GCC Patches, Teresa Johnson

On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall.  The PLT call overhead is simply too high.
>

The x86 string/memory functions in the current glibc are
extremely fast and tuned for Core 2/Core i7.  GCC is having
a very hard time to beat them with inlining:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

-- 
H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13 19:43               ` H.J. Lu
@ 2012-12-13 20:26                 ` Jan Hubicka
  2012-12-13 20:28                   ` H.J. Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Hubicka @ 2012-12-13 20:26 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, Xinliang David Li, Jan Hubicka, GCC Patches,
	Teresa Johnson

> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
> >> >> > cache than fully blowin function call.
> >> >>
> >> >> Be careful with this. My recollection is that REP sequence is good for
> >> >> any size -- for smaller size, the REP initial set up cost is too high
> >> >> (10s of cycles), while for large size copy, it is less efficient
> >> >> compared with library version.
> >> >
> >> > Well this is based on the data from the memtest script.
> >> > Core has good REP implementation - it is a win from rather small blocks (16
> >> > bytes if I recall) and it does not need alignment.
> >> > Library version starts to be interesting with caching hints, but I think till 80KB
> >> > it is still not a win for my setup (glibc-2.15)
> >>
> >> A simple test shows that -mstringop-strategy=libcall always beats
> >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
> >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
> >> Can you share your memtest ?
> >
> > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> > libcall.  The PLT call overhead is simply too high.
> >
> 
> The x86 string/memory functions in the current glibc are
> extremely fast and tuned for Core 2/Core i7.  GCC is having
> a very hard time to beat them with inlining:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

Here we speak about memcpy/memset only.  I never got around to modernize
strlen and friends, unfortunately...

memcmp and friends are different beats.  They realy need some TLC...

Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13 20:26                 ` Jan Hubicka
@ 2012-12-13 20:28                   ` H.J. Lu
  2012-12-13 20:40                     ` Jan Hubicka
  0 siblings, 1 reply; 40+ messages in thread
From: H.J. Lu @ 2012-12-13 20:28 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Jakub Jelinek, Xinliang David Li, GCC Patches, Teresa Johnson

On Thu, Dec 13, 2012 at 12:26 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
>> >> >> > cache than fully blowin function call.
>> >> >>
>> >> >> Be careful with this. My recollection is that REP sequence is good for
>> >> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> >> (10s of cycles), while for large size copy, it is less efficient
>> >> >> compared with library version.
>> >> >
>> >> > Well this is based on the data from the memtest script.
>> >> > Core has good REP implementation - it is a win from rather small blocks (16
>> >> > bytes if I recall) and it does not need alignment.
>> >> > Library version starts to be interesting with caching hints, but I think till 80KB
>> >> > it is still not a win for my setup (glibc-2.15)
>> >>
>> >> A simple test shows that -mstringop-strategy=libcall always beats
>> >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> >> Can you share your memtest ?
>> >
>> > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
>> > libcall.  The PLT call overhead is simply too high.
>> >
>>
>> The x86 string/memory functions in the current glibc are
>> extremely fast and tuned for Core 2/Core i7.  GCC is having
>> a very hard time to beat them with inlining:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
>
> Here we speak about memcpy/memset only.  I never got around to modernize
> strlen and friends, unfortunately...
>
> memcmp and friends are different beats.  They realy need some TLC...

memcpy and memset in glibc are also extremely fast.


-- 
H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13 20:28                   ` H.J. Lu
@ 2012-12-13 20:40                     ` Jan Hubicka
  2012-12-13 21:02                       ` H.J. Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Hubicka @ 2012-12-13 20:40 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jakub Jelinek, Xinliang David Li, GCC Patches,
	Teresa Johnson

> > Here we speak about memcpy/memset only.  I never got around to modernize
> > strlen and friends, unfortunately...
> >
> > memcmp and friends are different beats.  They realy need some TLC...
> 
> memcpy and memset in glibc are also extremely fast.

The default strategy now is to inline only when the block is known to be small
(either constant or via profile feedback, we do not really use the info on
upper bound of size of the copied object that would be useful but not readilly
available at expansion time).

You can try the test_stringop script I attached and send me the results.  For
me libc starts to be win only for rather large blocks (i.e. >8KB)

Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13 20:40                     ` Jan Hubicka
@ 2012-12-13 21:02                       ` H.J. Lu
  2012-12-13 21:35                         ` Jan Hubicka
  2012-12-20 12:13                         ` Melik-adamyan, Areg
  0 siblings, 2 replies; 40+ messages in thread
From: H.J. Lu @ 2012-12-13 21:02 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Xinliang David Li, GCC Patches, Teresa Johnson,
	Melik-adamyan, Areg

On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Here we speak about memcpy/memset only.  I never got around to modernize
>> > strlen and friends, unfortunately...
>> >
>> > memcmp and friends are different beats.  They realy need some TLC...
>>
>> memcpy and memset in glibc are also extremely fast.
>
> The default strategy now is to inline only when the block is known to be small
> (either constant or via profile feedback, we do not really use the info on
> upper bound of size of the copied object that would be useful but not readilly
> available at expansion time).
>
> You can try the test_stringop script I attached and send me the results.  For

Areg, can you give it a try?  Thanks.

> me libc starts to be win only for rather large blocks (i.e. >8KB)
>

Which glibc are you using?

-- 
H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13 21:02                       ` H.J. Lu
@ 2012-12-13 21:35                         ` Jan Hubicka
  2012-12-20 12:13                         ` Melik-adamyan, Areg
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Hubicka @ 2012-12-13 21:35 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jakub Jelinek, Xinliang David Li, GCC Patches,
	Teresa Johnson, Melik-adamyan, Areg

> > me libc starts to be win only for rather large blocks (i.e. >8KB)
> >
> 
> Which glibc are you using?

2.15 as it comes with opensuse 12.2

Honza
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-13 21:02                       ` H.J. Lu
  2012-12-13 21:35                         ` Jan Hubicka
@ 2012-12-20 12:13                         ` Melik-adamyan, Areg
  2012-12-20 14:08                           ` H.J. Lu
  1 sibling, 1 reply; 40+ messages in thread
From: Melik-adamyan, Areg @ 2012-12-20 12:13 UTC (permalink / raw)
  To: H.J. Lu, Jan Hubicka
  Cc: Jakub Jelinek, Xinliang David Li, GCC Patches, Teresa Johnson

We checked,  no significant gains or losses.

-----Original Message-----
From: H.J. Lu [mailto:hjl.tools@gmail.com] 
Sent: Friday, December 14, 2012 1:03 AM
To: Jan Hubicka
Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; Melik-adamyan, Areg
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Here we speak about memcpy/memset only.  I never got around to 
>> > modernize strlen and friends, unfortunately...
>> >
>> > memcmp and friends are different beats.  They realy need some TLC...
>>
>> memcpy and memset in glibc are also extremely fast.
>
> The default strategy now is to inline only when the block is known to 
> be small (either constant or via profile feedback, we do not really 
> use the info on upper bound of size of the copied object that would be 
> useful but not readilly available at expansion time).
>
> You can try the test_stringop script I attached and send me the 
> results.  For

Areg, can you give it a try?  Thanks.

> me libc starts to be win only for rather large blocks (i.e. >8KB)
>

Which glibc are you using?

--
H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-20 12:13                         ` Melik-adamyan, Areg
@ 2012-12-20 14:08                           ` H.J. Lu
  2012-12-20 15:05                             ` Jan Hubicka
  0 siblings, 1 reply; 40+ messages in thread
From: H.J. Lu @ 2012-12-20 14:08 UTC (permalink / raw)
  To: Melik-adamyan, Areg
  Cc: Jan Hubicka, Jakub Jelinek, Xinliang David Li, GCC Patches,
	Teresa Johnson

On Thu, Dec 20, 2012 at 4:13 AM, Melik-adamyan, Areg
<areg.melik-adamyan@intel.com> wrote:
> We checked,  no significant gains or losses.
>
> -----Original Message-----
> From: H.J. Lu [mailto:hjl.tools@gmail.com]
> Sent: Friday, December 14, 2012 1:03 AM
> To: Jan Hubicka
> Cc: Jakub Jelinek; Xinliang David Li; GCC Patches; Teresa Johnson; Melik-adamyan, Areg
> Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
>
> On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > Here we speak about memcpy/memset only.  I never got around to
>>> > modernize strlen and friends, unfortunately...
>>> >
>>> > memcmp and friends are different beats.  They realy need some TLC...
>>>
>>> memcpy and memset in glibc are also extremely fast.
>>
>> The default strategy now is to inline only when the block is known to
>> be small (either constant or via profile feedback, we do not really
>> use the info on upper bound of size of the copied object that would be
>> useful but not readilly available at expansion time).
>>
>> You can try the test_stringop script I attached and send me the
>> results.  For
>
> Areg, can you give it a try?  Thanks.
>

Hi Areg,

Did you mean inlined memcpy/memset are as fast as
the ones in libc.so on both ia32 and Intel64?

Please keep in mind that memcpy/memset in libc.a
may not be optimized.  You must not use -static for
linking.

-- 
H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-20 14:08                           ` H.J. Lu
@ 2012-12-20 15:05                             ` Jan Hubicka
  2012-12-20 15:07                               ` Jan Hubicka
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Hubicka @ 2012-12-20 15:05 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Melik-adamyan, Areg, Jan Hubicka, Jakub Jelinek,
	Xinliang David Li, GCC Patches, Teresa Johnson

> Hi Areg,
> 
> Did you mean inlined memcpy/memset are as fast as
> the ones in libc.so on both ia32 and Intel64?

I would be interested in output of the stringop script.
> 
> Please keep in mind that memcpy/memset in libc.a
> may not be optimized.  You must not use -static for
> linking.

In my setup I use dynamic linking...
(this is quite anoying property in general - people tend to use --static for
performance critical binaries to save expenses of PIC.  It would be really cool
to have way to call proper stringops based on -march switch....)

Honza
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-20 15:05                             ` Jan Hubicka
@ 2012-12-20 15:07                               ` Jan Hubicka
  2012-12-20 15:22                                 ` H.J. Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Hubicka @ 2012-12-20 15:07 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Melik-adamyan, Areg, Jakub Jelinek, Xinliang David Li,
	GCC Patches, Teresa Johnson

> > Hi Areg,
> > 
> > Did you mean inlined memcpy/memset are as fast as
> > the ones in libc.so on both ia32 and Intel64?
> 
> I would be interested in output of the stringop script.

Also as far as I can remember, none of spec2k6 benchmarks is really stringop
bound.  On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap
oprations) but mostly by collecting page faults there.  Inlining that one made
quite a lot of difference on K8 hardware, but not on later chips.

Honza

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-20 15:07                               ` Jan Hubicka
@ 2012-12-20 15:22                                 ` H.J. Lu
  0 siblings, 0 replies; 40+ messages in thread
From: H.J. Lu @ 2012-12-20 15:22 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Melik-adamyan, Areg, Jakub Jelinek, Xinliang David Li,
	GCC Patches, Teresa Johnson

On Thu, Dec 20, 2012 at 7:06 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Hi Areg,
>> >
>> > Did you mean inlined memcpy/memset are as fast as
>> > the ones in libc.so on both ia32 and Intel64?
>>
>> I would be interested in output of the stringop script.
>
> Also as far as I can remember, none of spec2k6 benchmarks is really stringop
> bound.  On Spec2k GCC was quite bound by memset (within alloc_rtx and bitmap
> oprations) but mostly by collecting page faults there.  Inlining that one made
> quite a lot of difference on K8 hardware, but not on later chips.
>

There is a GCC performance regression bug on EEMBC.  It turns out
that -static was used for linking and optimized memory functions weren't
used.  Remove -static fixed the performance regression.

-- 
H.J.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
  2012-12-12 16:37 ` Jan Hubicka
  2012-12-12 17:25   ` Xinliang David Li
  2012-12-12 17:34   ` Xinliang David Li
@ 2012-12-21  8:28   ` Zamyatin, Igor
  2 siblings, 0 replies; 40+ messages in thread
From: Zamyatin, Igor @ 2012-12-21  8:28 UTC (permalink / raw)
  To: Jan Hubicka, Xinliang David Li; +Cc: GCC Patches

So far we see a regression on one of eembc_1_1 test because of following change:

   /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
      from FP to FP. */
-  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
+  m_AMDFAM10 | m_GENERIC,

Probably we should keep it as is while there is nothing about it in docs indeed...


Thanks,
Igor


-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Jan Hubicka
Sent: Wednesday, December 12, 2012 8:37 PM
To: Xinliang David Li
Cc: GCC Patches
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a 
> SP adjustment instead of a sequence of pushes/pops. The preference to 
> the MOVs are good for old CPU micro-architectures (before pentium-4, 
> K10), because it breaks the data dependency.  In modern 
> micro-architecture, push/pop is implemented using a mechanism called 
> stack engine. The data dependency is removed by the hardware, and 
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them.   This is
> also what ICC does.
> 
> The following patch fixed the problem. It passes bootstrap/regression 
> test. OK to install?
> 
> thanks,
> 
> David
> 
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c (revision 194324)
> +++ config/i386/i386.c (working copy)
> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> 
>    /* X86_TUNE_PROLOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,
> 
>    /* X86_TUNE_EPILOGUE_USING_MOVE */
> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
> +  m_PPRO | m_ATHLON_K8,

Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it is gone from generic (in fact I had similar patch pending).
Are you sure about Atom having stack engine, too?

Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will give it a try on K10.

Honza

I am attaching the changes for core costs I made if someone is interested in testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for generic, there is room for improvement in generic, too. Like using inc/dec again.

Honza

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 194452)
+++ config/i386/i386.c	(working copy)
@@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
   COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
   COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
-  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
+  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
               {-1, libcall, false}}}},
   {{libcall, {{6, loop_1_byte, true},
               {24, loop, true},
               {8192, rep_prefix_4_byte, true},
               {-1, libcall, false}}},
-   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
+   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
               {-1, libcall, false}}}},
   1,					/* scalar_stmt_cost.  */
   1,					/* scalar load_cost.  */
@@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO,
 
   /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
-  m_CORE2I7 | m_GENERIC,
+  m_GENERIC | m_CORE2,
 
   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
    * on 16-bit immediate moves into memory on Core2 and Corei7.  */ @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
   m_K6,
 
   /* X86_TUNE_USE_CLTD */
-  ~(m_PENT | m_ATOM | m_K6),
+  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
 
   /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
   m_PENT4,
@@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
   m_COREI7 | m_BDVER,
 
   /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
-  m_BDVER ,
+  m_BDVER,
 
   /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
      are resolved on SSE register parts instead of whole registers, so we may @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
   m_ATHLON_K8,
 
   /* X86_TUNE_SSE_TYPELESS_STORES */
-  m_AMD_MULTIPLE,
+  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
 
   /* X86_TUNE_SSE_LOAD0_BY_PXOR */
-  m_PPRO | m_P4_NOCONA,
+  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
 
   /* X86_TUNE_MEMORY_MISMATCH_STALL */
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC, @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
      than 4 branch instructions in the 16 byte window.  */
-  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
+  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_SCHEDULE */
   m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC, @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
   m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_USE_INCDEC */
-  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
+  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
 
   /* X86_TUNE_PAD_RETURNS */
-  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
+  m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
   m_ATOM,
@@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
 
   /* X86_TUNE_AVOID_VECTOR_DECODE */
-  m_CORE2I7 | m_K8 | m_GENERIC64,
+  m_K8 | m_GENERIC64,
 
   /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
      and SImode multiply, but 386 and 486 do HImode multiply faster.  */ @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
      vector path on AMD machines.  */
-  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
+  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
 
   /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
      machines.  */
-  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
+  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
 
   /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
      than a MOV.  */
@@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
      from FP to FP. */
-  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
+  m_AMDFAM10 | m_GENERIC,
 
   /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
      from integer to FP. */
@@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
   /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
      with a subsequent conditional jump instruction into a single
      compare-and-branch uop.  */
-  m_BDVER,
+  m_BDVER | m_CORE2I7,
 
   /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
      will impact LEA instruction selection. */ @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe  };
 
 static const unsigned int x86_accumulate_outgoing_args
-  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
+  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
 
 static const unsigned int x86_arch_always_fancy_math_387
   = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 194452)
+++ config/i386/i386.md	(working copy)
@@ -965,7 +965,7 @@
 	(compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
 		    (match_operand:SDWIM 2 "<general_operand>")))
    (set (pc) (if_then_else
-	       (match_operator 0 "ordered_comparison_operator"
+	       (match_operator 0 "comparison_operator"
 		[(reg:CC FLAGS_REG) (const_int 0)])
 	       (label_ref (match_operand 3))
 	       (pc)))]
@@ -983,7 +983,7 @@
 	(compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
 		    (match_operand:SWIM 3 "<general_operand>")))
    (set (match_operand:QI 0 "register_operand")
-	(match_operator 1 "ordered_comparison_operator"
+	(match_operator 1 "comparison_operator"
 	  [(reg:CC FLAGS_REG) (const_int 0)]))]
   ""
 {

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2012-12-21  8:28 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-09 13:50 [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs Uros Bizjak
2012-12-09 17:09 ` Дмитрий Дьяченко
2012-12-10  9:23 ` Richard Biener
2012-12-10 20:42   ` Xinliang David Li
2012-12-10 21:07     ` Mike Stump
2012-12-11  9:49       ` Richard Biener
2012-12-11 17:15         ` Xinliang David Li
  -- strict thread matches above, loose matches on Subject: below --
2012-12-21  7:26 Xinliang David Li
2012-12-21  8:20 ` Zamyatin, Igor
2012-12-11 22:53 Xinliang David Li
2012-12-11 23:39 ` Xinliang David Li
2012-12-12 11:00 ` Richard Biener
2012-12-08 18:13 Xinliang David Li
2012-12-12 16:37 ` Jan Hubicka
2012-12-12 17:25   ` Xinliang David Li
2012-12-12 17:34   ` Xinliang David Li
2012-12-12 18:30     ` Jan Hubicka
2012-12-12 18:37       ` Andi Kleen
2012-12-12 18:43         ` Jan Hubicka
2012-12-12 18:43         ` Andi Kleen
2012-12-13  0:16       ` Xinliang David Li
2012-12-13  0:16         ` Xinliang David Li
2012-12-13  1:19         ` Jan Hubicka
2012-12-13  6:09           ` Xinliang David Li
2012-12-13  6:21             ` Jakub Jelinek
2012-12-13  7:05               ` Xinliang David Li
2012-12-13 19:28                 ` Jan Hubicka
2012-12-13 10:22               ` Richard Biener
2012-12-13 19:43               ` H.J. Lu
2012-12-13 20:26                 ` Jan Hubicka
2012-12-13 20:28                   ` H.J. Lu
2012-12-13 20:40                     ` Jan Hubicka
2012-12-13 21:02                       ` H.J. Lu
2012-12-13 21:35                         ` Jan Hubicka
2012-12-20 12:13                         ` Melik-adamyan, Areg
2012-12-20 14:08                           ` H.J. Lu
2012-12-20 15:05                             ` Jan Hubicka
2012-12-20 15:07                               ` Jan Hubicka
2012-12-20 15:22                                 ` H.J. Lu
2012-12-21  8:28   ` Zamyatin, Igor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).