public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-09 13:50 Uros Bizjak
  2012-12-09 17:09 ` Дмитрий Дьяченко
  2012-12-10  9:23 ` Richard Biener
  0 siblings, 2 replies; 40+ messages in thread
From: Uros Bizjak @ 2012-12-09 13:50 UTC (permalink / raw)
  To: gcc-patches; +Cc: Xinliang David Li

Hello!

> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency.  In modern
> micro-architecture, push/pop is implemented using a mechanism called
> stack engine. The data dependency is removed by the hardware, and
> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
> smaller. There is no longer the need to avoid using them.   This is
> also what ICC does.

> 2012-12-08  Xinliang David Li  <davidxl@google.com>
>            * config/i386/i386.c: Eanble push/pop in pro/epilogue for moderen CPUs.

s/moderen/modern

OK for mainline SVN.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 40+ messages in thread
* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-21  7:26 Xinliang David Li
  2012-12-21  8:20 ` Zamyatin, Igor
  0 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-21  7:26 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Ahmad Sharif

Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically
significant regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.

David

On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>>> SP adjustment instead of a sequence of pushes/pops. The preference to
>>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>>> K10), because it breaks the data dependency.  In modern
>>> micro-architecture, push/pop is implemented using a mechanism called
>>> stack engine. The data dependency is removed by the hardware, and
>>> push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
>>> smaller. There is no longer the need to avoid using them.   This is
>>> also what ICC does.
>>>
>>> The following patch fixed the problem. It passes bootstrap/regression
>>> test. OK to install?
>>>
>>> thanks,
>>>
>>> David
>>>
>>> Index: config/i386/i386.c
>>> ===================================================================
>>> --- config/i386/i386.c (revision 194324)
>>> +++ config/i386/i386.c (working copy)
>>> @@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
>>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>>
>>>    /* X86_TUNE_PROLOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>>
>>>    /* X86_TUNE_EPILOGUE_USING_MOVE */
>>> -  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
>>> +  m_PPRO | m_ATHLON_K8,
>>
>> Push/pops wrt moves was always difficult to tune on old CPUs, so I am happy it
>> is gone from generic (in fact I had similar patch pending).
>> Are you sure about Atom having stack engine, too?
>>
>
> Good question. The instruction latency table
> (http://www.agner.org/optimize/instruction_tables.pdf) shows that for
> Atom: push r has one 1uop, 1 cycle latency. However the instruction is
> not pairable which will affect ILP. The guide here
> http://www.agner.org/optimize/microarchitecture.pdf does not mention
> Atom has stack engine either.
>
> I will help collect some performance data on Atom.
>
>
> thanks,
>
> David
>
>
>> Related thing is accumulate_outgoing_args. Igor is testing it on Core and I will
>> give it a try on K10.
>>
>> Honza
>>
>> I am attaching the changes for core costs I made if someone is interested in
>> testing them.  If we can declare P4/PPRo and maybe K8 chips obsolette for
>> generic, there is room for improvement in generic, too. Like using inc/dec
>> again.
>>
>> Honza
>>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c  (revision 194452)
>> +++ config/i386/i386.c  (working copy)
>> @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
>>    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
>>    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
>>    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
>> -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
>> +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>                {-1, libcall, false}}}},
>>    {{libcall, {{6, loop_1_byte, true},
>>                {24, loop, true},
>>                {8192, rep_prefix_4_byte, true},
>>                {-1, libcall, false}}},
>> -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
>> +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
>>                {-1, libcall, false}}}},
>>    1,                                   /* scalar_stmt_cost.  */
>>    1,                                   /* scalar load_cost.  */
>> @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_PPRO,
>>
>>    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
>> -  m_CORE2I7 | m_GENERIC,
>> +  m_GENERIC | m_CORE2,
>>
>>    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_K6,
>>
>>    /* X86_TUNE_USE_CLTD */
>> -  ~(m_PENT | m_ATOM | m_K6),
>> +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>>    /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
>>    m_PENT4,
>> @@ -1901,7 +1901,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_COREI7 | m_BDVER,
>>
>>    /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
>> -  m_BDVER ,
>> +  m_BDVER,
>>
>>    /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>>       are resolved on SSE register parts instead of whole registers, so we may
>> @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_ATHLON_K8,
>>
>>    /* X86_TUNE_SSE_TYPELESS_STORES */
>> -  m_AMD_MULTIPLE,
>> +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/
>>
>>    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
>> -  m_PPRO | m_P4_NOCONA,
>> +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/
>>
>>    /* X86_TUNE_MEMORY_MISMATCH_STALL */
>>    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>>       than 4 branch instructions in the 16 byte window.  */
>> -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>> +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_SCHEDULE */
>>    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
>> @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
>>    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_USE_INCDEC */
>> -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
>> +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),
>>
>>    /* X86_TUNE_PAD_RETURNS */
>> -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
>> +  m_AMD_MULTIPLE | m_GENERIC,
>>
>>    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
>>    m_ATOM,
>> @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
>>    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
>>
>>    /* X86_TUNE_AVOID_VECTOR_DECODE */
>> -  m_CORE2I7 | m_K8 | m_GENERIC64,
>> +  m_K8 | m_GENERIC64,
>>
>>    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
>>       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
>> @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
>>       vector path on AMD machines.  */
>> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>>    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
>>       machines.  */
>> -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
>> +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
>>
>>    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
>>       than a MOV.  */
>> @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
>>
>>    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>>       from FP to FP. */
>> -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
>> +  m_AMDFAM10 | m_GENERIC,
>>
>>    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>>       from integer to FP. */
>> @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
>>    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>>       with a subsequent conditional jump instruction into a single
>>       compare-and-branch uop.  */
>> -  m_BDVER,
>> +  m_BDVER | m_CORE2I7,
>>
>>    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>>       will impact LEA instruction selection. */
>> @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
>>  };
>>
>>  static const unsigned int x86_accumulate_outgoing_args
>> -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
>> +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>>
>>  static const unsigned int x86_arch_always_fancy_math_387
>>    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;
>> Index: config/i386/i386.md
>> ===================================================================
>> --- config/i386/i386.md (revision 194452)
>> +++ config/i386/i386.md (working copy)
>> @@ -965,7 +965,7 @@
>>         (compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
>>                     (match_operand:SDWIM 2 "<general_operand>")))
>>     (set (pc) (if_then_else
>> -              (match_operator 0 "ordered_comparison_operator"
>> +              (match_operator 0 "comparison_operator"
>>                 [(reg:CC FLAGS_REG) (const_int 0)])
>>                (label_ref (match_operand 3))
>>                (pc)))]
>> @@ -983,7 +983,7 @@
>>         (compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
>>                     (match_operand:SWIM 3 "<general_operand>")))
>>     (set (match_operand:QI 0 "register_operand")
>> -       (match_operator 1 "ordered_comparison_operator"
>> +       (match_operator 1 "comparison_operator"
>>           [(reg:CC FLAGS_REG) (const_int 0)]))]
>>    ""
>>  {

^ permalink raw reply	[flat|nested] 40+ messages in thread
* Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-11 22:53 Xinliang David Li
  2012-12-11 23:39 ` Xinliang David Li
  2012-12-12 11:00 ` Richard Biener
  0 siblings, 2 replies; 40+ messages in thread
From: Xinliang David Li @ 2012-12-11 22:53 UTC (permalink / raw)
  To: Richard Biener; +Cc: Mike Stump, Uros Bizjak, GCC Patches

The following the O2 size data from SPEC2k.  Note that with push/pop,
it is a always a net win (negative delta) in terms of total binary or
total loadable section size.

thanks,

David

                   .text    .eh_frame  Total_binary
vortex-move 440252 40796 584066
vortex-push 415436 57452 575906
delta            -5.6% 40.8%  -1.397%

twolf-move 169324 10748 223521
twolf-push 168876 11124 223449
delta       -0.3% 3.5% -0.032%

gzip-move 30668 3652 374399
gzip-push 30524 3740 374343
delta     -0.5% 2.4% -0.015%

bzip2-move 22748 3196 111616
bzip2-push 22636 3284 111592
delta      -0.5% 2.8% -0.022%

vpr-move 104684 9380 147378
vpr-push 104236 9788 147338
delta     -0.4% 4.3% -0.027%

mcf-move 8444 1244 26760
mcf-push 8444 1244 26760
delta    0.0% 0.0% 0.000%

cc1-move 1093964 90772 1576994
cc1-push 1078988 104068 1575314
delta      -1.4% 14.6% -0.107%

crafty-move 130556 5508 1256037
crafty-push 130236 5772 1255981
delta        -0.2% 4.8% -0.004%

eon-move 333660 33220 516491
eon-push 330140 35812 515555
delta     -1.1% 7.8% -0.181%

gap-move 404092 46732 1457735
gap-push 396012 53180 1456103
delta     -2.0% 13.8% -0.112%

perlbmk-move 456572 45324 618585
perlbmk-push 449516 52340 618545
delta         -1.5% 15.5% -0.006%

parser-move 81244 15788 334003
parser-push 80684 16332 333987
delta       -0.7% 3.4% -0.005%


On Tue, Dec 11, 2012 at 9:14 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump <mikestump@comcast.net> wrote:
>>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> I have not measured the CFI size impact -- but conceivably it should
>>>> be larger -- which is unfortunate.
>>>
>>> Code speed and size are preferable to optimizing dwarf size…  :-)  I'd let dwarf 5 fix it!
>>
>> Well, different to debug info, CFI data has to be in memory to make
>> unwinding work.
>> These days most Linux distributions enable asyncronous unwind tables so any
>> size savings due to shorter push/pop epilogue/prologue sequences has to be
>> offsetted by the increase in CFI data.  I'm not sure there is really a
>> speed difference
>> between both variants (well, maybe due to better icache footprint of
>> the push/pop
>> variant).
>
> Yes, for large applications, this can be crucial to performance.
>
>>
>> That said - I'd prefer to have more data on this before making the switch for
>> the generic model.  What was your original motivation?  Just "theory" or was
>> it a real case?
>
> 1) some of the very large internal apps I measured benefit from this
> change (in terms of performance)
> 2) both ICC and LLVM do the same.
>
> I have already committed the patch. I will find some time to collect
> more size data and post it later.
>
> thanks,
>
> David
>
>
>>
>> Thanks,
>> Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread
* [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
@ 2012-12-08 18:13 Xinliang David Li
  2012-12-12 16:37 ` Jan Hubicka
  0 siblings, 1 reply; 40+ messages in thread
From: Xinliang David Li @ 2012-12-08 18:13 UTC (permalink / raw)
  To: GCC Patches

I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
SP adjustment instead of a sequence of pushes/pops. The preference to
the MOVs are good for old CPU micro-architectures (before pentium-4,
K10), because it breaks the data dependency.  In modern
micro-architecture, push/pop is implemented using a mechanism called
stack engine. The data dependency is removed by the hardware, and
push/pop becomes very cheap (1 uOp, 1 cycle latency), and they are
smaller. There is no longer the need to avoid using them.   This is
also what ICC does.

The following patch fixed the problem. It passes bootstrap/regression
test. OK to install?

thanks,

David

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c (revision 194324)
+++ config/i386/i386.c (working copy)
@@ -1919,10 +1919,10 @@ static unsigned int initial_ix86_tune_fe
   m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

   /* X86_TUNE_PROLOGUE_USING_MOVE */
-  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
+  m_PPRO | m_ATHLON_K8,

   /* X86_TUNE_EPILOGUE_USING_MOVE */
-  m_PPRO | m_CORE2I7 | m_ATOM | m_ATHLON_K8 | m_GENERIC,
+  m_PPRO | m_ATHLON_K8,

   /* X86_TUNE_SHIFT1 */
   ~m_486,


2012-12-08  Xinliang David Li  <davidxl@google.com>
           * config/i386/i386.c: Eanble push/pop in pro/epilogue for
moderen CPUs.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2012-12-21  8:28 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-09 13:50 [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs Uros Bizjak
2012-12-09 17:09 ` Дмитрий Дьяченко
2012-12-10  9:23 ` Richard Biener
2012-12-10 20:42   ` Xinliang David Li
2012-12-10 21:07     ` Mike Stump
2012-12-11  9:49       ` Richard Biener
2012-12-11 17:15         ` Xinliang David Li
  -- strict thread matches above, loose matches on Subject: below --
2012-12-21  7:26 Xinliang David Li
2012-12-21  8:20 ` Zamyatin, Igor
2012-12-11 22:53 Xinliang David Li
2012-12-11 23:39 ` Xinliang David Li
2012-12-12 11:00 ` Richard Biener
2012-12-08 18:13 Xinliang David Li
2012-12-12 16:37 ` Jan Hubicka
2012-12-12 17:25   ` Xinliang David Li
2012-12-12 17:34   ` Xinliang David Li
2012-12-12 18:30     ` Jan Hubicka
2012-12-12 18:37       ` Andi Kleen
2012-12-12 18:43         ` Jan Hubicka
2012-12-12 18:43         ` Andi Kleen
2012-12-13  0:16       ` Xinliang David Li
2012-12-13  0:16         ` Xinliang David Li
2012-12-13  1:19         ` Jan Hubicka
2012-12-13  6:09           ` Xinliang David Li
2012-12-13  6:21             ` Jakub Jelinek
2012-12-13  7:05               ` Xinliang David Li
2012-12-13 19:28                 ` Jan Hubicka
2012-12-13 10:22               ` Richard Biener
2012-12-13 19:43               ` H.J. Lu
2012-12-13 20:26                 ` Jan Hubicka
2012-12-13 20:28                   ` H.J. Lu
2012-12-13 20:40                     ` Jan Hubicka
2012-12-13 21:02                       ` H.J. Lu
2012-12-13 21:35                         ` Jan Hubicka
2012-12-20 12:13                         ` Melik-adamyan, Areg
2012-12-20 14:08                           ` H.J. Lu
2012-12-20 15:05                             ` Jan Hubicka
2012-12-20 15:07                               ` Jan Hubicka
2012-12-20 15:22                                 ` H.J. Lu
2012-12-21  8:28   ` Zamyatin, Igor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).