Richard Guenther wrote:
> On Sat, Oct 17, 2009 at 5:34 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>   
>> Richard Guenther wrote:
>>     
>>> On Wed, Oct 14, 2009 at 6:27 PM, Vladimir Makarov <vmakarov@redhat.com>
>>> wrote:
>>>
>>>       
>>>> Zdenek Dvorak wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>>
>>>>>           
>>>>>>>> +      if (i < ira_reg_class_cover_size)
>>>>>>>> +       size_cost = comp_cost + 10;
>>>>>>>> +      else
>>>>>>>> +       size_cost = 0;
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> Including comp_cost in size_cost makes no sense (this would prevent us
>>>>>>> from
>>>>>>> moving even very costly invariants out of the loop if we run out of
>>>>>>> registers).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> That is exactly what I intended.  As I wrote above, I tried a lot of
>>>>>>  heuristics with different parameters which decided to move loop
>>>>>>  invariant
>>>>>> depending on spill cost and loop invariant cost.  But they  don't  work
>>>>>> well
>>>>>> at least for x86/x86_64 and power6.  I have some  speculation for this.
>>>>>>  x86/x86_64 is OOO processors these days.  And  costly invariant will
>>>>>> be
>>>>>> hidden because usually the invariant has a lot  of freedom to be
>>>>>> executed
>>>>>> out-of-order.  For power6, long latency is  hidden by insn scheduling.
>>>>>>  It
>>>>>> is hard to me find a processor where it  will be important.  Another
>>>>>> reason
>>>>>> for this, it is very hard to evaluate  accurately spill cost at this
>>>>>> stage.
>>>>>>  So I decided not to use  combination of register pressure and
>>>>>> invariant
>>>>>> cost in my approach.
>>>>>>
>>>>>>             
>>>>> could you please add this reasoning to the comment?  Another reason why
>>>>> preventing the invariant motion does not hurt might be that all
>>>>> expensive
>>>>> invariants were already moved out of the loop by PRE and gimple
>>>>> invariant
>>>>> motion pass.
>>>>>
>>>>>
>>>>>           
>>>>>> +      for (i = 0; i < ira_reg_class_cover_size; i++)
>>>>>> +       {
>>>>>> +         cover_class = ira_reg_class_cover[i];
>>>>>> +         if ((int) new_regs[cover_class]
>>>>>> +             + (int) regs_needed[cover_class]
>>>>>> +             + LOOP_DATA (curr_loop)->max_reg_pressure[cover_class]
>>>>>> +             + IRA_LOOP_RESERVED_REGS
>>>>>> +             - ira_available_class_regs[cover_class] > 0)
>>>>>> +           break;
>>>>>> +       }
>>>>>>
>>>>>>             
>>>>> It might be clearer to write this as ... >
>>>>> ira_available_class_regs[cover_class] instead
>>>>> of ... - ira_available_class_regs[cover_class] > 0.  Otherwise, the
>>>>> patch
>>>>> is OK.
>>>>>
>>>>>
>>>>>           
>>>> Zdenek, thanks for the additional comments.  I incorporated them into the
>>>> patch just before committing.  Here is the affected patch part:
>>>>
>>>>         
>>> I think this consistently regressed both compile-time and runtime for
>>> Polyhedron on x86_64.  For Itanium the story isn't clear, but effects
>>> are seen there as well (it's also the only one I see off-noise effects
>>> on SPEC 2000 - significant ups and downs).
>>>
>>>
>>>       
>>  Yes, it is expensive optimization (at least 3 additional passes
>> through RTL insns one for calculating register pressure and two very
>> expensive passes for finding register classes for pseudos).  It is
>> clearly seen from SPEC compilation time graphs on
>>
>> http://vmakarov.fedorapeople.org/spec
>>
>> for 2 last benchmarking.  Therefore I proposed it only for -O3.
>>
>> Overall SPEC2000 scores are practically the same on x86/x86_64.
>>
>> As for Polyhedron benchmarks, here is my results on Core I7:
>>
>> first:  -ffast-math -funroll-loops -O3 -fno-ira-loop-pressure
>> second: -ffast-math -funroll-loops -O3 -fira-loop-pressure
>>
>> x86:
>> Geometric Mean Execution Time =      12.84 seconds
>> Geometric Mean Execution Time =      12.82 seconds
>>
>> x86_64:
>> Geometric Mean Execution Time =       9.89 seconds
>> Geometric Mean Execution Time =       9.91 seconds
>>
>> On power6:
>> first:  -mtune=power6 -ffast-math -funroll-loops -O3 -fno-ira-loop-pressure
>> second: -mtune=power6 -ffast-math -funroll-loops -O3 -fira-loop-pressure
>>
>> Geometric Mean Execution Time =      19.22 seconds
>> Geometric Mean Execution Time =      19.04 seconds
>>
>>  As I wrote earlier the winner of the optimization usage will be
>> loops with pressure lower (but not too lower) than #registers.  For
>> x86/x86_64, practically all loops have pressure more than #registers.
>> For such loops, evaluation of invariant cost vs spill cost would be
>> important.  But at this stage, spill cost is impossible to evaluate
>> accurately.  So usage of old and new loop invariant motion criteria on
>> processors similar x86/x86_64 will give different results for particular
>> tests (some tests better, some worse) but overall score will be
>> practically the same.
>>
>>  Probably, there is no sense to use IRA-based register pressure calculation
>> for all targets (including x86/x86_64) but for power it is a clear win as it
>> is seen from polyhedron and as I reported for SPEC2000.
>>
>>  So we could switch it off by default for -O3.  What do you think about this
>> solution, Richard?
>>     
>
> I think we could switch it on by default at -O3 for a selected group of
> targets.  Itanium overall also improves with the new heuristics.  That would
> make it power and Itanium.
The patch is below.  Ok to commit?
>   Did you try restricting the heuristics to certain
> register classes, like SSE registers on x86_64?
>
>   
No, I did not try.  I am not sure it is worth to do  it.


2009-10-19  Vladimir Makarov  <vmakarov@redhat.com>

    * doc/invoke.texi (fira-loop-pressure): Update default value.
    * opts.c (decode_options): Remove default value setting for
    flag_ira_loop_pressure.
    * config/ia64/ia64.c (ia64_override_options): Set
    flag_ira_loop_pressure up for -O3.
    * config/rs6000/rs6000.c (rs6000_override_options): Ditto.