RFC: improving estimate_num

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* RFC: improving estimate_num_insns
@ 2005-07-12 17:56 Josh Conner
  2005-07-12 20:07 ` Steven Bosscher
  0 siblings, 1 reply; 6+ messages in thread
From: Josh Conner @ 2005-07-12 17:56 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 4642 bytes --]

In looking at the function inliner, I did some analysis of the  
correctness of estimate_num_insns (from tree-inline.c), as this  
measurement is critical to the inlining heuristics.  Here is the  
overview of what I found on the functions in two sample files from  
the FAAD2 AAC library (measured at -O3 with inlining disabled):

syntax.c
               avg.    avg.     std.
target        ratio   error    deviation
arm-none      2.88     57%     2.37
darwin-ppc    3.17     55%     2.50
linux-x86     2.72     55%     2.06

huffman.c
               avg.    avg.     std.
target        ratio   error    deviation
arm-none      3.27    56%      2.37
darwin-ppc    4.19    53%      2.88
linux-x86     3.42    56%      2.39

     fn. differential = actual bytes / estimated insns
   avg. ratio = average (fn. differential) of all functions
     fn. error = max (fn. differential, avg. ratio) / min (fn.  
differential, avg. ratio)
   avg. error = average (fn. error) of all functions
   std. deviation = standard deviation (fn. differential) across all  
functions

Note that by choosing these metrics, my intent was to isolate the  
consistency of the differential from the magnitude of the  
differential.  In other words, the goal was to have estimates that  
were a consistent ratio to the actual results, whatever that ratio  
may be.

Thinking that there may be room for improvement on this, I tried this  
same experiment with a couple of adjustments to estimate_num_insns:
- Instead of ignoring constants, assign them a weight of 2  
instructions (one for the constant itself and one to load into memory)
- Instead of ignoring dereferences, assign a weight of 2 to an array  
reference and 1 for a pointer dereference
- Instead of ignoring case labels, assign them a weight of 2  
instructions (to represent the cost of control logic to get there)

With these modifications (tree-estimate.patch), I see the following  
results:

syntax.c
               avg.    avg.     std.
target        ratio   error    deviation
arm-none      1.68    29%      0.67
darwin-ppc    1.84    29%      0.69
linux-x86     1.62    26%      0.55

huffman.c
               avg.    avg.     std.
target        ratio   error    deviation
arm-none      1.75    26%      0.52
darwin-ppc    2.31    24%      0.69
linux-x86     1.95    25%      0.67

Which appears to be a significant improvement in the accuracy of  
estimate_num_insns.  However, note that since the avg. ratio has  
decreased significantly (estimates have gone up because we're  
counting things that were ignored before), any heuristics that are  
based on instruction counts will have effectively decreased by  
~25-50%.  To compensate for that, I bumped up any constants that are  
based on instruction estimates by 50% each (see inl-costs.patch).

With both of these changes in place, I ran some simple SPEC2000  
integer benchmarks (not fine-tuned, not reportable) and saw the  
following impact on performance (positive is better):

           darwin-ppc   linux-x86  linux-arm*
gzip      -4.7%        +11.7%     +1.3%
vpr       -0.7%        -1.1%      -----
mcf       -0.3%        +1.1%      -----
crafty    -0.8%        +1.8%      -----
parser    -----        -----      -----
eon       -4.4%        +0.3%      -----
perlbmk   +0.4%        0.0%       -1.1%
gap       -----        +1.0%      -2.9%
vortex    +4.1%        0.0%       0.0%
bzip2     +1.2%        +1.1%      -1.3%
twolf     -0.5%        -2.9%      -1.2%

   ----- = unable to obtain a valid result
   * linux-arm results are based on a subset of SPEC data and command- 
line options for use on a 32MB embedded board -- not even close to  
the full SPEC suite.

For what it's worth, code size is equal to or smaller for all  
benchmarks across all platforms.

So, here are the open issues I see at this point:
1. It appears that this change to estimate_num_instructions generates  
a much better estimate of actual code size.  However, the benchmark  
results are ambiguous.  Is this patch worth considering as-is?
2. Increasing instruction weights causes the instruction-based values  
(e.g., --param max-inline-insns-auto) to be effectively lower.   
However, changing these constants/defaults as in the second patch  
will cause a semantic change to anyone who is setting these values at  
the command line.  Is that change acceptable?

I do realize that this area is one that will eventually be likely  
best served by target-dependent routines (and constants), however I  
also see a significant benefit to all targets in fixing the default  
implementation first.

Thoughts?  Advice?

Thanks -

Josh
~~~~
Josh Conner



[-- Attachment #2: tree-estimate.patch --]
[-- Type: application/octet-stream, Size: 3220 bytes --]

Index: tree-inline.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/tree-inline.c,v
retrieving revision 1.200
diff -c -3 -p -r1.200 tree-inline.c
*** tree-inline.c	28 Jun 2005 19:33:23 -0000	1.200
--- tree-inline.c	1 Jul 2005 23:08:11 -0000
*************** estimate_num_insns_1 (tree *tp, int *wal
*** 1564,1574 ****
        *walk_subtrees = 0;
        return NULL;
      }
!   /* Assume that constants and references counts nothing.  These should
!      be majorized by amount of operations among them we count later
!      and are common target of CSE and similar optimizations.  */
!   else if (CONSTANT_CLASS_P (x) || REFERENCE_CLASS_P (x))
!     return NULL;
  
    switch (TREE_CODE (x))
      {
--- 1564,1593 ----
        *walk_subtrees = 0;
        return NULL;
      }
!   else if (CONSTANT_CLASS_P (x))
!     {
!       /* One word to hold the constant and a load instruction.  */
!       *count += 2;
!       *walk_subtrees = 0;
!       return NULL;
!     }
!   else if (REFERENCE_CLASS_P (x))
!     {
!       switch (TREE_CODE (x))
! 	{
! 	  /* Approximate an array access as an addition + pointer
!              dereference.  */
! 	  case ARRAY_REF:
! 	    *count += 2;
! 	    break;
! 
! 	  /* Any other dereference.  */
! 	  default:
! 	    *count += 1;
! 	    break;
! 	}
!       return NULL;
!     }
  
    switch (TREE_CODE (x))
      {
*************** estimate_num_insns_1 (tree *tp, int *wal
*** 1576,1588 ****
      case TREE_LIST:
      case TREE_VEC:
      case BLOCK:
-     case COMPONENT_REF:
-     case BIT_FIELD_REF:
-     case INDIRECT_REF:
-     case ALIGN_INDIRECT_REF:
-     case MISALIGNED_INDIRECT_REF:
-     case ARRAY_REF:
-     case ARRAY_RANGE_REF:
      case OBJ_TYPE_REF:
      case EXC_PTR_EXPR: /* ??? */
      case FILTER_EXPR: /* ??? */
--- 1595,1600 ----
*************** estimate_num_insns_1 (tree *tp, int *wal
*** 1595,1601 ****
      case ADDR_EXPR:
      case COMPLEX_EXPR:
      case RANGE_EXPR:
-     case CASE_LABEL_EXPR:
      case SSA_NAME:
      case CATCH_EXPR:
      case EH_FILTER_EXPR:
--- 1607,1612 ----
*************** estimate_num_insns_1 (tree *tp, int *wal
*** 1613,1632 ****
      case LOOP_EXPR:
      case PHI_NODE:
      case WITH_SIZE_EXPR:
        break;
  
!     /* We don't account constants for now.  Assume that the cost is amortized
!        by operations that do use them.  We may re-consider this decision once
!        we are able to optimize the tree before estimating its size and break
!        out static initializers.  */
!     case IDENTIFIER_NODE:
!     case INTEGER_CST:
!     case REAL_CST:
!     case COMPLEX_CST:
!     case VECTOR_CST:
!     case STRING_CST:
!       *walk_subtrees = 0;
!       return NULL;
  
      /* Try to estimate the cost of assignments.  We have three cases to
         deal with:
--- 1624,1636 ----
      case LOOP_EXPR:
      case PHI_NODE:
      case WITH_SIZE_EXPR:
+     case IDENTIFIER_NODE:
        break;
  
!     /* A switch statement generally requires testing logic for each case.  */
!     case CASE_LABEL_EXPR:
!       *count += 2;
!       break;
  
      /* Try to estimate the cost of assignments.  We have three cases to
         deal with:

[-- Attachment #3: inl-costs.patch --]
[-- Type: application/octet-stream, Size: 6182 bytes --]

Index: params.def
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/params.def,v
retrieving revision 1.61
diff -c -3 -p -r1.61 params.def
*** params.def	27 Jun 2005 07:40:48 -0000	1.61
--- params.def	1 Jul 2005 23:08:11 -0000
*************** DEFPARAM (PARAM_SRA_FIELD_STRUCTURE_RATI
*** 66,72 ****
     of a function counted in internal gcc instructions (not in
     real machine instructions) that is eligible for inlining
     by the tree inliner.
!    The default value is 450.
     Only functions marked inline (or methods defined in the class
     definition for C++) are affected by this.
     There are more restrictions to inlining: If inlined functions
--- 66,72 ----
     of a function counted in internal gcc instructions (not in
     real machine instructions) that is eligible for inlining
     by the tree inliner.
!    The default value is 675.
     Only functions marked inline (or methods defined in the class
     definition for C++) are affected by this.
     There are more restrictions to inlining: If inlined functions
*************** DEFPARAM (PARAM_SRA_FIELD_STRUCTURE_RATI
*** 77,83 ****
  DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
  	  "max-inline-insns-single",
  	  "The maximum number of instructions in a single function eligible for inlining",
! 	  450, 0, 0)
  
  /* The single function inlining limit for functions that are
     inlined by virtue of -finline-functions (-O3).
--- 77,83 ----
  DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
  	  "max-inline-insns-single",
  	  "The maximum number of instructions in a single function eligible for inlining",
! 	  675, 0, 0)
  
  /* The single function inlining limit for functions that are
     inlined by virtue of -finline-functions (-O3).
*************** DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
*** 85,105 ****
     that is applied to functions marked inlined (or defined in the
     class declaration in C++) given by the "max-inline-insns-single"
     parameter.
!    The default value is 90.  */
  DEFPARAM (PARAM_MAX_INLINE_INSNS_AUTO,
  	  "max-inline-insns-auto",
  	  "The maximum number of instructions when automatically inlining",
! 	  90, 0, 0)
  
  DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE,
  	  "max-inline-insns-recursive",
  	  "The maximum number of instructions inline function can grow to via recursive inlining",
! 	  450, 0, 0)
  
  DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE_AUTO,
  	  "max-inline-insns-recursive-auto",
  	  "The maximum number of instructions non-inline function can grow to via recursive inlining",
! 	  450, 0, 0)
  
  DEFPARAM (PARAM_MAX_INLINE_RECURSIVE_DEPTH,
  	  "max-inline-recursive-depth",
--- 85,105 ----
     that is applied to functions marked inlined (or defined in the
     class declaration in C++) given by the "max-inline-insns-single"
     parameter.
!    The default value is 135.  */
  DEFPARAM (PARAM_MAX_INLINE_INSNS_AUTO,
  	  "max-inline-insns-auto",
  	  "The maximum number of instructions when automatically inlining",
! 	  135, 0, 0)
  
  DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE,
  	  "max-inline-insns-recursive",
  	  "The maximum number of instructions inline function can grow to via recursive inlining",
! 	  675, 0, 0)
  
  DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE_AUTO,
  	  "max-inline-insns-recursive-auto",
  	  "The maximum number of instructions non-inline function can grow to via recursive inlining",
! 	  675, 0, 0)
  
  DEFPARAM (PARAM_MAX_INLINE_RECURSIVE_DEPTH,
  	  "max-inline-recursive-depth",
*************** DEFPARAM(PARAM_MAX_PENDING_LIST_LENGTH,
*** 155,161 ****
  DEFPARAM(PARAM_LARGE_FUNCTION_INSNS,
  	 "large-function-insns",
  	 "The size of function body to be considered large",
! 	 2700, 0, 0)
  DEFPARAM(PARAM_LARGE_FUNCTION_GROWTH,
  	 "large-function-growth",
  	 "Maximal growth due to inlining of large function (in percent)",
--- 155,161 ----
  DEFPARAM(PARAM_LARGE_FUNCTION_INSNS,
  	 "large-function-insns",
  	 "The size of function body to be considered large",
! 	 4050, 0, 0)
  DEFPARAM(PARAM_LARGE_FUNCTION_GROWTH,
  	 "large-function-growth",
  	 "Maximal growth due to inlining of large function (in percent)",
*************** DEFPARAM(PARAM_MAX_PEEL_TIMES,
*** 233,239 ****
  DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS,
  	"max-completely-peeled-insns",
  	"The maximum number of insns of a completely peeled loop",
! 	400, 0, 0)
  /* The maximum number of peelings of a single loop that is peeled completely.  */
  DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES,
  	"max-completely-peel-times",
--- 233,239 ----
  DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS,
  	"max-completely-peeled-insns",
  	"The maximum number of insns of a completely peeled loop",
! 	600, 0, 0)
  /* The maximum number of peelings of a single loop that is peeled completely.  */
  DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES,
  	"max-completely-peel-times",
Index: tree-ssa-loop-ch.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/tree-ssa-loop-ch.c,v
retrieving revision 2.18
diff -c -3 -p -r2.18 tree-ssa-loop-ch.c
*** tree-ssa-loop-ch.c	25 Jun 2005 02:01:39 -0000	2.18
--- tree-ssa-loop-ch.c	1 Jul 2005 23:08:11 -0000
*************** should_duplicate_loop_header_p (basic_bl
*** 76,82 ****
      return false;
  
    /* Approximately copy the conditions that used to be used in jump.c --
!      at most 20 insns and no calls.  */
    for (bsi = bsi_start (header); !bsi_end_p (bsi); bsi_next (&bsi))
      {
        last = bsi_stmt (bsi);
--- 76,82 ----
      return false;
  
    /* Approximately copy the conditions that used to be used in jump.c --
!      at most 30 insns and no calls.  */
    for (bsi = bsi_start (header); !bsi_end_p (bsi); bsi_next (&bsi))
      {
        last = bsi_stmt (bsi);
*************** copy_loop_headers (void)
*** 150,157 ****
  
    for (i = 1; i < loops->num; i++)
      {
!       /* Copy at most 20 insns.  */
!       int limit = 20;
  
        loop = loops->parray[i];
        if (!loop)
--- 150,157 ----
  
    for (i = 1; i < loops->num; i++)
      {
!       /* Copy at most 30 insns.  */
!       int limit = 30;
  
        loop = loops->parray[i];
        if (!loop)

[-- Attachment #4: Type: text/plain, Size: 1 bytes --]



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: improving estimate_num_insns
  2005-07-12 17:56 RFC: improving estimate_num_insns Josh Conner
@ 2005-07-12 20:07 ` Steven Bosscher
  2005-07-13  2:31   ` Josh Conner
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Bosscher @ 2005-07-12 20:07 UTC (permalink / raw)
  To: gcc; +Cc: Josh Conner, jh

On Tuesday 12 July 2005 19:56, Josh Conner wrote:
> In looking at the function inliner, I did some analysis of the
> correctness of estimate_num_insns (from tree-inline.c), as this
> measurement is critical to the inlining heuristics.

You don't say what compiler you used for these measurements.  I suppose
you used mainline?

>   Here is the  
> overview of what I found on the functions in two sample files from  
> the FAAD2 AAC library (measured at -O3 with inlining disabled):

I think you should look at a lot more code than this.

> Thinking that there may be room for improvement on this, I tried this
> same experiment with a couple of adjustments to estimate_num_insns:
> - Instead of ignoring constants, assign them a weight of 2
> instructions (one for the constant itself and one to load into memory)

Why the load into memory, you mean constants that must be loaded from
the constant pool?  I guess the cost for the constant itself depends on
whether it is a legitimate constant or not, and how it is loaded, and
this is all very target specific.  But a cost greater than 0 probably
makes sense.

It would be nice if you retain the comment about constants in the
existing code somehow.  The cost of constants is not trivial, and you
should explain your choice better in a comment.

> - Instead of ignoring dereferences, assign a weight of 2 to an array
> reference and 1 for a pointer dereference

Makes sense.

> - Instead of ignoring case labels, assign them a weight of 2
> instructions (to represent the cost of control logic to get there)

This depends on how the switch statement is expanded, e.g. to a binary
decision tree, or to a table jump, or to something smaller than that.
So again (like everything else, sadly) this is highly target-specific
and even context-specific.  I'd think a cost of 2 is too pessimistic in
most cases.

You could look at the code in stmt.c to see how switch statements are
expanded.  Maybe there is a cheap test you can do to make CASE_LABELs
free for very small switch statements (i.e. the ones which should never
hold you back from inlining the function containing them ;-).

> For what it's worth, code size is equal to or smaller for all
> benchmarks across all platforms.

What about the compile time?

> So, here are the open issues I see at this point:
> 1. It appears that this change to estimate_num_instructions generates
> a much better estimate of actual code size.  However, the benchmark
> results are ambiguous.  Is this patch worth considering as-is?

I would say you'd at least have to look into the ppc gzip and eon
regressions before this is acceptable.  But it is not my decision to
make, of course.

> 2. Increasing instruction weights causes the instruction-based values
> (e.g., --param max-inline-insns-auto) to be effectively lower.
> However, changing these constants/defaults as in the second patch
> will cause a semantic change to anyone who is setting these values at
> the command line.  Is that change acceptable?

This has constantly changed from one release to the next since GCC 3.3,
so I don't think this should be a problem.

> I do realize that this area is one that will eventually be likely
> best served by target-dependent routines (and constants), however I
> also see a significant benefit to all targets in fixing the default
> implementation first.

I don't really think we want to make too many things target-dependent.
This whole estimate is just that: An estimate.  And it's not intended
to be absolute.  It is just a relative measure of cost.  There are IMHO
very few trees that should have target-dependent inline cost estimates.

OK, that mas the thoughts part...

> Thoughts?  Advice?

...on to advice then.

First of all, I think you should handle ARRAY_RANGE_REF and ARRAY_REF
the same.  And you probably should make BIT_FIELD_REF more expensive,
and maybe make its cost depend on which bit you're addressing (you can
see that in operands 1 and 2 of the BIT_FIELD_REF).  Its current cost
of just 1 is probably too optimistic, just like the other cases you are
trying to address with this patch.

Second, you may want to add a target hook to return the cost of target
builtins.  Even builtins that expand to just one instruction are now
counted as 16 insns plus the cost of the arguments list.  This badly
hurts when you use e.g. SSE intrinsics.  It's probably not an issue for
the benchmark you looked at now, but maybe you want to look into it
anyway, while you're at it.

Third,
> (measured at -O3 with inlining disabled):
Then why not just -O2 with inlining disabled?  Now you have enabled
loop unswitching, which is known to sometimes significantly increase
code size.

Fourth, look at _much_ more code than this.  I would especially look at
a lot more C++ code, which is where our inliner heuristics can really
dramatically improve or destroy performance.  See Richard Guenther's
inline heuristics tweaks from earlier this year, in the thread starting
here: http://gcc.gnu.org/ml/gcc-patches/2005-03/msg01936.html.

Finally, you've apparently used grep to find all the places where
PARAM_MAX_INLINE_INSNS_SINGLE and its friends are used, but you hve
missed the ones in opts.c and maybe elsewhere.

Good luck, and thanks for working on this difficult stuff.

Gr.
Steven

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: improving estimate_num_insns
  2005-07-12 20:07 ` Steven Bosscher
@ 2005-07-13  2:31   ` Josh Conner
  2005-07-13  7:59     ` Steven Bosscher
  0 siblings, 1 reply; 6+ messages in thread
From: Josh Conner @ 2005-07-13  2:31 UTC (permalink / raw)
  To: gcc; +Cc: Steven Bosscher, Jan Hubicka


On Jul 12, 2005, at 1:07 PM, Steven Bosscher wrote:

> You don't say what compiler you used for these measurements.  I  
> suppose
> you used mainline?

Yes, I am working with the mainline.

> I think you should look at a lot more code than this.

OK - I stopped because I was seeing fairly consistent results.   
However, since both of these files were from the same source base, I  
can see how this may not be representative.  I'll try some more  
examples from different source code, including C++.

>> Thinking that there may be room for improvement on this, I tried this
>> same experiment with a couple of adjustments to estimate_num_insns:
>> - Instead of ignoring constants, assign them a weight of 2
>> instructions (one for the constant itself and one to load into  
>> memory)
>>
>
> Why the load into memory, you mean constants that must be loaded from
> the constant pool?  I guess the cost for the constant itself  
> depends on
> whether it is a legitimate constant or not, and how it is loaded, and
> this is all very target specific.  But a cost greater than 0 probably
> makes sense.

Good point - I was thinking of constant pools when I wrote that, even  
though that isn't always the case.

> It would be nice if you retain the comment about constants in the
> existing code somehow.  The cost of constants is not trivial, and you
> should explain your choice better in a comment.

I'm not sure which comment you mean - the one from my email, or the  
one that was originally in the source code?

>> - Instead of ignoring case labels, assign them a weight of 2
>> instructions (to represent the cost of control logic to get there)
>>
>
> This depends on how the switch statement is expanded, e.g. to a binary
> decision tree, or to a table jump, or to something smaller than that.
> So again (like everything else, sadly) this is highly target-specific
> and even context-specific.  I'd think a cost of 2 is too  
> pessimistic in
> most cases.

Do you mean that 2 is too high?  I actually got (slightly) better  
statistical results with a cost of 3, but I had the same reaction  
(that it was too pessimistic), which is why I settled on 2.

> You could look at the code in stmt.c to see how switch statements are
> expanded.  Maybe there is a cheap test you can do to make CASE_LABELs
> free for very small switch statements (i.e. the ones which should  
> never
> hold you back from inlining the function containing them ;-).

I'll look into this.

>> For what it's worth, code size is equal to or smaller for all
>> benchmarks across all platforms.
>>
>
> What about the compile time?

Oh no!  I didn't measure this.  I will have a look.

>> So, here are the open issues I see at this point:
>> 1. It appears that this change to estimate_num_instructions generates
>> a much better estimate of actual code size.  However, the benchmark
>> results are ambiguous.  Is this patch worth considering as-is?
>>
>
> I would say you'd at least have to look into the ppc gzip and eon
> regressions before this is acceptable.  But it is not my decision to
> make, of course.

Makes sense.  I'll see what kind of time I can put into this.

>> 2. Increasing instruction weights causes the instruction-based values
>> (e.g., --param max-inline-insns-auto) to be effectively lower.
>> However, changing these constants/defaults as in the second patch
>> will cause a semantic change to anyone who is setting these values at
>> the command line.  Is that change acceptable?
>>
>
> This has constantly changed from one release to the next since GCC  
> 3.3,
> so I don't think this should be a problem.

Whew...

>> Thoughts?  Advice?
>>
>
> ...on to advice then.
>
> First of all, I think you should handle ARRAY_RANGE_REF and ARRAY_REF
> the same.  And you probably should make BIT_FIELD_REF more expensive,
> and maybe make its cost depend on which bit you're addressing (you can
> see that in operands 1 and 2 of the BIT_FIELD_REF).  Its current cost
> of just 1 is probably too optimistic, just like the other cases you  
> are
> trying to address with this patch.

Great - thanks for the suggestions, I'll try these changes as well  
and see if I get even better correlation.

> Second, you may want to add a target hook to return the cost of target
> builtins.  Even builtins that expand to just one instruction are now
> counted as 16 insns plus the cost of the arguments list.  This badly
> hurts when you use e.g. SSE intrinsics.  It's probably not an issue  
> for
> the benchmark you looked at now, but maybe you want to look into it
> anyway, while you're at it.

OK, I'll look into this.

> Third,
>
>> (measured at -O3 with inlining disabled):
>>
> Then why not just -O2 with inlining disabled?  Now you have enabled
> loop unswitching, which is known to sometimes significantly increase
> code size.

Well, I thought that inlining estimates were most likely to be  
relevant at O3, and so I should include as many other optimizations  
as were likely to be performed alongside inlining, even though this  
may make it more difficult to get accurate estimates.

> Fourth, look at _much_ more code than this.  I would especially  
> look at
> a lot more C++ code, which is where our inliner heuristics can really
> dramatically improve or destroy performance.

OK.

>   See Richard Guenther's
> inline heuristics tweaks from earlier this year, in the thread  
> starting
> here: http://gcc.gnu.org/ml/gcc-patches/2005-03/msg01936.html.

I remember reading this when it was first posted, but looking at it  
again after my experiments I see the additional relevance.   
Especially the decision to assign a function call a cost of 16, which  
I quickly identified as hurting our size estimates, but which also  
improves inlining performance!

> Finally, you've apparently used grep to find all the places where
> PARAM_MAX_INLINE_INSNS_SINGLE and its friends are used, but you hve
> missed the ones in opts.c and maybe elsewhere.

Hmmm - I looked for all of the places where estimate_num_insns was  
called.  I still don't see anything in opts.c -- can you give me a  
little more of a hint?

> Good luck, and thanks for working on this difficult stuff.

Thank you for the helpful comments!

- Josh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: improving estimate_num_insns
  2005-07-13  2:31   ` Josh Conner
@ 2005-07-13  7:59     ` Steven Bosscher
  2005-07-13  8:47       ` Richard Guenther
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Bosscher @ 2005-07-13  7:59 UTC (permalink / raw)
  To: Josh Conner; +Cc: gcc, Jan Hubicka

On Wednesday 13 July 2005 04:29, Josh Conner wrote:
> > Finally, you've apparently used grep to find all the places where
> > PARAM_MAX_INLINE_INSNS_SINGLE and its friends are used, but you hve
> > missed the ones in opts.c and maybe elsewhere.
>
> Hmmm - I looked for all of the places where estimate_num_insns was
> called.  I still don't see anything in opts.c -- can you give me a
> little more of a hint?

From opts.c:decode_options()

  if (optimize_size)
    {
      /* Inlining of very small functions usually reduces total size.  */
      set_param_value ("max-inline-insns-single", 5);
      set_param_value ("max-inline-insns-auto", 5);
      flag_inline_functions = 1;

      (...)
    }

You didn't update those two.

Gr.
Steven


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: improving estimate_num_insns
  2005-07-13  7:59     ` Steven Bosscher
@ 2005-07-13  8:47       ` Richard Guenther
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Guenther @ 2005-07-13  8:47 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Josh Conner, gcc, Jan Hubicka

[-- Attachment #1: Type: text/plain, Size: 1912 bytes --]

On 7/13/05, Steven Bosscher <stevenb@suse.de> wrote:
> On Wednesday 13 July 2005 04:29, Josh Conner wrote:
> > > Finally, you've apparently used grep to find all the places where
> > > PARAM_MAX_INLINE_INSNS_SINGLE and its friends are used, but you hve
> > > missed the ones in opts.c and maybe elsewhere.
> >
> > Hmmm - I looked for all of the places where estimate_num_insns was
> > called.  I still don't see anything in opts.c -- can you give me a
> > little more of a hint?
> 
> From opts.c:decode_options()
> 
>   if (optimize_size)
>     {
>       /* Inlining of very small functions usually reduces total size.  */
>       set_param_value ("max-inline-insns-single", 5);
>       set_param_value ("max-inline-insns-auto", 5);
>       flag_inline_functions = 1;
> 
>       (...)
>     }
> 
> You didn't update those two.

Note that I'd rather change these to set the inline-unit-growth to zero
for -Os.  I even have a patch for this laying around.  Though it will
no longer apply :)

Also rather than making estimate_num_insns target dependent, we should
maybe conditionalize the results on whether we are optimizing for size
(which is what I think the current metric should do) or for speed (where we
really should drop the costs for all the real multiplication/division stuff).

An important invariant is, that we do not grow the size of a forwarder
function like int foo(int i) { return i; } int bar(int i) { return
foo(i); } whether
foo is inlined or not.  I don't remember if I added testcases to verify this
in the last round of adjustments.

Finally, CSiBE is your friend for all size checks, certain C++ testcases
from PRs, like Geralds testcase, for compile-time (and runtime, if available).
And of course compiling gcc itself.  Remember, it's hard work to get overall
improvements, especially if you consider embedded targets (I didn't ;)).

Richard.

[-- Attachment #2: optimize_size --]
[-- Type: application/octet-stream, Size: 2996 bytes --]

2005-03-02  Richard Guenther  <rguenth@gcc.gnu.org>

	* opts.c (decode_options): Do not fiddle with inlining
	parameters in case of optimizing for size.
	* cgraphunit.c (cgraph_check_inline_limits): If optimizing
	for size make sure we do not grow the unit-size by inlining.
	(cgraph_decide_recursive_inlining): Likewise.

Index: opts.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/opts.c,v
retrieving revision 1.94
diff -c -3 -p -r1.94 opts.c
*** opts.c	24 Feb 2005 09:24:13 -0000	1.94
--- opts.c	2 Mar 2005 13:10:58 -0000
*************** decode_options (unsigned int argc, const
*** 572,580 ****
  
    if (optimize_size)
      {
!       /* Inlining of very small functions usually reduces total size.  */
!       set_param_value ("max-inline-insns-single", 5);
!       set_param_value ("max-inline-insns-auto", 5);
        flag_inline_functions = 1;
  
        /* We want to crossjump as much as possible.  */
--- 572,579 ----
  
    if (optimize_size)
      {
!       /* Inlining of functions reducing size is a good idea regardless
! 	 of them being declared inline.  */
        flag_inline_functions = 1;
  
        /* We want to crossjump as much as possible.  */
Index: cgraphunit.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cgraphunit.c,v
retrieving revision 1.93.2.1
diff -c -3 -p -r1.93.2.1 cgraphunit.c
*** cgraphunit.c	2 Mar 2005 10:10:30 -0000	1.93.2.1
--- cgraphunit.c	2 Mar 2005 13:10:58 -0000
*************** cgraph_check_inline_limits (struct cgrap
*** 1184,1189 ****
--- 1189,1201 ----
    limit += limit * PARAM_VALUE (PARAM_LARGE_FUNCTION_GROWTH) / 100;
  
    newsize = cgraph_estimate_size_after_inlining (times, to, what);
+   if (optimize_size
+       && newsize > to->global.insns)
+     {
+       if (reason)
+ 	*reason = N_("optimizing for size");
+       return false;
+     }
    if (newsize > PARAM_VALUE (PARAM_LARGE_FUNCTION_INSNS)
        && newsize > limit)
      {
*************** cgraph_decide_recursive_inlining (struct
*** 1279,1284 ****
--- 1291,1297 ----
    struct cgraph_node *master_clone;
    int depth = 0;
    int n = 0;
+   int newsize;
  
    if (DECL_DECLARED_INLINE_P (node->decl))
      {
*************** cgraph_decide_recursive_inlining (struct
*** 1287,1294 ****
      }
  
    /* Make sure that function is small enough to be considered for inlining.  */
!   if (!max_depth
!       || cgraph_estimate_size_after_inlining (1, node, node)  >= limit)
      return;
    lookup_recursive_calls (node, node, &first_call, &last_call);
    if (!first_call)
--- 1300,1309 ----
      }
  
    /* Make sure that function is small enough to be considered for inlining.  */
!   newsize = cgraph_estimate_size_after_inlining (1, node, node);
!   if (! max_depth
!       || newsize  >= limit
!       || (optimize_size && newsize > node->global.insns))
      return;
    lookup_recursive_calls (node, node, &first_call, &last_call);
    if (!first_call)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: improving estimate_num_insns
@ 2005-07-13  2:44 Richard Kenner
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Kenner @ 2005-07-13  2:44 UTC (permalink / raw)
  To: stevenb; +Cc: gcc

    First of all, I think you should handle ARRAY_RANGE_REF and ARRAY_REF
    the same.

I don't think so.  The latter is a memory reference while the former
is more like a NOP_EXPR: it's basically just creating a new array, which
can then either be indexed or pointed to.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-07-13  8:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-07-12 17:56 RFC: improving estimate_num_insns Josh Conner
2005-07-12 20:07 ` Steven Bosscher
2005-07-13  2:31   ` Josh Conner
2005-07-13  7:59     ` Steven Bosscher
2005-07-13  8:47       ` Richard Guenther
2005-07-13  2:44 Richard Kenner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).