Re: [PATCH 2/3] Add profiling support for IVOPTS

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Bin.Cheng" <amker.cheng@gmail.com>
To: "Martin Liška" <mliska@suse.cz>
Cc: gcc-patches List <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS
Date: Tue, 24 May 2016 10:19:00 -0000	[thread overview]
Message-ID: <CAHFci2-wYWJ=gsZ0THYoD1M28DCCBc362-DB0CSBSzFj7EjNhQ@mail.gmail.com> (raw)
In-Reply-To: <573D9549.4070700@suse.cz>

On Thu, May 19, 2016 at 11:28 AM, Martin Liška <mliska@suse.cz> wrote:
> On 05/17/2016 12:27 AM, Bin.Cheng wrote:
>>> As profile-guided optimization can provide very useful information
>>> about basic block frequencies within a loop, following patch set leverages
>>> that information. It speeds up a single benchmark from upcoming SPECv6
>>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can
>>> also improve others (currently measuring numbers for PGO).
>> Hi,
>> Is this 20% improvement from this patch, or does it include the
>> existing PGO's improvement?
>
> Hello.
>
> It shows that current trunk (compared to GCC 6 branch)
> has significantly improved the benchmark with PGO.
> Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to
> improve static profile that would utilize the patch.
>
>>
>> For the patch:
>>> +
>>> +  /* Return true if the frequency has a valid value.  */
>>> +  bool has_frequency ();
>>> +
>>>    /* Return infinite comp_cost.  */
>>>    static comp_cost get_infinite ();
>>>
>>> @@ -249,6 +272,9 @@ private:
>>>       complexity field should be larger for more
>>>       complex expressions and addressing modes).  */
>>>    int m_scratch;  /* Scratch used during cost computation.  */
>>> +  sreal m_frequency;  /* Frequency of the basic block this comp_cost
>>> +     belongs to.  */
>>> +  sreal m_cost_scaled;  /* Scalled runtime cost.  */
>> IMHO we shouldn't embed frequency in comp_cost, neither record scaled
>> cost in it.  I would suggest we compute cost and amortize the cost
>> over frequency in get_computation_cost_at before storing it into
>> comp_cost.  That is, once cost is computed/stored in comp_cost, it is
>> already scaled with frequency.  One argument is frequency info is only
>> valid for use's statement/basic_block, it really doesn't have clear
>> meaning in comp_cost structure.  Outside of function
>> get_computation_cost_at, I found it's hard to understand/remember
>> what's the meaning of comp_cost.m_frequency and where it came from.
>> There are other reasons embedded in below comments.
>>>
>>>
>>>  comp_cost&
>>> @@ -257,6 +283,8 @@ comp_cost::operator= (const comp_cost& other)
>>>    m_cost = other.m_cost;
>>>    m_complexity = other.m_complexity;
>>>    m_scratch = other.m_scratch;
>>> +  m_frequency = other.m_frequency;
>>> +  m_cost_scaled = other.m_cost_scaled;
>>>
>>>    return *this;
>>>  }
>>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2)
>>>
>>>    cost1.m_cost += cost2.m_cost;
>>>    cost1.m_complexity += cost2.m_complexity;
>>> +  cost1.m_cost_scaled += cost2.m_cost_scaled;
>>>
>>>    return cost1;
>>>  }
>>> @@ -290,6 +319,8 @@ comp_cost
>>>  comp_cost::operator+= (HOST_WIDE_INT c)
>> This and below operators need check for infinite cost first and return
>> immediately.
>>>  {
>>>    this->m_cost += c;
>>> +  if (has_frequency ())
>>> +    this->m_cost_scaled += scale_cost (c);
>>>
>>>    return *this;
>>>  }
>>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *data,
>>>       (symbol/var1/const parts may be omitted).  If we are looking for an
>>>       address, find the cost of addressing this.  */
>>>    if (address_p)
>>> -    return cost + get_address_cost (symbol_present, var_present,
>>> -    offset, ratio, cstepi,
>>> -    mem_mode,
>>> -    TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> -    speed, stmt_is_after_inc, can_autoinc);
>>> +    {
>>> +      cost += get_address_cost (symbol_present, var_present,
>>> + offset, ratio, cstepi,
>>> + mem_mode,
>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> + speed, stmt_is_after_inc, can_autoinc);
>>> +      goto ret;
>>> +    }
>>>
>>>    /* Otherwise estimate the costs for computing the expression.  */
>>>    if (!symbol_present && !var_present && !offset)
>>>      {
>>>        if (ratio != 1)
>>>   cost += mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed);
>>> -      return cost;
>>> +      goto ret;
>>>      }
>>>
>>>    /* Symbol + offset should be compile-time computable so consider that they
>>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data,
>>>    aratio = ratio > 0 ? ratio : -ratio;
>>>    if (aratio != 1)
>>>      cost += mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed);
>>> -  return cost;
>>> +
>>> +  goto ret;
>>>
>>>  fallback:
>>>    if (can_autoinc)
>>> @@ -5093,8 +5178,13 @@ fallback:
>>>      if (address_p)
>>>        comp = build_simple_mem_ref (comp);
>>>
>>> -    return comp_cost (computation_cost (comp, speed), 0);
>>> +    cost = comp_cost (computation_cost (comp, speed), 0);
>>>    }
>>> +
>>> +ret:
>>> +  cost.calculate_scaled_cost (at->bb->frequency,
>>> +      data->current_loop->header->frequency);
>> Here cost consists of two parts.  One is for loop invariant
>> computation, we amortize is against avg_loop_niter and record register
>> pressure (either via invriant variables or invariant expressions) for
>> it;  the other is loop variant part.  For the first part, we should
>> not scaled it using frequency, since we have already assumed it would
>> be hoisted out of loop.  No matter where the use is, hoisted loop
>> invariant has the same frequency as loop header.  This is the second
>> reason I want to factor frequency out of comp_cost.  It's easier to
>> scale with frequency only it's necessary.
>>
>>> +  return cost;
>>>  }
>>>
>>>  /* Determines the cost of the computation by that USE is expressed
>>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *data)
>>>    group = data->vgroups[i];
>>>
>>>    fprintf (dump_file, "Group %d:\n", i);
>>> -  fprintf (dump_file, "  cand\tcost\tcompl.\tinv.ex.\tdepends on\n");
>>> +  fprintf (dump_file, "  cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex."
>>> +   "\tdepends on\n");
>>>    for (j = 0; j < group->n_map_members; j++)
>>>      {
>>>        if (!group->cost_map[j].cand
>>>    || group->cost_map[j].cost.infinite_cost_p ())
>>>   continue;
>>>
>>> -      fprintf (dump_file, "  %d\t%d\t%d\t",
>>> +      fprintf (dump_file, "  %d\t%d\t%2.2f\t%2.2f\t%d\t",
>>>         group->cost_map[j].cand->id,
>>>         group->cost_map[j].cost.get_cost (),
>>> +       group->cost_map[j].cost.get_cost_scaled (),
>>> +       group->cost_map[j].cost.get_frequency (),
>>>         group->cost_map[j].cost.get_complexity ());
>>>        if (group->cost_map[j].inv_expr != NULL)
>>>   fprintf (dump_file, "%d\t",
>>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *data)
>>>   }
>>>        fprintf (dump_file, "\n");
>>>      }
>>> +
>>> +  for (i = 0; i < data->vgroups.length (); i++)
>>> +    {
>>> +      group = data->vgroups[i];
>>> +      for (j = 0; j < group->n_map_members; j++)
>>> + {
>>> +  if (!group->cost_map[j].cand
>>> +      || group->cost_map[j].cost.infinite_cost_p ())
>>> +    continue;
>>> +
>>> +  group->cost_map[j].cost.propagate_scaled_cost ();
>>> + }
>>> +    }
>> This is wrong.  m_frequency and m_cost_scaled are initialized to
>> sreal(0) by default, and are never changed later for conditional
>> iv_use.  As a matter of factor, costs computed for all conditional
>> iv_uses are wrong (value is 0).  This makes the observed improvement
>> not that promising.  Considering code generation is very sensitive to
>> cost computation, it maybe just hit some special cases.  Eitherway we
>> need more work/investigation on the impact of this patch.
>>
>> Again, I would suggest we factor out frequency out of comp_cost and
>> only scale the cost in place when we compute cost for each use.  Then
>> we can measure what's the impact on code generation.
>>
>> Thanks,
>> bin
>>
>
> All remarks were applied in third version of the patch. Together with the previous
> patch, it survives bootstrap and regression tests on x86_64-linux-gnu.
> I'm going to re-test the patch on SPEC benchmarks.
> +
> +ret:
> +  /* Scale (multiply) the computed cost (except scratch part that should be
> +     hoisted out a loop) by header->frequency / at->frequency,
> +     which makes expected cost more accurate.  */
> +   int loop_freq = data->current_loop->header->frequency;
> +   int bb_freq = at->bb->frequency;
> +   if (loop_freq != 0)
> +     {
> +       gcc_assert (cost.scratch <= cost.cost);
> +       int scaled_cost
> +     = cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq;
> +
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +     fprintf (dump_file, "Scaling iv_use based on cand %d "
> +          "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n",
> +          cand->id, 1.0f * bb_freq / loop_freq, cost.cost,
> +          cost.scratch, scaled_cost, bb_freq, loop_freq);
> +
> +       cost.cost = scaled_cost;
> +     }
> +
> +   return cost;
Hi,
Could you please factor out this as a function and remove the goto
statements?  Okay with this change if no fallout in benchmarks you
run.

Thanks,
bin
>
> Martin
>

WARNING: multiple messages have this Message-ID

From: "Bin.Cheng" <amker.cheng@gmail.com>
To: "Martin Liška" <mliska@suse.cz>
Cc: gcc-patches List <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS
Date: Tue, 24 May 2016 10:33:00 -0000	[thread overview]
Message-ID: <CAHFci2-wYWJ=gsZ0THYoD1M28DCCBc362-DB0CSBSzFj7EjNhQ@mail.gmail.com> (raw)
Message-ID: <20160524103300.KLI_4ALuiZjzoD5OVUWKq4WcYjijEB_kRcddYNJP0uQ@z> (raw)
In-Reply-To: <573D9549.4070700@suse.cz>

On Thu, May 19, 2016 at 11:28 AM, Martin Liška <mliska@suse.cz> wrote:
> On 05/17/2016 12:27 AM, Bin.Cheng wrote:
>>> As profile-guided optimization can provide very useful information
>>> about basic block frequencies within a loop, following patch set leverages
>>> that information. It speeds up a single benchmark from upcoming SPECv6
>>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can
>>> also improve others (currently measuring numbers for PGO).
>> Hi,
>> Is this 20% improvement from this patch, or does it include the
>> existing PGO's improvement?
>
> Hello.
>
> It shows that current trunk (compared to GCC 6 branch)
> has significantly improved the benchmark with PGO.
> Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to
> improve static profile that would utilize the patch.
>
>>
>> For the patch:
>>> +
>>> +  /* Return true if the frequency has a valid value.  */
>>> +  bool has_frequency ();
>>> +
>>>    /* Return infinite comp_cost.  */
>>>    static comp_cost get_infinite ();
>>>
>>> @@ -249,6 +272,9 @@ private:
>>>       complexity field should be larger for more
>>>       complex expressions and addressing modes).  */
>>>    int m_scratch;  /* Scratch used during cost computation.  */
>>> +  sreal m_frequency;  /* Frequency of the basic block this comp_cost
>>> +     belongs to.  */
>>> +  sreal m_cost_scaled;  /* Scalled runtime cost.  */
>> IMHO we shouldn't embed frequency in comp_cost, neither record scaled
>> cost in it.  I would suggest we compute cost and amortize the cost
>> over frequency in get_computation_cost_at before storing it into
>> comp_cost.  That is, once cost is computed/stored in comp_cost, it is
>> already scaled with frequency.  One argument is frequency info is only
>> valid for use's statement/basic_block, it really doesn't have clear
>> meaning in comp_cost structure.  Outside of function
>> get_computation_cost_at, I found it's hard to understand/remember
>> what's the meaning of comp_cost.m_frequency and where it came from.
>> There are other reasons embedded in below comments.
>>>
>>>
>>>  comp_cost&
>>> @@ -257,6 +283,8 @@ comp_cost::operator= (const comp_cost& other)
>>>    m_cost = other.m_cost;
>>>    m_complexity = other.m_complexity;
>>>    m_scratch = other.m_scratch;
>>> +  m_frequency = other.m_frequency;
>>> +  m_cost_scaled = other.m_cost_scaled;
>>>
>>>    return *this;
>>>  }
>>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2)
>>>
>>>    cost1.m_cost += cost2.m_cost;
>>>    cost1.m_complexity += cost2.m_complexity;
>>> +  cost1.m_cost_scaled += cost2.m_cost_scaled;
>>>
>>>    return cost1;
>>>  }
>>> @@ -290,6 +319,8 @@ comp_cost
>>>  comp_cost::operator+= (HOST_WIDE_INT c)
>> This and below operators need check for infinite cost first and return
>> immediately.
>>>  {
>>>    this->m_cost += c;
>>> +  if (has_frequency ())
>>> +    this->m_cost_scaled += scale_cost (c);
>>>
>>>    return *this;
>>>  }
>>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *data,
>>>       (symbol/var1/const parts may be omitted).  If we are looking for an
>>>       address, find the cost of addressing this.  */
>>>    if (address_p)
>>> -    return cost + get_address_cost (symbol_present, var_present,
>>> -    offset, ratio, cstepi,
>>> -    mem_mode,
>>> -    TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> -    speed, stmt_is_after_inc, can_autoinc);
>>> +    {
>>> +      cost += get_address_cost (symbol_present, var_present,
>>> + offset, ratio, cstepi,
>>> + mem_mode,
>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> + speed, stmt_is_after_inc, can_autoinc);
>>> +      goto ret;
>>> +    }
>>>
>>>    /* Otherwise estimate the costs for computing the expression.  */
>>>    if (!symbol_present && !var_present && !offset)
>>>      {
>>>        if (ratio != 1)
>>>   cost += mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed);
>>> -      return cost;
>>> +      goto ret;
>>>      }
>>>
>>>    /* Symbol + offset should be compile-time computable so consider that they
>>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data,
>>>    aratio = ratio > 0 ? ratio : -ratio;
>>>    if (aratio != 1)
>>>      cost += mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed);
>>> -  return cost;
>>> +
>>> +  goto ret;
>>>
>>>  fallback:
>>>    if (can_autoinc)
>>> @@ -5093,8 +5178,13 @@ fallback:
>>>      if (address_p)
>>>        comp = build_simple_mem_ref (comp);
>>>
>>> -    return comp_cost (computation_cost (comp, speed), 0);
>>> +    cost = comp_cost (computation_cost (comp, speed), 0);
>>>    }
>>> +
>>> +ret:
>>> +  cost.calculate_scaled_cost (at->bb->frequency,
>>> +      data->current_loop->header->frequency);
>> Here cost consists of two parts.  One is for loop invariant
>> computation, we amortize is against avg_loop_niter and record register
>> pressure (either via invriant variables or invariant expressions) for
>> it;  the other is loop variant part.  For the first part, we should
>> not scaled it using frequency, since we have already assumed it would
>> be hoisted out of loop.  No matter where the use is, hoisted loop
>> invariant has the same frequency as loop header.  This is the second
>> reason I want to factor frequency out of comp_cost.  It's easier to
>> scale with frequency only it's necessary.
>>
>>> +  return cost;
>>>  }
>>>
>>>  /* Determines the cost of the computation by that USE is expressed
>>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *data)
>>>    group = data->vgroups[i];
>>>
>>>    fprintf (dump_file, "Group %d:\n", i);
>>> -  fprintf (dump_file, "  cand\tcost\tcompl.\tinv.ex.\tdepends on\n");
>>> +  fprintf (dump_file, "  cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex."
>>> +   "\tdepends on\n");
>>>    for (j = 0; j < group->n_map_members; j++)
>>>      {
>>>        if (!group->cost_map[j].cand
>>>    || group->cost_map[j].cost.infinite_cost_p ())
>>>   continue;
>>>
>>> -      fprintf (dump_file, "  %d\t%d\t%d\t",
>>> +      fprintf (dump_file, "  %d\t%d\t%2.2f\t%2.2f\t%d\t",
>>>         group->cost_map[j].cand->id,
>>>         group->cost_map[j].cost.get_cost (),
>>> +       group->cost_map[j].cost.get_cost_scaled (),
>>> +       group->cost_map[j].cost.get_frequency (),
>>>         group->cost_map[j].cost.get_complexity ());
>>>        if (group->cost_map[j].inv_expr != NULL)
>>>   fprintf (dump_file, "%d\t",
>>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *data)
>>>   }
>>>        fprintf (dump_file, "\n");
>>>      }
>>> +
>>> +  for (i = 0; i < data->vgroups.length (); i++)
>>> +    {
>>> +      group = data->vgroups[i];
>>> +      for (j = 0; j < group->n_map_members; j++)
>>> + {
>>> +  if (!group->cost_map[j].cand
>>> +      || group->cost_map[j].cost.infinite_cost_p ())
>>> +    continue;
>>> +
>>> +  group->cost_map[j].cost.propagate_scaled_cost ();
>>> + }
>>> +    }
>> This is wrong.  m_frequency and m_cost_scaled are initialized to
>> sreal(0) by default, and are never changed later for conditional
>> iv_use.  As a matter of factor, costs computed for all conditional
>> iv_uses are wrong (value is 0).  This makes the observed improvement
>> not that promising.  Considering code generation is very sensitive to
>> cost computation, it maybe just hit some special cases.  Eitherway we
>> need more work/investigation on the impact of this patch.
>>
>> Again, I would suggest we factor out frequency out of comp_cost and
>> only scale the cost in place when we compute cost for each use.  Then
>> we can measure what's the impact on code generation.
>>
>> Thanks,
>> bin
>>
>
> All remarks were applied in third version of the patch. Together with the previous
> patch, it survives bootstrap and regression tests on x86_64-linux-gnu.
> I'm going to re-test the patch on SPEC benchmarks.
> +
> +ret:
> +  /* Scale (multiply) the computed cost (except scratch part that should be
> +     hoisted out a loop) by header->frequency / at->frequency,
> +     which makes expected cost more accurate.  */
> +   int loop_freq = data->current_loop->header->frequency;
> +   int bb_freq = at->bb->frequency;
> +   if (loop_freq != 0)
> +     {
> +       gcc_assert (cost.scratch <= cost.cost);
> +       int scaled_cost
> +     = cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq;
> +
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +     fprintf (dump_file, "Scaling iv_use based on cand %d "
> +          "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n",
> +          cand->id, 1.0f * bb_freq / loop_freq, cost.cost,
> +          cost.scratch, scaled_cost, bb_freq, loop_freq);
> +
> +       cost.cost = scaled_cost;
> +     }
> +
> +   return cost;
Hi,
Could you please factor out this as a function and remove the goto
statements?  Okay with this change if no fallout in benchmarks you
run.

Thanks,
bin
>
> Martin
>

WARNING: multiple messages have this Message-ID

From: "Bin.Cheng" <amker.cheng@gmail.com>
To: "Martin Liška" <mliska@suse.cz>
Cc: gcc-patches List <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS
Date: Tue, 24 May 2016 11:01:00 -0000	[thread overview]
Message-ID: <CAHFci2-wYWJ=gsZ0THYoD1M28DCCBc362-DB0CSBSzFj7EjNhQ@mail.gmail.com> (raw)
Message-ID: <20160524110100.hANLqsQg5ZA6lPQnHWRmmfywWYMiIwc_RAGjsHv6Yzc@z> (raw)
In-Reply-To: <573D9549.4070700@suse.cz>

On Thu, May 19, 2016 at 11:28 AM, Martin Liška <mliska@suse.cz> wrote:
> On 05/17/2016 12:27 AM, Bin.Cheng wrote:
>>> As profile-guided optimization can provide very useful information
>>> about basic block frequencies within a loop, following patch set leverages
>>> that information. It speeds up a single benchmark from upcoming SPECv6
>>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can
>>> also improve others (currently measuring numbers for PGO).
>> Hi,
>> Is this 20% improvement from this patch, or does it include the
>> existing PGO's improvement?
>
> Hello.
>
> It shows that current trunk (compared to GCC 6 branch)
> has significantly improved the benchmark with PGO.
> Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to
> improve static profile that would utilize the patch.
>
>>
>> For the patch:
>>> +
>>> +  /* Return true if the frequency has a valid value.  */
>>> +  bool has_frequency ();
>>> +
>>>    /* Return infinite comp_cost.  */
>>>    static comp_cost get_infinite ();
>>>
>>> @@ -249,6 +272,9 @@ private:
>>>       complexity field should be larger for more
>>>       complex expressions and addressing modes).  */
>>>    int m_scratch;  /* Scratch used during cost computation.  */
>>> +  sreal m_frequency;  /* Frequency of the basic block this comp_cost
>>> +     belongs to.  */
>>> +  sreal m_cost_scaled;  /* Scalled runtime cost.  */
>> IMHO we shouldn't embed frequency in comp_cost, neither record scaled
>> cost in it.  I would suggest we compute cost and amortize the cost
>> over frequency in get_computation_cost_at before storing it into
>> comp_cost.  That is, once cost is computed/stored in comp_cost, it is
>> already scaled with frequency.  One argument is frequency info is only
>> valid for use's statement/basic_block, it really doesn't have clear
>> meaning in comp_cost structure.  Outside of function
>> get_computation_cost_at, I found it's hard to understand/remember
>> what's the meaning of comp_cost.m_frequency and where it came from.
>> There are other reasons embedded in below comments.
>>>
>>>
>>>  comp_cost&
>>> @@ -257,6 +283,8 @@ comp_cost::operator= (const comp_cost& other)
>>>    m_cost = other.m_cost;
>>>    m_complexity = other.m_complexity;
>>>    m_scratch = other.m_scratch;
>>> +  m_frequency = other.m_frequency;
>>> +  m_cost_scaled = other.m_cost_scaled;
>>>
>>>    return *this;
>>>  }
>>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2)
>>>
>>>    cost1.m_cost += cost2.m_cost;
>>>    cost1.m_complexity += cost2.m_complexity;
>>> +  cost1.m_cost_scaled += cost2.m_cost_scaled;
>>>
>>>    return cost1;
>>>  }
>>> @@ -290,6 +319,8 @@ comp_cost
>>>  comp_cost::operator+= (HOST_WIDE_INT c)
>> This and below operators need check for infinite cost first and return
>> immediately.
>>>  {
>>>    this->m_cost += c;
>>> +  if (has_frequency ())
>>> +    this->m_cost_scaled += scale_cost (c);
>>>
>>>    return *this;
>>>  }
>>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *data,
>>>       (symbol/var1/const parts may be omitted).  If we are looking for an
>>>       address, find the cost of addressing this.  */
>>>    if (address_p)
>>> -    return cost + get_address_cost (symbol_present, var_present,
>>> -    offset, ratio, cstepi,
>>> -    mem_mode,
>>> -    TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> -    speed, stmt_is_after_inc, can_autoinc);
>>> +    {
>>> +      cost += get_address_cost (symbol_present, var_present,
>>> + offset, ratio, cstepi,
>>> + mem_mode,
>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> + speed, stmt_is_after_inc, can_autoinc);
>>> +      goto ret;
>>> +    }
>>>
>>>    /* Otherwise estimate the costs for computing the expression.  */
>>>    if (!symbol_present && !var_present && !offset)
>>>      {
>>>        if (ratio != 1)
>>>   cost += mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed);
>>> -      return cost;
>>> +      goto ret;
>>>      }
>>>
>>>    /* Symbol + offset should be compile-time computable so consider that they
>>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data,
>>>    aratio = ratio > 0 ? ratio : -ratio;
>>>    if (aratio != 1)
>>>      cost += mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed);
>>> -  return cost;
>>> +
>>> +  goto ret;
>>>
>>>  fallback:
>>>    if (can_autoinc)
>>> @@ -5093,8 +5178,13 @@ fallback:
>>>      if (address_p)
>>>        comp = build_simple_mem_ref (comp);
>>>
>>> -    return comp_cost (computation_cost (comp, speed), 0);
>>> +    cost = comp_cost (computation_cost (comp, speed), 0);
>>>    }
>>> +
>>> +ret:
>>> +  cost.calculate_scaled_cost (at->bb->frequency,
>>> +      data->current_loop->header->frequency);
>> Here cost consists of two parts.  One is for loop invariant
>> computation, we amortize is against avg_loop_niter and record register
>> pressure (either via invriant variables or invariant expressions) for
>> it;  the other is loop variant part.  For the first part, we should
>> not scaled it using frequency, since we have already assumed it would
>> be hoisted out of loop.  No matter where the use is, hoisted loop
>> invariant has the same frequency as loop header.  This is the second
>> reason I want to factor frequency out of comp_cost.  It's easier to
>> scale with frequency only it's necessary.
>>
>>> +  return cost;
>>>  }
>>>
>>>  /* Determines the cost of the computation by that USE is expressed
>>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *data)
>>>    group = data->vgroups[i];
>>>
>>>    fprintf (dump_file, "Group %d:\n", i);
>>> -  fprintf (dump_file, "  cand\tcost\tcompl.\tinv.ex.\tdepends on\n");
>>> +  fprintf (dump_file, "  cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex."
>>> +   "\tdepends on\n");
>>>    for (j = 0; j < group->n_map_members; j++)
>>>      {
>>>        if (!group->cost_map[j].cand
>>>    || group->cost_map[j].cost.infinite_cost_p ())
>>>   continue;
>>>
>>> -      fprintf (dump_file, "  %d\t%d\t%d\t",
>>> +      fprintf (dump_file, "  %d\t%d\t%2.2f\t%2.2f\t%d\t",
>>>         group->cost_map[j].cand->id,
>>>         group->cost_map[j].cost.get_cost (),
>>> +       group->cost_map[j].cost.get_cost_scaled (),
>>> +       group->cost_map[j].cost.get_frequency (),
>>>         group->cost_map[j].cost.get_complexity ());
>>>        if (group->cost_map[j].inv_expr != NULL)
>>>   fprintf (dump_file, "%d\t",
>>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *data)
>>>   }
>>>        fprintf (dump_file, "\n");
>>>      }
>>> +
>>> +  for (i = 0; i < data->vgroups.length (); i++)
>>> +    {
>>> +      group = data->vgroups[i];
>>> +      for (j = 0; j < group->n_map_members; j++)
>>> + {
>>> +  if (!group->cost_map[j].cand
>>> +      || group->cost_map[j].cost.infinite_cost_p ())
>>> +    continue;
>>> +
>>> +  group->cost_map[j].cost.propagate_scaled_cost ();
>>> + }
>>> +    }
>> This is wrong.  m_frequency and m_cost_scaled are initialized to
>> sreal(0) by default, and are never changed later for conditional
>> iv_use.  As a matter of factor, costs computed for all conditional
>> iv_uses are wrong (value is 0).  This makes the observed improvement
>> not that promising.  Considering code generation is very sensitive to
>> cost computation, it maybe just hit some special cases.  Eitherway we
>> need more work/investigation on the impact of this patch.
>>
>> Again, I would suggest we factor out frequency out of comp_cost and
>> only scale the cost in place when we compute cost for each use.  Then
>> we can measure what's the impact on code generation.
>>
>> Thanks,
>> bin
>>
>
> All remarks were applied in third version of the patch. Together with the previous
> patch, it survives bootstrap and regression tests on x86_64-linux-gnu.
> I'm going to re-test the patch on SPEC benchmarks.
> +
> +ret:
> +  /* Scale (multiply) the computed cost (except scratch part that should be
> +     hoisted out a loop) by header->frequency / at->frequency,
> +     which makes expected cost more accurate.  */
> +   int loop_freq = data->current_loop->header->frequency;
> +   int bb_freq = at->bb->frequency;
> +   if (loop_freq != 0)
> +     {
> +       gcc_assert (cost.scratch <= cost.cost);
> +       int scaled_cost
> +     = cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq;
> +
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +     fprintf (dump_file, "Scaling iv_use based on cand %d "
> +          "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n",
> +          cand->id, 1.0f * bb_freq / loop_freq, cost.cost,
> +          cost.scratch, scaled_cost, bb_freq, loop_freq);
> +
> +       cost.cost = scaled_cost;
> +     }
> +
> +   return cost;
Hi,
Could you please factor out this as a function and remove the goto
statements?  Okay with this change if no fallout in benchmarks you
run.

Thanks,
bin
>
> Martin
>

next prev parent reply	other threads:[~2016-05-24 10:19 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-29 11:58 [PATCH 0/3] IVOPTS: support profiling marxin
2016-04-29 11:58 ` [PATCH 2/3] Add profiling support for IVOPTS marxin
2016-05-16 13:56   ` Martin Liška
2016-05-16 22:27     ` Bin.Cheng
2016-05-19 10:28       ` Martin Liška
2016-05-20 10:04         ` Bin.Cheng
2016-05-24 10:19         ` Bin.Cheng [this message]
2016-05-24 10:33           ` Bin.Cheng
2016-05-24 11:01           ` Bin.Cheng
2016-05-30 19:51           ` Martin Liška
2016-04-29 11:58 ` [PATCH 1/3] Encapsulate comp_cost within a class with methods marxin
2016-05-16 10:14   ` Bin.Cheng
2016-05-16 13:55     ` Martin Liška
2016-05-19 10:23       ` Martin Liška
2016-05-19 11:24         ` Bin.Cheng
2016-05-26 21:02           ` Martin Liška
2016-04-29 11:58 ` [PATCH 3/3] Enhance dumps of IVOPTS marxin
2016-05-06  9:19   ` Martin Liška
2016-05-09  9:47     ` Richard Biener
2016-05-10 13:16       ` Bin.Cheng
2016-05-11 14:18         ` Martin Liška
2016-05-12 12:14         ` Martin Liška
2016-05-12 13:51           ` Bin.Cheng
2016-05-12 16:42             ` Martin Liška
2016-05-13  9:43               ` Bin.Cheng
2016-05-13 10:44                 ` Martin Liška
2016-05-13 12:12                   ` H.J. Lu
2016-05-13 12:39                     ` Martin Liška
2016-05-13 12:44                       ` Kyrill Tkachov
2016-05-13 12:47                         ` Richard Biener
2016-05-13 12:51                           ` Martin Liška
2016-05-13 14:17                             ` H.J. Lu
2016-05-13 14:46                               ` H.J. Lu
2016-05-03  9:28 ` [PATCH 0/3] IVOPTS: support profiling Bin.Cheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHFci2-wYWJ=gsZ0THYoD1M28DCCBc362-DB0CSBSzFj7EjNhQ@mail.gmail.com' \
    --to=amker.cheng@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=mliska@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).