From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-428128-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 6201 invoked by alias); 24 May 2016 10:19:18 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 5666 invoked by uid 89); 24 May 2016 10:19:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=10f, 1.0f, Frequency, comp
X-HELO: mail-vk0-f45.google.com
Received: from mail-vk0-f45.google.com (HELO mail-vk0-f45.google.com) (209.85.213.45) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 24 May 2016 10:12:26 +0000
Received: by mail-vk0-f45.google.com with SMTP id c189so15054155vkb.1        for <gcc-patches@gcc.gnu.org>; Tue, 24 May 2016 03:12:08 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:mime-version:in-reply-to:references:date         :message-id:subject:from:to:cc:content-transfer-encoding;        bh=iLiXEs459wFbs7PKlFm5YNmu5y2HYDhEV23zFeh/jok=;        b=i+edOaYYUqIikWpA32EF6yb2tL/AIdZPtxx3/j2F/9nZONXaIMkK90cQq9VKrJ6+6x         eIrwNicpC4Bwo7/l7pZXoWuvqT+AQPTI9pezKmYo8KN/9tFEgqcrvyATe7XYV8wKDoBq         EG7M1T4nwe6BPfIsaRWHbuDF+2HHL+MDjM4Za2yZ93ogxcQbZUk1ypbpWTqqSBql/3X9         TO1jTL15UedKi/cCDWt0Lcs/WgPH7EgzbWzkdnYHhm/DNjEpuZgn/YYnTlEe3iY9XREe         GseMMa5TFEg09xpH70LXY0BTJi5H1BHoQDReG3NzWhotIkP6gb996g6+CZkBkqZ/uU7J         GTrw==
X-Gm-Message-State: ALyK8tJXgJHRPUdH3j29VMfwcsdfE9DlXJ+YIpSBHwUwk6FJ8oOqyC7CisqnM6el/OFshmGzc8XNL3uiJQZwyw==
MIME-Version: 1.0
X-Received: by 10.31.158.1 with SMTP id h1mr2021503vke.5.1464084694648; Tue, 24 May 2016 03:11:34 -0700 (PDT)
Received: by 10.103.45.195 with HTTP; Tue, 24 May 2016 03:11:34 -0700 (PDT)
In-Reply-To: <573D9549.4070700@suse.cz>
References: <cover.1461931011.git.mliska@suse.cz>	<00578bce6fdccccacdd740ff0067ccde46f98f51.1461931011.git.mliska@suse.cz>	<5739D16A.9020907@suse.cz>	<CAHFci2_tWaneWe2kWfyrcAC-s+Mw_zc7f42X0ARZ12oD3wbtwg@mail.gmail.com>	<573D9549.4070700@suse.cz>
Date: Tue, 24 May 2016 10:19:00 -0000
Message-ID: <CAHFci2-wYWJ=gsZ0THYoD1M28DCCBc362-DB0CSBSzFj7EjNhQ@mail.gmail.com>
Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS
From: "Bin.Cheng" <amker.cheng@gmail.com>
To: =?UTF-8?Q?Martin_Li=C5=A1ka?= <mliska@suse.cz>
Cc: gcc-patches List <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2016-05/txt/msg01877.txt.bz2

On Thu, May 19, 2016 at 11:28 AM, Martin Li=C5=A1ka <mliska@suse.cz> wrote:
> On 05/17/2016 12:27 AM, Bin.Cheng wrote:
>>> As profile-guided optimization can provide very useful information
>>> about basic block frequencies within a loop, following patch set levera=
ges
>>> that information. It speeds up a single benchmark from upcoming SPECv6
>>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can
>>> also improve others (currently measuring numbers for PGO).
>> Hi,
>> Is this 20% improvement from this patch, or does it include the
>> existing PGO's improvement?
>
> Hello.
>
> It shows that current trunk (compared to GCC 6 branch)
> has significantly improved the benchmark with PGO.
> Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to
> improve static profile that would utilize the patch.
>
>>
>> For the patch:
>>> +
>>> +  /* Return true if the frequency has a valid value.  */
>>> +  bool has_frequency ();
>>> +
>>>    /* Return infinite comp_cost.  */
>>>    static comp_cost get_infinite ();
>>>
>>> @@ -249,6 +272,9 @@ private:
>>>       complexity field should be larger for more
>>>       complex expressions and addressing modes).  */
>>>    int m_scratch;  /* Scratch used during cost computation.  */
>>> +  sreal m_frequency;  /* Frequency of the basic block this comp_cost
>>> +     belongs to.  */
>>> +  sreal m_cost_scaled;  /* Scalled runtime cost.  */
>> IMHO we shouldn't embed frequency in comp_cost, neither record scaled
>> cost in it.  I would suggest we compute cost and amortize the cost
>> over frequency in get_computation_cost_at before storing it into
>> comp_cost.  That is, once cost is computed/stored in comp_cost, it is
>> already scaled with frequency.  One argument is frequency info is only
>> valid for use's statement/basic_block, it really doesn't have clear
>> meaning in comp_cost structure.  Outside of function
>> get_computation_cost_at, I found it's hard to understand/remember
>> what's the meaning of comp_cost.m_frequency and where it came from.
>> There are other reasons embedded in below comments.
>>>
>>>
>>>  comp_cost&
>>> @@ -257,6 +283,8 @@ comp_cost::operator=3D (const comp_cost& other)
>>>    m_cost =3D other.m_cost;
>>>    m_complexity =3D other.m_complexity;
>>>    m_scratch =3D other.m_scratch;
>>> +  m_frequency =3D other.m_frequency;
>>> +  m_cost_scaled =3D other.m_cost_scaled;
>>>
>>>    return *this;
>>>  }
>>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2)
>>>
>>>    cost1.m_cost +=3D cost2.m_cost;
>>>    cost1.m_complexity +=3D cost2.m_complexity;
>>> +  cost1.m_cost_scaled +=3D cost2.m_cost_scaled;
>>>
>>>    return cost1;
>>>  }
>>> @@ -290,6 +319,8 @@ comp_cost
>>>  comp_cost::operator+=3D (HOST_WIDE_INT c)
>> This and below operators need check for infinite cost first and return
>> immediately.
>>>  {
>>>    this->m_cost +=3D c;
>>> +  if (has_frequency ())
>>> +    this->m_cost_scaled +=3D scale_cost (c);
>>>
>>>    return *this;
>>>  }
>>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *da=
ta,
>>>       (symbol/var1/const parts may be omitted).  If we are looking for =
an
>>>       address, find the cost of addressing this.  */
>>>    if (address_p)
>>> -    return cost + get_address_cost (symbol_present, var_present,
>>> -    offset, ratio, cstepi,
>>> -    mem_mode,
>>> -    TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> -    speed, stmt_is_after_inc, can_autoinc);
>>> +    {
>>> +      cost +=3D get_address_cost (symbol_present, var_present,
>>> + offset, ratio, cstepi,
>>> + mem_mode,
>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> + speed, stmt_is_after_inc, can_autoinc);
>>> +      goto ret;
>>> +    }
>>>
>>>    /* Otherwise estimate the costs for computing the expression.  */
>>>    if (!symbol_present && !var_present && !offset)
>>>      {
>>>        if (ratio !=3D 1)
>>>   cost +=3D mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed);
>>> -      return cost;
>>> +      goto ret;
>>>      }
>>>
>>>    /* Symbol + offset should be compile-time computable so consider tha=
t they
>>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data,
>>>    aratio =3D ratio > 0 ? ratio : -ratio;
>>>    if (aratio !=3D 1)
>>>      cost +=3D mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed);
>>> -  return cost;
>>> +
>>> +  goto ret;
>>>
>>>  fallback:
>>>    if (can_autoinc)
>>> @@ -5093,8 +5178,13 @@ fallback:
>>>      if (address_p)
>>>        comp =3D build_simple_mem_ref (comp);
>>>
>>> -    return comp_cost (computation_cost (comp, speed), 0);
>>> +    cost =3D comp_cost (computation_cost (comp, speed), 0);
>>>    }
>>> +
>>> +ret:
>>> +  cost.calculate_scaled_cost (at->bb->frequency,
>>> +      data->current_loop->header->frequency);
>> Here cost consists of two parts.  One is for loop invariant
>> computation, we amortize is against avg_loop_niter and record register
>> pressure (either via invriant variables or invariant expressions) for
>> it;  the other is loop variant part.  For the first part, we should
>> not scaled it using frequency, since we have already assumed it would
>> be hoisted out of loop.  No matter where the use is, hoisted loop
>> invariant has the same frequency as loop header.  This is the second
>> reason I want to factor frequency out of comp_cost.  It's easier to
>> scale with frequency only it's necessary.
>>
>>> +  return cost;
>>>  }
>>>
>>>  /* Determines the cost of the computation by that USE is expressed
>>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *d=
ata)
>>>    group =3D data->vgroups[i];
>>>
>>>    fprintf (dump_file, "Group %d:\n", i);
>>> -  fprintf (dump_file, "  cand\tcost\tcompl.\tinv.ex.\tdepends on\n");
>>> +  fprintf (dump_file, "  cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex."
>>> +   "\tdepends on\n");
>>>    for (j =3D 0; j < group->n_map_members; j++)
>>>      {
>>>        if (!group->cost_map[j].cand
>>>    || group->cost_map[j].cost.infinite_cost_p ())
>>>   continue;
>>>
>>> -      fprintf (dump_file, "  %d\t%d\t%d\t",
>>> +      fprintf (dump_file, "  %d\t%d\t%2.2f\t%2.2f\t%d\t",
>>>         group->cost_map[j].cand->id,
>>>         group->cost_map[j].cost.get_cost (),
>>> +       group->cost_map[j].cost.get_cost_scaled (),
>>> +       group->cost_map[j].cost.get_frequency (),
>>>         group->cost_map[j].cost.get_complexity ());
>>>        if (group->cost_map[j].inv_expr !=3D NULL)
>>>   fprintf (dump_file, "%d\t",
>>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *da=
ta)
>>>   }
>>>        fprintf (dump_file, "\n");
>>>      }
>>> +
>>> +  for (i =3D 0; i < data->vgroups.length (); i++)
>>> +    {
>>> +      group =3D data->vgroups[i];
>>> +      for (j =3D 0; j < group->n_map_members; j++)
>>> + {
>>> +  if (!group->cost_map[j].cand
>>> +      || group->cost_map[j].cost.infinite_cost_p ())
>>> +    continue;
>>> +
>>> +  group->cost_map[j].cost.propagate_scaled_cost ();
>>> + }
>>> +    }
>> This is wrong.  m_frequency and m_cost_scaled are initialized to
>> sreal(0) by default, and are never changed later for conditional
>> iv_use.  As a matter of factor, costs computed for all conditional
>> iv_uses are wrong (value is 0).  This makes the observed improvement
>> not that promising.  Considering code generation is very sensitive to
>> cost computation, it maybe just hit some special cases.  Eitherway we
>> need more work/investigation on the impact of this patch.
>>
>> Again, I would suggest we factor out frequency out of comp_cost and
>> only scale the cost in place when we compute cost for each use.  Then
>> we can measure what's the impact on code generation.
>>
>> Thanks,
>> bin
>>
>
> All remarks were applied in third version of the patch. Together with the=
 previous
> patch, it survives bootstrap and regression tests on x86_64-linux-gnu.
> I'm going to re-test the patch on SPEC benchmarks.
> +
> +ret:
> +  /* Scale (multiply) the computed cost (except scratch part that should=
 be
> +     hoisted out a loop) by header->frequency / at->frequency,
> +     which makes expected cost more accurate.  */
> +   int loop_freq =3D data->current_loop->header->frequency;
> +   int bb_freq =3D at->bb->frequency;
> +   if (loop_freq !=3D 0)
> +     {
> +       gcc_assert (cost.scratch <=3D cost.cost);
> +       int scaled_cost
> +     =3D cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq;
> +
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +     fprintf (dump_file, "Scaling iv_use based on cand %d "
> +          "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n",
> +          cand->id, 1.0f * bb_freq / loop_freq, cost.cost,
> +          cost.scratch, scaled_cost, bb_freq, loop_freq);
> +
> +       cost.cost =3D scaled_cost;
> +     }
> +
> +   return cost;
Hi,
Could you please factor out this as a function and remove the goto
statements?  Okay with this change if no fallout in benchmarks you
run.

Thanks,
bin
>
> Martin
>

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-428129-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 83984 invoked by alias); 24 May 2016 10:22:56 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 83659 invoked by uid 89); 24 May 2016 10:22:55 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW autolearn=unavailable version=3.3.2 spammy=
X-HELO: mail-vk0-f50.google.com
Received: from mail-vk0-f50.google.com (HELO mail-vk0-f50.google.com) (209.85.213.50) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 24 May 2016 10:17:53 +0000
Received: by mail-vk0-f50.google.com with SMTP id c189so15212369vkb.1        for <gcc-patches@gcc.gnu.org>; Tue, 24 May 2016 03:17:46 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:mime-version:in-reply-to:references:date         :message-id:subject:from:to:cc:content-transfer-encoding;        bh=iLiXEs459wFbs7PKlFm5YNmu5y2HYDhEV23zFeh/jok=;        b=d13zMXbJe9sRPDXFz+8640iR+1ys/HOIdCv+qCXFNXg2kgazQrrkYMVaH8tDdZBJgC         j/CW91jJZ3T8a0AhYznAI0OxEOYoIpw8SL/8oYX1O3mJUTiClTj+nv1izT9pIn8Bwb+P         z+RRKzWBz5Q7XIn/aVfJrhj68N2NdGercBlHZ4Wfxq+NMJJJJfnuIuuKmep2NHepKom5         Mslirc60qX/0ez+GtFGefXLm9+T7HxzzdwvSoqIA7ZzHlfkyocAUJZDiohpEDgXaibjm         rnGBII/tttbukeoWafd4UkZ8fcqT0VH8ubeNhquwIDw30gepcmQedlHjMS++WmNnHquk         R4Cw==
X-Gm-Message-State: ALyK8tInGN1wnyiyPoIszNURFxYE8qeJ3M7qyv7aRExi8cmO1zg+kHi1+U9YsBRM08+BA6y0orMKpvsfDfTSQQ==
MIME-Version: 1.0
X-Received: by 10.31.158.1 with SMTP id h1mr2021503vke.5.1464084694648; Tue, 24 May 2016 03:11:34 -0700 (PDT)
Received: by 10.103.45.195 with HTTP; Tue, 24 May 2016 03:11:34 -0700 (PDT)
In-Reply-To: <573D9549.4070700@suse.cz>
References: <cover.1461931011.git.mliska@suse.cz>	<00578bce6fdccccacdd740ff0067ccde46f98f51.1461931011.git.mliska@suse.cz>	<5739D16A.9020907@suse.cz>	<CAHFci2_tWaneWe2kWfyrcAC-s+Mw_zc7f42X0ARZ12oD3wbtwg@mail.gmail.com>	<573D9549.4070700@suse.cz>
Date: Tue, 24 May 2016 10:33:00 -0000
Message-ID: <CAHFci2-wYWJ=gsZ0THYoD1M28DCCBc362-DB0CSBSzFj7EjNhQ@mail.gmail.com>
Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS
From: "Bin.Cheng" <amker.cheng@gmail.com>
To: =?UTF-8?Q?Martin_Li=C5=A1ka?= <mliska@suse.cz>
Cc: gcc-patches List <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2016-05/txt/msg01878.txt.bz2
Message-ID: <20160524103300.KLI_4ALuiZjzoD5OVUWKq4WcYjijEB_kRcddYNJP0uQ@z>

On Thu, May 19, 2016 at 11:28 AM, Martin Li=C5=A1ka <mliska@suse.cz> wrote:
> On 05/17/2016 12:27 AM, Bin.Cheng wrote:
>>> As profile-guided optimization can provide very useful information
>>> about basic block frequencies within a loop, following patch set levera=
ges
>>> that information. It speeds up a single benchmark from upcoming SPECv6
>>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can
>>> also improve others (currently measuring numbers for PGO).
>> Hi,
>> Is this 20% improvement from this patch, or does it include the
>> existing PGO's improvement?
>
> Hello.
>
> It shows that current trunk (compared to GCC 6 branch)
> has significantly improved the benchmark with PGO.
> Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to
> improve static profile that would utilize the patch.
>
>>
>> For the patch:
>>> +
>>> +  /* Return true if the frequency has a valid value.  */
>>> +  bool has_frequency ();
>>> +
>>>    /* Return infinite comp_cost.  */
>>>    static comp_cost get_infinite ();
>>>
>>> @@ -249,6 +272,9 @@ private:
>>>       complexity field should be larger for more
>>>       complex expressions and addressing modes).  */
>>>    int m_scratch;  /* Scratch used during cost computation.  */
>>> +  sreal m_frequency;  /* Frequency of the basic block this comp_cost
>>> +     belongs to.  */
>>> +  sreal m_cost_scaled;  /* Scalled runtime cost.  */
>> IMHO we shouldn't embed frequency in comp_cost, neither record scaled
>> cost in it.  I would suggest we compute cost and amortize the cost
>> over frequency in get_computation_cost_at before storing it into
>> comp_cost.  That is, once cost is computed/stored in comp_cost, it is
>> already scaled with frequency.  One argument is frequency info is only
>> valid for use's statement/basic_block, it really doesn't have clear
>> meaning in comp_cost structure.  Outside of function
>> get_computation_cost_at, I found it's hard to understand/remember
>> what's the meaning of comp_cost.m_frequency and where it came from.
>> There are other reasons embedded in below comments.
>>>
>>>
>>>  comp_cost&
>>> @@ -257,6 +283,8 @@ comp_cost::operator=3D (const comp_cost& other)
>>>    m_cost =3D other.m_cost;
>>>    m_complexity =3D other.m_complexity;
>>>    m_scratch =3D other.m_scratch;
>>> +  m_frequency =3D other.m_frequency;
>>> +  m_cost_scaled =3D other.m_cost_scaled;
>>>
>>>    return *this;
>>>  }
>>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2)
>>>
>>>    cost1.m_cost +=3D cost2.m_cost;
>>>    cost1.m_complexity +=3D cost2.m_complexity;
>>> +  cost1.m_cost_scaled +=3D cost2.m_cost_scaled;
>>>
>>>    return cost1;
>>>  }
>>> @@ -290,6 +319,8 @@ comp_cost
>>>  comp_cost::operator+=3D (HOST_WIDE_INT c)
>> This and below operators need check for infinite cost first and return
>> immediately.
>>>  {
>>>    this->m_cost +=3D c;
>>> +  if (has_frequency ())
>>> +    this->m_cost_scaled +=3D scale_cost (c);
>>>
>>>    return *this;
>>>  }
>>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *da=
ta,
>>>       (symbol/var1/const parts may be omitted).  If we are looking for =
an
>>>       address, find the cost of addressing this.  */
>>>    if (address_p)
>>> -    return cost + get_address_cost (symbol_present, var_present,
>>> -    offset, ratio, cstepi,
>>> -    mem_mode,
>>> -    TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> -    speed, stmt_is_after_inc, can_autoinc);
>>> +    {
>>> +      cost +=3D get_address_cost (symbol_present, var_present,
>>> + offset, ratio, cstepi,
>>> + mem_mode,
>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> + speed, stmt_is_after_inc, can_autoinc);
>>> +      goto ret;
>>> +    }
>>>
>>>    /* Otherwise estimate the costs for computing the expression.  */
>>>    if (!symbol_present && !var_present && !offset)
>>>      {
>>>        if (ratio !=3D 1)
>>>   cost +=3D mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed);
>>> -      return cost;
>>> +      goto ret;
>>>      }
>>>
>>>    /* Symbol + offset should be compile-time computable so consider tha=
t they
>>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data,
>>>    aratio =3D ratio > 0 ? ratio : -ratio;
>>>    if (aratio !=3D 1)
>>>      cost +=3D mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed);
>>> -  return cost;
>>> +
>>> +  goto ret;
>>>
>>>  fallback:
>>>    if (can_autoinc)
>>> @@ -5093,8 +5178,13 @@ fallback:
>>>      if (address_p)
>>>        comp =3D build_simple_mem_ref (comp);
>>>
>>> -    return comp_cost (computation_cost (comp, speed), 0);
>>> +    cost =3D comp_cost (computation_cost (comp, speed), 0);
>>>    }
>>> +
>>> +ret:
>>> +  cost.calculate_scaled_cost (at->bb->frequency,
>>> +      data->current_loop->header->frequency);
>> Here cost consists of two parts.  One is for loop invariant
>> computation, we amortize is against avg_loop_niter and record register
>> pressure (either via invriant variables or invariant expressions) for
>> it;  the other is loop variant part.  For the first part, we should
>> not scaled it using frequency, since we have already assumed it would
>> be hoisted out of loop.  No matter where the use is, hoisted loop
>> invariant has the same frequency as loop header.  This is the second
>> reason I want to factor frequency out of comp_cost.  It's easier to
>> scale with frequency only it's necessary.
>>
>>> +  return cost;
>>>  }
>>>
>>>  /* Determines the cost of the computation by that USE is expressed
>>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *d=
ata)
>>>    group =3D data->vgroups[i];
>>>
>>>    fprintf (dump_file, "Group %d:\n", i);
>>> -  fprintf (dump_file, "  cand\tcost\tcompl.\tinv.ex.\tdepends on\n");
>>> +  fprintf (dump_file, "  cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex."
>>> +   "\tdepends on\n");
>>>    for (j =3D 0; j < group->n_map_members; j++)
>>>      {
>>>        if (!group->cost_map[j].cand
>>>    || group->cost_map[j].cost.infinite_cost_p ())
>>>   continue;
>>>
>>> -      fprintf (dump_file, "  %d\t%d\t%d\t",
>>> +      fprintf (dump_file, "  %d\t%d\t%2.2f\t%2.2f\t%d\t",
>>>         group->cost_map[j].cand->id,
>>>         group->cost_map[j].cost.get_cost (),
>>> +       group->cost_map[j].cost.get_cost_scaled (),
>>> +       group->cost_map[j].cost.get_frequency (),
>>>         group->cost_map[j].cost.get_complexity ());
>>>        if (group->cost_map[j].inv_expr !=3D NULL)
>>>   fprintf (dump_file, "%d\t",
>>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *da=
ta)
>>>   }
>>>        fprintf (dump_file, "\n");
>>>      }
>>> +
>>> +  for (i =3D 0; i < data->vgroups.length (); i++)
>>> +    {
>>> +      group =3D data->vgroups[i];
>>> +      for (j =3D 0; j < group->n_map_members; j++)
>>> + {
>>> +  if (!group->cost_map[j].cand
>>> +      || group->cost_map[j].cost.infinite_cost_p ())
>>> +    continue;
>>> +
>>> +  group->cost_map[j].cost.propagate_scaled_cost ();
>>> + }
>>> +    }
>> This is wrong.  m_frequency and m_cost_scaled are initialized to
>> sreal(0) by default, and are never changed later for conditional
>> iv_use.  As a matter of factor, costs computed for all conditional
>> iv_uses are wrong (value is 0).  This makes the observed improvement
>> not that promising.  Considering code generation is very sensitive to
>> cost computation, it maybe just hit some special cases.  Eitherway we
>> need more work/investigation on the impact of this patch.
>>
>> Again, I would suggest we factor out frequency out of comp_cost and
>> only scale the cost in place when we compute cost for each use.  Then
>> we can measure what's the impact on code generation.
>>
>> Thanks,
>> bin
>>
>
> All remarks were applied in third version of the patch. Together with the=
 previous
> patch, it survives bootstrap and regression tests on x86_64-linux-gnu.
> I'm going to re-test the patch on SPEC benchmarks.
> +
> +ret:
> +  /* Scale (multiply) the computed cost (except scratch part that should=
 be
> +     hoisted out a loop) by header->frequency / at->frequency,
> +     which makes expected cost more accurate.  */
> +   int loop_freq =3D data->current_loop->header->frequency;
> +   int bb_freq =3D at->bb->frequency;
> +   if (loop_freq !=3D 0)
> +     {
> +       gcc_assert (cost.scratch <=3D cost.cost);
> +       int scaled_cost
> +     =3D cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq;
> +
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +     fprintf (dump_file, "Scaling iv_use based on cand %d "
> +          "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n",
> +          cand->id, 1.0f * bb_freq / loop_freq, cost.cost,
> +          cost.scratch, scaled_cost, bb_freq, loop_freq);
> +
> +       cost.cost =3D scaled_cost;
> +     }
> +
> +   return cost;
Hi,
Could you please factor out this as a function and remove the goto
statements?  Okay with this change if no fallout in benchmarks you
run.

Thanks,
bin
>
> Martin
>

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-428132-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 88111 invoked by alias); 24 May 2016 10:37:02 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 87443 invoked by uid 89); 24 May 2016 10:36:56 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=unavailable version=3.3.2 spammy=speeds, our, pressure
X-HELO: mail-vk0-f51.google.com
Received: from mail-vk0-f51.google.com (HELO mail-vk0-f51.google.com) (209.85.213.51) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 24 May 2016 10:35:11 +0000
Received: by mail-vk0-f51.google.com with SMTP id c189so15683428vkb.1        for <gcc-patches@gcc.gnu.org>; Tue, 24 May 2016 03:35:08 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:mime-version:in-reply-to:references:date         :message-id:subject:from:to:cc:content-transfer-encoding;        bh=iLiXEs459wFbs7PKlFm5YNmu5y2HYDhEV23zFeh/jok=;        b=axHOazxzOxohFrXEfoZZIQq+5gtvNN4HDdj+hkKOB8jfx46wUsVOP8hrphXaWVFNRK         98ELtmhcw4pXYW7WCbE2QTvwD8RnbJeZar24oo+57qvQWJJJDGSO+q3TdnIqd/Rzi2nB         KDaH7Bz9wA7IVCms8HehuLcA6krywXBIrOwIvkNPQFMW3o8nxIyk5e5j+kwK5XWlvOyM         kvfr7XZesCbAbjvgR1WAshZktW55vGxNCNRDh0hypr8NNtnys9o7YMsZmj9eeK4xj0ot         IxZ/Hq2hU37ogxSmtILHgp1kwGTiztVtu/SX2FANaM8KhsUr6W9BCHgFI1nIJ//Ul10o         nX3g==
X-Gm-Message-State: ALyK8tI1JIliaDX+W3r4bpNpFykwL68WmDv/Qj74hdQJKoHr6NfNFgxfqT7lMxdAbo5316dcdQHCjk7/74wE6A==
MIME-Version: 1.0
X-Received: by 10.31.158.1 with SMTP id h1mr2021503vke.5.1464084694648; Tue, 24 May 2016 03:11:34 -0700 (PDT)
Received: by 10.103.45.195 with HTTP; Tue, 24 May 2016 03:11:34 -0700 (PDT)
In-Reply-To: <573D9549.4070700@suse.cz>
References: <cover.1461931011.git.mliska@suse.cz>	<00578bce6fdccccacdd740ff0067ccde46f98f51.1461931011.git.mliska@suse.cz>	<5739D16A.9020907@suse.cz>	<CAHFci2_tWaneWe2kWfyrcAC-s+Mw_zc7f42X0ARZ12oD3wbtwg@mail.gmail.com>	<573D9549.4070700@suse.cz>
Date: Tue, 24 May 2016 11:01:00 -0000
Message-ID: <CAHFci2-wYWJ=gsZ0THYoD1M28DCCBc362-DB0CSBSzFj7EjNhQ@mail.gmail.com>
Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS
From: "Bin.Cheng" <amker.cheng@gmail.com>
To: =?UTF-8?Q?Martin_Li=C5=A1ka?= <mliska@suse.cz>
Cc: gcc-patches List <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2016-05/txt/msg01883.txt.bz2
Message-ID: <20160524110100.hANLqsQg5ZA6lPQnHWRmmfywWYMiIwc_RAGjsHv6Yzc@z>

On Thu, May 19, 2016 at 11:28 AM, Martin Li=C5=A1ka <mliska@suse.cz> wrote:
> On 05/17/2016 12:27 AM, Bin.Cheng wrote:
>>> As profile-guided optimization can provide very useful information
>>> about basic block frequencies within a loop, following patch set levera=
ges
>>> that information. It speeds up a single benchmark from upcoming SPECv6
>>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can
>>> also improve others (currently measuring numbers for PGO).
>> Hi,
>> Is this 20% improvement from this patch, or does it include the
>> existing PGO's improvement?
>
> Hello.
>
> It shows that current trunk (compared to GCC 6 branch)
> has significantly improved the benchmark with PGO.
> Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to
> improve static profile that would utilize the patch.
>
>>
>> For the patch:
>>> +
>>> +  /* Return true if the frequency has a valid value.  */
>>> +  bool has_frequency ();
>>> +
>>>    /* Return infinite comp_cost.  */
>>>    static comp_cost get_infinite ();
>>>
>>> @@ -249,6 +272,9 @@ private:
>>>       complexity field should be larger for more
>>>       complex expressions and addressing modes).  */
>>>    int m_scratch;  /* Scratch used during cost computation.  */
>>> +  sreal m_frequency;  /* Frequency of the basic block this comp_cost
>>> +     belongs to.  */
>>> +  sreal m_cost_scaled;  /* Scalled runtime cost.  */
>> IMHO we shouldn't embed frequency in comp_cost, neither record scaled
>> cost in it.  I would suggest we compute cost and amortize the cost
>> over frequency in get_computation_cost_at before storing it into
>> comp_cost.  That is, once cost is computed/stored in comp_cost, it is
>> already scaled with frequency.  One argument is frequency info is only
>> valid for use's statement/basic_block, it really doesn't have clear
>> meaning in comp_cost structure.  Outside of function
>> get_computation_cost_at, I found it's hard to understand/remember
>> what's the meaning of comp_cost.m_frequency and where it came from.
>> There are other reasons embedded in below comments.
>>>
>>>
>>>  comp_cost&
>>> @@ -257,6 +283,8 @@ comp_cost::operator=3D (const comp_cost& other)
>>>    m_cost =3D other.m_cost;
>>>    m_complexity =3D other.m_complexity;
>>>    m_scratch =3D other.m_scratch;
>>> +  m_frequency =3D other.m_frequency;
>>> +  m_cost_scaled =3D other.m_cost_scaled;
>>>
>>>    return *this;
>>>  }
>>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2)
>>>
>>>    cost1.m_cost +=3D cost2.m_cost;
>>>    cost1.m_complexity +=3D cost2.m_complexity;
>>> +  cost1.m_cost_scaled +=3D cost2.m_cost_scaled;
>>>
>>>    return cost1;
>>>  }
>>> @@ -290,6 +319,8 @@ comp_cost
>>>  comp_cost::operator+=3D (HOST_WIDE_INT c)
>> This and below operators need check for infinite cost first and return
>> immediately.
>>>  {
>>>    this->m_cost +=3D c;
>>> +  if (has_frequency ())
>>> +    this->m_cost_scaled +=3D scale_cost (c);
>>>
>>>    return *this;
>>>  }
>>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *da=
ta,
>>>       (symbol/var1/const parts may be omitted).  If we are looking for =
an
>>>       address, find the cost of addressing this.  */
>>>    if (address_p)
>>> -    return cost + get_address_cost (symbol_present, var_present,
>>> -    offset, ratio, cstepi,
>>> -    mem_mode,
>>> -    TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> -    speed, stmt_is_after_inc, can_autoinc);
>>> +    {
>>> +      cost +=3D get_address_cost (symbol_present, var_present,
>>> + offset, ratio, cstepi,
>>> + mem_mode,
>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)),
>>> + speed, stmt_is_after_inc, can_autoinc);
>>> +      goto ret;
>>> +    }
>>>
>>>    /* Otherwise estimate the costs for computing the expression.  */
>>>    if (!symbol_present && !var_present && !offset)
>>>      {
>>>        if (ratio !=3D 1)
>>>   cost +=3D mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed);
>>> -      return cost;
>>> +      goto ret;
>>>      }
>>>
>>>    /* Symbol + offset should be compile-time computable so consider tha=
t they
>>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data,
>>>    aratio =3D ratio > 0 ? ratio : -ratio;
>>>    if (aratio !=3D 1)
>>>      cost +=3D mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed);
>>> -  return cost;
>>> +
>>> +  goto ret;
>>>
>>>  fallback:
>>>    if (can_autoinc)
>>> @@ -5093,8 +5178,13 @@ fallback:
>>>      if (address_p)
>>>        comp =3D build_simple_mem_ref (comp);
>>>
>>> -    return comp_cost (computation_cost (comp, speed), 0);
>>> +    cost =3D comp_cost (computation_cost (comp, speed), 0);
>>>    }
>>> +
>>> +ret:
>>> +  cost.calculate_scaled_cost (at->bb->frequency,
>>> +      data->current_loop->header->frequency);
>> Here cost consists of two parts.  One is for loop invariant
>> computation, we amortize is against avg_loop_niter and record register
>> pressure (either via invriant variables or invariant expressions) for
>> it;  the other is loop variant part.  For the first part, we should
>> not scaled it using frequency, since we have already assumed it would
>> be hoisted out of loop.  No matter where the use is, hoisted loop
>> invariant has the same frequency as loop header.  This is the second
>> reason I want to factor frequency out of comp_cost.  It's easier to
>> scale with frequency only it's necessary.
>>
>>> +  return cost;
>>>  }
>>>
>>>  /* Determines the cost of the computation by that USE is expressed
>>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *d=
ata)
>>>    group =3D data->vgroups[i];
>>>
>>>    fprintf (dump_file, "Group %d:\n", i);
>>> -  fprintf (dump_file, "  cand\tcost\tcompl.\tinv.ex.\tdepends on\n");
>>> +  fprintf (dump_file, "  cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex."
>>> +   "\tdepends on\n");
>>>    for (j =3D 0; j < group->n_map_members; j++)
>>>      {
>>>        if (!group->cost_map[j].cand
>>>    || group->cost_map[j].cost.infinite_cost_p ())
>>>   continue;
>>>
>>> -      fprintf (dump_file, "  %d\t%d\t%d\t",
>>> +      fprintf (dump_file, "  %d\t%d\t%2.2f\t%2.2f\t%d\t",
>>>         group->cost_map[j].cand->id,
>>>         group->cost_map[j].cost.get_cost (),
>>> +       group->cost_map[j].cost.get_cost_scaled (),
>>> +       group->cost_map[j].cost.get_frequency (),
>>>         group->cost_map[j].cost.get_complexity ());
>>>        if (group->cost_map[j].inv_expr !=3D NULL)
>>>   fprintf (dump_file, "%d\t",
>>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *da=
ta)
>>>   }
>>>        fprintf (dump_file, "\n");
>>>      }
>>> +
>>> +  for (i =3D 0; i < data->vgroups.length (); i++)
>>> +    {
>>> +      group =3D data->vgroups[i];
>>> +      for (j =3D 0; j < group->n_map_members; j++)
>>> + {
>>> +  if (!group->cost_map[j].cand
>>> +      || group->cost_map[j].cost.infinite_cost_p ())
>>> +    continue;
>>> +
>>> +  group->cost_map[j].cost.propagate_scaled_cost ();
>>> + }
>>> +    }
>> This is wrong.  m_frequency and m_cost_scaled are initialized to
>> sreal(0) by default, and are never changed later for conditional
>> iv_use.  As a matter of factor, costs computed for all conditional
>> iv_uses are wrong (value is 0).  This makes the observed improvement
>> not that promising.  Considering code generation is very sensitive to
>> cost computation, it maybe just hit some special cases.  Eitherway we
>> need more work/investigation on the impact of this patch.
>>
>> Again, I would suggest we factor out frequency out of comp_cost and
>> only scale the cost in place when we compute cost for each use.  Then
>> we can measure what's the impact on code generation.
>>
>> Thanks,
>> bin
>>
>
> All remarks were applied in third version of the patch. Together with the=
 previous
> patch, it survives bootstrap and regression tests on x86_64-linux-gnu.
> I'm going to re-test the patch on SPEC benchmarks.
> +
> +ret:
> +  /* Scale (multiply) the computed cost (except scratch part that should=
 be
> +     hoisted out a loop) by header->frequency / at->frequency,
> +     which makes expected cost more accurate.  */
> +   int loop_freq =3D data->current_loop->header->frequency;
> +   int bb_freq =3D at->bb->frequency;
> +   if (loop_freq !=3D 0)
> +     {
> +       gcc_assert (cost.scratch <=3D cost.cost);
> +       int scaled_cost
> +     =3D cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq;
> +
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +     fprintf (dump_file, "Scaling iv_use based on cand %d "
> +          "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n",
> +          cand->id, 1.0f * bb_freq / loop_freq, cost.cost,
> +          cost.scratch, scaled_cost, bb_freq, loop_freq);
> +
> +       cost.cost =3D scaled_cost;
> +     }
> +
> +   return cost;
Hi,
Could you please factor out this as a function and remove the goto
statements?  Okay with this change if no fallout in benchmarks you
run.

Thanks,
bin
>
> Martin
>