From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6201 invoked by alias); 24 May 2016 10:19:18 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 5666 invoked by uid 89); 24 May 2016 10:19:16 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=10f, 1.0f, Frequency, comp X-HELO: mail-vk0-f45.google.com Received: from mail-vk0-f45.google.com (HELO mail-vk0-f45.google.com) (209.85.213.45) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 24 May 2016 10:12:26 +0000 Received: by mail-vk0-f45.google.com with SMTP id c189so15054155vkb.1 for ; Tue, 24 May 2016 03:12:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-transfer-encoding; bh=iLiXEs459wFbs7PKlFm5YNmu5y2HYDhEV23zFeh/jok=; b=i+edOaYYUqIikWpA32EF6yb2tL/AIdZPtxx3/j2F/9nZONXaIMkK90cQq9VKrJ6+6x eIrwNicpC4Bwo7/l7pZXoWuvqT+AQPTI9pezKmYo8KN/9tFEgqcrvyATe7XYV8wKDoBq EG7M1T4nwe6BPfIsaRWHbuDF+2HHL+MDjM4Za2yZ93ogxcQbZUk1ypbpWTqqSBql/3X9 TO1jTL15UedKi/cCDWt0Lcs/WgPH7EgzbWzkdnYHhm/DNjEpuZgn/YYnTlEe3iY9XREe GseMMa5TFEg09xpH70LXY0BTJi5H1BHoQDReG3NzWhotIkP6gb996g6+CZkBkqZ/uU7J GTrw== X-Gm-Message-State: ALyK8tJXgJHRPUdH3j29VMfwcsdfE9DlXJ+YIpSBHwUwk6FJ8oOqyC7CisqnM6el/OFshmGzc8XNL3uiJQZwyw== MIME-Version: 1.0 X-Received: by 10.31.158.1 with SMTP id h1mr2021503vke.5.1464084694648; Tue, 24 May 2016 03:11:34 -0700 (PDT) Received: by 10.103.45.195 with HTTP; Tue, 24 May 2016 03:11:34 -0700 (PDT) In-Reply-To: <573D9549.4070700@suse.cz> References: <00578bce6fdccccacdd740ff0067ccde46f98f51.1461931011.git.mliska@suse.cz> <5739D16A.9020907@suse.cz> <573D9549.4070700@suse.cz> Date: Tue, 24 May 2016 10:19:00 -0000 Message-ID: Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS From: "Bin.Cheng" To: =?UTF-8?Q?Martin_Li=C5=A1ka?= Cc: gcc-patches List Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2016-05/txt/msg01877.txt.bz2 On Thu, May 19, 2016 at 11:28 AM, Martin Li=C5=A1ka wrote: > On 05/17/2016 12:27 AM, Bin.Cheng wrote: >>> As profile-guided optimization can provide very useful information >>> about basic block frequencies within a loop, following patch set levera= ges >>> that information. It speeds up a single benchmark from upcoming SPECv6 >>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can >>> also improve others (currently measuring numbers for PGO). >> Hi, >> Is this 20% improvement from this patch, or does it include the >> existing PGO's improvement? > > Hello. > > It shows that current trunk (compared to GCC 6 branch) > has significantly improved the benchmark with PGO. > Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to > improve static profile that would utilize the patch. > >> >> For the patch: >>> + >>> + /* Return true if the frequency has a valid value. */ >>> + bool has_frequency (); >>> + >>> /* Return infinite comp_cost. */ >>> static comp_cost get_infinite (); >>> >>> @@ -249,6 +272,9 @@ private: >>> complexity field should be larger for more >>> complex expressions and addressing modes). */ >>> int m_scratch; /* Scratch used during cost computation. */ >>> + sreal m_frequency; /* Frequency of the basic block this comp_cost >>> + belongs to. */ >>> + sreal m_cost_scaled; /* Scalled runtime cost. */ >> IMHO we shouldn't embed frequency in comp_cost, neither record scaled >> cost in it. I would suggest we compute cost and amortize the cost >> over frequency in get_computation_cost_at before storing it into >> comp_cost. That is, once cost is computed/stored in comp_cost, it is >> already scaled with frequency. One argument is frequency info is only >> valid for use's statement/basic_block, it really doesn't have clear >> meaning in comp_cost structure. Outside of function >> get_computation_cost_at, I found it's hard to understand/remember >> what's the meaning of comp_cost.m_frequency and where it came from. >> There are other reasons embedded in below comments. >>> >>> >>> comp_cost& >>> @@ -257,6 +283,8 @@ comp_cost::operator=3D (const comp_cost& other) >>> m_cost =3D other.m_cost; >>> m_complexity =3D other.m_complexity; >>> m_scratch =3D other.m_scratch; >>> + m_frequency =3D other.m_frequency; >>> + m_cost_scaled =3D other.m_cost_scaled; >>> >>> return *this; >>> } >>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2) >>> >>> cost1.m_cost +=3D cost2.m_cost; >>> cost1.m_complexity +=3D cost2.m_complexity; >>> + cost1.m_cost_scaled +=3D cost2.m_cost_scaled; >>> >>> return cost1; >>> } >>> @@ -290,6 +319,8 @@ comp_cost >>> comp_cost::operator+=3D (HOST_WIDE_INT c) >> This and below operators need check for infinite cost first and return >> immediately. >>> { >>> this->m_cost +=3D c; >>> + if (has_frequency ()) >>> + this->m_cost_scaled +=3D scale_cost (c); >>> >>> return *this; >>> } >>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *da= ta, >>> (symbol/var1/const parts may be omitted). If we are looking for = an >>> address, find the cost of addressing this. */ >>> if (address_p) >>> - return cost + get_address_cost (symbol_present, var_present, >>> - offset, ratio, cstepi, >>> - mem_mode, >>> - TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>> - speed, stmt_is_after_inc, can_autoinc); >>> + { >>> + cost +=3D get_address_cost (symbol_present, var_present, >>> + offset, ratio, cstepi, >>> + mem_mode, >>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>> + speed, stmt_is_after_inc, can_autoinc); >>> + goto ret; >>> + } >>> >>> /* Otherwise estimate the costs for computing the expression. */ >>> if (!symbol_present && !var_present && !offset) >>> { >>> if (ratio !=3D 1) >>> cost +=3D mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed); >>> - return cost; >>> + goto ret; >>> } >>> >>> /* Symbol + offset should be compile-time computable so consider tha= t they >>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data, >>> aratio =3D ratio > 0 ? ratio : -ratio; >>> if (aratio !=3D 1) >>> cost +=3D mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed); >>> - return cost; >>> + >>> + goto ret; >>> >>> fallback: >>> if (can_autoinc) >>> @@ -5093,8 +5178,13 @@ fallback: >>> if (address_p) >>> comp =3D build_simple_mem_ref (comp); >>> >>> - return comp_cost (computation_cost (comp, speed), 0); >>> + cost =3D comp_cost (computation_cost (comp, speed), 0); >>> } >>> + >>> +ret: >>> + cost.calculate_scaled_cost (at->bb->frequency, >>> + data->current_loop->header->frequency); >> Here cost consists of two parts. One is for loop invariant >> computation, we amortize is against avg_loop_niter and record register >> pressure (either via invriant variables or invariant expressions) for >> it; the other is loop variant part. For the first part, we should >> not scaled it using frequency, since we have already assumed it would >> be hoisted out of loop. No matter where the use is, hoisted loop >> invariant has the same frequency as loop header. This is the second >> reason I want to factor frequency out of comp_cost. It's easier to >> scale with frequency only it's necessary. >> >>> + return cost; >>> } >>> >>> /* Determines the cost of the computation by that USE is expressed >>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *d= ata) >>> group =3D data->vgroups[i]; >>> >>> fprintf (dump_file, "Group %d:\n", i); >>> - fprintf (dump_file, " cand\tcost\tcompl.\tinv.ex.\tdepends on\n"); >>> + fprintf (dump_file, " cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex." >>> + "\tdepends on\n"); >>> for (j =3D 0; j < group->n_map_members; j++) >>> { >>> if (!group->cost_map[j].cand >>> || group->cost_map[j].cost.infinite_cost_p ()) >>> continue; >>> >>> - fprintf (dump_file, " %d\t%d\t%d\t", >>> + fprintf (dump_file, " %d\t%d\t%2.2f\t%2.2f\t%d\t", >>> group->cost_map[j].cand->id, >>> group->cost_map[j].cost.get_cost (), >>> + group->cost_map[j].cost.get_cost_scaled (), >>> + group->cost_map[j].cost.get_frequency (), >>> group->cost_map[j].cost.get_complexity ()); >>> if (group->cost_map[j].inv_expr !=3D NULL) >>> fprintf (dump_file, "%d\t", >>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *da= ta) >>> } >>> fprintf (dump_file, "\n"); >>> } >>> + >>> + for (i =3D 0; i < data->vgroups.length (); i++) >>> + { >>> + group =3D data->vgroups[i]; >>> + for (j =3D 0; j < group->n_map_members; j++) >>> + { >>> + if (!group->cost_map[j].cand >>> + || group->cost_map[j].cost.infinite_cost_p ()) >>> + continue; >>> + >>> + group->cost_map[j].cost.propagate_scaled_cost (); >>> + } >>> + } >> This is wrong. m_frequency and m_cost_scaled are initialized to >> sreal(0) by default, and are never changed later for conditional >> iv_use. As a matter of factor, costs computed for all conditional >> iv_uses are wrong (value is 0). This makes the observed improvement >> not that promising. Considering code generation is very sensitive to >> cost computation, it maybe just hit some special cases. Eitherway we >> need more work/investigation on the impact of this patch. >> >> Again, I would suggest we factor out frequency out of comp_cost and >> only scale the cost in place when we compute cost for each use. Then >> we can measure what's the impact on code generation. >> >> Thanks, >> bin >> > > All remarks were applied in third version of the patch. Together with the= previous > patch, it survives bootstrap and regression tests on x86_64-linux-gnu. > I'm going to re-test the patch on SPEC benchmarks. > + > +ret: > + /* Scale (multiply) the computed cost (except scratch part that should= be > + hoisted out a loop) by header->frequency / at->frequency, > + which makes expected cost more accurate. */ > + int loop_freq =3D data->current_loop->header->frequency; > + int bb_freq =3D at->bb->frequency; > + if (loop_freq !=3D 0) > + { > + gcc_assert (cost.scratch <=3D cost.cost); > + int scaled_cost > + =3D cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq; > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "Scaling iv_use based on cand %d " > + "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n", > + cand->id, 1.0f * bb_freq / loop_freq, cost.cost, > + cost.scratch, scaled_cost, bb_freq, loop_freq); > + > + cost.cost =3D scaled_cost; > + } > + > + return cost; Hi, Could you please factor out this as a function and remove the goto statements? Okay with this change if no fallout in benchmarks you run. Thanks, bin > > Martin > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 83984 invoked by alias); 24 May 2016 10:22:56 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 83659 invoked by uid 89); 24 May 2016 10:22:55 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW autolearn=unavailable version=3.3.2 spammy= X-HELO: mail-vk0-f50.google.com Received: from mail-vk0-f50.google.com (HELO mail-vk0-f50.google.com) (209.85.213.50) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 24 May 2016 10:17:53 +0000 Received: by mail-vk0-f50.google.com with SMTP id c189so15212369vkb.1 for ; Tue, 24 May 2016 03:17:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-transfer-encoding; bh=iLiXEs459wFbs7PKlFm5YNmu5y2HYDhEV23zFeh/jok=; b=d13zMXbJe9sRPDXFz+8640iR+1ys/HOIdCv+qCXFNXg2kgazQrrkYMVaH8tDdZBJgC j/CW91jJZ3T8a0AhYznAI0OxEOYoIpw8SL/8oYX1O3mJUTiClTj+nv1izT9pIn8Bwb+P z+RRKzWBz5Q7XIn/aVfJrhj68N2NdGercBlHZ4Wfxq+NMJJJJfnuIuuKmep2NHepKom5 Mslirc60qX/0ez+GtFGefXLm9+T7HxzzdwvSoqIA7ZzHlfkyocAUJZDiohpEDgXaibjm rnGBII/tttbukeoWafd4UkZ8fcqT0VH8ubeNhquwIDw30gepcmQedlHjMS++WmNnHquk R4Cw== X-Gm-Message-State: ALyK8tInGN1wnyiyPoIszNURFxYE8qeJ3M7qyv7aRExi8cmO1zg+kHi1+U9YsBRM08+BA6y0orMKpvsfDfTSQQ== MIME-Version: 1.0 X-Received: by 10.31.158.1 with SMTP id h1mr2021503vke.5.1464084694648; Tue, 24 May 2016 03:11:34 -0700 (PDT) Received: by 10.103.45.195 with HTTP; Tue, 24 May 2016 03:11:34 -0700 (PDT) In-Reply-To: <573D9549.4070700@suse.cz> References: <00578bce6fdccccacdd740ff0067ccde46f98f51.1461931011.git.mliska@suse.cz> <5739D16A.9020907@suse.cz> <573D9549.4070700@suse.cz> Date: Tue, 24 May 2016 10:33:00 -0000 Message-ID: Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS From: "Bin.Cheng" To: =?UTF-8?Q?Martin_Li=C5=A1ka?= Cc: gcc-patches List Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2016-05/txt/msg01878.txt.bz2 Message-ID: <20160524103300.KLI_4ALuiZjzoD5OVUWKq4WcYjijEB_kRcddYNJP0uQ@z> On Thu, May 19, 2016 at 11:28 AM, Martin Li=C5=A1ka wrote: > On 05/17/2016 12:27 AM, Bin.Cheng wrote: >>> As profile-guided optimization can provide very useful information >>> about basic block frequencies within a loop, following patch set levera= ges >>> that information. It speeds up a single benchmark from upcoming SPECv6 >>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can >>> also improve others (currently measuring numbers for PGO). >> Hi, >> Is this 20% improvement from this patch, or does it include the >> existing PGO's improvement? > > Hello. > > It shows that current trunk (compared to GCC 6 branch) > has significantly improved the benchmark with PGO. > Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to > improve static profile that would utilize the patch. > >> >> For the patch: >>> + >>> + /* Return true if the frequency has a valid value. */ >>> + bool has_frequency (); >>> + >>> /* Return infinite comp_cost. */ >>> static comp_cost get_infinite (); >>> >>> @@ -249,6 +272,9 @@ private: >>> complexity field should be larger for more >>> complex expressions and addressing modes). */ >>> int m_scratch; /* Scratch used during cost computation. */ >>> + sreal m_frequency; /* Frequency of the basic block this comp_cost >>> + belongs to. */ >>> + sreal m_cost_scaled; /* Scalled runtime cost. */ >> IMHO we shouldn't embed frequency in comp_cost, neither record scaled >> cost in it. I would suggest we compute cost and amortize the cost >> over frequency in get_computation_cost_at before storing it into >> comp_cost. That is, once cost is computed/stored in comp_cost, it is >> already scaled with frequency. One argument is frequency info is only >> valid for use's statement/basic_block, it really doesn't have clear >> meaning in comp_cost structure. Outside of function >> get_computation_cost_at, I found it's hard to understand/remember >> what's the meaning of comp_cost.m_frequency and where it came from. >> There are other reasons embedded in below comments. >>> >>> >>> comp_cost& >>> @@ -257,6 +283,8 @@ comp_cost::operator=3D (const comp_cost& other) >>> m_cost =3D other.m_cost; >>> m_complexity =3D other.m_complexity; >>> m_scratch =3D other.m_scratch; >>> + m_frequency =3D other.m_frequency; >>> + m_cost_scaled =3D other.m_cost_scaled; >>> >>> return *this; >>> } >>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2) >>> >>> cost1.m_cost +=3D cost2.m_cost; >>> cost1.m_complexity +=3D cost2.m_complexity; >>> + cost1.m_cost_scaled +=3D cost2.m_cost_scaled; >>> >>> return cost1; >>> } >>> @@ -290,6 +319,8 @@ comp_cost >>> comp_cost::operator+=3D (HOST_WIDE_INT c) >> This and below operators need check for infinite cost first and return >> immediately. >>> { >>> this->m_cost +=3D c; >>> + if (has_frequency ()) >>> + this->m_cost_scaled +=3D scale_cost (c); >>> >>> return *this; >>> } >>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *da= ta, >>> (symbol/var1/const parts may be omitted). If we are looking for = an >>> address, find the cost of addressing this. */ >>> if (address_p) >>> - return cost + get_address_cost (symbol_present, var_present, >>> - offset, ratio, cstepi, >>> - mem_mode, >>> - TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>> - speed, stmt_is_after_inc, can_autoinc); >>> + { >>> + cost +=3D get_address_cost (symbol_present, var_present, >>> + offset, ratio, cstepi, >>> + mem_mode, >>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>> + speed, stmt_is_after_inc, can_autoinc); >>> + goto ret; >>> + } >>> >>> /* Otherwise estimate the costs for computing the expression. */ >>> if (!symbol_present && !var_present && !offset) >>> { >>> if (ratio !=3D 1) >>> cost +=3D mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed); >>> - return cost; >>> + goto ret; >>> } >>> >>> /* Symbol + offset should be compile-time computable so consider tha= t they >>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data, >>> aratio =3D ratio > 0 ? ratio : -ratio; >>> if (aratio !=3D 1) >>> cost +=3D mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed); >>> - return cost; >>> + >>> + goto ret; >>> >>> fallback: >>> if (can_autoinc) >>> @@ -5093,8 +5178,13 @@ fallback: >>> if (address_p) >>> comp =3D build_simple_mem_ref (comp); >>> >>> - return comp_cost (computation_cost (comp, speed), 0); >>> + cost =3D comp_cost (computation_cost (comp, speed), 0); >>> } >>> + >>> +ret: >>> + cost.calculate_scaled_cost (at->bb->frequency, >>> + data->current_loop->header->frequency); >> Here cost consists of two parts. One is for loop invariant >> computation, we amortize is against avg_loop_niter and record register >> pressure (either via invriant variables or invariant expressions) for >> it; the other is loop variant part. For the first part, we should >> not scaled it using frequency, since we have already assumed it would >> be hoisted out of loop. No matter where the use is, hoisted loop >> invariant has the same frequency as loop header. This is the second >> reason I want to factor frequency out of comp_cost. It's easier to >> scale with frequency only it's necessary. >> >>> + return cost; >>> } >>> >>> /* Determines the cost of the computation by that USE is expressed >>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *d= ata) >>> group =3D data->vgroups[i]; >>> >>> fprintf (dump_file, "Group %d:\n", i); >>> - fprintf (dump_file, " cand\tcost\tcompl.\tinv.ex.\tdepends on\n"); >>> + fprintf (dump_file, " cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex." >>> + "\tdepends on\n"); >>> for (j =3D 0; j < group->n_map_members; j++) >>> { >>> if (!group->cost_map[j].cand >>> || group->cost_map[j].cost.infinite_cost_p ()) >>> continue; >>> >>> - fprintf (dump_file, " %d\t%d\t%d\t", >>> + fprintf (dump_file, " %d\t%d\t%2.2f\t%2.2f\t%d\t", >>> group->cost_map[j].cand->id, >>> group->cost_map[j].cost.get_cost (), >>> + group->cost_map[j].cost.get_cost_scaled (), >>> + group->cost_map[j].cost.get_frequency (), >>> group->cost_map[j].cost.get_complexity ()); >>> if (group->cost_map[j].inv_expr !=3D NULL) >>> fprintf (dump_file, "%d\t", >>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *da= ta) >>> } >>> fprintf (dump_file, "\n"); >>> } >>> + >>> + for (i =3D 0; i < data->vgroups.length (); i++) >>> + { >>> + group =3D data->vgroups[i]; >>> + for (j =3D 0; j < group->n_map_members; j++) >>> + { >>> + if (!group->cost_map[j].cand >>> + || group->cost_map[j].cost.infinite_cost_p ()) >>> + continue; >>> + >>> + group->cost_map[j].cost.propagate_scaled_cost (); >>> + } >>> + } >> This is wrong. m_frequency and m_cost_scaled are initialized to >> sreal(0) by default, and are never changed later for conditional >> iv_use. As a matter of factor, costs computed for all conditional >> iv_uses are wrong (value is 0). This makes the observed improvement >> not that promising. Considering code generation is very sensitive to >> cost computation, it maybe just hit some special cases. Eitherway we >> need more work/investigation on the impact of this patch. >> >> Again, I would suggest we factor out frequency out of comp_cost and >> only scale the cost in place when we compute cost for each use. Then >> we can measure what's the impact on code generation. >> >> Thanks, >> bin >> > > All remarks were applied in third version of the patch. Together with the= previous > patch, it survives bootstrap and regression tests on x86_64-linux-gnu. > I'm going to re-test the patch on SPEC benchmarks. > + > +ret: > + /* Scale (multiply) the computed cost (except scratch part that should= be > + hoisted out a loop) by header->frequency / at->frequency, > + which makes expected cost more accurate. */ > + int loop_freq =3D data->current_loop->header->frequency; > + int bb_freq =3D at->bb->frequency; > + if (loop_freq !=3D 0) > + { > + gcc_assert (cost.scratch <=3D cost.cost); > + int scaled_cost > + =3D cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq; > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "Scaling iv_use based on cand %d " > + "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n", > + cand->id, 1.0f * bb_freq / loop_freq, cost.cost, > + cost.scratch, scaled_cost, bb_freq, loop_freq); > + > + cost.cost =3D scaled_cost; > + } > + > + return cost; Hi, Could you please factor out this as a function and remove the goto statements? Okay with this change if no fallout in benchmarks you run. Thanks, bin > > Martin > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 88111 invoked by alias); 24 May 2016 10:37:02 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 87443 invoked by uid 89); 24 May 2016 10:36:56 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=unavailable version=3.3.2 spammy=speeds, our, pressure X-HELO: mail-vk0-f51.google.com Received: from mail-vk0-f51.google.com (HELO mail-vk0-f51.google.com) (209.85.213.51) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 24 May 2016 10:35:11 +0000 Received: by mail-vk0-f51.google.com with SMTP id c189so15683428vkb.1 for ; Tue, 24 May 2016 03:35:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-transfer-encoding; bh=iLiXEs459wFbs7PKlFm5YNmu5y2HYDhEV23zFeh/jok=; b=axHOazxzOxohFrXEfoZZIQq+5gtvNN4HDdj+hkKOB8jfx46wUsVOP8hrphXaWVFNRK 98ELtmhcw4pXYW7WCbE2QTvwD8RnbJeZar24oo+57qvQWJJJDGSO+q3TdnIqd/Rzi2nB KDaH7Bz9wA7IVCms8HehuLcA6krywXBIrOwIvkNPQFMW3o8nxIyk5e5j+kwK5XWlvOyM kvfr7XZesCbAbjvgR1WAshZktW55vGxNCNRDh0hypr8NNtnys9o7YMsZmj9eeK4xj0ot IxZ/Hq2hU37ogxSmtILHgp1kwGTiztVtu/SX2FANaM8KhsUr6W9BCHgFI1nIJ//Ul10o nX3g== X-Gm-Message-State: ALyK8tI1JIliaDX+W3r4bpNpFykwL68WmDv/Qj74hdQJKoHr6NfNFgxfqT7lMxdAbo5316dcdQHCjk7/74wE6A== MIME-Version: 1.0 X-Received: by 10.31.158.1 with SMTP id h1mr2021503vke.5.1464084694648; Tue, 24 May 2016 03:11:34 -0700 (PDT) Received: by 10.103.45.195 with HTTP; Tue, 24 May 2016 03:11:34 -0700 (PDT) In-Reply-To: <573D9549.4070700@suse.cz> References: <00578bce6fdccccacdd740ff0067ccde46f98f51.1461931011.git.mliska@suse.cz> <5739D16A.9020907@suse.cz> <573D9549.4070700@suse.cz> Date: Tue, 24 May 2016 11:01:00 -0000 Message-ID: Subject: Re: [PATCH 2/3] Add profiling support for IVOPTS From: "Bin.Cheng" To: =?UTF-8?Q?Martin_Li=C5=A1ka?= Cc: gcc-patches List Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2016-05/txt/msg01883.txt.bz2 Message-ID: <20160524110100.hANLqsQg5ZA6lPQnHWRmmfywWYMiIwc_RAGjsHv6Yzc@z> On Thu, May 19, 2016 at 11:28 AM, Martin Li=C5=A1ka wrote: > On 05/17/2016 12:27 AM, Bin.Cheng wrote: >>> As profile-guided optimization can provide very useful information >>> about basic block frequencies within a loop, following patch set levera= ges >>> that information. It speeds up a single benchmark from upcoming SPECv6 >>> suite by 20% (-O2 -profile-generate/-fprofile use) and I think it can >>> also improve others (currently measuring numbers for PGO). >> Hi, >> Is this 20% improvement from this patch, or does it include the >> existing PGO's improvement? > > Hello. > > It shows that current trunk (compared to GCC 6 branch) > has significantly improved the benchmark with PGO. > Currently, my patch improves PGO by ~5% w/ -O2, but our plan is to > improve static profile that would utilize the patch. > >> >> For the patch: >>> + >>> + /* Return true if the frequency has a valid value. */ >>> + bool has_frequency (); >>> + >>> /* Return infinite comp_cost. */ >>> static comp_cost get_infinite (); >>> >>> @@ -249,6 +272,9 @@ private: >>> complexity field should be larger for more >>> complex expressions and addressing modes). */ >>> int m_scratch; /* Scratch used during cost computation. */ >>> + sreal m_frequency; /* Frequency of the basic block this comp_cost >>> + belongs to. */ >>> + sreal m_cost_scaled; /* Scalled runtime cost. */ >> IMHO we shouldn't embed frequency in comp_cost, neither record scaled >> cost in it. I would suggest we compute cost and amortize the cost >> over frequency in get_computation_cost_at before storing it into >> comp_cost. That is, once cost is computed/stored in comp_cost, it is >> already scaled with frequency. One argument is frequency info is only >> valid for use's statement/basic_block, it really doesn't have clear >> meaning in comp_cost structure. Outside of function >> get_computation_cost_at, I found it's hard to understand/remember >> what's the meaning of comp_cost.m_frequency and where it came from. >> There are other reasons embedded in below comments. >>> >>> >>> comp_cost& >>> @@ -257,6 +283,8 @@ comp_cost::operator=3D (const comp_cost& other) >>> m_cost =3D other.m_cost; >>> m_complexity =3D other.m_complexity; >>> m_scratch =3D other.m_scratch; >>> + m_frequency =3D other.m_frequency; >>> + m_cost_scaled =3D other.m_cost_scaled; >>> >>> return *this; >>> } >>> @@ -275,6 +303,7 @@ operator+ (comp_cost cost1, comp_cost cost2) >>> >>> cost1.m_cost +=3D cost2.m_cost; >>> cost1.m_complexity +=3D cost2.m_complexity; >>> + cost1.m_cost_scaled +=3D cost2.m_cost_scaled; >>> >>> return cost1; >>> } >>> @@ -290,6 +319,8 @@ comp_cost >>> comp_cost::operator+=3D (HOST_WIDE_INT c) >> This and below operators need check for infinite cost first and return >> immediately. >>> { >>> this->m_cost +=3D c; >>> + if (has_frequency ()) >>> + this->m_cost_scaled +=3D scale_cost (c); >>> >>> return *this; >>> } >>> @@ -5047,18 +5128,21 @@ get_computation_cost_at (struct ivopts_data *da= ta, >>> (symbol/var1/const parts may be omitted). If we are looking for = an >>> address, find the cost of addressing this. */ >>> if (address_p) >>> - return cost + get_address_cost (symbol_present, var_present, >>> - offset, ratio, cstepi, >>> - mem_mode, >>> - TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>> - speed, stmt_is_after_inc, can_autoinc); >>> + { >>> + cost +=3D get_address_cost (symbol_present, var_present, >>> + offset, ratio, cstepi, >>> + mem_mode, >>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>> + speed, stmt_is_after_inc, can_autoinc); >>> + goto ret; >>> + } >>> >>> /* Otherwise estimate the costs for computing the expression. */ >>> if (!symbol_present && !var_present && !offset) >>> { >>> if (ratio !=3D 1) >>> cost +=3D mult_by_coeff_cost (ratio, TYPE_MODE (ctype), speed); >>> - return cost; >>> + goto ret; >>> } >>> >>> /* Symbol + offset should be compile-time computable so consider tha= t they >>> @@ -5077,7 +5161,8 @@ get_computation_cost_at (struct ivopts_data *data, >>> aratio =3D ratio > 0 ? ratio : -ratio; >>> if (aratio !=3D 1) >>> cost +=3D mult_by_coeff_cost (aratio, TYPE_MODE (ctype), speed); >>> - return cost; >>> + >>> + goto ret; >>> >>> fallback: >>> if (can_autoinc) >>> @@ -5093,8 +5178,13 @@ fallback: >>> if (address_p) >>> comp =3D build_simple_mem_ref (comp); >>> >>> - return comp_cost (computation_cost (comp, speed), 0); >>> + cost =3D comp_cost (computation_cost (comp, speed), 0); >>> } >>> + >>> +ret: >>> + cost.calculate_scaled_cost (at->bb->frequency, >>> + data->current_loop->header->frequency); >> Here cost consists of two parts. One is for loop invariant >> computation, we amortize is against avg_loop_niter and record register >> pressure (either via invriant variables or invariant expressions) for >> it; the other is loop variant part. For the first part, we should >> not scaled it using frequency, since we have already assumed it would >> be hoisted out of loop. No matter where the use is, hoisted loop >> invariant has the same frequency as loop header. This is the second >> reason I want to factor frequency out of comp_cost. It's easier to >> scale with frequency only it's necessary. >> >>> + return cost; >>> } >>> >>> /* Determines the cost of the computation by that USE is expressed >>> @@ -5922,16 +6012,19 @@ determine_group_iv_costs (struct ivopts_data *d= ata) >>> group =3D data->vgroups[i]; >>> >>> fprintf (dump_file, "Group %d:\n", i); >>> - fprintf (dump_file, " cand\tcost\tcompl.\tinv.ex.\tdepends on\n"); >>> + fprintf (dump_file, " cand\tcost\tscaled\tfreq.\tcompl.\tinv.ex." >>> + "\tdepends on\n"); >>> for (j =3D 0; j < group->n_map_members; j++) >>> { >>> if (!group->cost_map[j].cand >>> || group->cost_map[j].cost.infinite_cost_p ()) >>> continue; >>> >>> - fprintf (dump_file, " %d\t%d\t%d\t", >>> + fprintf (dump_file, " %d\t%d\t%2.2f\t%2.2f\t%d\t", >>> group->cost_map[j].cand->id, >>> group->cost_map[j].cost.get_cost (), >>> + group->cost_map[j].cost.get_cost_scaled (), >>> + group->cost_map[j].cost.get_frequency (), >>> group->cost_map[j].cost.get_complexity ()); >>> if (group->cost_map[j].inv_expr !=3D NULL) >>> fprintf (dump_file, "%d\t", >>> @@ -5948,6 +6041,19 @@ determine_group_iv_costs (struct ivopts_data *da= ta) >>> } >>> fprintf (dump_file, "\n"); >>> } >>> + >>> + for (i =3D 0; i < data->vgroups.length (); i++) >>> + { >>> + group =3D data->vgroups[i]; >>> + for (j =3D 0; j < group->n_map_members; j++) >>> + { >>> + if (!group->cost_map[j].cand >>> + || group->cost_map[j].cost.infinite_cost_p ()) >>> + continue; >>> + >>> + group->cost_map[j].cost.propagate_scaled_cost (); >>> + } >>> + } >> This is wrong. m_frequency and m_cost_scaled are initialized to >> sreal(0) by default, and are never changed later for conditional >> iv_use. As a matter of factor, costs computed for all conditional >> iv_uses are wrong (value is 0). This makes the observed improvement >> not that promising. Considering code generation is very sensitive to >> cost computation, it maybe just hit some special cases. Eitherway we >> need more work/investigation on the impact of this patch. >> >> Again, I would suggest we factor out frequency out of comp_cost and >> only scale the cost in place when we compute cost for each use. Then >> we can measure what's the impact on code generation. >> >> Thanks, >> bin >> > > All remarks were applied in third version of the patch. Together with the= previous > patch, it survives bootstrap and regression tests on x86_64-linux-gnu. > I'm going to re-test the patch on SPEC benchmarks. > + > +ret: > + /* Scale (multiply) the computed cost (except scratch part that should= be > + hoisted out a loop) by header->frequency / at->frequency, > + which makes expected cost more accurate. */ > + int loop_freq =3D data->current_loop->header->frequency; > + int bb_freq =3D at->bb->frequency; > + if (loop_freq !=3D 0) > + { > + gcc_assert (cost.scratch <=3D cost.cost); > + int scaled_cost > + =3D cost.scratch + (cost.cost - cost.scratch) * bb_freq / loop_freq; > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "Scaling iv_use based on cand %d " > + "by %2.2f: %d (scratch: %d) -> %d (%d/%d)\n", > + cand->id, 1.0f * bb_freq / loop_freq, cost.cost, > + cost.scratch, scaled_cost, bb_freq, loop_freq); > + > + cost.cost =3D scaled_cost; > + } > + > + return cost; Hi, Could you please factor out this as a function and remove the goto statements? Okay with this change if no fallout in benchmarks you run. Thanks, bin > > Martin >