* [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
@ 2010-12-13 21:57 Fang, Changpeng
2010-12-14 0:56 ` Sebastian Pop
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-13 21:57 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 1233 bytes --]
Hi,
The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops.
Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few
iterations and still perform optimizations on them. While this does not hurt the performance in general,
it may significantly increase the compilation time and code size without performance benefit.
This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
prefetch, unswitch and loop unrolling on them. On polyhedron with -Ofast -funroll-loops -march=amdfam10,
the patch could reduce the compilation time by 28% on average, the reduce the binary size by 20% on
average (see the atached data). Note that the small improvement (0.5%) could have been noise, the
code size reduction could possibly improve the performance in some cases (I_cache iprovement?).
The patch passed bootstrap and gcc regression tests on x86_64-unknown-linux-gnu.
Is it OK to commit to trunk?
Thanks,
Changpeng
[-- Attachment #2: polyhedron.txt --]
[-- Type: text/plain, Size: 956 bytes --]
Effact of the pre-/post-loop patch on polyhedron
option: gfortran -Ofast -funroll-loops -march=amdfam10
compilation code size speed
time reduction reduction improvement
(%) (%) (%)
-----------------------------------------------------------
ac -20.54 -17.15 0
aermod -15.93 -10.15 2.51
air -5.74 -5.45 -0.09
capacita -31.35 -18.27 0.08
channel -11.32 -10.24 1.22
doduc -4.52 -6.12 0.82
fatigue -34.51 -15.94 0
gas_dyn -45.56 -28.66 2.31
induct -3.1 -1.91 0.05
linpk -25.55 -27.5 0.26
mdbx -24.06 -19.74 1.27
nf -60.85 -48.92 -0.77
protein -44.73 -24.02 -0.19
rnflow -50.55 -36.69 0.47
test_fpu -52.49 -41.35 1.18
tfft -24.83 -18.29 0.39
-----------------------------------------------------------
average -28.48 -20.65 0.59
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch --]
[-- Type: text/x-patch; name="0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch", Size: 8306 bytes --]
From e8636e80de4d6de8ba2dbc8f08bd2daddd02edc3 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Mon, 13 Dec 2010 12:01:49 -0800
Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
* basic-block.h (bb_flags): Add a new flag BB_PRE_POST_LOOP_HEADER.
* cfg.c (clear_bb_flags): Keep BB_PRE_POST_LOOP_HEADER marker.
* cfgloop.h (mark_pre_or_post_loop): New function declaration.
(pre_or_post_loop_p): New function declaration.
* loop-unroll.c (decide_unroll_runtime_iterations): Do not unroll a
pre- or post-loop.
* loop-unswitch.c (unswitch_single_loop): Do not unswitch a pre- or
post-loop.
* tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
post-loop.
* tree-ssa-loop-niter.c (mark_pre_or_post_loop): Implement the new
function. (pre_or_post_loop_p): Implement the new function.
* tree-ssa-loop-prefetch.c (loop_prefetch_arrays): Don't prefetch
a pre- or post-loop.
* tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Do not unswitch
a pre- or post-loop.
* tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
post-loop. (vect_do_peeling_for_alignment): Mark the pre-loop.
---
gcc/basic-block.h | 6 +++++-
gcc/cfg.c | 7 ++++---
gcc/cfgloop.h | 2 ++
gcc/loop-unroll.c | 7 +++++++
gcc/loop-unswitch.c | 8 ++++++++
gcc/tree-ssa-loop-manip.c | 3 +++
gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
gcc/tree-ssa-loop-prefetch.c | 7 +++++++
gcc/tree-ssa-loop-unswitch.c | 8 ++++++++
gcc/tree-vect-loop-manip.c | 8 ++++++++
10 files changed, 72 insertions(+), 4 deletions(-)
diff --git a/gcc/basic-block.h b/gcc/basic-block.h
index be0a1d1..78552fd 100644
--- a/gcc/basic-block.h
+++ b/gcc/basic-block.h
@@ -245,7 +245,11 @@ enum bb_flags
/* Set on blocks that cannot be threaded through.
Only used in cfgcleanup.c. */
- BB_NONTHREADABLE_BLOCK = 1 << 11
+ BB_NONTHREADABLE_BLOCK = 1 << 11,
+
+ /* Set on blocks that are headers of pre- or post-loops. */
+ BB_PRE_POST_LOOP_HEADER = 1 << 12
+
};
/* Dummy flag for convenience in the hot/cold partitioning code. */
diff --git a/gcc/cfg.c b/gcc/cfg.c
index c8ef799..e9b394a 100644
--- a/gcc/cfg.c
+++ b/gcc/cfg.c
@@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
connect_src (e);
}
-/* Clear all basic block flags, with the exception of partitioning and
- setjmp_target. */
+/* Clear all basic block flags, with the exception of partitioning,
+ setjmp_target, and the pre/post loop marker. */
void
clear_bb_flags (void)
{
@@ -434,7 +434,8 @@ clear_bb_flags (void)
FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
bb->flags = (BB_PARTITION (bb)
- | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
+ | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
+ + BB_PRE_POST_LOOP_HEADER)));
}
\f
/* Check the consistency of profile information. We can't do that
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index bf2614e..ce848cc 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
void estimate_numbers_of_iterations_loop (struct loop *, bool);
HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
bool estimated_loop_iterations (struct loop *, bool, double_int *);
+void mark_pre_or_post_loop (struct loop *);
+bool pre_or_post_loop_p (struct loop *);
/* Loop manipulation. */
extern bool can_duplicate_loop_p (const struct loop *loop);
diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
index 67d6ea0..6f095f6 100644
--- a/gcc/loop-unroll.c
+++ b/gcc/loop-unroll.c
@@ -857,6 +857,13 @@ decide_unroll_runtime_iterations (struct loop *loop, int flags)
fprintf (dump_file, ";; Loop iterates constant times\n");
return;
}
+
+ if (pre_or_post_loop_p (loop))
+ {
+ if (dump_file)
+ fprintf (dump_file, ";; Not unrolling, a pre- or post-loop\n");
+ return;
+ }
/* If we have profile feedback, check whether the loop rolls. */
if (loop->header->count && expected_loop_iterations (loop) < 2 * nunroll)
diff --git a/gcc/loop-unswitch.c b/gcc/loop-unswitch.c
index 77524d8..59373bf 100644
--- a/gcc/loop-unswitch.c
+++ b/gcc/loop-unswitch.c
@@ -276,6 +276,14 @@ unswitch_single_loop (struct loop *loop, rtx cond_checked, int num)
return;
}
+ /* Pre- or post loop usually just roll a few iterations. */
+ if (pre_or_post_loop_p (loop))
+ {
+ if (dump_file)
+ fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
+ return;
+ }
+
/* We must be able to duplicate loop body. */
if (!can_duplicate_loop_p (loop))
{
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 87b2c0d..f8ddbab 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
gcc_assert (new_loop != NULL);
update_ssa (TODO_update_ssa);
+ /* NEW_LOOP is a post-loop. */
+ mark_pre_or_post_loop (new_loop);
+
/* Determine the probability of the exit edge of the unrolled loop. */
new_est_niter = est_niter / factor;
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index ee85f6f..33e8cc3 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
fold_undefer_and_ignore_overflow_warnings ();
}
+/* Mark LOOP as a pre- or post loop. */
+
+void
+mark_pre_or_post_loop (struct loop *loop)
+{
+ gcc_assert (loop && loop->header);
+ loop->header->flags |= BB_PRE_POST_LOOP_HEADER;
+}
+
+/* Return true if LOOP is a pre- or post loop. */
+
+bool
+pre_or_post_loop_p (struct loop *loop)
+{
+ int masked_flags;
+ gcc_assert (loop && loop->header);
+ masked_flags = (loop->header->flags & BB_PRE_POST_LOOP_HEADER);
+ return (masked_flags != 0);
+}
+
/* Returns true if statement S1 dominates statement S2. */
bool
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 59c65d3..5c9f640 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1793,6 +1793,13 @@ loop_prefetch_arrays (struct loop *loop)
return false;
}
+ if (pre_or_post_loop_p (loop))
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, " Not Prefetching -- pre- or post loop\n");
+ return false;
+ }
+
/* FIXME: the time should be weighted by the probabilities of the blocks in
the loop body. */
time = tree_num_loop_insns (loop, &eni_time_weights);
diff --git a/gcc/tree-ssa-loop-unswitch.c b/gcc/tree-ssa-loop-unswitch.c
index b6b32dc..f3b8108 100644
--- a/gcc/tree-ssa-loop-unswitch.c
+++ b/gcc/tree-ssa-loop-unswitch.c
@@ -88,6 +88,14 @@ tree_ssa_unswitch_loops (void)
if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, ";; Considering loop %d\n", loop->num);
+ /* Do not unswitch a pre- or post loop. */
+ if (pre_or_post_loop_p (loop))
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
+ continue;
+ }
+
/* Do not unswitch in cold regions. */
if (optimize_loop_for_size_p (loop))
{
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6ecd304..9a63f7e 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
cond_expr, cond_expr_stmt_list);
gcc_assert (new_loop);
gcc_assert (loop_num == loop->num);
+
+ /* NEW_LOOP is a post loop. */
+ mark_pre_or_post_loop (new_loop);
+
#ifdef ENABLE_CHECKING
slpeel_verify_cfg_after_peeling (loop, new_loop);
#endif
@@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
th, true, NULL_TREE, NULL);
gcc_assert (new_loop);
+
+ /* NEW_LOOP is a pre-loop. */
+ mark_pre_or_post_loop (new_loop);
+
#ifdef ENABLE_CHECKING
slpeel_verify_cfg_after_peeling (new_loop, loop);
#endif
--
1.6.3.3
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-13 21:57 [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops Fang, Changpeng
@ 2010-12-14 0:56 ` Sebastian Pop
2010-12-14 8:25 ` Zdenek Dvorak
2010-12-14 16:13 ` Jack Howarth
2 siblings, 0 replies; 36+ messages in thread
From: Sebastian Pop @ 2010-12-14 0:56 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: gcc-patches, Richard Guenther
On Mon, Dec 13, 2010 at 14:35, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> Hi,
>
> The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops.
>
> Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
> of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
> Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few
> iterations and still perform optimizations on them. While this does not hurt the performance in general,
> it may significantly increase the compilation time and code size without performance benefit.
>
> This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
> prefetch, unswitch and loop unrolling on them. On polyhedron with -Ofast -funroll-loops -march=amdfam10,
> the patch could reduce the compilation time by 28% on average, the reduce the binary size by 20% on
> average (see the atached data). Note that the small improvement (0.5%) could have been noise, the
> code size reduction could possibly improve the performance in some cases (I_cache iprovement?).
>
> The patch passed bootstrap and gcc regression tests on x86_64-unknown-linux-gnu.
>
> Is it OK to commit to trunk?
I like the way you solved this problem, but I cannot approve your patch.
I will let Richi or someone else comment on it.
Thanks for fixing this,
Sebastian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-13 21:57 [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops Fang, Changpeng
2010-12-14 0:56 ` Sebastian Pop
@ 2010-12-14 8:25 ` Zdenek Dvorak
2010-12-14 20:02 ` Fang, Changpeng
2010-12-14 16:13 ` Jack Howarth
2 siblings, 1 reply; 36+ messages in thread
From: Zdenek Dvorak @ 2010-12-14 8:25 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: gcc-patches
Hi,
> The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops.
>
> Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
> of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
> Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few
> iterations and still perform optimizations on them. While this does not hurt the performance in general,
> it may significantly increase the compilation time and code size without performance benefit.
>
> This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
> prefetch, unswitch and loop unrolling on them.
why not simply change the profile updating to correctly indicate that these loops do not roll?
That way, all the optimizations would profit, not just those aware of the new bb flag,
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-13 21:57 [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops Fang, Changpeng
2010-12-14 0:56 ` Sebastian Pop
2010-12-14 8:25 ` Zdenek Dvorak
@ 2010-12-14 16:13 ` Jack Howarth
2010-12-14 17:33 ` Fang, Changpeng
2 siblings, 1 reply; 36+ messages in thread
From: Jack Howarth @ 2010-12-14 16:13 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: gcc-patches
On Mon, Dec 13, 2010 at 02:35:35PM -0600, Fang, Changpeng wrote:
> Hi,
>
> The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops.
>
> Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
> of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
> Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few
> iterations and still perform optimizations on them. While this does not hurt the performance in general,
> it may significantly increase the compilation time and code size without performance benefit.
>
> This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
> prefetch, unswitch and loop unrolling on them. On polyhedron with -Ofast -funroll-loops -march=amdfam10,
> the patch could reduce the compilation time by 28% on average, the reduce the binary size by 20% on
> average (see the atached data). Note that the small improvement (0.5%) could have been noise, the
> code size reduction could possibly improve the performance in some cases (I_cache iprovement?).
>
> The patch passed bootstrap and gcc regression tests on x86_64-unknown-linux-gnu.
>
> Is it OK to commit to trunk?
>
> Thanks,
>
> Changpeng
Changpeng,
On x86_64-apple-darwin10, this patch produces some regressions in the gcc testsuite.
In particular at both -m32 and -m64...
XPASS: gcc.dg/pr30957-1.c execution test
FAIL: gcc.dg/pr30957-1.c scan-rtl-dump loop2_unroll "Expanding Accumulator"
and
FAIL: gcc.dg/var-expand1.c scan-rtl-dump loop2_unroll "Expanding Accumulator"
Do you see those as well on linux?
Jack
>
Content-Description: polyhedron.txt
> Effact of the pre-/post-loop patch on polyhedron
> option: gfortran -Ofast -funroll-loops -march=amdfam10
>
> compilation code size speed
> time reduction reduction improvement
> (%) (%) (%)
> -----------------------------------------------------------
> ac -20.54 -17.15 0
> aermod -15.93 -10.15 2.51
> air -5.74 -5.45 -0.09
> capacita -31.35 -18.27 0.08
> channel -11.32 -10.24 1.22
> doduc -4.52 -6.12 0.82
> fatigue -34.51 -15.94 0
> gas_dyn -45.56 -28.66 2.31
> induct -3.1 -1.91 0.05
> linpk -25.55 -27.5 0.26
> mdbx -24.06 -19.74 1.27
> nf -60.85 -48.92 -0.77
> protein -44.73 -24.02 -0.19
> rnflow -50.55 -36.69 0.47
> test_fpu -52.49 -41.35 1.18
> tfft -24.83 -18.29 0.39
> -----------------------------------------------------------
> average -28.48 -20.65 0.59
>
Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> From e8636e80de4d6de8ba2dbc8f08bd2daddd02edc3 Mon Sep 17 00:00:00 2001
> From: Changpeng Fang <chfang@houghton.(none)>
> Date: Mon, 13 Dec 2010 12:01:49 -0800
> Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
>
> * basic-block.h (bb_flags): Add a new flag BB_PRE_POST_LOOP_HEADER.
> * cfg.c (clear_bb_flags): Keep BB_PRE_POST_LOOP_HEADER marker.
> * cfgloop.h (mark_pre_or_post_loop): New function declaration.
> (pre_or_post_loop_p): New function declaration.
> * loop-unroll.c (decide_unroll_runtime_iterations): Do not unroll a
> pre- or post-loop.
> * loop-unswitch.c (unswitch_single_loop): Do not unswitch a pre- or
> post-loop.
> * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> post-loop.
> * tree-ssa-loop-niter.c (mark_pre_or_post_loop): Implement the new
> function. (pre_or_post_loop_p): Implement the new function.
> * tree-ssa-loop-prefetch.c (loop_prefetch_arrays): Don't prefetch
> a pre- or post-loop.
> * tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Do not unswitch
> a pre- or post-loop.
> * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> post-loop. (vect_do_peeling_for_alignment): Mark the pre-loop.
> ---
> gcc/basic-block.h | 6 +++++-
> gcc/cfg.c | 7 ++++---
> gcc/cfgloop.h | 2 ++
> gcc/loop-unroll.c | 7 +++++++
> gcc/loop-unswitch.c | 8 ++++++++
> gcc/tree-ssa-loop-manip.c | 3 +++
> gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> gcc/tree-ssa-loop-prefetch.c | 7 +++++++
> gcc/tree-ssa-loop-unswitch.c | 8 ++++++++
> gcc/tree-vect-loop-manip.c | 8 ++++++++
> 10 files changed, 72 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> index be0a1d1..78552fd 100644
> --- a/gcc/basic-block.h
> +++ b/gcc/basic-block.h
> @@ -245,7 +245,11 @@ enum bb_flags
>
> /* Set on blocks that cannot be threaded through.
> Only used in cfgcleanup.c. */
> - BB_NONTHREADABLE_BLOCK = 1 << 11
> + BB_NONTHREADABLE_BLOCK = 1 << 11,
> +
> + /* Set on blocks that are headers of pre- or post-loops. */
> + BB_PRE_POST_LOOP_HEADER = 1 << 12
> +
> };
>
> /* Dummy flag for convenience in the hot/cold partitioning code. */
> diff --git a/gcc/cfg.c b/gcc/cfg.c
> index c8ef799..e9b394a 100644
> --- a/gcc/cfg.c
> +++ b/gcc/cfg.c
> @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> connect_src (e);
> }
>
> -/* Clear all basic block flags, with the exception of partitioning and
> - setjmp_target. */
> +/* Clear all basic block flags, with the exception of partitioning,
> + setjmp_target, and the pre/post loop marker. */
> void
> clear_bb_flags (void)
> {
> @@ -434,7 +434,8 @@ clear_bb_flags (void)
>
> FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> bb->flags = (BB_PARTITION (bb)
> - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> + + BB_PRE_POST_LOOP_HEADER)));
> }
> \f
> /* Check the consistency of profile information. We can't do that
> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> index bf2614e..ce848cc 100644
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> void estimate_numbers_of_iterations_loop (struct loop *, bool);
> HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> bool estimated_loop_iterations (struct loop *, bool, double_int *);
> +void mark_pre_or_post_loop (struct loop *);
> +bool pre_or_post_loop_p (struct loop *);
>
> /* Loop manipulation. */
> extern bool can_duplicate_loop_p (const struct loop *loop);
> diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
> index 67d6ea0..6f095f6 100644
> --- a/gcc/loop-unroll.c
> +++ b/gcc/loop-unroll.c
> @@ -857,6 +857,13 @@ decide_unroll_runtime_iterations (struct loop *loop, int flags)
> fprintf (dump_file, ";; Loop iterates constant times\n");
> return;
> }
> +
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file)
> + fprintf (dump_file, ";; Not unrolling, a pre- or post-loop\n");
> + return;
> + }
>
> /* If we have profile feedback, check whether the loop rolls. */
> if (loop->header->count && expected_loop_iterations (loop) < 2 * nunroll)
> diff --git a/gcc/loop-unswitch.c b/gcc/loop-unswitch.c
> index 77524d8..59373bf 100644
> --- a/gcc/loop-unswitch.c
> +++ b/gcc/loop-unswitch.c
> @@ -276,6 +276,14 @@ unswitch_single_loop (struct loop *loop, rtx cond_checked, int num)
> return;
> }
>
> + /* Pre- or post loop usually just roll a few iterations. */
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file)
> + fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
> + return;
> + }
> +
> /* We must be able to duplicate loop body. */
> if (!can_duplicate_loop_p (loop))
> {
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 87b2c0d..f8ddbab 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> gcc_assert (new_loop != NULL);
> update_ssa (TODO_update_ssa);
>
> + /* NEW_LOOP is a post-loop. */
> + mark_pre_or_post_loop (new_loop);
> +
> /* Determine the probability of the exit edge of the unrolled loop. */
> new_est_niter = est_niter / factor;
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index ee85f6f..33e8cc3 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> fold_undefer_and_ignore_overflow_warnings ();
> }
>
> +/* Mark LOOP as a pre- or post loop. */
> +
> +void
> +mark_pre_or_post_loop (struct loop *loop)
> +{
> + gcc_assert (loop && loop->header);
> + loop->header->flags |= BB_PRE_POST_LOOP_HEADER;
> +}
> +
> +/* Return true if LOOP is a pre- or post loop. */
> +
> +bool
> +pre_or_post_loop_p (struct loop *loop)
> +{
> + int masked_flags;
> + gcc_assert (loop && loop->header);
> + masked_flags = (loop->header->flags & BB_PRE_POST_LOOP_HEADER);
> + return (masked_flags != 0);
> +}
> +
> /* Returns true if statement S1 dominates statement S2. */
>
> bool
> diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
> index 59c65d3..5c9f640 100644
> --- a/gcc/tree-ssa-loop-prefetch.c
> +++ b/gcc/tree-ssa-loop-prefetch.c
> @@ -1793,6 +1793,13 @@ loop_prefetch_arrays (struct loop *loop)
> return false;
> }
>
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, " Not Prefetching -- pre- or post loop\n");
> + return false;
> + }
> +
> /* FIXME: the time should be weighted by the probabilities of the blocks in
> the loop body. */
> time = tree_num_loop_insns (loop, &eni_time_weights);
> diff --git a/gcc/tree-ssa-loop-unswitch.c b/gcc/tree-ssa-loop-unswitch.c
> index b6b32dc..f3b8108 100644
> --- a/gcc/tree-ssa-loop-unswitch.c
> +++ b/gcc/tree-ssa-loop-unswitch.c
> @@ -88,6 +88,14 @@ tree_ssa_unswitch_loops (void)
> if (dump_file && (dump_flags & TDF_DETAILS))
> fprintf (dump_file, ";; Considering loop %d\n", loop->num);
>
> + /* Do not unswitch a pre- or post loop. */
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
> + continue;
> + }
> +
> /* Do not unswitch in cold regions. */
> if (optimize_loop_for_size_p (loop))
> {
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 6ecd304..9a63f7e 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> cond_expr, cond_expr_stmt_list);
> gcc_assert (new_loop);
> gcc_assert (loop_num == loop->num);
> +
> + /* NEW_LOOP is a post loop. */
> + mark_pre_or_post_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (loop, new_loop);
> #endif
> @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> th, true, NULL_TREE, NULL);
>
> gcc_assert (new_loop);
> +
> + /* NEW_LOOP is a pre-loop. */
> + mark_pre_or_post_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (new_loop, loop);
> #endif
> --
> 1.6.3.3
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-14 16:13 ` Jack Howarth
@ 2010-12-14 17:33 ` Fang, Changpeng
0 siblings, 0 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-14 17:33 UTC (permalink / raw)
To: Jack Howarth; +Cc: gcc-patches
No, I didn't see these failures.
The failure list in my bootstrapping is the following. I see nothing relevant:
FAIL: gcc.dg/guality/pr43077-1.c -O2 -flto -flto-partition=none line 42 varb == 2
FAIL: gcc.dg/guality/pr43077-1.c -O2 -flto line 42 varb == 2
FAIL: gcc.dg/guality/sra-1.c -O1 line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c -O2 line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c -O3 -fomit-frame-pointer line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c -O3 -g line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c -Os line 21 a.j == 14
FAIL: gcc.dg/guality/vla-1.c -O0 line 17 sizeof (a) == 6
FAIL: gcc.dg/guality/vla-1.c -O0 line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c -O1 line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c -O2 line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c -O3 -fomit-frame-pointer line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c -O3 -g line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c -Os line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-2.c -O0 line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O0 line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O1 line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O1 line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O2 line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O2 line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O3 -fomit-frame-pointer line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O3 -fomit-frame-pointer line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O3 -g line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -O3 -g line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -Os line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c -Os line 25 sizeof (a) == 6 * sizeof (int)
FAIL: g++.dg/guality/redeclaration1.C -O0 line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C -O1 line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C -O2 line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C -O3 -fomit-frame-pointer line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C -O3 -g line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C -Os line 17 i == 24
FAIL: libmudflap.c/pass49-frag.c execution test
FAIL: libmudflap.c/pass49-frag.c output pattern test
FAIL: libmudflap.c/pass49-frag.c execution test
FAIL: libmudflap.c/pass49-frag.c output pattern test
FAIL: libmudflap.c/pass49-frag.c (-static) execution test
FAIL: libmudflap.c/pass49-frag.c (-static) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-static) execution test
FAIL: libmudflap.c/pass49-frag.c (-static) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O2) execution test
FAIL: libmudflap.c/pass49-frag.c (-O2) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O2) execution test
FAIL: libmudflap.c/pass49-frag.c (-O2) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O3) execution test
FAIL: libmudflap.c/pass49-frag.c (-O3) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O3) execution test
FAIL: libmudflap.c/pass49-frag.c (-O3) output pattern test
FAIL: gcc.dg/cproj-fails-with-broken-glibc.c execution test
Thanks,
Changpeng
________________________________________
From: Jack Howarth [howarth@bromo.med.uc.edu]
Sent: Tuesday, December 14, 2010 8:27 AM
To: Fang, Changpeng
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Mon, Dec 13, 2010 at 02:35:35PM -0600, Fang, Changpeng wrote:
> Hi,
>
> The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops.
>
> Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
> of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
> Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few
> iterations and still perform optimizations on them. While this does not hurt the performance in general,
> it may significantly increase the compilation time and code size without performance benefit.
>
> This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
> prefetch, unswitch and loop unrolling on them. On polyhedron with -Ofast -funroll-loops -march=amdfam10,
> the patch could reduce the compilation time by 28% on average, the reduce the binary size by 20% on
> average (see the atached data). Note that the small improvement (0.5%) could have been noise, the
> code size reduction could possibly improve the performance in some cases (I_cache iprovement?).
>
> The patch passed bootstrap and gcc regression tests on x86_64-unknown-linux-gnu.
>
> Is it OK to commit to trunk?
>
> Thanks,
>
> Changpeng
Changpeng,
On x86_64-apple-darwin10, this patch produces some regressions in the gcc testsuite.
In particular at both -m32 and -m64...
XPASS: gcc.dg/pr30957-1.c execution test
FAIL: gcc.dg/pr30957-1.c scan-rtl-dump loop2_unroll "Expanding Accumulator"
and
FAIL: gcc.dg/var-expand1.c scan-rtl-dump loop2_unroll "Expanding Accumulator"
Do you see those as well on linux?
Jack
>
Content-Description: polyhedron.txt
> Effact of the pre-/post-loop patch on polyhedron
> option: gfortran -Ofast -funroll-loops -march=amdfam10
>
> compilation code size speed
> time reduction reduction improvement
> (%) (%) (%)
> -----------------------------------------------------------
> ac -20.54 -17.15 0
> aermod -15.93 -10.15 2.51
> air -5.74 -5.45 -0.09
> capacita -31.35 -18.27 0.08
> channel -11.32 -10.24 1.22
> doduc -4.52 -6.12 0.82
> fatigue -34.51 -15.94 0
> gas_dyn -45.56 -28.66 2.31
> induct -3.1 -1.91 0.05
> linpk -25.55 -27.5 0.26
> mdbx -24.06 -19.74 1.27
> nf -60.85 -48.92 -0.77
> protein -44.73 -24.02 -0.19
> rnflow -50.55 -36.69 0.47
> test_fpu -52.49 -41.35 1.18
> tfft -24.83 -18.29 0.39
> -----------------------------------------------------------
> average -28.48 -20.65 0.59
>
Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> From e8636e80de4d6de8ba2dbc8f08bd2daddd02edc3 Mon Sep 17 00:00:00 2001
> From: Changpeng Fang <chfang@houghton.(none)>
> Date: Mon, 13 Dec 2010 12:01:49 -0800
> Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
>
> * basic-block.h (bb_flags): Add a new flag BB_PRE_POST_LOOP_HEADER.
> * cfg.c (clear_bb_flags): Keep BB_PRE_POST_LOOP_HEADER marker.
> * cfgloop.h (mark_pre_or_post_loop): New function declaration.
> (pre_or_post_loop_p): New function declaration.
> * loop-unroll.c (decide_unroll_runtime_iterations): Do not unroll a
> pre- or post-loop.
> * loop-unswitch.c (unswitch_single_loop): Do not unswitch a pre- or
> post-loop.
> * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> post-loop.
> * tree-ssa-loop-niter.c (mark_pre_or_post_loop): Implement the new
> function. (pre_or_post_loop_p): Implement the new function.
> * tree-ssa-loop-prefetch.c (loop_prefetch_arrays): Don't prefetch
> a pre- or post-loop.
> * tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Do not unswitch
> a pre- or post-loop.
> * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> post-loop. (vect_do_peeling_for_alignment): Mark the pre-loop.
> ---
> gcc/basic-block.h | 6 +++++-
> gcc/cfg.c | 7 ++++---
> gcc/cfgloop.h | 2 ++
> gcc/loop-unroll.c | 7 +++++++
> gcc/loop-unswitch.c | 8 ++++++++
> gcc/tree-ssa-loop-manip.c | 3 +++
> gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> gcc/tree-ssa-loop-prefetch.c | 7 +++++++
> gcc/tree-ssa-loop-unswitch.c | 8 ++++++++
> gcc/tree-vect-loop-manip.c | 8 ++++++++
> 10 files changed, 72 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> index be0a1d1..78552fd 100644
> --- a/gcc/basic-block.h
> +++ b/gcc/basic-block.h
> @@ -245,7 +245,11 @@ enum bb_flags
>
> /* Set on blocks that cannot be threaded through.
> Only used in cfgcleanup.c. */
> - BB_NONTHREADABLE_BLOCK = 1 << 11
> + BB_NONTHREADABLE_BLOCK = 1 << 11,
> +
> + /* Set on blocks that are headers of pre- or post-loops. */
> + BB_PRE_POST_LOOP_HEADER = 1 << 12
> +
> };
>
> /* Dummy flag for convenience in the hot/cold partitioning code. */
> diff --git a/gcc/cfg.c b/gcc/cfg.c
> index c8ef799..e9b394a 100644
> --- a/gcc/cfg.c
> +++ b/gcc/cfg.c
> @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> connect_src (e);
> }
>
> -/* Clear all basic block flags, with the exception of partitioning and
> - setjmp_target. */
> +/* Clear all basic block flags, with the exception of partitioning,
> + setjmp_target, and the pre/post loop marker. */
> void
> clear_bb_flags (void)
> {
> @@ -434,7 +434,8 @@ clear_bb_flags (void)
>
> FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> bb->flags = (BB_PARTITION (bb)
> - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> + + BB_PRE_POST_LOOP_HEADER)));
> }
>
> /* Check the consistency of profile information. We can't do that
> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> index bf2614e..ce848cc 100644
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> void estimate_numbers_of_iterations_loop (struct loop *, bool);
> HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> bool estimated_loop_iterations (struct loop *, bool, double_int *);
> +void mark_pre_or_post_loop (struct loop *);
> +bool pre_or_post_loop_p (struct loop *);
>
> /* Loop manipulation. */
> extern bool can_duplicate_loop_p (const struct loop *loop);
> diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
> index 67d6ea0..6f095f6 100644
> --- a/gcc/loop-unroll.c
> +++ b/gcc/loop-unroll.c
> @@ -857,6 +857,13 @@ decide_unroll_runtime_iterations (struct loop *loop, int flags)
> fprintf (dump_file, ";; Loop iterates constant times\n");
> return;
> }
> +
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file)
> + fprintf (dump_file, ";; Not unrolling, a pre- or post-loop\n");
> + return;
> + }
>
> /* If we have profile feedback, check whether the loop rolls. */
> if (loop->header->count && expected_loop_iterations (loop) < 2 * nunroll)
> diff --git a/gcc/loop-unswitch.c b/gcc/loop-unswitch.c
> index 77524d8..59373bf 100644
> --- a/gcc/loop-unswitch.c
> +++ b/gcc/loop-unswitch.c
> @@ -276,6 +276,14 @@ unswitch_single_loop (struct loop *loop, rtx cond_checked, int num)
> return;
> }
>
> + /* Pre- or post loop usually just roll a few iterations. */
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file)
> + fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
> + return;
> + }
> +
> /* We must be able to duplicate loop body. */
> if (!can_duplicate_loop_p (loop))
> {
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 87b2c0d..f8ddbab 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> gcc_assert (new_loop != NULL);
> update_ssa (TODO_update_ssa);
>
> + /* NEW_LOOP is a post-loop. */
> + mark_pre_or_post_loop (new_loop);
> +
> /* Determine the probability of the exit edge of the unrolled loop. */
> new_est_niter = est_niter / factor;
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index ee85f6f..33e8cc3 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> fold_undefer_and_ignore_overflow_warnings ();
> }
>
> +/* Mark LOOP as a pre- or post loop. */
> +
> +void
> +mark_pre_or_post_loop (struct loop *loop)
> +{
> + gcc_assert (loop && loop->header);
> + loop->header->flags |= BB_PRE_POST_LOOP_HEADER;
> +}
> +
> +/* Return true if LOOP is a pre- or post loop. */
> +
> +bool
> +pre_or_post_loop_p (struct loop *loop)
> +{
> + int masked_flags;
> + gcc_assert (loop && loop->header);
> + masked_flags = (loop->header->flags & BB_PRE_POST_LOOP_HEADER);
> + return (masked_flags != 0);
> +}
> +
> /* Returns true if statement S1 dominates statement S2. */
>
> bool
> diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
> index 59c65d3..5c9f640 100644
> --- a/gcc/tree-ssa-loop-prefetch.c
> +++ b/gcc/tree-ssa-loop-prefetch.c
> @@ -1793,6 +1793,13 @@ loop_prefetch_arrays (struct loop *loop)
> return false;
> }
>
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, " Not Prefetching -- pre- or post loop\n");
> + return false;
> + }
> +
> /* FIXME: the time should be weighted by the probabilities of the blocks in
> the loop body. */
> time = tree_num_loop_insns (loop, &eni_time_weights);
> diff --git a/gcc/tree-ssa-loop-unswitch.c b/gcc/tree-ssa-loop-unswitch.c
> index b6b32dc..f3b8108 100644
> --- a/gcc/tree-ssa-loop-unswitch.c
> +++ b/gcc/tree-ssa-loop-unswitch.c
> @@ -88,6 +88,14 @@ tree_ssa_unswitch_loops (void)
> if (dump_file && (dump_flags & TDF_DETAILS))
> fprintf (dump_file, ";; Considering loop %d\n", loop->num);
>
> + /* Do not unswitch a pre- or post loop. */
> + if (pre_or_post_loop_p (loop))
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
> + continue;
> + }
> +
> /* Do not unswitch in cold regions. */
> if (optimize_loop_for_size_p (loop))
> {
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 6ecd304..9a63f7e 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> cond_expr, cond_expr_stmt_list);
> gcc_assert (new_loop);
> gcc_assert (loop_num == loop->num);
> +
> + /* NEW_LOOP is a post loop. */
> + mark_pre_or_post_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (loop, new_loop);
> #endif
> @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> th, true, NULL_TREE, NULL);
>
> gcc_assert (new_loop);
> +
> + /* NEW_LOOP is a pre-loop. */
> + mark_pre_or_post_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (new_loop, loop);
> #endif
> --
> 1.6.3.3
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-14 8:25 ` Zdenek Dvorak
@ 2010-12-14 20:02 ` Fang, Changpeng
2010-12-14 21:55 ` Zdenek Dvorak
0 siblings, 1 reply; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-14 20:02 UTC (permalink / raw)
To: Zdenek Dvorak; +Cc: gcc-patches
>why not simply change the profile updating to correctly indicate that these loops do not roll?
>That way, all the optimizations would profit, not just those aware of the new bb flag,
Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
to guard loop optimizations. For a given program, different data sizes will result in quite different
loop trip counts.
By the way, what optimizations else do you think will benefit from disabling for small trip count
loops, significantly?
Thanks,
Changpeng
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-14 20:02 ` Fang, Changpeng
@ 2010-12-14 21:55 ` Zdenek Dvorak
2010-12-15 6:16 ` Richard Guenther
0 siblings, 1 reply; 36+ messages in thread
From: Zdenek Dvorak @ 2010-12-14 21:55 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: gcc-patches
Hi,
> >why not simply change the profile updating to correctly indicate that these loops do not roll?
> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>
> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
> to guard loop optimizations.
it is already used that way; i.e., you do not need to change anything in the optimizations, just
make sure that the edge probabilities are sensible.
> For a given program, different data sizes will result in quite different
> loop trip counts.
That should not be the case -- for the pre/post loops generated in vectorization, we know the
expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
> By the way, what optimizations else do you think will benefit from disabling for small trip count
> loops, significantly?
Anything where we check whether we should optimize for speed or code size,
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-14 21:55 ` Zdenek Dvorak
@ 2010-12-15 6:16 ` Richard Guenther
2010-12-15 8:34 ` Fang, Changpeng
0 siblings, 1 reply; 36+ messages in thread
From: Richard Guenther @ 2010-12-15 6:16 UTC (permalink / raw)
To: Zdenek Dvorak; +Cc: Fang, Changpeng, gcc-patches
On Tue, Dec 14, 2010 at 10:05 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
> Hi,
>
>> >why not simply change the profile updating to correctly indicate that these loops do not roll?
>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>>
>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>> to guard loop optimizations.
>
> it is already used that way; i.e., you do not need to change anything in the optimizations, just
> make sure that the edge probabilities are sensible.
>
>> For a given program, different data sizes will result in quite different
>> loop trip counts.
>
> That should not be the case -- for the pre/post loops generated in vectorization, we know the
> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>
>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>> loops, significantly?
>
> Anything where we check whether we should optimize for speed or code size,
I agree with Zdenek (without having looked at the patch sofar).
Richard.
> Zdenek
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-15 6:16 ` Richard Guenther
@ 2010-12-15 8:34 ` Fang, Changpeng
2010-12-15 9:22 ` Xinliang David Li
0 siblings, 1 reply; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-15 8:34 UTC (permalink / raw)
To: Richard Guenther, Zdenek Dvorak; +Cc: gcc-patches
________________________________________
From: Richard Guenther [richard.guenther@gmail.com]
Sent: Tuesday, December 14, 2010 8:35 PM
To: Zdenek Dvorak
Cc: Fang, Changpeng; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Tue, Dec 14, 2010 at 10:05 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
> Hi,
>
>> >why not simply change the profile updating to correctly indicate that these loops do not roll?
>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>>
>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>> to guard loop optimizations.
>
> it is already used that way; i.e., you do not need to change anything in the optimizations, just
> make sure that the edge probabilities are sensible.
>
>> For a given program, different data sizes will result in quite different
>> loop trip counts.
>
> That should not be the case -- for the pre/post loops generated in vectorization, we know the
> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>
>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>> loops, significantly?
>
> Anything where we check whether we should optimize for speed or code size,
>I agree with Zdenek (without having looked at the patch sofar).
I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
updating the profile information for the same purpose.
Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
information to let the expected_loop_iterations to know this value? ( I got lost here about the
edge probabilities issues)
Thanks,
Changpeng
>Richard.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-15 8:34 ` Fang, Changpeng
@ 2010-12-15 9:22 ` Xinliang David Li
2010-12-15 10:00 ` Zdenek Dvorak
0 siblings, 1 reply; 36+ messages in thread
From: Xinliang David Li @ 2010-12-15 9:22 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: Richard Guenther, Zdenek Dvorak, gcc-patches
On Tue, Dec 14, 2010 at 10:22 PM, Fang, Changpeng
<Changpeng.Fang@amd.com> wrote:
>
>
> ________________________________________
> From: Richard Guenther [richard.guenther@gmail.com]
> Sent: Tuesday, December 14, 2010 8:35 PM
> To: Zdenek Dvorak
> Cc: Fang, Changpeng; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Tue, Dec 14, 2010 at 10:05 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
>> Hi,
>>
>>> >why not simply change the profile updating to correctly indicate that these loops do not roll?
>>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>>>
>>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>>> to guard loop optimizations.
>>
>> it is already used that way; i.e., you do not need to change anything in the optimizations, just
>> make sure that the edge probabilities are sensible.
>>
>>> For a given program, different data sizes will result in quite different
>>> loop trip counts.
>>
>> That should not be the case -- for the pre/post loops generated in vectorization, we know the
>> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
>> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>>
>>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>>> loops, significantly?
>>
>> Anything where we check whether we should optimize for speed or code size,
>
>>I agree with Zdenek (without having looked at the patch sofar).
>
> I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
> code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
> updating the profile information for the same purpose.
>
> Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
> information to let the expected_loop_iterations to know this value? ( I got lost here about the
> edge probabilities issues)
>
With profile data, the average loop trip count is recorded in the
nb_iterations_estimate field of the loop structure. You can get it
from estimated_loop_iteration_iterations_int(loop, false). For the
small trip count loops introduced by the optimization, you can use
interface record_niter_bound (..., true, ..) to record a realistic
estimate which may or may not be upper bound.
In general, without FDO, gcc does not estimate loop iteration
according to the back-edge probability computed by static prediction
(predict.c). This is less than ideal. For instance, when
builtin_expect is used to annotate the loop bound, the information
will be lost.
David
> Thanks,
>
> Changpeng
>
>
>
>
>
>>Richard.
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-15 9:22 ` Xinliang David Li
@ 2010-12-15 10:00 ` Zdenek Dvorak
2010-12-15 16:46 ` Fang, Changpeng
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Zdenek Dvorak @ 2010-12-15 10:00 UTC (permalink / raw)
To: Xinliang David Li; +Cc: Fang, Changpeng, Richard Guenther, gcc-patches
Hi,
> >>> Â >why not simply change the profile updating to correctly indicate that these loops do not roll?
> >>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
> >>>
> >>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
> >>> to guard loop optimizations.
> >>
> >> it is already used that way; i.e., you do not need to change anything in the optimizations, just
> >> make sure that the edge probabilities are sensible.
> >>
> >>> For a given program, different data sizes will result in quite different
> >>> loop trip counts.
> >>
> >> That should not be the case -- for the pre/post loops generated in vectorization, we know the
> >> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
> >> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
> >>
> >>> By the way, what optimizations else do you think will benefit from disabling for small trip count
> >>> loops, significantly?
> >>
> >> Anything where we check whether we should optimize for speed or code size,
> >
> >>I agree with Zdenek (without having looked at the patch sofar).
> >
> > I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
> > code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
> > updating the profile information for the same purpose.
> >
> > Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
> > information to let the expected_loop_iterations to know this value? ( I got lost here about the
> > edge probabilities issues)
> >
>
>
> In general, without FDO, gcc does not estimate loop iteration
> according to the back-edge probability computed by static prediction
> (predict.c). This is less than ideal. For instance, when
> builtin_expect is used to annotate the loop bound, the information
> will be lost.
hmmm.... I forgot about this. OK, I withdraw my objection against the patch, although
I would suggest the following changes:
-- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
-- in estimate_numbers_of_iterations_loop, for loops with this flags use
record_niter_bound (loop, double_int_two, true, false)
to make tree-level loop optimizations know that the loop does not roll,
-- the check for the flag in loop_prefetch_arrays should not be needed, then.
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-15 10:00 ` Zdenek Dvorak
@ 2010-12-15 16:46 ` Fang, Changpeng
2010-12-15 16:47 ` Zdenek Dvorak
2010-12-15 17:08 ` Xinliang David Li
2010-12-16 12:09 ` Richard Guenther
2 siblings, 1 reply; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-15 16:46 UTC (permalink / raw)
To: Zdenek Dvorak, Xinliang David Li; +Cc: Richard Guenther, gcc-patches
Hi,
>hmmm.... I forgot about this. OK, I withdraw my objection against the patch, although
>I would suggest the following changes:
>-- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
Thanks. I will do this.
>-- in estimate_numbers_of_iterations_loop, for loops with this flags use
> record_niter_bound (loop, double_int_two, true, false)
> to make tree-level loop optimizations know that the loop does not roll,
>-- the check for the flag in loop_prefetch_arrays should not be needed, then.
>Zdenek
I have a new idea about this. How about, "if the flag is ON, we consider the loop as "optimize for size"")?
In this way, we will consider the loop as a cold area and turn off related optimizations on it.
Thanks,
Changpeng
________________________________________
From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
Sent: Wednesday, December 15, 2010 3:22 AM
To: Xinliang David Li
Cc: Fang, Changpeng; Richard Guenther; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
Hi,
> >>> >why not simply change the profile updating to correctly indicate that these loops do not roll?
> >>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
> >>>
> >>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
> >>> to guard loop optimizations.
> >>
> >> it is already used that way; i.e., you do not need to change anything in the optimizations, just
> >> make sure that the edge probabilities are sensible.
> >>
> >>> For a given program, different data sizes will result in quite different
> >>> loop trip counts.
> >>
> >> That should not be the case -- for the pre/post loops generated in vectorization, we know the
> >> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
> >> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
> >>
> >>> By the way, what optimizations else do you think will benefit from disabling for small trip count
> >>> loops, significantly?
> >>
> >> Anything where we check whether we should optimize for speed or code size,
> >
> >>I agree with Zdenek (without having looked at the patch sofar).
> >
> > I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
> > code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
> > updating the profile information for the same purpose.
> >
> > Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
> > information to let the expected_loop_iterations to know this value? ( I got lost here about the
> > edge probabilities issues)
> >
>
>
> In general, without FDO, gcc does not estimate loop iteration
> according to the back-edge probability computed by static prediction
> (predict.c). This is less than ideal. For instance, when
> builtin_expect is used to annotate the loop bound, the information
> will be lost.
hmmm.... I forgot about this. OK, I withdraw my objection against the patch, although
I would suggest the following changes:
-- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
-- in estimate_numbers_of_iterations_loop, for loops with this flags use
record_niter_bound (loop, double_int_two, true, false)
to make tree-level loop optimizations know that the loop does not roll,
-- the check for the flag in loop_prefetch_arrays should not be needed, then.
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-15 16:46 ` Fang, Changpeng
@ 2010-12-15 16:47 ` Zdenek Dvorak
0 siblings, 0 replies; 36+ messages in thread
From: Zdenek Dvorak @ 2010-12-15 16:47 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: Xinliang David Li, Richard Guenther, gcc-patches
Hi,
> >hmmm.... I forgot about this. OK, I withdraw my objection against the patch, although
> >I would suggest the following changes:
> >-- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
>
> Thanks. I will do this.
>
> >-- in estimate_numbers_of_iterations_loop, for loops with this flags use
> > record_niter_bound (loop, double_int_two, true, false)
> > to make tree-level loop optimizations know that the loop does not roll,
> >-- the check for the flag in loop_prefetch_arrays should not be needed, then.
> >Zdenek
>
> I have a new idea about this. How about, "if the flag is ON, we consider the loop as "optimize for size"")?
> In this way, we will consider the loop as a cold area and turn off related optimizations on it.
yes, modifying optimize_loop_for_size_p is also a good idea,
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-15 10:00 ` Zdenek Dvorak
2010-12-15 16:46 ` Fang, Changpeng
@ 2010-12-15 17:08 ` Xinliang David Li
2010-12-16 12:09 ` Richard Guenther
2 siblings, 0 replies; 36+ messages in thread
From: Xinliang David Li @ 2010-12-15 17:08 UTC (permalink / raw)
To: Zdenek Dvorak; +Cc: Fang, Changpeng, Richard Guenther, gcc-patches
One more thing about FDO -- using average trip count can be misleading
too -- however if loop multi-version according to trip count value
profiling (currently missing), it will be more precise.
David
On Wed, Dec 15, 2010 at 1:22 AM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
> Hi,
>
>> >>> >why not simply change the profile updating to correctly indicate that these loops do not roll?
>> >>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>> >>>
>> >>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>> >>> to guard loop optimizations.
>> >>
>> >> it is already used that way; i.e., you do not need to change anything in the optimizations, just
>> >> make sure that the edge probabilities are sensible.
>> >>
>> >>> For a given program, different data sizes will result in quite different
>> >>> loop trip counts.
>> >>
>> >> That should not be the case -- for the pre/post loops generated in vectorization, we know the
>> >> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
>> >> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>> >>
>> >>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>> >>> loops, significantly?
>> >>
>> >> Anything where we check whether we should optimize for speed or code size,
>> >
>> >>I agree with Zdenek (without having looked at the patch sofar).
>> >
>> > I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
>> > code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
>> > updating the profile information for the same purpose.
>> >
>> > Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
>> > information to let the expected_loop_iterations to know this value? ( I got lost here about the
>> > edge probabilities issues)
>> >
>>
>>
>> In general, without FDO, gcc does not estimate loop iteration
>> according to the back-edge probability computed by static prediction
>> (predict.c). This is less than ideal. For instance, when
>> builtin_expect is used to annotate the loop bound, the information
>> will be lost.
>
> hmmm.... I forgot about this. OK, I withdraw my objection against the patch, although
> I would suggest the following changes:
> -- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
> -- in estimate_numbers_of_iterations_loop, for loops with this flags use
> record_niter_bound (loop, double_int_two, true, false)
> to make tree-level loop optimizations know that the loop does not roll,
> -- the check for the flag in loop_prefetch_arrays should not be needed, then.
>
> Zdenek
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-15 10:00 ` Zdenek Dvorak
2010-12-15 16:46 ` Fang, Changpeng
2010-12-15 17:08 ` Xinliang David Li
@ 2010-12-16 12:09 ` Richard Guenther
2010-12-16 12:41 ` Zdenek Dvorak
2 siblings, 1 reply; 36+ messages in thread
From: Richard Guenther @ 2010-12-16 12:09 UTC (permalink / raw)
To: Zdenek Dvorak; +Cc: Xinliang David Li, Fang, Changpeng, gcc-patches
2010/12/15 Zdenek Dvorak <rakdver@kam.mff.cuni.cz>:
> Hi,
>
>> >>> >why not simply change the profile updating to correctly indicate that these loops do not roll?
>> >>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>> >>>
>> >>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>> >>> to guard loop optimizations.
>> >>
>> >> it is already used that way; i.e., you do not need to change anything in the optimizations, just
>> >> make sure that the edge probabilities are sensible.
>> >>
>> >>> For a given program, different data sizes will result in quite different
>> >>> loop trip counts.
>> >>
>> >> That should not be the case -- for the pre/post loops generated in vectorization, we know the
>> >> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
>> >> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>> >>
>> >>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>> >>> loops, significantly?
>> >>
>> >> Anything where we check whether we should optimize for speed or code size,
>> >
>> >>I agree with Zdenek (without having looked at the patch sofar).
>> >
>> > I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
>> > code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
>> > updating the profile information for the same purpose.
>> >
>> > Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
>> > information to let the expected_loop_iterations to know this value? ( I got lost here about the
>> > edge probabilities issues)
>> >
>>
>>
>> In general, without FDO, gcc does not estimate loop iteration
>> according to the back-edge probability computed by static prediction
>> (predict.c). This is less than ideal. For instance, when
>> builtin_expect is used to annotate the loop bound, the information
>> will be lost.
>
> hmmm.... I forgot about this. OK, I withdraw my objection against the patch, although
> I would suggest the following changes:
> -- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
> -- in estimate_numbers_of_iterations_loop, for loops with this flags use
> record_niter_bound (loop, double_int_two, true, false)
> to make tree-level loop optimizations know that the loop does not roll,
> -- the check for the flag in loop_prefetch_arrays should not be needed, then.
Btw, it would be nice if number-of-iteration analysis would figure out an
upper bound for niter for the typical prologue loops (which have exit
tests like i < niter & CST). It's of course more difficult for epilogues
where we'd need to figure out the exit test and increment of a preceeding
loop.
Btw, any reason why we do not use static profiles for number of iteration
estimates? We after all _do_ use the static profile to guide the
maybe_hot/cold_bb tests.
Richard.
> Zdenek
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-16 12:09 ` Richard Guenther
@ 2010-12-16 12:41 ` Zdenek Dvorak
2010-12-16 18:26 ` Fang, Changpeng
0 siblings, 1 reply; 36+ messages in thread
From: Zdenek Dvorak @ 2010-12-16 12:41 UTC (permalink / raw)
To: Richard Guenther; +Cc: Xinliang David Li, Fang, Changpeng, gcc-patches
Hi,
> Btw, any reason why we do not use static profiles for number of iteration
> estimates? We after all _do_ use the static profile to guide the
> maybe_hot/cold_bb tests.
for loops for that we cannot determine the # of iterations statically,
basically the only important predictors are PRED_LOOP_BRANCH and
PRED_LOOP_EXIT, which predict that the loop will iterate about 10 times. So,
by using static profile, we would just learn that every such loop is expected
to iterate 10 times, which is kind of useless,
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-16 12:41 ` Zdenek Dvorak
@ 2010-12-16 18:26 ` Fang, Changpeng
2010-12-16 20:06 ` Zdenek Dvorak
2010-12-19 0:29 ` Richard Guenther
0 siblings, 2 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-16 18:26 UTC (permalink / raw)
To: Zdenek Dvorak, Richard Guenther; +Cc: Xinliang David Li, gcc-patches
My initial intention is Not to unroll prologue and epilogue loops. An estimated trip count
may not be that useful for the unrolling decision. To me, unrolling a loop that has at most
3 (or 7) iterations does not make sense. RTL unrolling does not use the estimated trip
count to determine the unroll factor, and thus it may still unroll the loop 4 or 8 times if
the loop is small ( #insns). To make things simple, we just don't unroll such loops.
However, a prologue or epilogue loop may still be a hot loop, depending on the outer
loops. It may still be beneficial to perform other optimizations on such loops, if the code
size is not expanded multiple times.
For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
not unrolling, (2) not prefetching. Which one do you prefer?
Thanks,
Changpeng
________________________________________
From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
Sent: Thursday, December 16, 2010 6:09 AM
To: Richard Guenther
Cc: Xinliang David Li; Fang, Changpeng; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
Hi,
> Btw, any reason why we do not use static profiles for number of iteration
> estimates? We after all _do_ use the static profile to guide the
> maybe_hot/cold_bb tests.
for loops for that we cannot determine the # of iterations statically,
basically the only important predictors are PRED_LOOP_BRANCH and
PRED_LOOP_EXIT, which predict that the loop will iterate about 10 times. So,
by using static profile, we would just learn that every such loop is expected
to iterate 10 times, which is kind of useless,
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-16 18:26 ` Fang, Changpeng
@ 2010-12-16 20:06 ` Zdenek Dvorak
2010-12-17 3:53 ` Fang, Changpeng
2010-12-19 0:29 ` Richard Guenther
1 sibling, 1 reply; 36+ messages in thread
From: Zdenek Dvorak @ 2010-12-16 20:06 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: Richard Guenther, Xinliang David Li, gcc-patches
Hi,
> For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> not unrolling, (2) not prefetching. Which one do you prefer?
it is better not to prefetch (the current placement of prefetches is not good for non-rolling
loops),
Zdenek
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-16 20:06 ` Zdenek Dvorak
@ 2010-12-17 3:53 ` Fang, Changpeng
2010-12-17 6:36 ` Jack Howarth
0 siblings, 1 reply; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-17 3:53 UTC (permalink / raw)
To: Zdenek Dvorak; +Cc: Richard Guenther, Xinliang David Li, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]
Hi,
Based on previous discussions, I modified the patch as such.
If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
and optimize_loop_for_speed_p returns FALSE. All users of these two
functions will be affected.
After applying the modified patch, pb05 compilation time decreases 29%, binary
size decreases 20%, while a small (0.5%) performance increase was found which may
be just noise.
Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
Is it OK to commit to trunk?
Thanks,
Changpeng
________________________________________
From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
Sent: Thursday, December 16, 2010 12:47 PM
To: Fang, Changpeng
Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
Hi,
> For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> not unrolling, (2) not prefetching. Which one do you prefer?
it is better not to prefetch (the current placement of prefetches is not good for non-rolling
loops),
Zdenek
[-- Attachment #2: polyhedron1.txt --]
[-- Type: text/plain, Size: 2986 bytes --]
Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
============================================================================================================================
|| | Before Patch | After Patch | Changes ||
||========================================================================================================================||
|| Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
|| Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
||========================================================================================================================||
|| ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
|| aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
|| air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
|| capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
|| channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
|| doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
|| fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
|| gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
|| induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
|| linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
|| mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
|| nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
|| protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
|| rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
|| test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
|| tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
||========================================================================================================================||
|| average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
============================================================================================================================
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch --]
[-- Type: text/x-patch; name="0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch", Size: 6190 bytes --]
From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Mon, 13 Dec 2010 12:01:49 -0800
Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
* basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
* cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
* cfgloop.h (mark_non_rolling_loop): New function declaration.
(non_rolling_loop_p): New function declaration.
* predict.c (optimize_loop_for_size_p): Return true if the loop was marked
NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
marked NON-ROLLING.
* tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
non-rolling loop.
* tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
function. (non_rolling_loop_p): Implement the new function.
* tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
loop.
---
gcc/basic-block.h | 6 +++++-
gcc/cfg.c | 7 ++++---
gcc/cfgloop.h | 2 ++
gcc/predict.c | 6 ++++++
gcc/tree-ssa-loop-manip.c | 3 +++
gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
gcc/tree-vect-loop-manip.c | 8 ++++++++
7 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/gcc/basic-block.h b/gcc/basic-block.h
index be0a1d1..850472d 100644
--- a/gcc/basic-block.h
+++ b/gcc/basic-block.h
@@ -245,7 +245,11 @@ enum bb_flags
/* Set on blocks that cannot be threaded through.
Only used in cfgcleanup.c. */
- BB_NONTHREADABLE_BLOCK = 1 << 11
+ BB_NONTHREADABLE_BLOCK = 1 << 11,
+
+ /* Set on blocks that are headers of non-rolling loops. */
+ BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
+
};
/* Dummy flag for convenience in the hot/cold partitioning code. */
diff --git a/gcc/cfg.c b/gcc/cfg.c
index c8ef799..e59a637 100644
--- a/gcc/cfg.c
+++ b/gcc/cfg.c
@@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
connect_src (e);
}
-/* Clear all basic block flags, with the exception of partitioning and
- setjmp_target. */
+/* Clear all basic block flags, with the exception of partitioning,
+ setjmp_target, and the non-rolling loop marker. */
void
clear_bb_flags (void)
{
@@ -434,7 +434,8 @@ clear_bb_flags (void)
FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
bb->flags = (BB_PARTITION (bb)
- | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
+ | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
+ + BB_HEADER_OF_NONROLLING_LOOP)));
}
\f
/* Check the consistency of profile information. We can't do that
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index bf2614e..e856a78 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
void estimate_numbers_of_iterations_loop (struct loop *, bool);
HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
bool estimated_loop_iterations (struct loop *, bool, double_int *);
+void mark_non_rolling_loop (struct loop *);
+bool non_rolling_loop_p (struct loop *);
/* Loop manipulation. */
extern bool can_duplicate_loop_p (const struct loop *loop);
diff --git a/gcc/predict.c b/gcc/predict.c
index c691990..bf729f8 100644
--- a/gcc/predict.c
+++ b/gcc/predict.c
@@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
bool
optimize_loop_for_size_p (struct loop *loop)
{
+ /* Loops marked NON-ROLLING are not likely to be hot. */
+ if (non_rolling_loop_p (loop))
+ return true;
return optimize_bb_for_size_p (loop->header);
}
@@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
bool
optimize_loop_for_speed_p (struct loop *loop)
{
+ /* Loops marked NON-ROLLING are not likely to be hot. */
+ if (non_rolling_loop_p (loop))
+ return false;
return optimize_bb_for_speed_p (loop->header);
}
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 87b2c0d..bc977bb 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
gcc_assert (new_loop != NULL);
update_ssa (TODO_update_ssa);
+ /* NEW_LOOP is a non-rolling loop. */
+ mark_non_rolling_loop (new_loop);
+
/* Determine the probability of the exit edge of the unrolled loop. */
new_est_niter = est_niter / factor;
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index ee85f6f..1e2e4b2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
fold_undefer_and_ignore_overflow_warnings ();
}
+/* Mark LOOP as a non-rolling loop. */
+
+void
+mark_non_rolling_loop (struct loop *loop)
+{
+ gcc_assert (loop && loop->header);
+ loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
+}
+
+/* Return true if LOOP is a non-rolling loop. */
+
+bool
+non_rolling_loop_p (struct loop *loop)
+{
+ int masked_flags;
+ gcc_assert (loop && loop->header);
+ masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
+ return (masked_flags != 0);
+}
+
/* Returns true if statement S1 dominates statement S2. */
bool
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6ecd304..216de78 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
cond_expr, cond_expr_stmt_list);
gcc_assert (new_loop);
gcc_assert (loop_num == loop->num);
+
+ /* NEW_LOOP is a non-rolling loop. */
+ mark_non_rolling_loop (new_loop);
+
#ifdef ENABLE_CHECKING
slpeel_verify_cfg_after_peeling (loop, new_loop);
#endif
@@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
th, true, NULL_TREE, NULL);
gcc_assert (new_loop);
+
+ /* NEW_LOOP is a non-rolling loop. */
+ mark_non_rolling_loop (new_loop);
+
#ifdef ENABLE_CHECKING
slpeel_verify_cfg_after_peeling (new_loop, loop);
#endif
--
1.6.3.3
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 3:53 ` Fang, Changpeng
@ 2010-12-17 6:36 ` Jack Howarth
2010-12-17 9:55 ` Fang, Changpeng
2010-12-17 21:45 ` Jack Howarth
0 siblings, 2 replies; 36+ messages in thread
From: Jack Howarth @ 2010-12-17 6:36 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> Hi,
>
> Based on previous discussions, I modified the patch as such.
>
> If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> and optimize_loop_for_speed_p returns FALSE. All users of these two
> functions will be affected.
>
> After applying the modified patch, pb05 compilation time decreases 29%, binary
> size decreases 20%, while a small (0.5%) performance increase was found which may
> be just noise.
>
> Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
>
> Is it OK to commit to trunk?
Changpeng,
On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
I get without the patch...
================================================================================
Date & Time : 16 Dec 2010 23:36:27
Test Name : gfortran_lin_O3
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.76 10000 8.78 12 0.0081
aermod 54.94 10000 17.28 10 0.0307
air 3.44 10000 5.53 13 0.0734
capacita 2.64 10000 32.65 10 0.0096
channel 0.89 10000 1.84 20 0.0977
doduc 8.12 10000 27.00 10 0.0132
fatigue 3.06 10000 8.36 10 0.0104
gas_dyn 5.00 10000 4.30 17 0.0915
induct 6.24 10000 12.42 10 0.0100
linpk 1.02 10000 15.50 12 0.0542
mdbx 2.55 10000 11.24 10 0.0256
nf 2.90 10000 30.16 20 0.0989
protein 7.98 10000 33.72 10 0.0070
rnflow 9.34 10000 23.21 10 0.0551
test_fpu 6.72 10000 8.05 10 0.0426
tfft 0.76 10000 1.87 10 0.0597
Geometric Mean Execution Time = 10.87 seconds
and with the patch...
================================================================================
Date & Time : 16 Dec 2010 21:31:06
Test Name : gfortran_lin_O3
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.19 10000 8.78 10 0.0099
aermod 47.91 10000 16.95 10 0.0123
air 2.85 10000 5.34 12 0.0715
capacita 1.63 10000 33.10 10 0.0361
channel 0.67 10000 1.87 10 0.0884
doduc 6.42 10000 27.35 10 0.0206
fatigue 2.10 10000 8.32 10 0.0194
gas_dyn 2.07 10000 4.30 17 0.0843
induct 5.38 10000 12.58 10 0.0088
linpk 0.71 10000 15.69 18 0.0796
mdbx 1.95 10000 11.41 10 0.0238
nf 1.24 10000 31.34 12 0.0991
protein 3.88 10000 35.13 10 0.0659
rnflow 4.73 10000 25.97 10 0.0629
test_fpu 3.66 10000 8.88 11 0.0989
tfft 0.52 10000 1.89 10 0.0403
Geometric Mean Execution Time = 11.09 seconds
This shows about a 2.0% performance reduction in the Geometric
Mean Execution Time. I would note that intel darwin now defaults
to -mtune=core2 and always has defaulted to -fPIC.
Jack
>
> Thanks,
>
> Changpeng
>
>
>
>
>
>
>
>
>
> ________________________________________
> From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> Sent: Thursday, December 16, 2010 12:47 PM
> To: Fang, Changpeng
> Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> Hi,
>
> > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > not unrolling, (2) not prefetching. Which one do you prefer?
>
> it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> loops),
>
> Zdenek
>
Content-Description: polyhedron1.txt
> Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
>
> gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> ============================================================================================================================
> || | Before Patch | After Patch | Changes ||
> ||========================================================================================================================||
> || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> ||========================================================================================================================||
> || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> ||========================================================================================================================||
> || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> ============================================================================================================================
Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> From: Changpeng Fang <chfang@houghton.(none)>
> Date: Mon, 13 Dec 2010 12:01:49 -0800
> Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
>
> * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> * cfgloop.h (mark_non_rolling_loop): New function declaration.
> (non_rolling_loop_p): New function declaration.
> * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> marked NON-ROLLING.
> * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> non-rolling loop.
> * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> function. (non_rolling_loop_p): Implement the new function.
> * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> loop.
> ---
> gcc/basic-block.h | 6 +++++-
> gcc/cfg.c | 7 ++++---
> gcc/cfgloop.h | 2 ++
> gcc/predict.c | 6 ++++++
> gcc/tree-ssa-loop-manip.c | 3 +++
> gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> gcc/tree-vect-loop-manip.c | 8 ++++++++
> 7 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> index be0a1d1..850472d 100644
> --- a/gcc/basic-block.h
> +++ b/gcc/basic-block.h
> @@ -245,7 +245,11 @@ enum bb_flags
>
> /* Set on blocks that cannot be threaded through.
> Only used in cfgcleanup.c. */
> - BB_NONTHREADABLE_BLOCK = 1 << 11
> + BB_NONTHREADABLE_BLOCK = 1 << 11,
> +
> + /* Set on blocks that are headers of non-rolling loops. */
> + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> +
> };
>
> /* Dummy flag for convenience in the hot/cold partitioning code. */
> diff --git a/gcc/cfg.c b/gcc/cfg.c
> index c8ef799..e59a637 100644
> --- a/gcc/cfg.c
> +++ b/gcc/cfg.c
> @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> connect_src (e);
> }
>
> -/* Clear all basic block flags, with the exception of partitioning and
> - setjmp_target. */
> +/* Clear all basic block flags, with the exception of partitioning,
> + setjmp_target, and the non-rolling loop marker. */
> void
> clear_bb_flags (void)
> {
> @@ -434,7 +434,8 @@ clear_bb_flags (void)
>
> FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> bb->flags = (BB_PARTITION (bb)
> - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> + + BB_HEADER_OF_NONROLLING_LOOP)));
> }
> \f
> /* Check the consistency of profile information. We can't do that
> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> index bf2614e..e856a78 100644
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> void estimate_numbers_of_iterations_loop (struct loop *, bool);
> HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> bool estimated_loop_iterations (struct loop *, bool, double_int *);
> +void mark_non_rolling_loop (struct loop *);
> +bool non_rolling_loop_p (struct loop *);
>
> /* Loop manipulation. */
> extern bool can_duplicate_loop_p (const struct loop *loop);
> diff --git a/gcc/predict.c b/gcc/predict.c
> index c691990..bf729f8 100644
> --- a/gcc/predict.c
> +++ b/gcc/predict.c
> @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> bool
> optimize_loop_for_size_p (struct loop *loop)
> {
> + /* Loops marked NON-ROLLING are not likely to be hot. */
> + if (non_rolling_loop_p (loop))
> + return true;
> return optimize_bb_for_size_p (loop->header);
> }
>
> @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> bool
> optimize_loop_for_speed_p (struct loop *loop)
> {
> + /* Loops marked NON-ROLLING are not likely to be hot. */
> + if (non_rolling_loop_p (loop))
> + return false;
> return optimize_bb_for_speed_p (loop->header);
> }
>
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 87b2c0d..bc977bb 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> gcc_assert (new_loop != NULL);
> update_ssa (TODO_update_ssa);
>
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> /* Determine the probability of the exit edge of the unrolled loop. */
> new_est_niter = est_niter / factor;
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index ee85f6f..1e2e4b2 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> fold_undefer_and_ignore_overflow_warnings ();
> }
>
> +/* Mark LOOP as a non-rolling loop. */
> +
> +void
> +mark_non_rolling_loop (struct loop *loop)
> +{
> + gcc_assert (loop && loop->header);
> + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> +}
> +
> +/* Return true if LOOP is a non-rolling loop. */
> +
> +bool
> +non_rolling_loop_p (struct loop *loop)
> +{
> + int masked_flags;
> + gcc_assert (loop && loop->header);
> + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> + return (masked_flags != 0);
> +}
> +
> /* Returns true if statement S1 dominates statement S2. */
>
> bool
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 6ecd304..216de78 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> cond_expr, cond_expr_stmt_list);
> gcc_assert (new_loop);
> gcc_assert (loop_num == loop->num);
> +
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (loop, new_loop);
> #endif
> @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> th, true, NULL_TREE, NULL);
>
> gcc_assert (new_loop);
> +
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (new_loop, loop);
> #endif
> --
> 1.6.3.3
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 6:36 ` Jack Howarth
@ 2010-12-17 9:55 ` Fang, Changpeng
2010-12-17 16:13 ` Jack Howarth
2011-01-04 3:33 ` Jack Howarth
2010-12-17 21:45 ` Jack Howarth
1 sibling, 2 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-17 9:55 UTC (permalink / raw)
To: Jack Howarth
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
Hi, Jack:
Thanks for the testing.
This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
It would be helpful if you can provide analysis why they are slowed down.
We did see a significant compilation time reduction for most pb05 programs.
(I don't know why you do not have executable size data).
>I would note that intel darwin now defaults
>to -mtune=core2 and always has defaulted to -fPIC.
I could not understand these default for darwin. My understanding is that,
for x86_64, the default should be -mtune=generic.
Thanks,
Changpeng
________________________________________
From: Jack Howarth [howarth@bromo.med.uc.edu]
Sent: Thursday, December 16, 2010 11:31 PM
To: Fang, Changpeng
Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> Hi,
>
> Based on previous discussions, I modified the patch as such.
>
> If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> and optimize_loop_for_speed_p returns FALSE. All users of these two
> functions will be affected.
>
> After applying the modified patch, pb05 compilation time decreases 29%, binary
> size decreases 20%, while a small (0.5%) performance increase was found which may
> be just noise.
>
> Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
>
> Is it OK to commit to trunk?
Changpeng,
On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
I get without the patch...
================================================================================
Date & Time : 16 Dec 2010 23:36:27
Test Name : gfortran_lin_O3
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.76 10000 8.78 12 0.0081
aermod 54.94 10000 17.28 10 0.0307
air 3.44 10000 5.53 13 0.0734
capacita 2.64 10000 32.65 10 0.0096
channel 0.89 10000 1.84 20 0.0977
doduc 8.12 10000 27.00 10 0.0132
fatigue 3.06 10000 8.36 10 0.0104
gas_dyn 5.00 10000 4.30 17 0.0915
induct 6.24 10000 12.42 10 0.0100
linpk 1.02 10000 15.50 12 0.0542
mdbx 2.55 10000 11.24 10 0.0256
nf 2.90 10000 30.16 20 0.0989
protein 7.98 10000 33.72 10 0.0070
rnflow 9.34 10000 23.21 10 0.0551
test_fpu 6.72 10000 8.05 10 0.0426
tfft 0.76 10000 1.87 10 0.0597
Geometric Mean Execution Time = 10.87 seconds
and with the patch...
================================================================================
Date & Time : 16 Dec 2010 21:31:06
Test Name : gfortran_lin_O3
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.19 10000 8.78 10 0.0099
aermod 47.91 10000 16.95 10 0.0123
air 2.85 10000 5.34 12 0.0715
capacita 1.63 10000 33.10 10 0.0361
channel 0.67 10000 1.87 10 0.0884
doduc 6.42 10000 27.35 10 0.0206
fatigue 2.10 10000 8.32 10 0.0194
gas_dyn 2.07 10000 4.30 17 0.0843
induct 5.38 10000 12.58 10 0.0088
linpk 0.71 10000 15.69 18 0.0796
mdbx 1.95 10000 11.41 10 0.0238
nf 1.24 10000 31.34 12 0.0991
protein 3.88 10000 35.13 10 0.0659
rnflow 4.73 10000 25.97 10 0.0629
test_fpu 3.66 10000 8.88 11 0.0989
tfft 0.52 10000 1.89 10 0.0403
Geometric Mean Execution Time = 11.09 seconds
This shows about a 2.0% performance reduction in the Geometric
Mean Execution Time. I would note that intel darwin now defaults
to -mtune=core2 and always has defaulted to -fPIC.
Jack
>
> Thanks,
>
> Changpeng
>
>
>
>
>
>
>
>
>
> ________________________________________
> From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> Sent: Thursday, December 16, 2010 12:47 PM
> To: Fang, Changpeng
> Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> Hi,
>
> > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > not unrolling, (2) not prefetching. Which one do you prefer?
>
> it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> loops),
>
> Zdenek
>
Content-Description: polyhedron1.txt
> Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
>
> gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> ============================================================================================================================
> || | Before Patch | After Patch | Changes ||
> ||========================================================================================================================||
> || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> ||========================================================================================================================||
> || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> ||========================================================================================================================||
> || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> ============================================================================================================================
Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> From: Changpeng Fang <chfang@houghton.(none)>
> Date: Mon, 13 Dec 2010 12:01:49 -0800
> Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
>
> * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> * cfgloop.h (mark_non_rolling_loop): New function declaration.
> (non_rolling_loop_p): New function declaration.
> * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> marked NON-ROLLING.
> * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> non-rolling loop.
> * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> function. (non_rolling_loop_p): Implement the new function.
> * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> loop.
> ---
> gcc/basic-block.h | 6 +++++-
> gcc/cfg.c | 7 ++++---
> gcc/cfgloop.h | 2 ++
> gcc/predict.c | 6 ++++++
> gcc/tree-ssa-loop-manip.c | 3 +++
> gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> gcc/tree-vect-loop-manip.c | 8 ++++++++
> 7 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> index be0a1d1..850472d 100644
> --- a/gcc/basic-block.h
> +++ b/gcc/basic-block.h
> @@ -245,7 +245,11 @@ enum bb_flags
>
> /* Set on blocks that cannot be threaded through.
> Only used in cfgcleanup.c. */
> - BB_NONTHREADABLE_BLOCK = 1 << 11
> + BB_NONTHREADABLE_BLOCK = 1 << 11,
> +
> + /* Set on blocks that are headers of non-rolling loops. */
> + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> +
> };
>
> /* Dummy flag for convenience in the hot/cold partitioning code. */
> diff --git a/gcc/cfg.c b/gcc/cfg.c
> index c8ef799..e59a637 100644
> --- a/gcc/cfg.c
> +++ b/gcc/cfg.c
> @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> connect_src (e);
> }
>
> -/* Clear all basic block flags, with the exception of partitioning and
> - setjmp_target. */
> +/* Clear all basic block flags, with the exception of partitioning,
> + setjmp_target, and the non-rolling loop marker. */
> void
> clear_bb_flags (void)
> {
> @@ -434,7 +434,8 @@ clear_bb_flags (void)
>
> FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> bb->flags = (BB_PARTITION (bb)
> - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> + + BB_HEADER_OF_NONROLLING_LOOP)));
> }
>
> /* Check the consistency of profile information. We can't do that
> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> index bf2614e..e856a78 100644
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> void estimate_numbers_of_iterations_loop (struct loop *, bool);
> HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> bool estimated_loop_iterations (struct loop *, bool, double_int *);
> +void mark_non_rolling_loop (struct loop *);
> +bool non_rolling_loop_p (struct loop *);
>
> /* Loop manipulation. */
> extern bool can_duplicate_loop_p (const struct loop *loop);
> diff --git a/gcc/predict.c b/gcc/predict.c
> index c691990..bf729f8 100644
> --- a/gcc/predict.c
> +++ b/gcc/predict.c
> @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> bool
> optimize_loop_for_size_p (struct loop *loop)
> {
> + /* Loops marked NON-ROLLING are not likely to be hot. */
> + if (non_rolling_loop_p (loop))
> + return true;
> return optimize_bb_for_size_p (loop->header);
> }
>
> @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> bool
> optimize_loop_for_speed_p (struct loop *loop)
> {
> + /* Loops marked NON-ROLLING are not likely to be hot. */
> + if (non_rolling_loop_p (loop))
> + return false;
> return optimize_bb_for_speed_p (loop->header);
> }
>
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 87b2c0d..bc977bb 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> gcc_assert (new_loop != NULL);
> update_ssa (TODO_update_ssa);
>
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> /* Determine the probability of the exit edge of the unrolled loop. */
> new_est_niter = est_niter / factor;
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index ee85f6f..1e2e4b2 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> fold_undefer_and_ignore_overflow_warnings ();
> }
>
> +/* Mark LOOP as a non-rolling loop. */
> +
> +void
> +mark_non_rolling_loop (struct loop *loop)
> +{
> + gcc_assert (loop && loop->header);
> + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> +}
> +
> +/* Return true if LOOP is a non-rolling loop. */
> +
> +bool
> +non_rolling_loop_p (struct loop *loop)
> +{
> + int masked_flags;
> + gcc_assert (loop && loop->header);
> + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> + return (masked_flags != 0);
> +}
> +
> /* Returns true if statement S1 dominates statement S2. */
>
> bool
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 6ecd304..216de78 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> cond_expr, cond_expr_stmt_list);
> gcc_assert (new_loop);
> gcc_assert (loop_num == loop->num);
> +
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (loop, new_loop);
> #endif
> @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> th, true, NULL_TREE, NULL);
>
> gcc_assert (new_loop);
> +
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (new_loop, loop);
> #endif
> --
> 1.6.3.3
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 9:55 ` Fang, Changpeng
@ 2010-12-17 16:13 ` Jack Howarth
2010-12-17 16:48 ` Fang, Changpeng
2011-01-04 3:33 ` Jack Howarth
1 sibling, 1 reply; 36+ messages in thread
From: Jack Howarth @ 2010-12-17 16:13 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> Hi, Jack:
>
> Thanks for the testing.
>
> This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> It would be helpful if you can provide analysis why they are slowed down.
Unfortunately gprof is pretty broken on darwin10 but I can try to do profiling
in Shark with -fno-omit-frame-pointer. Could you try rebenchmarking with the flags...
-ffast-math -funroll-loops -O3 -fPIC -mtune=generic
I'll try benchmarking again with '-Ofast -funroll-loops -mtune=generic' and
see if the performance loss is reduced.
>
> We did see a significant compilation time reduction for most pb05 programs.
> (I don't know why you do not have executable size data).
The sources I have for pb05 aren't fixed for detecting the code size properly
under darwin and I never spent the the time trying to fix that.
>
> >I would note that intel darwin now defaults
> >to -mtune=core2 and always has defaulted to -fPIC.
>
> I could not understand these default for darwin. My understanding is that,
> for x86_64, the default should be -mtune=generic.
This is a recent change.
Author: mrs
Date: Wed Dec 8 23:32:27 2010
New Revision: 167611
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=167611
Log:
2010-12-08 Iain Sandoe <iains@gcc.gnu.org>
gcc/config.gcc (with_cpu): Default i[34567]86-*-darwin* and
x86_64-*-darwin* to with_cpu:-core2.
gcc/config/i386/mmx.md (*mov<mode>_internal_rex64): Replace movq
with movd for darwin assembler.
gcc/config/i386/sse.md (*vec_concatv2di_rex64_sse4_1): Ditto.
(*vec_concatv2di_rex64_sse): Ditto.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config.gcc
trunk/gcc/config/i386/mmx.md
trunk/gcc/config/i386/sse.md
We had intended to do this awhile back (since all darwin hardware is core2
or recent Xeon), but held off until the core2 costs tables were improved to
match those of generic.
http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00000.html
>
> Thanks,
>
> Changpeng
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Thursday, December 16, 2010 11:31 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > Hi,
> >
> > Based on previous discussions, I modified the patch as such.
> >
> > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > functions will be affected.
> >
> > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > size decreases 20%, while a small (0.5%) performance increase was found which may
> > be just noise.
> >
> > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> >
> > Is it OK to commit to trunk?
>
> Changpeng,
> On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
>
> Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
>
> I get without the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 23:36:27
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.76 10000 8.78 12 0.0081
> aermod 54.94 10000 17.28 10 0.0307
> air 3.44 10000 5.53 13 0.0734
> capacita 2.64 10000 32.65 10 0.0096
> channel 0.89 10000 1.84 20 0.0977
> doduc 8.12 10000 27.00 10 0.0132
> fatigue 3.06 10000 8.36 10 0.0104
> gas_dyn 5.00 10000 4.30 17 0.0915
> induct 6.24 10000 12.42 10 0.0100
> linpk 1.02 10000 15.50 12 0.0542
> mdbx 2.55 10000 11.24 10 0.0256
> nf 2.90 10000 30.16 20 0.0989
> protein 7.98 10000 33.72 10 0.0070
> rnflow 9.34 10000 23.21 10 0.0551
> test_fpu 6.72 10000 8.05 10 0.0426
> tfft 0.76 10000 1.87 10 0.0597
>
> Geometric Mean Execution Time = 10.87 seconds
>
> and with the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 21:31:06
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.19 10000 8.78 10 0.0099
> aermod 47.91 10000 16.95 10 0.0123
> air 2.85 10000 5.34 12 0.0715
> capacita 1.63 10000 33.10 10 0.0361
> channel 0.67 10000 1.87 10 0.0884
> doduc 6.42 10000 27.35 10 0.0206
> fatigue 2.10 10000 8.32 10 0.0194
> gas_dyn 2.07 10000 4.30 17 0.0843
> induct 5.38 10000 12.58 10 0.0088
> linpk 0.71 10000 15.69 18 0.0796
> mdbx 1.95 10000 11.41 10 0.0238
> nf 1.24 10000 31.34 12 0.0991
> protein 3.88 10000 35.13 10 0.0659
> rnflow 4.73 10000 25.97 10 0.0629
> test_fpu 3.66 10000 8.88 11 0.0989
> tfft 0.52 10000 1.89 10 0.0403
>
> Geometric Mean Execution Time = 11.09 seconds
>
> This shows about a 2.0% performance reduction in the Geometric
> Mean Execution Time. I would note that intel darwin now defaults
> to -mtune=core2 and always has defaulted to -fPIC.
> Jack
>
> >
> > Thanks,
> >
> > Changpeng
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > Sent: Thursday, December 16, 2010 12:47 PM
> > To: Fang, Changpeng
> > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > Hi,
> >
> > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > not unrolling, (2) not prefetching. Which one do you prefer?
> >
> > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > loops),
> >
> > Zdenek
> >
>
> Content-Description: polyhedron1.txt
> > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> >
> > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > ============================================================================================================================
> > || | Before Patch | After Patch | Changes ||
> > ||========================================================================================================================||
> > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > ||========================================================================================================================||
> > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > ||========================================================================================================================||
> > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > ============================================================================================================================
>
> Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > From: Changpeng Fang <chfang@houghton.(none)>
> > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> >
> > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > (non_rolling_loop_p): New function declaration.
> > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > marked NON-ROLLING.
> > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > non-rolling loop.
> > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > function. (non_rolling_loop_p): Implement the new function.
> > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > loop.
> > ---
> > gcc/basic-block.h | 6 +++++-
> > gcc/cfg.c | 7 ++++---
> > gcc/cfgloop.h | 2 ++
> > gcc/predict.c | 6 ++++++
> > gcc/tree-ssa-loop-manip.c | 3 +++
> > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > 7 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > index be0a1d1..850472d 100644
> > --- a/gcc/basic-block.h
> > +++ b/gcc/basic-block.h
> > @@ -245,7 +245,11 @@ enum bb_flags
> >
> > /* Set on blocks that cannot be threaded through.
> > Only used in cfgcleanup.c. */
> > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > +
> > + /* Set on blocks that are headers of non-rolling loops. */
> > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > +
> > };
> >
> > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > index c8ef799..e59a637 100644
> > --- a/gcc/cfg.c
> > +++ b/gcc/cfg.c
> > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > connect_src (e);
> > }
> >
> > -/* Clear all basic block flags, with the exception of partitioning and
> > - setjmp_target. */
> > +/* Clear all basic block flags, with the exception of partitioning,
> > + setjmp_target, and the non-rolling loop marker. */
> > void
> > clear_bb_flags (void)
> > {
> > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> >
> > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > bb->flags = (BB_PARTITION (bb)
> > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > }
> >
> > /* Check the consistency of profile information. We can't do that
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index bf2614e..e856a78 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > +void mark_non_rolling_loop (struct loop *);
> > +bool non_rolling_loop_p (struct loop *);
> >
> > /* Loop manipulation. */
> > extern bool can_duplicate_loop_p (const struct loop *loop);
> > diff --git a/gcc/predict.c b/gcc/predict.c
> > index c691990..bf729f8 100644
> > --- a/gcc/predict.c
> > +++ b/gcc/predict.c
> > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > bool
> > optimize_loop_for_size_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return true;
> > return optimize_bb_for_size_p (loop->header);
> > }
> >
> > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > bool
> > optimize_loop_for_speed_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return false;
> > return optimize_bb_for_speed_p (loop->header);
> > }
> >
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 87b2c0d..bc977bb 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > gcc_assert (new_loop != NULL);
> > update_ssa (TODO_update_ssa);
> >
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > /* Determine the probability of the exit edge of the unrolled loop. */
> > new_est_niter = est_niter / factor;
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index ee85f6f..1e2e4b2 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > fold_undefer_and_ignore_overflow_warnings ();
> > }
> >
> > +/* Mark LOOP as a non-rolling loop. */
> > +
> > +void
> > +mark_non_rolling_loop (struct loop *loop)
> > +{
> > + gcc_assert (loop && loop->header);
> > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > +}
> > +
> > +/* Return true if LOOP is a non-rolling loop. */
> > +
> > +bool
> > +non_rolling_loop_p (struct loop *loop)
> > +{
> > + int masked_flags;
> > + gcc_assert (loop && loop->header);
> > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > + return (masked_flags != 0);
> > +}
> > +
> > /* Returns true if statement S1 dominates statement S2. */
> >
> > bool
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 6ecd304..216de78 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > cond_expr, cond_expr_stmt_list);
> > gcc_assert (new_loop);
> > gcc_assert (loop_num == loop->num);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > #endif
> > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > th, true, NULL_TREE, NULL);
> >
> > gcc_assert (new_loop);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > #endif
> > --
> > 1.6.3.3
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 16:13 ` Jack Howarth
@ 2010-12-17 16:48 ` Fang, Changpeng
2010-12-17 17:20 ` Jack Howarth
2010-12-17 18:01 ` Jack Howarth
0 siblings, 2 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-17 16:48 UTC (permalink / raw)
To: Jack Howarth
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
Hi, Jack:
Have you tested my original patch (I remembered you bootstrapped it) in this thread?
That is with the same logic which only applies to prefetch, unswitch and unrolling?
It helps a lot if you have data about that?
Thanks,
Changpeng
________________________________________
From: Jack Howarth [howarth@bromo.med.uc.edu]
Sent: Friday, December 17, 2010 9:12 AM
To: Fang, Changpeng
Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> Hi, Jack:
>
> Thanks for the testing.
>
> This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> It would be helpful if you can provide analysis why they are slowed down.
Unfortunately gprof is pretty broken on darwin10 but I can try to do profiling
in Shark with -fno-omit-frame-pointer. Could you try rebenchmarking with the flags...
-ffast-math -funroll-loops -O3 -fPIC -mtune=generic
I'll try benchmarking again with '-Ofast -funroll-loops -mtune=generic' and
see if the performance loss is reduced.
>
> We did see a significant compilation time reduction for most pb05 programs.
> (I don't know why you do not have executable size data).
The sources I have for pb05 aren't fixed for detecting the code size properly
under darwin and I never spent the the time trying to fix that.
>
> >I would note that intel darwin now defaults
> >to -mtune=core2 and always has defaulted to -fPIC.
>
> I could not understand these default for darwin. My understanding is that,
> for x86_64, the default should be -mtune=generic.
This is a recent change.
Author: mrs
Date: Wed Dec 8 23:32:27 2010
New Revision: 167611
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=167611
Log:
2010-12-08 Iain Sandoe <iains@gcc.gnu.org>
gcc/config.gcc (with_cpu): Default i[34567]86-*-darwin* and
x86_64-*-darwin* to with_cpu:-core2.
gcc/config/i386/mmx.md (*mov<mode>_internal_rex64): Replace movq
with movd for darwin assembler.
gcc/config/i386/sse.md (*vec_concatv2di_rex64_sse4_1): Ditto.
(*vec_concatv2di_rex64_sse): Ditto.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config.gcc
trunk/gcc/config/i386/mmx.md
trunk/gcc/config/i386/sse.md
We had intended to do this awhile back (since all darwin hardware is core2
or recent Xeon), but held off until the core2 costs tables were improved to
match those of generic.
http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00000.html
>
> Thanks,
>
> Changpeng
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Thursday, December 16, 2010 11:31 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > Hi,
> >
> > Based on previous discussions, I modified the patch as such.
> >
> > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > functions will be affected.
> >
> > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > size decreases 20%, while a small (0.5%) performance increase was found which may
> > be just noise.
> >
> > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> >
> > Is it OK to commit to trunk?
>
> Changpeng,
> On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
>
> Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
>
> I get without the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 23:36:27
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.76 10000 8.78 12 0.0081
> aermod 54.94 10000 17.28 10 0.0307
> air 3.44 10000 5.53 13 0.0734
> capacita 2.64 10000 32.65 10 0.0096
> channel 0.89 10000 1.84 20 0.0977
> doduc 8.12 10000 27.00 10 0.0132
> fatigue 3.06 10000 8.36 10 0.0104
> gas_dyn 5.00 10000 4.30 17 0.0915
> induct 6.24 10000 12.42 10 0.0100
> linpk 1.02 10000 15.50 12 0.0542
> mdbx 2.55 10000 11.24 10 0.0256
> nf 2.90 10000 30.16 20 0.0989
> protein 7.98 10000 33.72 10 0.0070
> rnflow 9.34 10000 23.21 10 0.0551
> test_fpu 6.72 10000 8.05 10 0.0426
> tfft 0.76 10000 1.87 10 0.0597
>
> Geometric Mean Execution Time = 10.87 seconds
>
> and with the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 21:31:06
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.19 10000 8.78 10 0.0099
> aermod 47.91 10000 16.95 10 0.0123
> air 2.85 10000 5.34 12 0.0715
> capacita 1.63 10000 33.10 10 0.0361
> channel 0.67 10000 1.87 10 0.0884
> doduc 6.42 10000 27.35 10 0.0206
> fatigue 2.10 10000 8.32 10 0.0194
> gas_dyn 2.07 10000 4.30 17 0.0843
> induct 5.38 10000 12.58 10 0.0088
> linpk 0.71 10000 15.69 18 0.0796
> mdbx 1.95 10000 11.41 10 0.0238
> nf 1.24 10000 31.34 12 0.0991
> protein 3.88 10000 35.13 10 0.0659
> rnflow 4.73 10000 25.97 10 0.0629
> test_fpu 3.66 10000 8.88 11 0.0989
> tfft 0.52 10000 1.89 10 0.0403
>
> Geometric Mean Execution Time = 11.09 seconds
>
> This shows about a 2.0% performance reduction in the Geometric
> Mean Execution Time. I would note that intel darwin now defaults
> to -mtune=core2 and always has defaulted to -fPIC.
> Jack
>
> >
> > Thanks,
> >
> > Changpeng
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > Sent: Thursday, December 16, 2010 12:47 PM
> > To: Fang, Changpeng
> > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > Hi,
> >
> > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > not unrolling, (2) not prefetching. Which one do you prefer?
> >
> > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > loops),
> >
> > Zdenek
> >
>
> Content-Description: polyhedron1.txt
> > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> >
> > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > ============================================================================================================================
> > || | Before Patch | After Patch | Changes ||
> > ||========================================================================================================================||
> > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > ||========================================================================================================================||
> > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > ||========================================================================================================================||
> > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > ============================================================================================================================
>
> Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > From: Changpeng Fang <chfang@houghton.(none)>
> > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> >
> > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > (non_rolling_loop_p): New function declaration.
> > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > marked NON-ROLLING.
> > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > non-rolling loop.
> > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > function. (non_rolling_loop_p): Implement the new function.
> > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > loop.
> > ---
> > gcc/basic-block.h | 6 +++++-
> > gcc/cfg.c | 7 ++++---
> > gcc/cfgloop.h | 2 ++
> > gcc/predict.c | 6 ++++++
> > gcc/tree-ssa-loop-manip.c | 3 +++
> > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > 7 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > index be0a1d1..850472d 100644
> > --- a/gcc/basic-block.h
> > +++ b/gcc/basic-block.h
> > @@ -245,7 +245,11 @@ enum bb_flags
> >
> > /* Set on blocks that cannot be threaded through.
> > Only used in cfgcleanup.c. */
> > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > +
> > + /* Set on blocks that are headers of non-rolling loops. */
> > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > +
> > };
> >
> > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > index c8ef799..e59a637 100644
> > --- a/gcc/cfg.c
> > +++ b/gcc/cfg.c
> > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > connect_src (e);
> > }
> >
> > -/* Clear all basic block flags, with the exception of partitioning and
> > - setjmp_target. */
> > +/* Clear all basic block flags, with the exception of partitioning,
> > + setjmp_target, and the non-rolling loop marker. */
> > void
> > clear_bb_flags (void)
> > {
> > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> >
> > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > bb->flags = (BB_PARTITION (bb)
> > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > }
> >
> > /* Check the consistency of profile information. We can't do that
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index bf2614e..e856a78 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > +void mark_non_rolling_loop (struct loop *);
> > +bool non_rolling_loop_p (struct loop *);
> >
> > /* Loop manipulation. */
> > extern bool can_duplicate_loop_p (const struct loop *loop);
> > diff --git a/gcc/predict.c b/gcc/predict.c
> > index c691990..bf729f8 100644
> > --- a/gcc/predict.c
> > +++ b/gcc/predict.c
> > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > bool
> > optimize_loop_for_size_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return true;
> > return optimize_bb_for_size_p (loop->header);
> > }
> >
> > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > bool
> > optimize_loop_for_speed_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return false;
> > return optimize_bb_for_speed_p (loop->header);
> > }
> >
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 87b2c0d..bc977bb 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > gcc_assert (new_loop != NULL);
> > update_ssa (TODO_update_ssa);
> >
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > /* Determine the probability of the exit edge of the unrolled loop. */
> > new_est_niter = est_niter / factor;
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index ee85f6f..1e2e4b2 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > fold_undefer_and_ignore_overflow_warnings ();
> > }
> >
> > +/* Mark LOOP as a non-rolling loop. */
> > +
> > +void
> > +mark_non_rolling_loop (struct loop *loop)
> > +{
> > + gcc_assert (loop && loop->header);
> > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > +}
> > +
> > +/* Return true if LOOP is a non-rolling loop. */
> > +
> > +bool
> > +non_rolling_loop_p (struct loop *loop)
> > +{
> > + int masked_flags;
> > + gcc_assert (loop && loop->header);
> > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > + return (masked_flags != 0);
> > +}
> > +
> > /* Returns true if statement S1 dominates statement S2. */
> >
> > bool
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 6ecd304..216de78 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > cond_expr, cond_expr_stmt_list);
> > gcc_assert (new_loop);
> > gcc_assert (loop_num == loop->num);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > #endif
> > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > th, true, NULL_TREE, NULL);
> >
> > gcc_assert (new_loop);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > #endif
> > --
> > 1.6.3.3
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 16:48 ` Fang, Changpeng
@ 2010-12-17 17:20 ` Jack Howarth
2010-12-17 18:01 ` Jack Howarth
1 sibling, 0 replies; 36+ messages in thread
From: Jack Howarth @ 2010-12-17 17:20 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Fri, Dec 17, 2010 at 10:13:24AM -0600, Fang, Changpeng wrote:
>
> Hi, Jack:
>
> Have you tested my original patch (I remembered you bootstrapped it) in this thread?
> That is with the same logic which only applies to prefetch, unswitch and unrolling?
>
> It helps a lot if you have data about that?
Not for pb05 benchmarking but I try that patch and benchmark against it as well.
Jack
>
> Thanks,
>
> Changpeng
>
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Friday, December 17, 2010 9:12 AM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> > Hi, Jack:
> >
> > Thanks for the testing.
> >
> > This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> > It would be helpful if you can provide analysis why they are slowed down.
>
> Unfortunately gprof is pretty broken on darwin10 but I can try to do profiling
> in Shark with -fno-omit-frame-pointer. Could you try rebenchmarking with the flags...
>
> -ffast-math -funroll-loops -O3 -fPIC -mtune=generic
>
> I'll try benchmarking again with '-Ofast -funroll-loops -mtune=generic' and
> see if the performance loss is reduced.
>
> >
> > We did see a significant compilation time reduction for most pb05 programs.
> > (I don't know why you do not have executable size data).
>
> The sources I have for pb05 aren't fixed for detecting the code size properly
> under darwin and I never spent the the time trying to fix that.
>
> >
> > >I would note that intel darwin now defaults
> > >to -mtune=core2 and always has defaulted to -fPIC.
> >
> > I could not understand these default for darwin. My understanding is that,
> > for x86_64, the default should be -mtune=generic.
>
> This is a recent change.
>
> Author: mrs
> Date: Wed Dec 8 23:32:27 2010
> New Revision: 167611
>
> URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=167611
> Log:
> 2010-12-08 Iain Sandoe <iains@gcc.gnu.org>
>
> gcc/config.gcc (with_cpu): Default i[34567]86-*-darwin* and
> x86_64-*-darwin* to with_cpu:-core2.
> gcc/config/i386/mmx.md (*mov<mode>_internal_rex64): Replace movq
> with movd for darwin assembler.
> gcc/config/i386/sse.md (*vec_concatv2di_rex64_sse4_1): Ditto.
> (*vec_concatv2di_rex64_sse): Ditto.
>
> Modified:
> trunk/gcc/ChangeLog
> trunk/gcc/config.gcc
> trunk/gcc/config/i386/mmx.md
> trunk/gcc/config/i386/sse.md
>
> We had intended to do this awhile back (since all darwin hardware is core2
> or recent Xeon), but held off until the core2 costs tables were improved to
> match those of generic.
>
> http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00000.html
>
> >
> > Thanks,
> >
> > Changpeng
> >
> > ________________________________________
> > From: Jack Howarth [howarth@bromo.med.uc.edu]
> > Sent: Thursday, December 16, 2010 11:31 PM
> > To: Fang, Changpeng
> > Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > > Hi,
> > >
> > > Based on previous discussions, I modified the patch as such.
> > >
> > > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > > functions will be affected.
> > >
> > > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > > size decreases 20%, while a small (0.5%) performance increase was found which may
> > > be just noise.
> > >
> > > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> > >
> > > Is it OK to commit to trunk?
> >
> > Changpeng,
> > On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> > the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
> >
> > Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
> >
> > I get without the patch...
> >
> > ================================================================================
> > Date & Time : 16 Dec 2010 23:36:27
> > Test Name : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times : 2000.0
> > Target Error % : 0.100
> > Minimum Repeats : 10
> > Maximum Repeats : 100
> >
> > Benchmark Compile Executable Ave Run Number Estim
> > Name (secs) (bytes) (secs) Repeats Err %
> > --------- ------- ---------- ------- ------- ------
> > ac 1.76 10000 8.78 12 0.0081
> > aermod 54.94 10000 17.28 10 0.0307
> > air 3.44 10000 5.53 13 0.0734
> > capacita 2.64 10000 32.65 10 0.0096
> > channel 0.89 10000 1.84 20 0.0977
> > doduc 8.12 10000 27.00 10 0.0132
> > fatigue 3.06 10000 8.36 10 0.0104
> > gas_dyn 5.00 10000 4.30 17 0.0915
> > induct 6.24 10000 12.42 10 0.0100
> > linpk 1.02 10000 15.50 12 0.0542
> > mdbx 2.55 10000 11.24 10 0.0256
> > nf 2.90 10000 30.16 20 0.0989
> > protein 7.98 10000 33.72 10 0.0070
> > rnflow 9.34 10000 23.21 10 0.0551
> > test_fpu 6.72 10000 8.05 10 0.0426
> > tfft 0.76 10000 1.87 10 0.0597
> >
> > Geometric Mean Execution Time = 10.87 seconds
> >
> > and with the patch...
> >
> > ================================================================================
> > Date & Time : 16 Dec 2010 21:31:06
> > Test Name : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times : 2000.0
> > Target Error % : 0.100
> > Minimum Repeats : 10
> > Maximum Repeats : 100
> >
> > Benchmark Compile Executable Ave Run Number Estim
> > Name (secs) (bytes) (secs) Repeats Err %
> > --------- ------- ---------- ------- ------- ------
> > ac 1.19 10000 8.78 10 0.0099
> > aermod 47.91 10000 16.95 10 0.0123
> > air 2.85 10000 5.34 12 0.0715
> > capacita 1.63 10000 33.10 10 0.0361
> > channel 0.67 10000 1.87 10 0.0884
> > doduc 6.42 10000 27.35 10 0.0206
> > fatigue 2.10 10000 8.32 10 0.0194
> > gas_dyn 2.07 10000 4.30 17 0.0843
> > induct 5.38 10000 12.58 10 0.0088
> > linpk 0.71 10000 15.69 18 0.0796
> > mdbx 1.95 10000 11.41 10 0.0238
> > nf 1.24 10000 31.34 12 0.0991
> > protein 3.88 10000 35.13 10 0.0659
> > rnflow 4.73 10000 25.97 10 0.0629
> > test_fpu 3.66 10000 8.88 11 0.0989
> > tfft 0.52 10000 1.89 10 0.0403
> >
> > Geometric Mean Execution Time = 11.09 seconds
> >
> > This shows about a 2.0% performance reduction in the Geometric
> > Mean Execution Time. I would note that intel darwin now defaults
> > to -mtune=core2 and always has defaulted to -fPIC.
> > Jack
> >
> > >
> > > Thanks,
> > >
> > > Changpeng
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________________
> > > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > > Sent: Thursday, December 16, 2010 12:47 PM
> > > To: Fang, Changpeng
> > > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> > >
> > > Hi,
> > >
> > > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > > not unrolling, (2) not prefetching. Which one do you prefer?
> > >
> > > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > > loops),
> > >
> > > Zdenek
> > >
> >
> > Content-Description: polyhedron1.txt
> > > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> > >
> > > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > > ============================================================================================================================
> > > || | Before Patch | After Patch | Changes ||
> > > ||========================================================================================================================||
> > > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > > ||========================================================================================================================||
> > > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > > ||========================================================================================================================||
> > > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > > ============================================================================================================================
> >
> > Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > > From: Changpeng Fang <chfang@houghton.(none)>
> > > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> > >
> > > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > > (non_rolling_loop_p): New function declaration.
> > > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > > marked NON-ROLLING.
> > > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > > non-rolling loop.
> > > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > > function. (non_rolling_loop_p): Implement the new function.
> > > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > > loop.
> > > ---
> > > gcc/basic-block.h | 6 +++++-
> > > gcc/cfg.c | 7 ++++---
> > > gcc/cfgloop.h | 2 ++
> > > gcc/predict.c | 6 ++++++
> > > gcc/tree-ssa-loop-manip.c | 3 +++
> > > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > > 7 files changed, 48 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > > index be0a1d1..850472d 100644
> > > --- a/gcc/basic-block.h
> > > +++ b/gcc/basic-block.h
> > > @@ -245,7 +245,11 @@ enum bb_flags
> > >
> > > /* Set on blocks that cannot be threaded through.
> > > Only used in cfgcleanup.c. */
> > > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > > +
> > > + /* Set on blocks that are headers of non-rolling loops. */
> > > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > > +
> > > };
> > >
> > > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > > index c8ef799..e59a637 100644
> > > --- a/gcc/cfg.c
> > > +++ b/gcc/cfg.c
> > > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > > connect_src (e);
> > > }
> > >
> > > -/* Clear all basic block flags, with the exception of partitioning and
> > > - setjmp_target. */
> > > +/* Clear all basic block flags, with the exception of partitioning,
> > > + setjmp_target, and the non-rolling loop marker. */
> > > void
> > > clear_bb_flags (void)
> > > {
> > > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> > >
> > > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > > bb->flags = (BB_PARTITION (bb)
> > > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > > }
> > >
> > > /* Check the consistency of profile information. We can't do that
> > > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > > index bf2614e..e856a78 100644
> > > --- a/gcc/cfgloop.h
> > > +++ b/gcc/cfgloop.h
> > > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > > +void mark_non_rolling_loop (struct loop *);
> > > +bool non_rolling_loop_p (struct loop *);
> > >
> > > /* Loop manipulation. */
> > > extern bool can_duplicate_loop_p (const struct loop *loop);
> > > diff --git a/gcc/predict.c b/gcc/predict.c
> > > index c691990..bf729f8 100644
> > > --- a/gcc/predict.c
> > > +++ b/gcc/predict.c
> > > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > > bool
> > > optimize_loop_for_size_p (struct loop *loop)
> > > {
> > > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > > + if (non_rolling_loop_p (loop))
> > > + return true;
> > > return optimize_bb_for_size_p (loop->header);
> > > }
> > >
> > > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > > bool
> > > optimize_loop_for_speed_p (struct loop *loop)
> > > {
> > > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > > + if (non_rolling_loop_p (loop))
> > > + return false;
> > > return optimize_bb_for_speed_p (loop->header);
> > > }
> > >
> > > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > > index 87b2c0d..bc977bb 100644
> > > --- a/gcc/tree-ssa-loop-manip.c
> > > +++ b/gcc/tree-ssa-loop-manip.c
> > > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > > gcc_assert (new_loop != NULL);
> > > update_ssa (TODO_update_ssa);
> > >
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > /* Determine the probability of the exit edge of the unrolled loop. */
> > > new_est_niter = est_niter / factor;
> > >
> > > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > > index ee85f6f..1e2e4b2 100644
> > > --- a/gcc/tree-ssa-loop-niter.c
> > > +++ b/gcc/tree-ssa-loop-niter.c
> > > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > > fold_undefer_and_ignore_overflow_warnings ();
> > > }
> > >
> > > +/* Mark LOOP as a non-rolling loop. */
> > > +
> > > +void
> > > +mark_non_rolling_loop (struct loop *loop)
> > > +{
> > > + gcc_assert (loop && loop->header);
> > > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > > +}
> > > +
> > > +/* Return true if LOOP is a non-rolling loop. */
> > > +
> > > +bool
> > > +non_rolling_loop_p (struct loop *loop)
> > > +{
> > > + int masked_flags;
> > > + gcc_assert (loop && loop->header);
> > > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > > + return (masked_flags != 0);
> > > +}
> > > +
> > > /* Returns true if statement S1 dominates statement S2. */
> > >
> > > bool
> > > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > > index 6ecd304..216de78 100644
> > > --- a/gcc/tree-vect-loop-manip.c
> > > +++ b/gcc/tree-vect-loop-manip.c
> > > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > > cond_expr, cond_expr_stmt_list);
> > > gcc_assert (new_loop);
> > > gcc_assert (loop_num == loop->num);
> > > +
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > #ifdef ENABLE_CHECKING
> > > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > > #endif
> > > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > > th, true, NULL_TREE, NULL);
> > >
> > > gcc_assert (new_loop);
> > > +
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > #ifdef ENABLE_CHECKING
> > > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > > #endif
> > > --
> > > 1.6.3.3
> > >
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 16:48 ` Fang, Changpeng
2010-12-17 17:20 ` Jack Howarth
@ 2010-12-17 18:01 ` Jack Howarth
2010-12-17 18:31 ` Fang, Changpeng
1 sibling, 1 reply; 36+ messages in thread
From: Jack Howarth @ 2010-12-17 18:01 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Fri, Dec 17, 2010 at 10:13:24AM -0600, Fang, Changpeng wrote:
>
> Hi, Jack:
>
> Have you tested my original patch (I remembered you bootstrapped it) in this thread?
> That is with the same logic which only applies to prefetch, unswitch and unrolling?
>
> It helps a lot if you have data about that?
>
> Thanks,
>
> Changpeng
>
>
Changpeng,
I noticed that neither of your patches cleanly apply to recent gcc trunk due to
the context change in gcc/basic-block.h around...
/* Set on blocks that cannot be threaded through.
Only used in cfgcleanup.c. */
BB_NONTHREADABLE_BLOCK = 1 << 11,
/* Set on blocks that were modified in some way. This bit is set in
df_set_bb_dirty, but not cleared by df_analyze, so it can be used
to test whether a block has been modified prior to a df_analyze
call. */
BB_MODIFIED = 1 << 12
};
You might want to rebenchmark pb05 against current gcc trunk to make sure
that r167779 didn't impact the results.
Jack
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 18:01 ` Jack Howarth
@ 2010-12-17 18:31 ` Fang, Changpeng
0 siblings, 0 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-17 18:31 UTC (permalink / raw)
To: Jack Howarth
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
Thanks, Jack:
I am going to update the patch against the current trunk!
Thanks,
Changpeng
________________________________________
From: Jack Howarth [howarth@bromo.med.uc.edu]
Sent: Friday, December 17, 2010 10:55 AM
To: Fang, Changpeng
Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Fri, Dec 17, 2010 at 10:13:24AM -0600, Fang, Changpeng wrote:
>
> Hi, Jack:
>
> Have you tested my original patch (I remembered you bootstrapped it) in this thread?
> That is with the same logic which only applies to prefetch, unswitch and unrolling?
>
> It helps a lot if you have data about that?
>
> Thanks,
>
> Changpeng
>
>
Changpeng,
I noticed that neither of your patches cleanly apply to recent gcc trunk due to
the context change in gcc/basic-block.h around...
/* Set on blocks that cannot be threaded through.
Only used in cfgcleanup.c. */
BB_NONTHREADABLE_BLOCK = 1 << 11,
/* Set on blocks that were modified in some way. This bit is set in
df_set_bb_dirty, but not cleared by df_analyze, so it can be used
to test whether a block has been modified prior to a df_analyze
call. */
BB_MODIFIED = 1 << 12
};
You might want to rebenchmark pb05 against current gcc trunk to make sure
that r167779 didn't impact the results.
Jack
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 6:36 ` Jack Howarth
2010-12-17 9:55 ` Fang, Changpeng
@ 2010-12-17 21:45 ` Jack Howarth
2010-12-17 22:35 ` Fang, Changpeng
1 sibling, 1 reply; 36+ messages in thread
From: Jack Howarth @ 2010-12-17 21:45 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Fri, Dec 17, 2010 at 12:31:15AM -0500, Jack Howarth wrote:
> On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > Hi,
> >
> > Based on previous discussions, I modified the patch as such.
> >
> > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > functions will be affected.
> >
> > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > size decreases 20%, while a small (0.5%) performance increase was found which may
> > be just noise.
> >
> > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> >
> > Is it OK to commit to trunk?
>
> Changpeng,
> On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
>
> Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
>
> I get without the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 23:36:27
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.76 10000 8.78 12 0.0081
> aermod 54.94 10000 17.28 10 0.0307
> air 3.44 10000 5.53 13 0.0734
> capacita 2.64 10000 32.65 10 0.0096
> channel 0.89 10000 1.84 20 0.0977
> doduc 8.12 10000 27.00 10 0.0132
> fatigue 3.06 10000 8.36 10 0.0104
> gas_dyn 5.00 10000 4.30 17 0.0915
> induct 6.24 10000 12.42 10 0.0100
> linpk 1.02 10000 15.50 12 0.0542
> mdbx 2.55 10000 11.24 10 0.0256
> nf 2.90 10000 30.16 20 0.0989
> protein 7.98 10000 33.72 10 0.0070
> rnflow 9.34 10000 23.21 10 0.0551
> test_fpu 6.72 10000 8.05 10 0.0426
> tfft 0.76 10000 1.87 10 0.0597
>
> Geometric Mean Execution Time = 10.87 seconds
>
> and with the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 21:31:06
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.19 10000 8.78 10 0.0099
> aermod 47.91 10000 16.95 10 0.0123
> air 2.85 10000 5.34 12 0.0715
> capacita 1.63 10000 33.10 10 0.0361
> channel 0.67 10000 1.87 10 0.0884
> doduc 6.42 10000 27.35 10 0.0206
> fatigue 2.10 10000 8.32 10 0.0194
> gas_dyn 2.07 10000 4.30 17 0.0843
> induct 5.38 10000 12.58 10 0.0088
> linpk 0.71 10000 15.69 18 0.0796
> mdbx 1.95 10000 11.41 10 0.0238
> nf 1.24 10000 31.34 12 0.0991
> protein 3.88 10000 35.13 10 0.0659
> rnflow 4.73 10000 25.97 10 0.0629
> test_fpu 3.66 10000 8.88 11 0.0989
> tfft 0.52 10000 1.89 10 0.0403
>
> Geometric Mean Execution Time = 11.09 seconds
>
> This shows about a 2.0% performance reduction in the Geometric
> Mean Execution Time. I would note that intel darwin now defaults
> to -mtune=core2 and always has defaulted to -fPIC.
> Jack
>
Changpeng,
Using your original patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01024.html
I get...
================================================================================
Date & Time : 17 Dec 2010 14:52:35
Test Name : gfortran_lin_O3
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.24 10000 8.76 10 0.0118
aermod 63.85 10000 17.23 10 0.0061
air 2.92 10000 5.39 10 0.0577
capacita 1.66 10000 33.06 10 0.0259
channel 0.82 10000 1.87 12 0.0223
doduc 6.51 10000 27.39 10 0.0064
fatigue 2.14 10000 8.31 10 0.0228
gas_dyn 2.33 10000 4.29 18 0.0889
induct 5.66 10000 12.57 10 0.0099
linpk 0.76 10000 15.54 10 0.0560
mdbx 2.01 10000 11.45 10 0.0239
nf 1.29 10000 31.32 17 0.0929
protein 4.06 10000 33.54 10 0.0081
rnflow 4.89 10000 25.93 10 0.0152
test_fpu 3.77 10000 8.83 10 0.0279
tfft 0.54 10000 1.89 10 0.0266
Geometric Mean Execution Time = 11.06 seconds
which is only a slightly lower performance penalty on the
Geometric Mean Execution Time (1.7%).
Jack
> >
> > Thanks,
> >
> > Changpeng
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > Sent: Thursday, December 16, 2010 12:47 PM
> > To: Fang, Changpeng
> > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > Hi,
> >
> > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > not unrolling, (2) not prefetching. Which one do you prefer?
> >
> > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > loops),
> >
> > Zdenek
> >
>
> Content-Description: polyhedron1.txt
> > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> >
> > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > ============================================================================================================================
> > || | Before Patch | After Patch | Changes ||
> > ||========================================================================================================================||
> > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > ||========================================================================================================================||
> > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > ||========================================================================================================================||
> > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > ============================================================================================================================
>
> Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > From: Changpeng Fang <chfang@houghton.(none)>
> > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> >
> > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > (non_rolling_loop_p): New function declaration.
> > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > marked NON-ROLLING.
> > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > non-rolling loop.
> > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > function. (non_rolling_loop_p): Implement the new function.
> > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > loop.
> > ---
> > gcc/basic-block.h | 6 +++++-
> > gcc/cfg.c | 7 ++++---
> > gcc/cfgloop.h | 2 ++
> > gcc/predict.c | 6 ++++++
> > gcc/tree-ssa-loop-manip.c | 3 +++
> > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > 7 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > index be0a1d1..850472d 100644
> > --- a/gcc/basic-block.h
> > +++ b/gcc/basic-block.h
> > @@ -245,7 +245,11 @@ enum bb_flags
> >
> > /* Set on blocks that cannot be threaded through.
> > Only used in cfgcleanup.c. */
> > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > +
> > + /* Set on blocks that are headers of non-rolling loops. */
> > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > +
> > };
> >
> > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > index c8ef799..e59a637 100644
> > --- a/gcc/cfg.c
> > +++ b/gcc/cfg.c
> > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > connect_src (e);
> > }
> >
> > -/* Clear all basic block flags, with the exception of partitioning and
> > - setjmp_target. */
> > +/* Clear all basic block flags, with the exception of partitioning,
> > + setjmp_target, and the non-rolling loop marker. */
> > void
> > clear_bb_flags (void)
> > {
> > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> >
> > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > bb->flags = (BB_PARTITION (bb)
> > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > }
> > \f
> > /* Check the consistency of profile information. We can't do that
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index bf2614e..e856a78 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > +void mark_non_rolling_loop (struct loop *);
> > +bool non_rolling_loop_p (struct loop *);
> >
> > /* Loop manipulation. */
> > extern bool can_duplicate_loop_p (const struct loop *loop);
> > diff --git a/gcc/predict.c b/gcc/predict.c
> > index c691990..bf729f8 100644
> > --- a/gcc/predict.c
> > +++ b/gcc/predict.c
> > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > bool
> > optimize_loop_for_size_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return true;
> > return optimize_bb_for_size_p (loop->header);
> > }
> >
> > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > bool
> > optimize_loop_for_speed_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return false;
> > return optimize_bb_for_speed_p (loop->header);
> > }
> >
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 87b2c0d..bc977bb 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > gcc_assert (new_loop != NULL);
> > update_ssa (TODO_update_ssa);
> >
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > /* Determine the probability of the exit edge of the unrolled loop. */
> > new_est_niter = est_niter / factor;
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index ee85f6f..1e2e4b2 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > fold_undefer_and_ignore_overflow_warnings ();
> > }
> >
> > +/* Mark LOOP as a non-rolling loop. */
> > +
> > +void
> > +mark_non_rolling_loop (struct loop *loop)
> > +{
> > + gcc_assert (loop && loop->header);
> > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > +}
> > +
> > +/* Return true if LOOP is a non-rolling loop. */
> > +
> > +bool
> > +non_rolling_loop_p (struct loop *loop)
> > +{
> > + int masked_flags;
> > + gcc_assert (loop && loop->header);
> > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > + return (masked_flags != 0);
> > +}
> > +
> > /* Returns true if statement S1 dominates statement S2. */
> >
> > bool
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 6ecd304..216de78 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > cond_expr, cond_expr_stmt_list);
> > gcc_assert (new_loop);
> > gcc_assert (loop_num == loop->num);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > #endif
> > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > th, true, NULL_TREE, NULL);
> >
> > gcc_assert (new_loop);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > #endif
> > --
> > 1.6.3.3
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 21:45 ` Jack Howarth
@ 2010-12-17 22:35 ` Fang, Changpeng
2010-12-17 22:54 ` [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loopsj Jack Howarth
2010-12-18 11:35 ` [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops Jack Howarth
0 siblings, 2 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-17 22:35 UTC (permalink / raw)
To: Jack Howarth
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
Hi, Jack:
Is prefetch default on at -O3 on your systems?
Can you do an additional test to test
-O3 -ffast-math with/without the original patch?
This way, we can know that whether it is the rtl-loop unrolling problem.
Thanks,
Changpeng
________________________________________
From: Jack Howarth [howarth@bromo.med.uc.edu]
Sent: Friday, December 17, 2010 3:04 PM
To: Fang, Changpeng
Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Fri, Dec 17, 2010 at 12:31:15AM -0500, Jack Howarth wrote:
> On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > Hi,
> >
> > Based on previous discussions, I modified the patch as such.
> >
> > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > functions will be affected.
> >
> > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > size decreases 20%, while a small (0.5%) performance increase was found which may
> > be just noise.
> >
> > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> >
> > Is it OK to commit to trunk?
>
> Changpeng,
> On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
>
> Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
>
> I get without the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 23:36:27
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.76 10000 8.78 12 0.0081
> aermod 54.94 10000 17.28 10 0.0307
> air 3.44 10000 5.53 13 0.0734
> capacita 2.64 10000 32.65 10 0.0096
> channel 0.89 10000 1.84 20 0.0977
> doduc 8.12 10000 27.00 10 0.0132
> fatigue 3.06 10000 8.36 10 0.0104
> gas_dyn 5.00 10000 4.30 17 0.0915
> induct 6.24 10000 12.42 10 0.0100
> linpk 1.02 10000 15.50 12 0.0542
> mdbx 2.55 10000 11.24 10 0.0256
> nf 2.90 10000 30.16 20 0.0989
> protein 7.98 10000 33.72 10 0.0070
> rnflow 9.34 10000 23.21 10 0.0551
> test_fpu 6.72 10000 8.05 10 0.0426
> tfft 0.76 10000 1.87 10 0.0597
>
> Geometric Mean Execution Time = 10.87 seconds
>
> and with the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 21:31:06
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.19 10000 8.78 10 0.0099
> aermod 47.91 10000 16.95 10 0.0123
> air 2.85 10000 5.34 12 0.0715
> capacita 1.63 10000 33.10 10 0.0361
> channel 0.67 10000 1.87 10 0.0884
> doduc 6.42 10000 27.35 10 0.0206
> fatigue 2.10 10000 8.32 10 0.0194
> gas_dyn 2.07 10000 4.30 17 0.0843
> induct 5.38 10000 12.58 10 0.0088
> linpk 0.71 10000 15.69 18 0.0796
> mdbx 1.95 10000 11.41 10 0.0238
> nf 1.24 10000 31.34 12 0.0991
> protein 3.88 10000 35.13 10 0.0659
> rnflow 4.73 10000 25.97 10 0.0629
> test_fpu 3.66 10000 8.88 11 0.0989
> tfft 0.52 10000 1.89 10 0.0403
>
> Geometric Mean Execution Time = 11.09 seconds
>
> This shows about a 2.0% performance reduction in the Geometric
> Mean Execution Time. I would note that intel darwin now defaults
> to -mtune=core2 and always has defaulted to -fPIC.
> Jack
>
Changpeng,
Using your original patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01024.html
I get...
================================================================================
Date & Time : 17 Dec 2010 14:52:35
Test Name : gfortran_lin_O3
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.24 10000 8.76 10 0.0118
aermod 63.85 10000 17.23 10 0.0061
air 2.92 10000 5.39 10 0.0577
capacita 1.66 10000 33.06 10 0.0259
channel 0.82 10000 1.87 12 0.0223
doduc 6.51 10000 27.39 10 0.0064
fatigue 2.14 10000 8.31 10 0.0228
gas_dyn 2.33 10000 4.29 18 0.0889
induct 5.66 10000 12.57 10 0.0099
linpk 0.76 10000 15.54 10 0.0560
mdbx 2.01 10000 11.45 10 0.0239
nf 1.29 10000 31.32 17 0.0929
protein 4.06 10000 33.54 10 0.0081
rnflow 4.89 10000 25.93 10 0.0152
test_fpu 3.77 10000 8.83 10 0.0279
tfft 0.54 10000 1.89 10 0.0266
Geometric Mean Execution Time = 11.06 seconds
which is only a slightly lower performance penalty on the
Geometric Mean Execution Time (1.7%).
Jack
> >
> > Thanks,
> >
> > Changpeng
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > Sent: Thursday, December 16, 2010 12:47 PM
> > To: Fang, Changpeng
> > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > Hi,
> >
> > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > not unrolling, (2) not prefetching. Which one do you prefer?
> >
> > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > loops),
> >
> > Zdenek
> >
>
> Content-Description: polyhedron1.txt
> > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> >
> > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > ============================================================================================================================
> > || | Before Patch | After Patch | Changes ||
> > ||========================================================================================================================||
> > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > ||========================================================================================================================||
> > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > ||========================================================================================================================||
> > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > ============================================================================================================================
>
> Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > From: Changpeng Fang <chfang@houghton.(none)>
> > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> >
> > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > (non_rolling_loop_p): New function declaration.
> > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > marked NON-ROLLING.
> > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > non-rolling loop.
> > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > function. (non_rolling_loop_p): Implement the new function.
> > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > loop.
> > ---
> > gcc/basic-block.h | 6 +++++-
> > gcc/cfg.c | 7 ++++---
> > gcc/cfgloop.h | 2 ++
> > gcc/predict.c | 6 ++++++
> > gcc/tree-ssa-loop-manip.c | 3 +++
> > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > 7 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > index be0a1d1..850472d 100644
> > --- a/gcc/basic-block.h
> > +++ b/gcc/basic-block.h
> > @@ -245,7 +245,11 @@ enum bb_flags
> >
> > /* Set on blocks that cannot be threaded through.
> > Only used in cfgcleanup.c. */
> > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > +
> > + /* Set on blocks that are headers of non-rolling loops. */
> > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > +
> > };
> >
> > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > index c8ef799..e59a637 100644
> > --- a/gcc/cfg.c
> > +++ b/gcc/cfg.c
> > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > connect_src (e);
> > }
> >
> > -/* Clear all basic block flags, with the exception of partitioning and
> > - setjmp_target. */
> > +/* Clear all basic block flags, with the exception of partitioning,
> > + setjmp_target, and the non-rolling loop marker. */
> > void
> > clear_bb_flags (void)
> > {
> > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> >
> > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > bb->flags = (BB_PARTITION (bb)
> > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > }
> >
> > /* Check the consistency of profile information. We can't do that
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index bf2614e..e856a78 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > +void mark_non_rolling_loop (struct loop *);
> > +bool non_rolling_loop_p (struct loop *);
> >
> > /* Loop manipulation. */
> > extern bool can_duplicate_loop_p (const struct loop *loop);
> > diff --git a/gcc/predict.c b/gcc/predict.c
> > index c691990..bf729f8 100644
> > --- a/gcc/predict.c
> > +++ b/gcc/predict.c
> > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > bool
> > optimize_loop_for_size_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return true;
> > return optimize_bb_for_size_p (loop->header);
> > }
> >
> > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > bool
> > optimize_loop_for_speed_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return false;
> > return optimize_bb_for_speed_p (loop->header);
> > }
> >
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 87b2c0d..bc977bb 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > gcc_assert (new_loop != NULL);
> > update_ssa (TODO_update_ssa);
> >
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > /* Determine the probability of the exit edge of the unrolled loop. */
> > new_est_niter = est_niter / factor;
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index ee85f6f..1e2e4b2 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > fold_undefer_and_ignore_overflow_warnings ();
> > }
> >
> > +/* Mark LOOP as a non-rolling loop. */
> > +
> > +void
> > +mark_non_rolling_loop (struct loop *loop)
> > +{
> > + gcc_assert (loop && loop->header);
> > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > +}
> > +
> > +/* Return true if LOOP is a non-rolling loop. */
> > +
> > +bool
> > +non_rolling_loop_p (struct loop *loop)
> > +{
> > + int masked_flags;
> > + gcc_assert (loop && loop->header);
> > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > + return (masked_flags != 0);
> > +}
> > +
> > /* Returns true if statement S1 dominates statement S2. */
> >
> > bool
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 6ecd304..216de78 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > cond_expr, cond_expr_stmt_list);
> > gcc_assert (new_loop);
> > gcc_assert (loop_num == loop->num);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > #endif
> > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > th, true, NULL_TREE, NULL);
> >
> > gcc_assert (new_loop);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > #endif
> > --
> > 1.6.3.3
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loopsj
2010-12-17 22:35 ` Fang, Changpeng
@ 2010-12-17 22:54 ` Jack Howarth
2010-12-18 11:35 ` [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops Jack Howarth
1 sibling, 0 replies; 36+ messages in thread
From: Jack Howarth @ 2010-12-17 22:54 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Fri, Dec 17, 2010 at 03:30:37PM -0600, Fang, Changpeng wrote:
> Hi, Jack:
>
> Is prefetch default on at -O3 on your systems?
>
Changpeng,
On x86_64-apple-darwin10, from 'touch t.f90; gfortran -fverbose-asm -O3 t.f90 -S',
I get...
# GNU Fortran (GCC) version 4.6.0 20101217 (experimental) (x86_64-apple-darwin10.5.0)
# compiled by GNU C version 4.6.0 20101217 (experimental), GMP version 4.3.2, MPFR version 2.4.2-p3, MPC version 0.8.2
# GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
# options passed: t.f90 -fPIC -mmacosx-version-min=10.6.5 -mtune=core2 -O3
# -fverbose-asm
# -fintrinsic-modules-path /sw/lib/gcc4.6/lib/gcc/x86_64-apple-darwin10.5.0/4.6.0/finclude
# options enabled: -Wnonportable-cfstrings -fPIC
# -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg
# -fcaller-saves -fcombine-stack-adjustments -fcommon -fcprop-registers
# -fcrossjumping -fcse-follow-jumps -fdefer-pop
# -fdelete-null-pointer-checks -fearly-inlining
# -feliminate-unused-debug-types -fexpensive-optimizations
# -fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm
# -fguess-branch-probability -fident -fif-conversion -fif-conversion2
# -findirect-inlining -finline -finline-functions
# -finline-functions-called-once -finline-small-functions -fipa-cp
# -fipa-cp-clone -fipa-profile -fipa-pure-const -fipa-reference -fipa-sra
# -fira-share-save-slots -fira-share-spill-slots -fivopts
# -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-constants
# -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
# -foptimize-register-move -foptimize-sibling-calls -fpartial-inlining
# -fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays
# -freg-struct-return -fregmove -freorder-blocks -freorder-functions
# -frerun-cse-after-loop -fsched-critical-path-heuristic
# -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
# -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
# -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule-insns2
# -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types
# -fstrict-aliasing -fstrict-overflow -fstrict-volatile-bitfields
# -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
# -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copy-prop
# -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts
# -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-distribute-patterns
# -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
# -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pre
# -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink
# -ftree-slp-vectorize -ftree-sra -ftree-switch-conversion -ftree-ter
# -ftree-vect-loop-version -ftree-vectorize -ftree-vrp -funit-at-a-time
# -funswitch-loops -funwind-tables -fvect-cost-model -fverbose-asm -fzee
# -fzero-initialized-in-bss -gstrict-dwarf -m128bit-long-double -m64
# -m80387 -maccumulate-outgoing-args -malign-stringops -matt-stubs
# -mconstant-cfstrings -mfancy-math-387 -mfp-ret-in-387 -mieee-fp -mmmx
# -mno-sse4 -mpush-args -mred-zone -msse -msse2 -msse3
.subsections_via_symbols
> Can you do an additional test to test
> -O3 -ffast-math with/without the original patch?
>
> This way, we can know that whether it is the rtl-loop unrolling problem.
I'll do this later tonight.
>
> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Friday, December 17, 2010 3:04 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Fri, Dec 17, 2010 at 12:31:15AM -0500, Jack Howarth wrote:
> > On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > > Hi,
> > >
> > > Based on previous discussions, I modified the patch as such.
> > >
> > > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > > functions will be affected.
> > >
> > > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > > size decreases 20%, while a small (0.5%) performance increase was found which may
> > > be just noise.
> > >
> > > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> > >
> > > Is it OK to commit to trunk?
> >
> > Changpeng,
> > On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> > the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
> >
> > Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
> >
> > I get without the patch...
> >
> > ================================================================================
> > Date & Time : 16 Dec 2010 23:36:27
> > Test Name : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times : 2000.0
> > Target Error % : 0.100
> > Minimum Repeats : 10
> > Maximum Repeats : 100
> >
> > Benchmark Compile Executable Ave Run Number Estim
> > Name (secs) (bytes) (secs) Repeats Err %
> > --------- ------- ---------- ------- ------- ------
> > ac 1.76 10000 8.78 12 0.0081
> > aermod 54.94 10000 17.28 10 0.0307
> > air 3.44 10000 5.53 13 0.0734
> > capacita 2.64 10000 32.65 10 0.0096
> > channel 0.89 10000 1.84 20 0.0977
> > doduc 8.12 10000 27.00 10 0.0132
> > fatigue 3.06 10000 8.36 10 0.0104
> > gas_dyn 5.00 10000 4.30 17 0.0915
> > induct 6.24 10000 12.42 10 0.0100
> > linpk 1.02 10000 15.50 12 0.0542
> > mdbx 2.55 10000 11.24 10 0.0256
> > nf 2.90 10000 30.16 20 0.0989
> > protein 7.98 10000 33.72 10 0.0070
> > rnflow 9.34 10000 23.21 10 0.0551
> > test_fpu 6.72 10000 8.05 10 0.0426
> > tfft 0.76 10000 1.87 10 0.0597
> >
> > Geometric Mean Execution Time = 10.87 seconds
> >
> > and with the patch...
> >
> > ================================================================================
> > Date & Time : 16 Dec 2010 21:31:06
> > Test Name : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times : 2000.0
> > Target Error % : 0.100
> > Minimum Repeats : 10
> > Maximum Repeats : 100
> >
> > Benchmark Compile Executable Ave Run Number Estim
> > Name (secs) (bytes) (secs) Repeats Err %
> > --------- ------- ---------- ------- ------- ------
> > ac 1.19 10000 8.78 10 0.0099
> > aermod 47.91 10000 16.95 10 0.0123
> > air 2.85 10000 5.34 12 0.0715
> > capacita 1.63 10000 33.10 10 0.0361
> > channel 0.67 10000 1.87 10 0.0884
> > doduc 6.42 10000 27.35 10 0.0206
> > fatigue 2.10 10000 8.32 10 0.0194
> > gas_dyn 2.07 10000 4.30 17 0.0843
> > induct 5.38 10000 12.58 10 0.0088
> > linpk 0.71 10000 15.69 18 0.0796
> > mdbx 1.95 10000 11.41 10 0.0238
> > nf 1.24 10000 31.34 12 0.0991
> > protein 3.88 10000 35.13 10 0.0659
> > rnflow 4.73 10000 25.97 10 0.0629
> > test_fpu 3.66 10000 8.88 11 0.0989
> > tfft 0.52 10000 1.89 10 0.0403
> >
> > Geometric Mean Execution Time = 11.09 seconds
> >
> > This shows about a 2.0% performance reduction in the Geometric
> > Mean Execution Time. I would note that intel darwin now defaults
> > to -mtune=core2 and always has defaulted to -fPIC.
> > Jack
> >
>
> Changpeng,
> Using your original patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01024.html
> I get...
>
> ================================================================================
> Date & Time : 17 Dec 2010 14:52:35
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.24 10000 8.76 10 0.0118
> aermod 63.85 10000 17.23 10 0.0061
> air 2.92 10000 5.39 10 0.0577
> capacita 1.66 10000 33.06 10 0.0259
> channel 0.82 10000 1.87 12 0.0223
> doduc 6.51 10000 27.39 10 0.0064
> fatigue 2.14 10000 8.31 10 0.0228
> gas_dyn 2.33 10000 4.29 18 0.0889
> induct 5.66 10000 12.57 10 0.0099
> linpk 0.76 10000 15.54 10 0.0560
> mdbx 2.01 10000 11.45 10 0.0239
> nf 1.29 10000 31.32 17 0.0929
> protein 4.06 10000 33.54 10 0.0081
> rnflow 4.89 10000 25.93 10 0.0152
> test_fpu 3.77 10000 8.83 10 0.0279
> tfft 0.54 10000 1.89 10 0.0266
>
> Geometric Mean Execution Time = 11.06 seconds
>
> which is only a slightly lower performance penalty on the
> Geometric Mean Execution Time (1.7%).
> Jack
>
> > >
> > > Thanks,
> > >
> > > Changpeng
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________________
> > > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > > Sent: Thursday, December 16, 2010 12:47 PM
> > > To: Fang, Changpeng
> > > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> > >
> > > Hi,
> > >
> > > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > > not unrolling, (2) not prefetching. Which one do you prefer?
> > >
> > > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > > loops),
> > >
> > > Zdenek
> > >
> >
> > Content-Description: polyhedron1.txt
> > > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> > >
> > > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > > ============================================================================================================================
> > > || | Before Patch | After Patch | Changes ||
> > > ||========================================================================================================================||
> > > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > > ||========================================================================================================================||
> > > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > > ||========================================================================================================================||
> > > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > > ============================================================================================================================
> >
> > Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > > From: Changpeng Fang <chfang@houghton.(none)>
> > > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> > >
> > > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > > (non_rolling_loop_p): New function declaration.
> > > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > > marked NON-ROLLING.
> > > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > > non-rolling loop.
> > > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > > function. (non_rolling_loop_p): Implement the new function.
> > > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > > loop.
> > > ---
> > > gcc/basic-block.h | 6 +++++-
> > > gcc/cfg.c | 7 ++++---
> > > gcc/cfgloop.h | 2 ++
> > > gcc/predict.c | 6 ++++++
> > > gcc/tree-ssa-loop-manip.c | 3 +++
> > > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > > 7 files changed, 48 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > > index be0a1d1..850472d 100644
> > > --- a/gcc/basic-block.h
> > > +++ b/gcc/basic-block.h
> > > @@ -245,7 +245,11 @@ enum bb_flags
> > >
> > > /* Set on blocks that cannot be threaded through.
> > > Only used in cfgcleanup.c. */
> > > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > > +
> > > + /* Set on blocks that are headers of non-rolling loops. */
> > > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > > +
> > > };
> > >
> > > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > > index c8ef799..e59a637 100644
> > > --- a/gcc/cfg.c
> > > +++ b/gcc/cfg.c
> > > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > > connect_src (e);
> > > }
> > >
> > > -/* Clear all basic block flags, with the exception of partitioning and
> > > - setjmp_target. */
> > > +/* Clear all basic block flags, with the exception of partitioning,
> > > + setjmp_target, and the non-rolling loop marker. */
> > > void
> > > clear_bb_flags (void)
> > > {
> > > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> > >
> > > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > > bb->flags = (BB_PARTITION (bb)
> > > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > > }
> > >
> > > /* Check the consistency of profile information. We can't do that
> > > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > > index bf2614e..e856a78 100644
> > > --- a/gcc/cfgloop.h
> > > +++ b/gcc/cfgloop.h
> > > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > > +void mark_non_rolling_loop (struct loop *);
> > > +bool non_rolling_loop_p (struct loop *);
> > >
> > > /* Loop manipulation. */
> > > extern bool can_duplicate_loop_p (const struct loop *loop);
> > > diff --git a/gcc/predict.c b/gcc/predict.c
> > > index c691990..bf729f8 100644
> > > --- a/gcc/predict.c
> > > +++ b/gcc/predict.c
> > > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > > bool
> > > optimize_loop_for_size_p (struct loop *loop)
> > > {
> > > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > > + if (non_rolling_loop_p (loop))
> > > + return true;
> > > return optimize_bb_for_size_p (loop->header);
> > > }
> > >
> > > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > > bool
> > > optimize_loop_for_speed_p (struct loop *loop)
> > > {
> > > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > > + if (non_rolling_loop_p (loop))
> > > + return false;
> > > return optimize_bb_for_speed_p (loop->header);
> > > }
> > >
> > > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > > index 87b2c0d..bc977bb 100644
> > > --- a/gcc/tree-ssa-loop-manip.c
> > > +++ b/gcc/tree-ssa-loop-manip.c
> > > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > > gcc_assert (new_loop != NULL);
> > > update_ssa (TODO_update_ssa);
> > >
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > /* Determine the probability of the exit edge of the unrolled loop. */
> > > new_est_niter = est_niter / factor;
> > >
> > > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > > index ee85f6f..1e2e4b2 100644
> > > --- a/gcc/tree-ssa-loop-niter.c
> > > +++ b/gcc/tree-ssa-loop-niter.c
> > > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > > fold_undefer_and_ignore_overflow_warnings ();
> > > }
> > >
> > > +/* Mark LOOP as a non-rolling loop. */
> > > +
> > > +void
> > > +mark_non_rolling_loop (struct loop *loop)
> > > +{
> > > + gcc_assert (loop && loop->header);
> > > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > > +}
> > > +
> > > +/* Return true if LOOP is a non-rolling loop. */
> > > +
> > > +bool
> > > +non_rolling_loop_p (struct loop *loop)
> > > +{
> > > + int masked_flags;
> > > + gcc_assert (loop && loop->header);
> > > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > > + return (masked_flags != 0);
> > > +}
> > > +
> > > /* Returns true if statement S1 dominates statement S2. */
> > >
> > > bool
> > > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > > index 6ecd304..216de78 100644
> > > --- a/gcc/tree-vect-loop-manip.c
> > > +++ b/gcc/tree-vect-loop-manip.c
> > > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > > cond_expr, cond_expr_stmt_list);
> > > gcc_assert (new_loop);
> > > gcc_assert (loop_num == loop->num);
> > > +
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > #ifdef ENABLE_CHECKING
> > > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > > #endif
> > > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > > th, true, NULL_TREE, NULL);
> > >
> > > gcc_assert (new_loop);
> > > +
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > #ifdef ENABLE_CHECKING
> > > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > > #endif
> > > --
> > > 1.6.3.3
> > >
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 22:35 ` Fang, Changpeng
2010-12-17 22:54 ` [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loopsj Jack Howarth
@ 2010-12-18 11:35 ` Jack Howarth
1 sibling, 0 replies; 36+ messages in thread
From: Jack Howarth @ 2010-12-18 11:35 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Fri, Dec 17, 2010 at 03:30:37PM -0600, Fang, Changpeng wrote:
> Hi, Jack:
>
> Is prefetch default on at -O3 on your systems?
>
> Can you do an additional test to test
> -O3 -ffast-math with/without the original patch?
>
> This way, we can know that whether it is the rtl-loop unrolling problem.
unpatched r168001
================================================================================
Date & Time : 17 Dec 2010 19:24:47
Test Name : gfortran_lin_O3_nounroll
Compile Command : gfortran -ffast-math -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.01 10000 8.76 10 0.0129
aermod 46.64 10000 17.23 10 0.0154
air 2.57 10000 5.45 10 0.0579
capacita 1.63 10000 33.17 10 0.0407
channel 0.62 10000 1.87 10 0.0166
doduc 6.21 10000 27.33 10 0.0067
fatigue 2.02 10000 7.95 10 0.0339
gas_dyn 2.18 10000 4.36 19 0.0871
induct 5.08 10000 12.44 10 0.0054
linpk 0.70 10000 15.52 10 0.0638
mdbx 1.87 10000 11.46 10 0.0080
nf 1.25 10000 31.27 19 0.0955
protein 3.66 10000 35.68 10 0.0062
rnflow 4.54 10000 26.09 10 0.0104
test_fpu 3.46 10000 8.81 10 0.0443
tfft 0.50 10000 1.90 10 0.0442
Geometric Mean Execution Time = 11.09 seconds
================================================================================
r168001 with original patch
================================================================================
Date & Time : 17 Dec 2010 21:49:34
Test Name : gfortran_lin_O3_nounroll
Compile Command : gfortran -ffast-math -O3 %n.f90 -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 2000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 0.89 10000 8.76 10 0.0146
aermod 45.65 10000 17.23 10 0.0068
air 2.48 10000 5.44 12 0.0345
capacita 1.50 10000 33.20 10 0.0661
channel 0.60 10000 1.87 10 0.0331
doduc 6.07 10000 27.34 10 0.0118
fatigue 1.84 10000 7.96 10 0.0169
gas_dyn 1.95 10000 4.36 17 0.0911
induct 4.97 10000 12.44 10 0.0068
linpk 0.69 10000 15.54 10 0.0884
mdbx 1.83 10000 11.46 10 0.0258
nf 1.09 10000 31.38 20 0.0899
protein 3.09 10000 35.64 10 0.0141
rnflow 4.22 10000 26.06 10 0.0883
test_fpu 3.24 10000 8.82 12 0.0866
tfft 0.50 10000 1.91 10 0.0475
Geometric Mean Execution Time = 11.09 seconds
================================================================================
So the performance regression with the patch only manifests itself with
-funroll-loops.
Jack
>
> Thanks,
>
> Changpeng
>
>
>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-16 18:26 ` Fang, Changpeng
2010-12-16 20:06 ` Zdenek Dvorak
@ 2010-12-19 0:29 ` Richard Guenther
2010-12-21 22:45 ` Fang, Changpeng
1 sibling, 1 reply; 36+ messages in thread
From: Richard Guenther @ 2010-12-19 0:29 UTC (permalink / raw)
To: Fang, Changpeng; +Cc: Zdenek Dvorak, Xinliang David Li, gcc-patches
On Thu, Dec 16, 2010 at 6:22 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> My initial intention is Not to unroll prologue and epilogue loops. An estimated trip count
> may not be that useful for the unrolling decision. To me, unrolling a loop that has at most
> 3 (or 7) iterations does not make sense. RTL unrolling does not use the estimated trip
> count to determine the unroll factor, and thus it may still unroll the loop 4 or 8 times if
> the loop is small ( #insns). To make things simple, we just don't unroll such loops.
>
> However, a prologue or epilogue loop may still be a hot loop, depending on the outer
> loops. It may still be beneficial to perform other optimizations on such loops, if the code
> size is not expanded multiple times.
>
> For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> not unrolling, (2) not prefetching. Which one do you prefer?
For small loop bodies it might make sense to completely peel the
prologue/epilogue loops (think of vectorizing doubles where those
loops roll at most once). It would be nice to figure out if (or if not)
loop analysis (or later jump threading) is able to do that.
Richard.
> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> Sent: Thursday, December 16, 2010 6:09 AM
> To: Richard Guenther
> Cc: Xinliang David Li; Fang, Changpeng; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> Hi,
>
>> Btw, any reason why we do not use static profiles for number of iteration
>> estimates? We after all _do_ use the static profile to guide the
>> maybe_hot/cold_bb tests.
>
> for loops for that we cannot determine the # of iterations statically,
> basically the only important predictors are PRED_LOOP_BRANCH and
> PRED_LOOP_EXIT, which predict that the loop will iterate about 10 times. So,
> by using static profile, we would just learn that every such loop is expected
> to iterate 10 times, which is kind of useless,
>
> Zdenek
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-19 0:29 ` Richard Guenther
@ 2010-12-21 22:45 ` Fang, Changpeng
0 siblings, 0 replies; 36+ messages in thread
From: Fang, Changpeng @ 2010-12-21 22:45 UTC (permalink / raw)
To: Richard Guenther; +Cc: Zdenek Dvorak, Xinliang David Li, gcc-patches
>For small loop bodies it might make sense to completely peel the
>prologue/epilogue loops (think of vectorizing doubles where those
>loops roll at most once). It would be nice to figure out if (or if not)
>loop analysis (or later jump threading) is able to do that.
Hi, Richard,
This is a good point. My intention is to apply the NON-ROLLING loop
Marking approach only to loops whose trip count could not be
determined at compile time. So, at the time of applying the approach,
we should first check whether the loop rolls constant time first.
Thanks,
Changpeng
>Richard.
> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> Sent: Thursday, December 16, 2010 6:09 AM
> To: Richard Guenther
> Cc: Xinliang David Li; Fang, Changpeng; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> Hi,
>
>> Btw, any reason why we do not use static profiles for number of iteration
>> estimates? We after all _do_ use the static profile to guide the
>> maybe_hot/cold_bb tests.
>
> for loops for that we cannot determine the # of iterations statically,
> basically the only important predictors are PRED_LOOP_BRANCH and
> PRED_LOOP_EXIT, which predict that the loop will iterate about 10 times. So,
> by using static profile, we would just learn that every such loop is expected
> to iterate 10 times, which is kind of useless,
>
> Zdenek
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2010-12-17 9:55 ` Fang, Changpeng
2010-12-17 16:13 ` Jack Howarth
@ 2011-01-04 3:33 ` Jack Howarth
2011-01-04 22:40 ` Fang, Changpeng
1 sibling, 1 reply; 36+ messages in thread
From: Jack Howarth @ 2011-01-04 3:33 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> Hi, Jack:
>
> Thanks for the testing.
>
> This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> It would be helpful if you can provide analysis why they are slowed down.
Changpeng,
The corrected merge against gcc trunk of...
Index: gcc/basic-block.h
===================================================================
--- gcc/basic-block.h (revision 168437)
+++ gcc/basic-block.h (working copy)
@@ -247,11 +247,14 @@
Only used in cfgcleanup.c. */
BB_NONTHREADABLE_BLOCK = 1 << 11,
+ /* Set on blocks that are headers of non-rolling loops. */
+ BB_HEADER_OF_NONROLLING_LOOP = 1 << 12,
+
/* Set on blocks that were modified in some way. This bit is set in
df_set_bb_dirty, but not cleared by df_analyze, so it can be used
to test whether a block has been modified prior to a df_analyze
call. */
- BB_MODIFIED = 1 << 12
+ BB_MODIFIED = 1 << 13
};
/* Dummy flag for convenience in the hot/cold partitioning code. */
for the proposed patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01344.html
eliminated the performance regressions on x86_64-apple-darwin10. I now get...
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Execution Time
-m32
stock patched %increase
ac 10.59 10.59 0.0
aermod 19.49 19.13 -1.8
air 6.07 6.07 0.0
capacita 44.60 44.61 0.0
channel 1.98 1.98 0.0
doduc 31.19 31.31 0.4
fatigue 9.90 10.29 3.9
gas_dyn 4.72 4.71 -0.2
induct 13.93 13.93 0.0
linpk 15.50 15.49 -0.1
mdbx 11.28 11.26 -0.2
nf 27.62 27.58 -0.1
protein 38.70 38.60 -0.3
rnflow 24.68 24.68 0.0
test_fpu 10.13 10.13 0.0
tfft 1.92 1.92 0.0
Geometric Mean 12.06 12.08 0.2
Execution Time
-m64
stock patched %increase
ac 8.80 8.80 0.0
aermod 17.34 17.17 -1.0
air 5.48 5.52 0.7
capacita 32.38 32.50 0.4
channel 1.84 1.84 0.0
doduc 26.50 26.52 0.1
fatigue 8.35 8.33 -0.2
gas_dyn 4.30 4.29 -0.2
induct 12.83 12.83 0.0
linpk 15.49 15.49 0.0
mdbx 11.23 11.22 -0.1
nf 30.21 30.16 -0.2
protein 34.13 32.07 -6.0
rnflow 23.18 23.19 0.0
test_fpu 8.04 8.02 -0.2
tfft 1.87 1.86 -0.5
Geometric Mean 10.87 10.82 -0.5
Execution Time
>
> We did see a significant compilation time reduction for most pb05 programs.
> (I don't know why you do not have executable size data).
>
> >I would note that intel darwin now defaults
> >to -mtune=core2 and always has defaulted to -fPIC.
>
> I could not understand these default for darwin. My understanding is that,
> for x86_64, the default should be -mtune=generic.
>
> Thanks,
>
> Changpeng
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Thursday, December 16, 2010 11:31 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > Hi,
> >
> > Based on previous discussions, I modified the patch as such.
> >
> > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > functions will be affected.
> >
> > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > size decreases 20%, while a small (0.5%) performance increase was found which may
> > be just noise.
> >
> > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> >
> > Is it OK to commit to trunk?
>
> Changpeng,
> On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
>
> Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
>
> I get without the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 23:36:27
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.76 10000 8.78 12 0.0081
> aermod 54.94 10000 17.28 10 0.0307
> air 3.44 10000 5.53 13 0.0734
> capacita 2.64 10000 32.65 10 0.0096
> channel 0.89 10000 1.84 20 0.0977
> doduc 8.12 10000 27.00 10 0.0132
> fatigue 3.06 10000 8.36 10 0.0104
> gas_dyn 5.00 10000 4.30 17 0.0915
> induct 6.24 10000 12.42 10 0.0100
> linpk 1.02 10000 15.50 12 0.0542
> mdbx 2.55 10000 11.24 10 0.0256
> nf 2.90 10000 30.16 20 0.0989
> protein 7.98 10000 33.72 10 0.0070
> rnflow 9.34 10000 23.21 10 0.0551
> test_fpu 6.72 10000 8.05 10 0.0426
> tfft 0.76 10000 1.87 10 0.0597
>
> Geometric Mean Execution Time = 10.87 seconds
>
> and with the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 21:31:06
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.19 10000 8.78 10 0.0099
> aermod 47.91 10000 16.95 10 0.0123
> air 2.85 10000 5.34 12 0.0715
> capacita 1.63 10000 33.10 10 0.0361
> channel 0.67 10000 1.87 10 0.0884
> doduc 6.42 10000 27.35 10 0.0206
> fatigue 2.10 10000 8.32 10 0.0194
> gas_dyn 2.07 10000 4.30 17 0.0843
> induct 5.38 10000 12.58 10 0.0088
> linpk 0.71 10000 15.69 18 0.0796
> mdbx 1.95 10000 11.41 10 0.0238
> nf 1.24 10000 31.34 12 0.0991
> protein 3.88 10000 35.13 10 0.0659
> rnflow 4.73 10000 25.97 10 0.0629
> test_fpu 3.66 10000 8.88 11 0.0989
> tfft 0.52 10000 1.89 10 0.0403
>
> Geometric Mean Execution Time = 11.09 seconds
>
> This shows about a 2.0% performance reduction in the Geometric
> Mean Execution Time. I would note that intel darwin now defaults
> to -mtune=core2 and always has defaulted to -fPIC.
> Jack
>
> >
> > Thanks,
> >
> > Changpeng
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > Sent: Thursday, December 16, 2010 12:47 PM
> > To: Fang, Changpeng
> > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > Hi,
> >
> > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > not unrolling, (2) not prefetching. Which one do you prefer?
> >
> > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > loops),
> >
> > Zdenek
> >
>
> Content-Description: polyhedron1.txt
> > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> >
> > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > ============================================================================================================================
> > || | Before Patch | After Patch | Changes ||
> > ||========================================================================================================================||
> > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > ||========================================================================================================================||
> > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > ||========================================================================================================================||
> > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > ============================================================================================================================
>
> Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > From: Changpeng Fang <chfang@houghton.(none)>
> > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> >
> > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > (non_rolling_loop_p): New function declaration.
> > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > marked NON-ROLLING.
> > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > non-rolling loop.
> > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > function. (non_rolling_loop_p): Implement the new function.
> > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > loop.
> > ---
> > gcc/basic-block.h | 6 +++++-
> > gcc/cfg.c | 7 ++++---
> > gcc/cfgloop.h | 2 ++
> > gcc/predict.c | 6 ++++++
> > gcc/tree-ssa-loop-manip.c | 3 +++
> > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > 7 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > index be0a1d1..850472d 100644
> > --- a/gcc/basic-block.h
> > +++ b/gcc/basic-block.h
> > @@ -245,7 +245,11 @@ enum bb_flags
> >
> > /* Set on blocks that cannot be threaded through.
> > Only used in cfgcleanup.c. */
> > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > +
> > + /* Set on blocks that are headers of non-rolling loops. */
> > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > +
> > };
> >
> > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > index c8ef799..e59a637 100644
> > --- a/gcc/cfg.c
> > +++ b/gcc/cfg.c
> > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > connect_src (e);
> > }
> >
> > -/* Clear all basic block flags, with the exception of partitioning and
> > - setjmp_target. */
> > +/* Clear all basic block flags, with the exception of partitioning,
> > + setjmp_target, and the non-rolling loop marker. */
> > void
> > clear_bb_flags (void)
> > {
> > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> >
> > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > bb->flags = (BB_PARTITION (bb)
> > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > }
> >
> > /* Check the consistency of profile information. We can't do that
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index bf2614e..e856a78 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > +void mark_non_rolling_loop (struct loop *);
> > +bool non_rolling_loop_p (struct loop *);
> >
> > /* Loop manipulation. */
> > extern bool can_duplicate_loop_p (const struct loop *loop);
> > diff --git a/gcc/predict.c b/gcc/predict.c
> > index c691990..bf729f8 100644
> > --- a/gcc/predict.c
> > +++ b/gcc/predict.c
> > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > bool
> > optimize_loop_for_size_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return true;
> > return optimize_bb_for_size_p (loop->header);
> > }
> >
> > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > bool
> > optimize_loop_for_speed_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return false;
> > return optimize_bb_for_speed_p (loop->header);
> > }
> >
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 87b2c0d..bc977bb 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > gcc_assert (new_loop != NULL);
> > update_ssa (TODO_update_ssa);
> >
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > /* Determine the probability of the exit edge of the unrolled loop. */
> > new_est_niter = est_niter / factor;
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index ee85f6f..1e2e4b2 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > fold_undefer_and_ignore_overflow_warnings ();
> > }
> >
> > +/* Mark LOOP as a non-rolling loop. */
> > +
> > +void
> > +mark_non_rolling_loop (struct loop *loop)
> > +{
> > + gcc_assert (loop && loop->header);
> > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > +}
> > +
> > +/* Return true if LOOP is a non-rolling loop. */
> > +
> > +bool
> > +non_rolling_loop_p (struct loop *loop)
> > +{
> > + int masked_flags;
> > + gcc_assert (loop && loop->header);
> > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > + return (masked_flags != 0);
> > +}
> > +
> > /* Returns true if statement S1 dominates statement S2. */
> >
> > bool
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 6ecd304..216de78 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > cond_expr, cond_expr_stmt_list);
> > gcc_assert (new_loop);
> > gcc_assert (loop_num == loop->num);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > #endif
> > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > th, true, NULL_TREE, NULL);
> >
> > gcc_assert (new_loop);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > #endif
> > --
> > 1.6.3.3
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2011-01-04 3:33 ` Jack Howarth
@ 2011-01-04 22:40 ` Fang, Changpeng
2011-01-05 12:07 ` Richard Guenther
2011-01-05 14:12 ` Jack Howarth
0 siblings, 2 replies; 36+ messages in thread
From: Fang, Changpeng @ 2011-01-04 22:40 UTC (permalink / raw)
To: Jack Howarth
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 3589 bytes --]
Hi,
Thanks, Jack. Hopefully the 6% protein gain (for -m64) is not just noise.
I updated the patch based on the current trunk (REV 168477).
Is it OK to commit now?
Thanks,
Changpeng
________________________________________
From: Jack Howarth [howarth@bromo.med.uc.edu]
Sent: Monday, January 03, 2011 9:04 PM
To: Fang, Changpeng
Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> Hi, Jack:
>
> Thanks for the testing.
>
> This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> It would be helpful if you can provide analysis why they are slowed down.
Changpeng,
The corrected merge against gcc trunk of...
Index: gcc/basic-block.h
===================================================================
--- gcc/basic-block.h (revision 168437)
+++ gcc/basic-block.h (working copy)
@@ -247,11 +247,14 @@
Only used in cfgcleanup.c. */
BB_NONTHREADABLE_BLOCK = 1 << 11,
+ /* Set on blocks that are headers of non-rolling loops. */
+ BB_HEADER_OF_NONROLLING_LOOP = 1 << 12,
+
/* Set on blocks that were modified in some way. This bit is set in
df_set_bb_dirty, but not cleared by df_analyze, so it can be used
to test whether a block has been modified prior to a df_analyze
call. */
- BB_MODIFIED = 1 << 12
+ BB_MODIFIED = 1 << 13
};
/* Dummy flag for convenience in the hot/cold partitioning code. */
for the proposed patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01344.html
eliminated the performance regressions on x86_64-apple-darwin10. I now get...
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Execution Time
-m32
stock patched %increase
ac 10.59 10.59 0.0
aermod 19.49 19.13 -1.8
air 6.07 6.07 0.0
capacita 44.60 44.61 0.0
channel 1.98 1.98 0.0
doduc 31.19 31.31 0.4
fatigue 9.90 10.29 3.9
gas_dyn 4.72 4.71 -0.2
induct 13.93 13.93 0.0
linpk 15.50 15.49 -0.1
mdbx 11.28 11.26 -0.2
nf 27.62 27.58 -0.1
protein 38.70 38.60 -0.3
rnflow 24.68 24.68 0.0
test_fpu 10.13 10.13 0.0
tfft 1.92 1.92 0.0
Geometric Mean 12.06 12.08 0.2
Execution Time
-m64
stock patched %increase
ac 8.80 8.80 0.0
aermod 17.34 17.17 -1.0
air 5.48 5.52 0.7
capacita 32.38 32.50 0.4
channel 1.84 1.84 0.0
doduc 26.50 26.52 0.1
fatigue 8.35 8.33 -0.2
gas_dyn 4.30 4.29 -0.2
induct 12.83 12.83 0.0
linpk 15.49 15.49 0.0
mdbx 11.23 11.22 -0.1
nf 30.21 30.16 -0.2
protein 34.13 32.07 -6.0
rnflow 23.18 23.19 0.0
test_fpu 8.04 8.02 -0.2
tfft 1.87 1.86 -0.5
Geometric Mean 10.87 10.82 -0.5
Execution Time
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Consider-a-loop-not-hot-if-it-rolls-only-a-few-times.patch --]
[-- Type: text/x-patch; name="0001-Consider-a-loop-not-hot-if-it-rolls-only-a-few-times.patch", Size: 6232 bytes --]
From 7d10fcfe379e3fe4387cf192fbbdbea7daff0637 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Tue, 4 Jan 2011 14:36:50 -0800
Subject: [PATCH] Consider a loop not hot if it rolls only a few times (non-rolling)
* basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
* cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
* cfgloop.h (mark_non_rolling_loop): New function declaration.
(non_rolling_loop_p): New function declaration.
* predict.c (optimize_loop_for_size_p): Return true if the loop was marked
NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
marked NON-ROLLING.
* tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
non-rolling loop.
* tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
function. (non_rolling_loop_p): Implement the new function.
* tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
loop.
---
gcc/basic-block.h | 5 ++++-
gcc/cfg.c | 7 ++++---
gcc/cfgloop.h | 2 ++
gcc/predict.c | 6 ++++++
gcc/tree-ssa-loop-manip.c | 3 +++
gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
gcc/tree-vect-loop-manip.c | 8 ++++++++
7 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/gcc/basic-block.h b/gcc/basic-block.h
index 3594eea..081c175 100644
--- a/gcc/basic-block.h
+++ b/gcc/basic-block.h
@@ -251,7 +251,10 @@ enum bb_flags
df_set_bb_dirty, but not cleared by df_analyze, so it can be used
to test whether a block has been modified prior to a df_analyze
call. */
- BB_MODIFIED = 1 << 12
+ BB_MODIFIED = 1 << 12,
+
+ /* Set on blocks that are headers of non-rolling loops. */
+ BB_HEADER_OF_NONROLLING_LOOP = 1 << 13
};
/* Dummy flag for convenience in the hot/cold partitioning code. */
diff --git a/gcc/cfg.c b/gcc/cfg.c
index c8ef799..e59a637 100644
--- a/gcc/cfg.c
+++ b/gcc/cfg.c
@@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
connect_src (e);
}
-/* Clear all basic block flags, with the exception of partitioning and
- setjmp_target. */
+/* Clear all basic block flags, with the exception of partitioning,
+ setjmp_target, and the non-rolling loop marker. */
void
clear_bb_flags (void)
{
@@ -434,7 +434,8 @@ clear_bb_flags (void)
FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
bb->flags = (BB_PARTITION (bb)
- | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
+ | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
+ + BB_HEADER_OF_NONROLLING_LOOP)));
}
\f
/* Check the consistency of profile information. We can't do that
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index f7bb134..0f48115 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
void estimate_numbers_of_iterations_loop (struct loop *, bool);
HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
bool estimated_loop_iterations (struct loop *, bool, double_int *);
+void mark_non_rolling_loop (struct loop *);
+bool non_rolling_loop_p (struct loop *);
/* Loop manipulation. */
extern bool can_duplicate_loop_p (const struct loop *loop);
diff --git a/gcc/predict.c b/gcc/predict.c
index a86708a..34d7dff 100644
--- a/gcc/predict.c
+++ b/gcc/predict.c
@@ -280,6 +280,9 @@ optimize_insn_for_speed_p (void)
bool
optimize_loop_for_size_p (struct loop *loop)
{
+ /* Loops marked NON-ROLLING are not likely to be hot. */
+ if (non_rolling_loop_p (loop))
+ return true;
return optimize_bb_for_size_p (loop->header);
}
@@ -288,6 +291,9 @@ optimize_loop_for_size_p (struct loop *loop)
bool
optimize_loop_for_speed_p (struct loop *loop)
{
+ /* Loops marked NON-ROLLING are not likely to be hot. */
+ if (non_rolling_loop_p (loop))
+ return false;
return optimize_bb_for_speed_p (loop->header);
}
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 8176ed8..3ff0790 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
gcc_assert (new_loop != NULL);
update_ssa (TODO_update_ssa);
+ /* NEW_LOOP is a non-rolling loop. */
+ mark_non_rolling_loop (new_loop);
+
/* Determine the probability of the exit edge of the unrolled loop. */
new_est_niter = est_niter / factor;
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index ee85f6f..ec108c2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
fold_undefer_and_ignore_overflow_warnings ();
}
+/* Mark LOOP as a non-rolling loop. */
+
+void
+mark_non_rolling_loop (struct loop *loop)
+{
+ gcc_assert (loop && loop->header);
+ loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
+}
+
+/* Return true if LOOP is a non-rolling loop. */
+
+bool
+non_rolling_loop_p (struct loop *loop)
+{
+ int masked_flags;
+ gcc_assert (loop && loop->header);
+ masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
+ return (masked_flags != 0);
+}
+
/* Returns true if statement S1 dominates statement S2. */
bool
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 28b75f1..9bbe68b 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
cond_expr, cond_expr_stmt_list);
gcc_assert (new_loop);
gcc_assert (loop_num == loop->num);
+
+ /* NEW_LOOP is a non-rolling loop. */
+ mark_non_rolling_loop (new_loop);
+
#ifdef ENABLE_CHECKING
slpeel_verify_cfg_after_peeling (loop, new_loop);
#endif
@@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
th, true, NULL_TREE, NULL);
gcc_assert (new_loop);
+
+ /* NEW_LOOP is a non-rolling loop. */
+ mark_non_rolling_loop (new_loop);
+
#ifdef ENABLE_CHECKING
slpeel_verify_cfg_after_peeling (new_loop, loop);
#endif
--
1.6.3.3
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2011-01-04 22:40 ` Fang, Changpeng
@ 2011-01-05 12:07 ` Richard Guenther
2011-01-05 14:12 ` Jack Howarth
1 sibling, 0 replies; 36+ messages in thread
From: Richard Guenther @ 2011-01-05 12:07 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Jack Howarth, Zdenek Dvorak, Xinliang David Li, gcc-patches
On Tue, Jan 4, 2011 at 11:30 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> Hi,
>
> Thanks, Jack. Hopefully the 6% protein gain (for -m64) is not just noise.
>
> I updated the patch based on the current trunk (REV 168477).
>
> Is it OK to commit now?
The patch doesn't contain a testcase nor a reference to a bug so isn't
appropriate
for stage4 (or even stage3). Also using BB flags for this doesn't
sound quite right
as you'd have to make sure CFG manipulations properly copy/unset this flag
(which is likely the reason for the polyhedron regressions seen).
That said, I don't like the patch too much anyway, it looks like a hack.
The proper fix is to finally go the way of preserving loop information
and putting
such flag in the loop information (or well, just manually setting the
upper bound
for the number of iterations therein).
Please postpone this for stage1 of GCC 4.7. Thanks.
Richard.
> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Monday, January 03, 2011 9:04 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
>> Hi, Jack:
>>
>> Thanks for the testing.
>>
>> This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
>> It would be helpful if you can provide analysis why they are slowed down.
>
> Changpeng,
> The corrected merge against gcc trunk of...
>
> Index: gcc/basic-block.h
> ===================================================================
> --- gcc/basic-block.h (revision 168437)
> +++ gcc/basic-block.h (working copy)
> @@ -247,11 +247,14 @@
> Only used in cfgcleanup.c. */
> BB_NONTHREADABLE_BLOCK = 1 << 11,
>
> + /* Set on blocks that are headers of non-rolling loops. */
> + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12,
> +
> /* Set on blocks that were modified in some way. This bit is set in
> df_set_bb_dirty, but not cleared by df_analyze, so it can be used
> to test whether a block has been modified prior to a df_analyze
> call. */
> - BB_MODIFIED = 1 << 12
> + BB_MODIFIED = 1 << 13
> };
>
> /* Dummy flag for convenience in the hot/cold partitioning code. */
>
> for the proposed patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01344.html
> eliminated the performance regressions on x86_64-apple-darwin10. I now get...
>
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
>
> Execution Time
> -m32
> stock patched %increase
> ac 10.59 10.59 0.0
> aermod 19.49 19.13 -1.8
> air 6.07 6.07 0.0
> capacita 44.60 44.61 0.0
> channel 1.98 1.98 0.0
> doduc 31.19 31.31 0.4
> fatigue 9.90 10.29 3.9
> gas_dyn 4.72 4.71 -0.2
> induct 13.93 13.93 0.0
> linpk 15.50 15.49 -0.1
> mdbx 11.28 11.26 -0.2
> nf 27.62 27.58 -0.1
> protein 38.70 38.60 -0.3
> rnflow 24.68 24.68 0.0
> test_fpu 10.13 10.13 0.0
> tfft 1.92 1.92 0.0
>
> Geometric Mean 12.06 12.08 0.2
> Execution Time
>
> -m64
> stock patched %increase
> ac 8.80 8.80 0.0
> aermod 17.34 17.17 -1.0
> air 5.48 5.52 0.7
> capacita 32.38 32.50 0.4
> channel 1.84 1.84 0.0
> doduc 26.50 26.52 0.1
> fatigue 8.35 8.33 -0.2
> gas_dyn 4.30 4.29 -0.2
> induct 12.83 12.83 0.0
> linpk 15.49 15.49 0.0
> mdbx 11.23 11.22 -0.1
> nf 30.21 30.16 -0.2
> protein 34.13 32.07 -6.0
> rnflow 23.18 23.19 0.0
> test_fpu 8.04 8.02 -0.2
> tfft 1.87 1.86 -0.5
>
> Geometric Mean 10.87 10.82 -0.5
> Execution Time
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
2011-01-04 22:40 ` Fang, Changpeng
2011-01-05 12:07 ` Richard Guenther
@ 2011-01-05 14:12 ` Jack Howarth
1 sibling, 0 replies; 36+ messages in thread
From: Jack Howarth @ 2011-01-05 14:12 UTC (permalink / raw)
To: Fang, Changpeng
Cc: Zdenek Dvorak, Richard Guenther, Xinliang David Li, gcc-patches
On Tue, Jan 04, 2011 at 04:30:38PM -0600, Fang, Changpeng wrote:
> Hi,
>
> Thanks, Jack. Hopefully the 6% protein gain (for -m64) is not just noise.
>
> I updated the patch based on the current trunk (REV 168477).
>
> Is it OK to commit now?
I can't approve it but the benchmarks now seem reasonable with
the patch (although you might want to investigate why fatigue
runs 3.9% slower at -m32).
>
> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Monday, January 03, 2011 9:04 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> > Hi, Jack:
> >
> > Thanks for the testing.
> >
> > This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> > It would be helpful if you can provide analysis why they are slowed down.
>
> Changpeng,
> The corrected merge against gcc trunk of...
>
> Index: gcc/basic-block.h
> ===================================================================
> --- gcc/basic-block.h (revision 168437)
> +++ gcc/basic-block.h (working copy)
> @@ -247,11 +247,14 @@
> Only used in cfgcleanup.c. */
> BB_NONTHREADABLE_BLOCK = 1 << 11,
>
> + /* Set on blocks that are headers of non-rolling loops. */
> + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12,
> +
> /* Set on blocks that were modified in some way. This bit is set in
> df_set_bb_dirty, but not cleared by df_analyze, so it can be used
> to test whether a block has been modified prior to a df_analyze
> call. */
> - BB_MODIFIED = 1 << 12
> + BB_MODIFIED = 1 << 13
> };
>
> /* Dummy flag for convenience in the hot/cold partitioning code. */
>
> for the proposed patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01344.html
> eliminated the performance regressions on x86_64-apple-darwin10. I now get...
>
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
>
> Execution Time
> -m32
> stock patched %increase
> ac 10.59 10.59 0.0
> aermod 19.49 19.13 -1.8
> air 6.07 6.07 0.0
> capacita 44.60 44.61 0.0
> channel 1.98 1.98 0.0
> doduc 31.19 31.31 0.4
> fatigue 9.90 10.29 3.9
> gas_dyn 4.72 4.71 -0.2
> induct 13.93 13.93 0.0
> linpk 15.50 15.49 -0.1
> mdbx 11.28 11.26 -0.2
> nf 27.62 27.58 -0.1
> protein 38.70 38.60 -0.3
> rnflow 24.68 24.68 0.0
> test_fpu 10.13 10.13 0.0
> tfft 1.92 1.92 0.0
>
> Geometric Mean 12.06 12.08 0.2
> Execution Time
>
> -m64
> stock patched %increase
> ac 8.80 8.80 0.0
> aermod 17.34 17.17 -1.0
> air 5.48 5.52 0.7
> capacita 32.38 32.50 0.4
> channel 1.84 1.84 0.0
> doduc 26.50 26.52 0.1
> fatigue 8.35 8.33 -0.2
> gas_dyn 4.30 4.29 -0.2
> induct 12.83 12.83 0.0
> linpk 15.49 15.49 0.0
> mdbx 11.23 11.22 -0.1
> nf 30.21 30.16 -0.2
> protein 34.13 32.07 -6.0
> rnflow 23.18 23.19 0.0
> test_fpu 8.04 8.02 -0.2
> tfft 1.87 1.86 -0.5
>
> Geometric Mean 10.87 10.82 -0.5
> Execution Time
Content-Description: 0001-Consider-a-loop-not-hot-if-it-rolls-only-a-few-times.patch
> From 7d10fcfe379e3fe4387cf192fbbdbea7daff0637 Mon Sep 17 00:00:00 2001
> From: Changpeng Fang <chfang@houghton.(none)>
> Date: Tue, 4 Jan 2011 14:36:50 -0800
> Subject: [PATCH] Consider a loop not hot if it rolls only a few times (non-rolling)
>
> * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> * cfgloop.h (mark_non_rolling_loop): New function declaration.
> (non_rolling_loop_p): New function declaration.
> * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> marked NON-ROLLING.
> * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> non-rolling loop.
> * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> function. (non_rolling_loop_p): Implement the new function.
> * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> loop.
> ---
> gcc/basic-block.h | 5 ++++-
> gcc/cfg.c | 7 ++++---
> gcc/cfgloop.h | 2 ++
> gcc/predict.c | 6 ++++++
> gcc/tree-ssa-loop-manip.c | 3 +++
> gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> gcc/tree-vect-loop-manip.c | 8 ++++++++
> 7 files changed, 47 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> index 3594eea..081c175 100644
> --- a/gcc/basic-block.h
> +++ b/gcc/basic-block.h
> @@ -251,7 +251,10 @@ enum bb_flags
> df_set_bb_dirty, but not cleared by df_analyze, so it can be used
> to test whether a block has been modified prior to a df_analyze
> call. */
> - BB_MODIFIED = 1 << 12
> + BB_MODIFIED = 1 << 12,
> +
> + /* Set on blocks that are headers of non-rolling loops. */
> + BB_HEADER_OF_NONROLLING_LOOP = 1 << 13
> };
>
> /* Dummy flag for convenience in the hot/cold partitioning code. */
> diff --git a/gcc/cfg.c b/gcc/cfg.c
> index c8ef799..e59a637 100644
> --- a/gcc/cfg.c
> +++ b/gcc/cfg.c
> @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> connect_src (e);
> }
>
> -/* Clear all basic block flags, with the exception of partitioning and
> - setjmp_target. */
> +/* Clear all basic block flags, with the exception of partitioning,
> + setjmp_target, and the non-rolling loop marker. */
> void
> clear_bb_flags (void)
> {
> @@ -434,7 +434,8 @@ clear_bb_flags (void)
>
> FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> bb->flags = (BB_PARTITION (bb)
> - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> + + BB_HEADER_OF_NONROLLING_LOOP)));
> }
> \f
> /* Check the consistency of profile information. We can't do that
> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> index f7bb134..0f48115 100644
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> void estimate_numbers_of_iterations_loop (struct loop *, bool);
> HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> bool estimated_loop_iterations (struct loop *, bool, double_int *);
> +void mark_non_rolling_loop (struct loop *);
> +bool non_rolling_loop_p (struct loop *);
>
> /* Loop manipulation. */
> extern bool can_duplicate_loop_p (const struct loop *loop);
> diff --git a/gcc/predict.c b/gcc/predict.c
> index a86708a..34d7dff 100644
> --- a/gcc/predict.c
> +++ b/gcc/predict.c
> @@ -280,6 +280,9 @@ optimize_insn_for_speed_p (void)
> bool
> optimize_loop_for_size_p (struct loop *loop)
> {
> + /* Loops marked NON-ROLLING are not likely to be hot. */
> + if (non_rolling_loop_p (loop))
> + return true;
> return optimize_bb_for_size_p (loop->header);
> }
>
> @@ -288,6 +291,9 @@ optimize_loop_for_size_p (struct loop *loop)
> bool
> optimize_loop_for_speed_p (struct loop *loop)
> {
> + /* Loops marked NON-ROLLING are not likely to be hot. */
> + if (non_rolling_loop_p (loop))
> + return false;
> return optimize_bb_for_speed_p (loop->header);
> }
>
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 8176ed8..3ff0790 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> gcc_assert (new_loop != NULL);
> update_ssa (TODO_update_ssa);
>
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> /* Determine the probability of the exit edge of the unrolled loop. */
> new_est_niter = est_niter / factor;
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index ee85f6f..ec108c2 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> fold_undefer_and_ignore_overflow_warnings ();
> }
>
> +/* Mark LOOP as a non-rolling loop. */
> +
> +void
> +mark_non_rolling_loop (struct loop *loop)
> +{
> + gcc_assert (loop && loop->header);
> + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> +}
> +
> +/* Return true if LOOP is a non-rolling loop. */
> +
> +bool
> +non_rolling_loop_p (struct loop *loop)
> +{
> + int masked_flags;
> + gcc_assert (loop && loop->header);
> + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> + return (masked_flags != 0);
> +}
> +
> /* Returns true if statement S1 dominates statement S2. */
>
> bool
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 28b75f1..9bbe68b 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> cond_expr, cond_expr_stmt_list);
> gcc_assert (new_loop);
> gcc_assert (loop_num == loop->num);
> +
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (loop, new_loop);
> #endif
> @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> th, true, NULL_TREE, NULL);
>
> gcc_assert (new_loop);
> +
> + /* NEW_LOOP is a non-rolling loop. */
> + mark_non_rolling_loop (new_loop);
> +
> #ifdef ENABLE_CHECKING
> slpeel_verify_cfg_after_peeling (new_loop, loop);
> #endif
> --
> 1.6.3.3
>
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2011-01-05 14:11 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-13 21:57 [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops Fang, Changpeng
2010-12-14 0:56 ` Sebastian Pop
2010-12-14 8:25 ` Zdenek Dvorak
2010-12-14 20:02 ` Fang, Changpeng
2010-12-14 21:55 ` Zdenek Dvorak
2010-12-15 6:16 ` Richard Guenther
2010-12-15 8:34 ` Fang, Changpeng
2010-12-15 9:22 ` Xinliang David Li
2010-12-15 10:00 ` Zdenek Dvorak
2010-12-15 16:46 ` Fang, Changpeng
2010-12-15 16:47 ` Zdenek Dvorak
2010-12-15 17:08 ` Xinliang David Li
2010-12-16 12:09 ` Richard Guenther
2010-12-16 12:41 ` Zdenek Dvorak
2010-12-16 18:26 ` Fang, Changpeng
2010-12-16 20:06 ` Zdenek Dvorak
2010-12-17 3:53 ` Fang, Changpeng
2010-12-17 6:36 ` Jack Howarth
2010-12-17 9:55 ` Fang, Changpeng
2010-12-17 16:13 ` Jack Howarth
2010-12-17 16:48 ` Fang, Changpeng
2010-12-17 17:20 ` Jack Howarth
2010-12-17 18:01 ` Jack Howarth
2010-12-17 18:31 ` Fang, Changpeng
2011-01-04 3:33 ` Jack Howarth
2011-01-04 22:40 ` Fang, Changpeng
2011-01-05 12:07 ` Richard Guenther
2011-01-05 14:12 ` Jack Howarth
2010-12-17 21:45 ` Jack Howarth
2010-12-17 22:35 ` Fang, Changpeng
2010-12-17 22:54 ` [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loopsj Jack Howarth
2010-12-18 11:35 ` [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops Jack Howarth
2010-12-19 0:29 ` Richard Guenther
2010-12-21 22:45 ` Fang, Changpeng
2010-12-14 16:13 ` Jack Howarth
2010-12-14 17:33 ` Fang, Changpeng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).