* [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
@ 2017-06-12 17:03 Bin Cheng
2017-06-20 9:22 ` Bin.Cheng
0 siblings, 1 reply; 16+ messages in thread
From: Bin Cheng @ 2017-06-12 17:03 UTC (permalink / raw)
To: gcc-patches; +Cc: nd
[-- Attachment #1: Type: text/plain, Size: 4118 bytes --]
Hi,
This is the main patch rewriting loop distribution in order to handle hmmer.
It improves loop distribution by versioning loop under runtime alias check conditions.
As described in comments, the patch basically implements distribution in the following
steps:
1) Seed partitions with specific type statements. For now we support
two types seed statements: statement defining variable used outside
of loop; statement storing to memory.
2) Build reduced dependence graph (RDG) for loop to be distributed.
The vertices (RDG:V) model all statements in the loop and the edges
(RDG:E) model flow and control dependencies between statements.
3) Apart from RDG, compute data dependencies between memory references.
4) Starting from seed statement, build up partition by adding depended
statements according to RDG's dependence information. Partition is
classified as parallel type if it can be executed paralleled; or as
sequential type if it can't. Parallel type partition is further
classified as different builtin kinds if it can be implemented as
builtin function calls.
5) Build partition dependence graph (PG) based on data dependencies.
The vertices (PG:V) model all partitions and the edges (PG:E) model
all data dependencies between every partitions pair. In general,
data dependence is either compilation time known or unknown. In C
family languages, there exists quite amount compilation time unknown
dependencies because of possible alias relation of data references.
We categorize PG's edge to two types: "true" edge that represents
compilation time known data dependencies; "alias" edge for all other
data dependencies.
6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
partitions in each strong connected component (SCC) correspondingly.
Build new PG for merged partitions.
7) Traverse PG again and this time with both "true" and "alias" edges
included. We try to break SCCs by removing some edges. Because
SCCs by "true" edges are all fused in step 6), we can break SCCs
by removing some "alias" edges. It's NP-hard to choose optimal
edge set, fortunately simple approximation is good enough for us
given the small problem scale.
8) Collect all data dependencies of the removed "alias" edges. Create
runtime alias checks for collected data dependencies.
9) Version loop under the condition of runtime alias checks. Given
loop distribution generally introduces additional overhead, it is
only useful if vectorization is achieved in distributed loop. We
version loop with internal function call IFN_LOOP_DIST_ALIAS. If
no distributed loop can be vectorized, we simply remove distributed
loops and recover to the original one.
Also, there are some more to improve in the future (which isn't difficult I think):
TODO:
1) We only distribute innermost loops now. This pass should handle loop
nests in the future.
2) We only fuse partitions in SCC now. A better fusion algorithm is
desired to minimize loop overhead, maximize parallelism and maximize
Bootstrap and test on x86_64 and AArch64. Is it OK?
Thanks,
bin
2017-06-07 Bin Cheng <bin.cheng@arm.com>
* tree-loop-distribution.c: Add general explanantion on the pass.
(pg_add_dependence_edges): New parameter. handle alias data
dependence specially and record it in the parameter if asked.
(struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
(init_partition_graph_vertices, add_partition_graph_edge): New.
(pg_skip_alias_edge, free_partition_graph_edata_cb): New.
(free_partition_graph_vdata, build_partition_graph): New.
(sort_partitions_by_post_order, merge_dep_scc_partitions): New.
(pg_collect_alias_ddrs, break_alias_scc_partitions): New.
(data_ref_segment_size, latch_dominated_by_data_ref): New.
(compute_alias_check_pairs, version_loop_by_alias_check): New.
(version_for_distribution_p, finalize_partitions): New.
(distribute_loop): Handle alias data dependence specially. Factor
out loop fustion code as functions.
[-- Attachment #2: 0013-loop-distribution-20170607.txt --]
[-- Type: text/plain, Size: 36066 bytes --]
From ab801ff60a3a92d7ee90b422b89d9a96a003b7ba Mon Sep 17 00:00:00 2001
From: Bin Cheng <binche01@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 14:24:52 +0100
Subject: [PATCH 13/13] loop-distribution-20170607.txt
---
gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c | 3 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c | 4 +-
gcc/tree-loop-distribution.c | 843 +++++++++++++++++++++++++++----
5 files changed, 743 insertions(+), 117 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
index 53551ca..625dd92 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
@@ -18,4 +18,5 @@ int foo (int * __restrict__ ia,
return oya[22] + oyb[21];
}
-/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 1 "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
index ba39d4d..8c9fd56 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
@@ -16,6 +16,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
index 48c1040..fa4d1a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
@@ -21,6 +21,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
index c36daf071..4def9b4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
@@ -20,7 +20,5 @@ int loop1 (int k)
return b[100-1][1];
}
-/* The current cost model fuses the two partitions because they have
- similar memory accesses. */
-/* { dg-final { scan-tree-dump "similar memory accesses" "ldist" } } */
+/* Distributing inner loop doesn't expose more parallelism. */
/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 167155e..b28a499 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -36,10 +36,58 @@ along with GCC; see the file COPYING3. If not see
| D(I) = A(I-1)*E
|ENDDO
- This pass uses an RDG, Reduced Dependence Graph built on top of the
- data dependence relations. The RDG is then topologically sorted to
- obtain a map of information producers/consumers based on which it
- generates the new loops. */
+ Loop distribution is the dual of loop fusion. It separates statements
+ of a loop (or loop nest) into multiple loops (or loop nests) with the
+ same loop header. The major goal is to separate statements which may
+ be vectorized from those that can't. This pass implements distribution
+ in the following steps:
+
+ 1) Seed partitions with specific type statements. For now we support
+ two types seed statements: statement defining variable used outside
+ of loop; statement storing to memory.
+ 2) Build reduced dependence graph (RDG) for loop to be distributed.
+ The vertices (RDG:V) model all statements in the loop and the edges
+ (RDG:E) model flow and control dependences between statements.
+ 3) Apart from RDG, compute data dependences between memory references.
+ 4) Starting from seed statement, build up partition by adding depended
+ statements according to RDG's dependence information. Partition is
+ classified as parallel type if it can be executed parallelly; or as
+ sequential type if it can't. Parallel type partition is further
+ classified as different builtin kinds if it can be implemented as
+ builtin function calls.
+ 5) Build partition dependence graph (PG) based on data dependences.
+ The vertices (PG:V) model all partitions and the edges (PG:E) model
+ all data dependences between every partitions pair. In general,
+ data dependence is either compilation time known or unknown. In C
+ family languages, there exists quite amount compilation time unknown
+ dependences because of possible alias relation of data references.
+ We categorize PG's edge to two types: "true" edge that represents
+ compilation time known data dependences; "alias" edge for all other
+ data dependences.
+ 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
+ partitions in each strong connected commponent (SCC) correspondingly.
+ Build new PG for merged partitions.
+ 7) Traverse PG again and this time with both "true" and "alias" edges
+ included. We try to break SCCs by removing some edges. Because
+ SCCs by "true" edges are all fused in step 6), we can break SCCs
+ by removing some "alias" edges. It's NP-hard to choose optimal
+ edge set, fortunately simple approximation is good enough for us
+ given the small problem scale.
+ 8) Collect all data dependences of the removed "alias" edges. Create
+ runtime alias checks for collected data dependences.
+ 9) Version loop under the condition of runtime alias checks. Given
+ loop distribution generally introduces additional overhead, it is
+ only useful if vectorization is achieved in distributed loop. We
+ version loop with internal function call IFN_LOOP_DIST_ALIAS. If
+ no distributed loop can be vectorized, we simply remove distributed
+ loops and recover to the original one.
+
+ TODO:
+ 1) We only distribute innermost loops now. This pass should handle loop
+ nests in the future.
+ 2) We only fuse partitions in SCC now. A better fusion algorithm is
+ desired to minimize loop overhead, maximize parallelism and maximize
+ data reuse. */
#include "config.h"
#include "system.h"
@@ -1676,11 +1724,14 @@ partition_contains_all_rw (struct graph *rdg,
}
/* Compute partition dependence created by the data references in DRS1
- and DRS2 and modify and return DIR according to that. */
+ and DRS2, modify and return DIR according to that. IF ALIAS_DDR is
+ not NULL, we record dependence introduced by possible alias between
+ two data references in ALIAS_DDRS; otherwise, we simply ignore such
+ dependence as if it doesn't exist at all. */
static int
pg_add_dependence_edges (struct graph *rdg, int dir,
- bitmap drs1, bitmap drs2)
+ bitmap drs1, bitmap drs2, vec<ddr_p> *alias_ddrs)
{
unsigned i, j;
bitmap_iterator bi, bj;
@@ -1694,11 +1745,12 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
EXECUTE_IF_SET_IN_BITMAP (drs2, 0, j, bj)
{
+ int res, this_dir = 1;
+ ddr_p ddr;
+
dr2 = (*datarefs_vec)[j];
saved_dr1 = dr1;
- int this_dir = 1;
- ddr_p ddr;
- /* Re-shuffle data-refs to be in dominator order. */
+ /* Re-shuffle data-refs to be in topological order. */
if (rdg_vertex_for_stmt (rdg, DR_STMT (dr1))
> rdg_vertex_for_stmt (rdg, DR_STMT (dr2)))
{
@@ -1707,14 +1759,33 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
}
ddr = get_data_dependence (rdg, dr1, dr2);
if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know)
- this_dir = 2;
+ {
+ this_dir = 0;
+ res = data_ref_compare_tree (DR_BASE_ADDRESS (dr1),
+ DR_BASE_ADDRESS (dr2));
+ /* Be conservative. If data references are not well analyzed,
+ or the two data references have the same base address and
+ offset, add dependence and consider it alias to each other.
+ In other words, the dependence can not be resolved by
+ runtime alias check. */
+ if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
+ || !DR_INIT (dr1) || !DR_INIT (dr2)
+ || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
+ || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
+ || res == 0)
+ this_dir = 2;
+ /* Data dependence could be resolved by runtime alias check,
+ record it in ALIAS_DDRS. */
+ else if (alias_ddrs != NULL)
+ alias_ddrs->safe_push (ddr);
+ /* Or simply ignore it. */
+ }
else if (DDR_ARE_DEPENDENT (ddr) == NULL_TREE)
{
if (DDR_REVERSED_P (ddr))
- {
- std::swap (dr1, dr2);
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
+
/* Known dependences can still be unordered througout the
iteration space, see gcc.dg/tree-ssa/ldist-16.c. */
if (DDR_NUM_DIST_VECTS (ddr) != 1)
@@ -1722,12 +1793,10 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
/* If the overlap is exact preserve stmt order. */
else if (lambda_vector_zerop (DDR_DIST_VECT (ddr, 0), 1))
;
+ /* Else as the distance vector is lexicographic positive swap
+ the dependence direction. */
else
- {
- /* Else as the distance vector is lexicographic positive
- swap the dependence direction. */
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
}
else
this_dir = 0;
@@ -1754,11 +1823,635 @@ pgcmp (const void *v1_, const void *v2_)
return v2->post - v1->post;
}
-/* Distributes the code from LOOP in such a way that producer
- statements are placed before consumer statements. Tries to separate
- only the statements from STMTS into separate loops.
- Returns the number of distributed loops. Set *DESTROY_P to whether
- LOOP needs to be destroyed. */
+/* Data attached to vertices of partition dependence graph. */
+struct pg_vdata
+{
+ /* ID of the corresponding partition. */
+ int id;
+ /* The partition. */
+ struct partition *partition;
+};
+
+/* Data attached to edges of partition dependence graph. */
+struct pg_edata
+{
+ /* If the dependence edge can be resolved by runtime alias check,
+ this vector contains data dependence relations for runtime alias
+ check. On the other hand, if the dependence edge is introduced
+ because of compilation time known data dependence, this vector
+ contains nothing. */
+ vec<ddr_p> alias_ddrs;
+};
+
+/* Callback data for traversing edges in graph. */
+struct pg_edge_callback_data
+{
+ /* Bitmap contains strong connected components should be merged. */
+ bitmap sccs_to_merge;
+ /* Array constains component information for all vertices. */
+ int *vertices_component;
+ /* Vector to record all data dependence relations which are needed
+ to break strong connected components by runtime alias checks. */
+ vec<ddr_p> *alias_ddrs;
+};
+
+/* Initialize vertice's data for partition dependence graph PG with
+ PARTITIONS. */
+
+static void
+init_partition_graph_vertices (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ partition *partition;
+ struct pg_vdata *data;
+
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ data = new pg_vdata;
+ pg->vertices[i].data = data;
+ data->id = i;
+ data->partition = partition;
+ }
+}
+
+/* Add edge <I, J> to partition dependence graph PG. Attach vector of data
+ dependence relations to the EDGE if DDRS isn't NULL. */
+
+static void
+add_partition_graph_edge (struct graph *pg, int i, int j, vec<ddr_p> *ddrs)
+{
+ struct graph_edge *e = add_edge (pg, i, j);
+
+ /* If the edge is attached with data dependence relations, it means this
+ dependence edge can be resolved by runtime alias checks. */
+ if (ddrs != NULL)
+ {
+ struct pg_edata *data = new pg_edata;
+
+ gcc_assert (ddrs->length () > 0);
+ e->data = data;
+ data->alias_ddrs = vNULL;
+ data->alias_ddrs.safe_splice (*ddrs);
+ }
+}
+
+/* Callback function for graph travesal algorithm. It returns true
+ if edge E should skipped when traversing the graph. */
+
+static bool
+pg_skip_alias_edge (struct graph_edge *e)
+{
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ return (data != NULL && data->alias_ddrs.length () > 0);
+}
+
+/* Callback function freeing data attached to edge E of graph. */
+
+static void
+free_partition_graph_edata_cb (struct graph *, struct graph_edge *e, void *)
+{
+ if (e->data != NULL)
+ {
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ data->alias_ddrs.release ();
+ delete data;
+ }
+}
+
+/* Free data attached to vertice of partition dependence graph PG. */
+
+static void
+free_partition_graph_vdata (struct graph *pg)
+{
+ int i;
+ struct pg_vdata *data;
+
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ delete data;
+ }
+}
+
+/* Build and return partition dependence graph for PARTITIONS. RDG is
+ reduced dependence graph for the loop to be distributed. If IGNORE_ALIAS_P
+ is true, data dependence caused by possible alias between references
+ is ignored, as if it doesn't exist at all; otherwise all depdendences
+ are considered. */
+
+static struct graph *
+build_partition_graph (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ int i, j;
+ struct partition *partition1, *partition2;
+ graph *pg = new_graph (partitions->length ());
+ auto_vec<ddr_p> alias_ddrs, *alias_ddrs_p;
+
+ alias_ddrs_p = ignore_alias_p ? NULL : &alias_ddrs;
+
+ init_partition_graph_vertices (pg, partitions);
+
+ for (i = 0; partitions->iterate (i, &partition1); ++i)
+ {
+ for (j = i + 1; partitions->iterate (j, &partition2); ++j)
+ {
+ /* dependence direction - 0 is no dependence, -1 is back,
+ 1 is forth, 2 is both (we can stop then, merging will occur). */
+ int dir = 0;
+
+ /* If the first partition has reduction, add back edge; if the
+ second partition has reduction, add forth edge. This makes
+ sure that reduction partition will be sorted as the last one. */
+ if (partition_reduction_p (partition1))
+ dir = -1;
+ else if (partition_reduction_p (partition2))
+ dir = 1;
+
+ /* Cleanup the temporary vector. */
+ alias_ddrs.truncate (0);
+
+ dir = pg_add_dependence_edges (rdg, dir, partition1->writes,
+ partition2->reads, alias_ddrs_p);
+ if (dir != 2)
+ dir = pg_add_dependence_edges (rdg, dir, partition1->reads,
+ partition2->writes, alias_ddrs_p);
+ if (dir != 2)
+ dir = pg_add_dependence_edges (rdg, dir, partition1->writes,
+ partition2->writes, alias_ddrs_p);
+
+ /* Add edge to partition graph if there exists dependence. There
+ are two types of edges. One type edge is caused by compilation
+ time known dependence, this type can not be resolved by runtime
+ alias check. The other type can be resolved by runtime alias
+ check. */
+ if (dir == 1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != 1 && dir != 2);
+ add_partition_graph_edge (pg, i, j,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ if (dir == -1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != -1 && dir != 2);
+ add_partition_graph_edge (pg, j, i,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ }
+ }
+ return pg;
+}
+
+/* Sort partitions in PG by post order and store them in PARTITIONS. */
+
+static void
+sort_partitions_by_post_order (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ struct pg_vdata *data;
+
+ /* Now order the remaining nodes in postorder. */
+ qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
+ partitions->truncate (0);
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ if (data->partition)
+ partitions->safe_push (data->partition);
+ }
+}
+
+/* Given reduced dependence graph RDG merge strong connected components
+ of PARTITIONS. If IGNORE_ALIAS_P is true, data dependence caused by
+ possible alias between references is ignored, as if it doesn't exist
+ at all; otherwise all depdendences are considered. */
+
+static void
+merge_dep_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ struct partition *partition1, *partition2;
+ struct pg_vdata *data;
+ graph *pg = build_partition_graph (rdg, partitions, ignore_alias_p);
+ int i, j, num_sccs = graphds_scc (pg, NULL);
+
+ /* Strong connected compoenent means dependence cycle, we cannot distribute
+ them. So fuse them together. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &partition1); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (j = j + 1; partitions->iterate (j, &partition2); ++j)
+ if (pg->vertices[j].component == i)
+ {
+ partition_merge_into (partition1, partition2, FUSE_SAME_SCC);
+ (*partitions)[j] = NULL;
+ partition_free (partition2);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ data->partition = NULL;
+ }
+ }
+ sort_partitions_by_post_order (pg, partitions);
+ }
+ gcc_assert (partitions->length () == (unsigned)num_sccs);
+ free_partition_graph_vdata (pg);
+ free_graph (pg);
+}
+
+/* Callback function for traversing edge E in graph G. DATA is private
+ callback data. */
+
+static void
+pg_collect_alias_ddrs (struct graph *g, struct graph_edge *e, void *data)
+{
+ int i, j, component;
+ struct pg_edge_callback_data *cbdata;
+ struct pg_edata *edata = (struct pg_edata *) e->data;
+
+ /* If the edge doesn't have attached data dependence, it represents
+ compilation time known dependences. This type dependence cannot
+ be resolved by runtime alias check. */
+ if (edata == NULL || edata->alias_ddrs.length () == 0)
+ return;
+
+ cbdata = (struct pg_edge_callback_data *) data;
+ i = e->src;
+ j = e->dest;
+ component = cbdata->vertices_component[i];
+ /* Vertices are topologically sorted according to compilation time
+ known dependences, so we can break strong connected components
+ by removing edges of the opposite direction, i.e, edges pointing
+ from vertice with smaller post number to vertice with bigger post
+ number. */
+ if (g->vertices[i].post < g->vertices[j].post
+ /* We only need to remove edges connecting vertices in the same
+ strong connected component to break it. */
+ && component == cbdata->vertices_component[j]
+ /* Check if we want to break the strong connected component or not. */
+ && !bitmap_bit_p (cbdata->sccs_to_merge, component))
+ cbdata->alias_ddrs->safe_splice (edata->alias_ddrs);
+}
+
+/* This is the main function breaking strong conected components in
+ PARTITIONS giving reduced depdendence graph RDG. Store data dependence
+ relations for runtime alias check in ALIAS_DDRS. */
+
+static void
+break_alias_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ int i, j, num_sccs, num_sccs_no_alias;
+ /* Build partition dependence graph. */
+ graph *pg = build_partition_graph (rdg, partitions, false);
+
+ alias_ddrs->truncate (0);
+ /* Find strong connected components in the graph, with all dependence edges
+ considered. */
+ num_sccs = graphds_scc (pg, NULL);
+ /* All SCCs now can be broken by runtime alias checks because SCCs caused by
+ compilation time known dependences are merged before this function. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ struct pg_edge_callback_data cbdata;
+ auto_bitmap sccs_to_merge;
+ auto_vec<enum partition_type> scc_types;
+ struct partition *partition, *first;
+
+ /* If all paritions in a SCC has the same type, we can simply merge the
+ SCC. This loop finds out such SCCS and record them in bitmap. */
+ bitmap_set_range (sccs_to_merge, 0, (unsigned) num_sccs);
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ if (pg->vertices[j].component != i)
+ continue;
+
+ if (first->type != partition->type)
+ {
+ bitmap_clear_bit (sccs_to_merge, i);
+ break;
+ }
+ }
+ }
+
+ /* Initialize callback data for traversing. */
+ cbdata.sccs_to_merge = sccs_to_merge;
+ cbdata.alias_ddrs = alias_ddrs;
+ cbdata.vertices_component = XNEWVEC (int, pg->n_vertices);
+ /* Record the component information which will be corrupted by next
+ graph scc finding call. */
+ for (i = 0; i < pg->n_vertices; ++i)
+ cbdata.vertices_component[i] = pg->vertices[i].component;
+
+ /* Collect data dependences for runtime alias checks to break SCCs. */
+ if (bitmap_count_bits (sccs_to_merge) != (unsigned) num_sccs)
+ {
+ /* Run SCC finding algorithm again, with alias dependence edges
+ skipped. This is to topologically sort paritions according to
+ compilation time known dependence. Note the topological order
+ is stored in the form of pg's post order number. */
+ num_sccs_no_alias = graphds_scc (pg, NULL, pg_skip_alias_edge);
+ gcc_assert (partitions->length () == (unsigned) num_sccs_no_alias);
+ /* With topological order, we can construct two subgraphs L and R.
+ L contains edge <x, y> where x < y in terms of post order, while
+ R contains edge <x, y> where x > y. Edges for compilation time
+ known dependence all fall in R, so we break SCCs by removing all
+ (alias) edges of in subgraph L. */
+ for_each_edge (pg, pg_collect_alias_ddrs, &cbdata);
+ }
+
+ /* For SCC that doesn't need to be broken, merge it. */
+ for (i = 0; i < num_sccs; ++i)
+ {
+ if (!bitmap_bit_p (sccs_to_merge, i))
+ continue;
+
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (cbdata.vertices_component[j] == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ struct pg_vdata *data;
+
+ if (cbdata.vertices_component[j] != i)
+ continue;
+
+ partition_merge_into (first, partition, FUSE_SAME_SCC);
+ (*partitions)[j] = NULL;
+ partition_free (partition);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ gcc_assert (data->id == j);
+ data->partition = NULL;
+ }
+ }
+ }
+
+ sort_partitions_by_post_order (pg, partitions);
+ free_partition_graph_vdata (pg);
+ for_each_edge (pg, free_partition_graph_edata_cb, NULL);
+ free_graph (pg);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "Possible alias data dependence to break:\n");
+ dump_data_dependence_relations (dump_file, *alias_ddrs);
+ }
+}
+
+/* Compute and return an expression whose value is the segment length which
+ will be accessed by DR in NITERS iterations. */
+
+static tree
+data_ref_segment_size (struct data_reference *dr, tree niters)
+{
+ tree segment_length;
+
+ if (integer_zerop (DR_STEP (dr)))
+ segment_length = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
+ else
+ segment_length = size_binop (MULT_EXPR,
+ fold_convert (sizetype, DR_STEP (dr)),
+ fold_convert (sizetype, niters));
+
+ return segment_length;
+}
+
+/* Return true if LOOP's latch is dominated by statement for data reference
+ DR. */
+
+static inline bool
+latch_dominated_by_data_ref (struct loop *loop, data_reference *dr)
+{
+ return dominated_by_p (CDI_DOMINATORS, single_exit (loop)->src,
+ gimple_bb (DR_STMT (dr)));
+}
+
+/* Compute alias check pairs and store them in COMP_ALIAS_PAIRS for LOOP's
+ data dependence relations ALIAS_DDRS. */
+
+static void
+compute_alias_check_pairs (struct loop *loop, vec<ddr_p> *alias_ddrs,
+ vec<dr_with_seg_len_pair_t> *comp_alias_pairs)
+{
+ unsigned int i;
+ unsigned HOST_WIDE_INT factor = 1;
+ tree niters_plus_one, niters = number_of_latch_executions (loop);
+
+ gcc_assert (niters != NULL_TREE && niters != chrec_dont_know);
+ niters = fold_convert (sizetype, niters);
+ niters_plus_one = size_binop (PLUS_EXPR, niters, size_one_node);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Creating alias check pairs:\n");
+
+ /* Iterate all data dependence relations and compute alias check pairs. */
+ for (i = 0; i < alias_ddrs->length (); i++)
+ {
+ ddr_p ddr = (*alias_ddrs)[i];
+ struct data_reference *dr_a = DDR_A (ddr);
+ struct data_reference *dr_b = DDR_B (ddr);
+ tree seg_length_a, seg_length_b;
+ int comp_res = data_ref_compare_tree (DR_BASE_ADDRESS (dr_a),
+ DR_BASE_ADDRESS (dr_b));
+
+ if (comp_res == 0)
+ comp_res = data_ref_compare_tree (DR_OFFSET (dr_a), DR_OFFSET (dr_b));
+ gcc_assert (comp_res != 0);
+
+ if (latch_dominated_by_data_ref (loop, dr_a))
+ seg_length_a = data_ref_segment_size (dr_a, niters_plus_one);
+ else
+ seg_length_a = data_ref_segment_size (dr_a, niters);
+
+ if (latch_dominated_by_data_ref (loop, dr_b))
+ seg_length_b = data_ref_segment_size (dr_b, niters_plus_one);
+ else
+ seg_length_b = data_ref_segment_size (dr_b, niters);
+
+ dr_with_seg_len_pair_t dr_with_seg_len_pair
+ (dr_with_seg_len (dr_a, seg_length_a),
+ dr_with_seg_len (dr_b, seg_length_b));
+
+ /* Canonicalize pairs by sorting the two DR members. */
+ if (comp_res > 0)
+ std::swap (dr_with_seg_len_pair.first, dr_with_seg_len_pair.second);
+
+ comp_alias_pairs->safe_push (dr_with_seg_len_pair);
+ }
+
+ if (tree_fits_uhwi_p (niters))
+ factor = tree_to_uhwi (niters);
+
+ /* Prune alias check pairs. */
+ prune_runtime_alias_test_list (comp_alias_pairs, factor);
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Improved number of alias checks from %d to %d\n",
+ alias_ddrs->length (), comp_alias_pairs->length ());
+}
+
+/* Given data dependence relations in ALIAS_DDRS, generate runtime alias
+ checks and version LOOP under condition of these runtime alias checks. */
+
+static void
+version_loop_by_alias_check (struct loop *loop, vec<ddr_p> *alias_ddrs)
+{
+ unsigned then_prob, else_prob, then_scale, else_scale;
+ basic_block cond_bb;
+ struct loop *nloop;
+ tree lhs, arg0, cond_expr = NULL_TREE;
+ gimple_seq cond_stmts = NULL;
+ gimple *stmt;
+ auto_vec<dr_with_seg_len_pair_t> comp_alias_pairs;
+
+ /* Generate code for runtime alias checks if necessary. */
+ if (alias_ddrs->length () > 0)
+ {
+ compute_alias_check_pairs (loop, alias_ddrs, &comp_alias_pairs);
+ create_runtime_alias_checks (loop, &comp_alias_pairs, &cond_expr);
+ cond_expr = force_gimple_operand_1 (cond_expr, &cond_stmts,
+ is_gimple_condexpr, NULL_TREE);
+ then_prob = 9 * REG_BR_PROB_BASE / 10;
+ else_prob = REG_BR_PROB_BASE - then_prob;
+ then_scale = 9 * REG_BR_PROB_BASE / 10;
+ else_scale = REG_BR_PROB_BASE - then_scale;
+ }
+ else
+ {
+ cond_expr = boolean_true_node;
+ /* Use REG_BR_PROB_BASE for both branches since only one branch
+ will be kept in the end. */
+ then_prob = REG_BR_PROB_BASE;
+ else_prob = REG_BR_PROB_BASE;
+ then_scale = REG_BR_PROB_BASE;
+ else_scale = REG_BR_PROB_BASE;
+ }
+
+ /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
+ if (flag_tree_loop_vectorize)
+ {
+ /* Generate internal function call for loop distribution alias check. */
+ arg0 = build_int_cst (integer_type_node, loop->num);
+ stmt = gimple_build_call_internal (IFN_LOOP_DIST_ALIAS,
+ 2, arg0, cond_expr);
+ lhs = make_ssa_name (boolean_type_node);
+ gimple_call_set_lhs (stmt, lhs);
+ gimple_seq_add_stmt (&cond_stmts, stmt);
+ }
+ else
+ lhs = cond_expr;
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Version loop <%d> with runtime alias check\n", loop->num);
+
+ initialize_original_copy_tables ();
+ nloop = loop_version (loop, lhs, &cond_bb, then_prob,
+ else_prob, then_scale, else_scale, true);
+ free_original_copy_tables ();
+ nloop->aux = (void *)nloop;
+ /* Record the original loop number in newly generated loops. */
+ nloop->ldist_alias_id = loop->num;
+ nloop->dont_vectorize = true;
+ nloop->force_vectorize = false;
+
+ if (cond_stmts)
+ {
+ gimple_stmt_iterator cond_gsi = gsi_last_bb (cond_bb);
+ gsi_insert_seq_before (&cond_gsi, cond_stmts, GSI_SAME_STMT);
+ }
+ update_ssa (TODO_update_ssa);
+}
+
+/* Return true if loop versioning is needed to distrubute PARTITIONS.
+ ALIAS_DDRS are data dependence relations for runtime alias check. */
+
+static inline bool
+version_for_distribution_p (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ unsigned i;
+ struct partition *partition;
+
+ /* No need to version loop if we have only one partition. */
+ if (partitions->length () == 1)
+ return false;
+
+ /* Need to version loop if runtime alias check is necessary. */
+ if (alias_ddrs->length () > 0)
+ return true;
+
+ /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
+ vectorizer is not enable because no other pass can fold it. */
+ if (!flag_tree_loop_vectorize)
+ return false;
+
+ /* Don't version loop if any partition is builtin. */
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ if (partition->kind != PKIND_NORMAL)
+ break;
+ }
+ return (i == partitions->length ());
+}
+
+/* Fuse all partitions if necessary before finalizing distribution. */
+
+static void
+finalize_partitions (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ unsigned i;
+ struct partition *a, *partition;
+
+ if (partitions->length () == 1
+ || alias_ddrs->length () > 0)
+ return;
+
+ a = (*partitions)[0];
+ if (a->kind != PKIND_NORMAL)
+ return;
+
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ /* Don't fuse if partition has different type or it is a builtin. */
+ if (partition->type != a->type
+ || partition->kind != PKIND_NORMAL)
+ return;
+ }
+
+ /* Fuse all partitions. */
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ partition_merge_into (a, partition, FUSE_FINALIZE);
+ partition_free (partition);
+ }
+ partitions->truncate (1);
+}
+
+/* Distributes the code from LOOP in such a way that producer statements
+ are placed before consumer statements. Tries to separate only the
+ statements from STMTS into separate loops. Returns the number of
+ distributed loops. Set NB_CALLS to number of generated builtin calls.
+ Set *DESTROY_P to whether LOOP needs to be destroyed. */
static int
distribute_loop (struct loop *loop, vec<gimple *> stmts,
@@ -1768,8 +2461,6 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
partition *partition;
bool any_builtin;
int i, nbp;
- graph *pg = NULL;
- int num_sccs = 1;
*destroy_p = false;
*nb_calls = 0;
@@ -1826,6 +2517,11 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
ddrs_vec = new vec<ddr_p> ();
ddrs_table = new hash_table<ddr_entry_hasher> (389);
+ /* Can't do runtime alias check if loop niter is unknown. */
+ tree niters = number_of_latch_executions (loop);
+ bool rt_alias_check_p = (niters != NULL_TREE && niters != chrec_dont_know);
+ auto_vec<ddr_p> alias_ddrs;
+
auto_bitmap stmt_in_all_partitions;
auto_vec<struct partition *, 3> partitions;
bitmap_set_range (stmt_in_all_partitions, 0, rdg->n_vertices);
@@ -1912,88 +2608,14 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
/* Build the partition dependency graph. */
if (partitions.length () > 1)
{
- pg = new_graph (partitions.length ());
- struct pgdata {
- struct partition *partition;
- };
-#define PGDATA(i) ((pgdata *)(pg->vertices[i].data))
- for (i = 0; partitions.iterate (i, &partition); ++i)
- {
- vertex *v = &pg->vertices[i];
- pgdata *data = new pgdata;
- /* FIXME - leaks. */
- v->data = data;
- data->partition = partition;
- }
- struct partition *partition1, *partition2;
- for (i = 0; partitions.iterate (i, &partition1); ++i)
- for (int j = i + 1; partitions.iterate (j, &partition2); ++j)
- {
- /* dependence direction - 0 is no dependence, -1 is back,
- 1 is forth, 2 is both (we can stop then, merging will occur). */
- int dir = 0;
- dir = pg_add_dependence_edges (rdg, dir,
- partition1->writes,
- partition2->reads);
- if (dir != 2)
- dir = pg_add_dependence_edges (rdg, dir,
- partition1->reads,
- partition2->writes);
- if (dir != 2)
- dir = pg_add_dependence_edges (rdg, dir,
- partition1->writes,
- partition2->writes);
- if (dir == 1 || dir == 2)
- add_edge (pg, i, j);
- if (dir == -1 || dir == 2)
- add_edge (pg, j, i);
- }
-
- /* Add edges to the reduction partition (if any) to force it last. */
- unsigned j;
- for (j = 0; partitions.iterate (j, &partition); ++j)
- if (partition_reduction_p (partition))
- break;
- if (j < partitions.length ())
- {
- for (unsigned i = 0; partitions.iterate (i, &partition); ++i)
- if (i != j)
- add_edge (pg, i, j);
- }
-
- /* Compute partitions we cannot separate and fuse them. */
- num_sccs = graphds_scc (pg, NULL);
- for (i = 0; i < num_sccs; ++i)
- {
- struct partition *first;
- int j;
- for (j = 0; partitions.iterate (j, &first); ++j)
- if (pg->vertices[j].component == i)
- break;
- for (j = j + 1; partitions.iterate (j, &partition); ++j)
- if (pg->vertices[j].component == i)
- {
- partition_merge_into (first, partition, FUSE_SAME_SCC);
- partitions[j] = NULL;
- partition_free (partition);
- PGDATA (j)->partition = NULL;
- }
- }
-
- /* Now order the remaining nodes in postorder. */
- qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
- partitions.truncate (0);
- for (i = 0; i < pg->n_vertices; ++i)
- {
- pgdata *data = PGDATA (i);
- if (data->partition)
- partitions.safe_push (data->partition);
- delete data;
- }
- gcc_assert (partitions.length () == (unsigned)num_sccs);
- free_graph (pg);
+ merge_dep_scc_partitions (rdg, &partitions, rt_alias_check_p);
+ alias_ddrs.truncate (0);
+ if (rt_alias_check_p && partitions.length () > 1)
+ break_alias_scc_partitions (rdg, &partitions, &alias_ddrs);
}
+ finalize_partitions (&partitions, &alias_ddrs);
+
nbp = partitions.length ();
if (nbp == 0
|| (nbp == 1 && !partition_builtin_p (partitions[0]))
@@ -2003,8 +2625,15 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
goto ldist_done;
}
+ if (version_for_distribution_p (&partitions, &alias_ddrs))
+ version_loop_by_alias_check (loop, &alias_ddrs);
+
if (dump_file && (dump_flags & TDF_DETAILS))
- dump_rdg_partitions (dump_file, partitions);
+ {
+ fprintf (dump_file,
+ "distribute loop <%d> into partitions:\n", loop->num);
+ dump_rdg_partitions (dump_file, partitions);
+ }
FOR_EACH_VEC_ELT (partitions, i, partition)
{
--
1.9.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-12 17:03 [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check Bin Cheng
@ 2017-06-20 9:22 ` Bin.Cheng
2017-06-23 10:30 ` Bin.Cheng
0 siblings, 1 reply; 16+ messages in thread
From: Bin.Cheng @ 2017-06-20 9:22 UTC (permalink / raw)
To: gcc-patches, Richard Guenther
[-- Attachment #1: Type: text/plain, Size: 4847 bytes --]
On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> This is the main patch rewriting loop distribution in order to handle hmmer.
> It improves loop distribution by versioning loop under runtime alias check conditions.
> As described in comments, the patch basically implements distribution in the following
> steps:
>
> 1) Seed partitions with specific type statements. For now we support
> two types seed statements: statement defining variable used outside
> of loop; statement storing to memory.
> 2) Build reduced dependence graph (RDG) for loop to be distributed.
> The vertices (RDG:V) model all statements in the loop and the edges
> (RDG:E) model flow and control dependencies between statements.
> 3) Apart from RDG, compute data dependencies between memory references.
> 4) Starting from seed statement, build up partition by adding depended
> statements according to RDG's dependence information. Partition is
> classified as parallel type if it can be executed paralleled; or as
> sequential type if it can't. Parallel type partition is further
> classified as different builtin kinds if it can be implemented as
> builtin function calls.
> 5) Build partition dependence graph (PG) based on data dependencies.
> The vertices (PG:V) model all partitions and the edges (PG:E) model
> all data dependencies between every partitions pair. In general,
> data dependence is either compilation time known or unknown. In C
> family languages, there exists quite amount compilation time unknown
> dependencies because of possible alias relation of data references.
> We categorize PG's edge to two types: "true" edge that represents
> compilation time known data dependencies; "alias" edge for all other
> data dependencies.
> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
> partitions in each strong connected component (SCC) correspondingly.
> Build new PG for merged partitions.
> 7) Traverse PG again and this time with both "true" and "alias" edges
> included. We try to break SCCs by removing some edges. Because
> SCCs by "true" edges are all fused in step 6), we can break SCCs
> by removing some "alias" edges. It's NP-hard to choose optimal
> edge set, fortunately simple approximation is good enough for us
> given the small problem scale.
> 8) Collect all data dependencies of the removed "alias" edges. Create
> runtime alias checks for collected data dependencies.
> 9) Version loop under the condition of runtime alias checks. Given
> loop distribution generally introduces additional overhead, it is
> only useful if vectorization is achieved in distributed loop. We
> version loop with internal function call IFN_LOOP_DIST_ALIAS. If
> no distributed loop can be vectorized, we simply remove distributed
> loops and recover to the original one.
>
> Also, there are some more to improve in the future (which isn't difficult I think):
> TODO:
> 1) We only distribute innermost loops now. This pass should handle loop
> nests in the future.
> 2) We only fuse partitions in SCC now. A better fusion algorithm is
> desired to minimize loop overhead, maximize parallelism and maximize
>
> Bootstrap and test on x86_64 and AArch64. Is it OK?
>
Trivial updated due to changes in previous patches. Also fixed issues
mentioned by Kugan.
Thanks,
bin
2017-06-17 Bin Cheng <bin.cheng@arm.com>
* tree-loop-distribution.c: Add general explanantion on the pass.
(pg_add_dependence_edges): New parameter. handle alias data
dependence specially and record it in the parameter if asked.
(struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
(init_partition_graph_vertices, add_partition_graph_edge): New.
(pg_skip_alias_edge, free_partition_graph_edata_cb): New.
(free_partition_graph_vdata, build_partition_graph): New.
(sort_partitions_by_post_order, merge_dep_scc_partitions): New.
(pg_collect_alias_ddrs, break_alias_scc_partitions): New.
(data_ref_segment_size, latch_dominated_by_data_ref): New.
(compute_alias_check_pairs, version_loop_by_alias_check): New.
(version_for_distribution_p, finalize_partitions): New.
(distribute_loop): Handle alias data dependence specially. Factor
out loop fusion code as functions and call these functions.
gcc/testsuite/ChangeLog
2017-06-17 Bin Cheng <bin.cheng@arm.com>
* gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
* gcc.dg/tree-ssa/ldist-12.c: Ditto.
* gcc.dg/tree-ssa/ldist-13.c: Ditto.
* gcc.dg/tree-ssa/ldist-14.c: Ditto.
[-- Attachment #2: 0012-loop-distribution-20170608.txt.patch --]
[-- Type: text/x-patch, Size: 35681 bytes --]
From 9a03d3b423df71fd59f357485fbcebc034b7dd4b Mon Sep 17 00:00:00 2001
From: Bin Cheng <binche01@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 14:24:52 +0100
Subject: [PATCH 12/13] loop-distribution-20170608.txt
---
gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c | 3 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c | 4 +-
gcc/tree-loop-distribution.c | 827 +++++++++++++++++++++++++++----
5 files changed, 736 insertions(+), 108 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
index 53551ca..625dd92 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
@@ -18,4 +18,5 @@ int foo (int * __restrict__ ia,
return oya[22] + oyb[21];
}
-/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 1 "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
index ba39d4d..8c9fd56 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
@@ -16,6 +16,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
index 48c1040..fa4d1a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
@@ -21,6 +21,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
index c36daf071..4def9b4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
@@ -20,7 +20,5 @@ int loop1 (int k)
return b[100-1][1];
}
-/* The current cost model fuses the two partitions because they have
- similar memory accesses. */
-/* { dg-final { scan-tree-dump "similar memory accesses" "ldist" } } */
+/* Distributing inner loop doesn't expose more parallelism. */
/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 1b4e238..29a8c20 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -36,10 +36,58 @@ along with GCC; see the file COPYING3. If not see
| D(I) = A(I-1)*E
|ENDDO
- This pass uses an RDG, Reduced Dependence Graph built on top of the
- data dependence relations. The RDG is then topologically sorted to
- obtain a map of information producers/consumers based on which it
- generates the new loops. */
+ Loop distribution is the dual of loop fusion. It separates statements
+ of a loop (or loop nest) into multiple loops (or loop nests) with the
+ same loop header. The major goal is to separate statements which may
+ be vectorized from those that can't. This pass implements distribution
+ in the following steps:
+
+ 1) Seed partitions with specific type statements. For now we support
+ two types seed statements: statement defining variable used outside
+ of loop; statement storing to memory.
+ 2) Build reduced dependence graph (RDG) for loop to be distributed.
+ The vertices (RDG:V) model all statements in the loop and the edges
+ (RDG:E) model flow and control dependencies between statements.
+ 3) Apart from RDG, compute data dependencies between memory references.
+ 4) Starting from seed statement, build up partition by adding depended
+ statements according to RDG's dependence information. Partition is
+ classified as parallel type if it can be executed paralleled; or as
+ sequential type if it can't. Parallel type partition is further
+ classified as different builtin kinds if it can be implemented as
+ builtin function calls.
+ 5) Build partition dependence graph (PG) based on data dependencies.
+ The vertices (PG:V) model all partitions and the edges (PG:E) model
+ all data dependencies between every partitions pair. In general,
+ data dependence is either compilation time known or unknown. In C
+ family languages, there exists quite amount compilation time unknown
+ dependencies because of possible alias relation of data references.
+ We categorize PG's edge to two types: "true" edge that represents
+ compilation time known data dependencies; "alias" edge for all other
+ data dependencies.
+ 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
+ partitions in each strong connected component (SCC) correspondingly.
+ Build new PG for merged partitions.
+ 7) Traverse PG again and this time with both "true" and "alias" edges
+ included. We try to break SCCs by removing some edges. Because
+ SCCs by "true" edges are all fused in step 6), we can break SCCs
+ by removing some "alias" edges. It's NP-hard to choose optimal
+ edge set, fortunately simple approximation is good enough for us
+ given the small problem scale.
+ 8) Collect all data dependencies of the removed "alias" edges. Create
+ runtime alias checks for collected data dependencies.
+ 9) Version loop under the condition of runtime alias checks. Given
+ loop distribution generally introduces additional overhead, it is
+ only useful if vectorization is achieved in distributed loop. We
+ version loop with internal function call IFN_LOOP_DIST_ALIAS. If
+ no distributed loop can be vectorized, we simply remove distributed
+ loops and recover to the original one.
+
+ TODO:
+ 1) We only distribute innermost loops now. This pass should handle loop
+ nests in the future.
+ 2) We only fuse partitions in SCC now. A better fusion algorithm is
+ desired to minimize loop overhead, maximize parallelism and maximize
+ data reuse. */
#include "config.h"
#include "system.h"
@@ -1575,11 +1623,14 @@ partition_contains_all_rw (struct graph *rdg,
}
/* Compute partition dependence created by the data references in DRS1
- and DRS2 and modify and return DIR according to that. */
+ and DRS2, modify and return DIR according to that. IF ALIAS_DDR is
+ not NULL, we record dependence introduced by possible alias between
+ two data references in ALIAS_DDRS; otherwise, we simply ignore such
+ dependence as if it doesn't exist at all. */
static int
pg_add_dependence_edges (struct graph *rdg, int dir,
- bitmap drs1, bitmap drs2)
+ bitmap drs1, bitmap drs2, vec<ddr_p> *alias_ddrs)
{
unsigned i, j;
bitmap_iterator bi, bj;
@@ -1593,6 +1644,9 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
EXECUTE_IF_SET_IN_BITMAP (drs2, 0, j, bj)
{
+ int res, this_dir = 1;
+ ddr_p ddr;
+
dr2 = datarefs_vec[j];
/* Skip all <read, read> data dependence. */
@@ -1600,9 +1654,7 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
continue;
saved_dr1 = dr1;
- int this_dir = 1;
- ddr_p ddr;
- /* Re-shuffle data-refs to be in dominator order. */
+ /* Re-shuffle data-refs to be in topological order. */
if (rdg_vertex_for_stmt (rdg, DR_STMT (dr1))
> rdg_vertex_for_stmt (rdg, DR_STMT (dr2)))
{
@@ -1611,14 +1663,33 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
}
ddr = get_data_dependence (rdg, dr1, dr2);
if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know)
- this_dir = 2;
+ {
+ this_dir = 0;
+ res = data_ref_compare_tree (DR_BASE_ADDRESS (dr1),
+ DR_BASE_ADDRESS (dr2));
+ /* Be conservative. If data references are not well analyzed,
+ or the two data references have the same base address and
+ offset, add dependence and consider it alias to each other.
+ In other words, the dependence can not be resolved by
+ runtime alias check. */
+ if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
+ || !DR_INIT (dr1) || !DR_INIT (dr2)
+ || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
+ || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
+ || res == 0)
+ this_dir = 2;
+ /* Data dependence could be resolved by runtime alias check,
+ record it in ALIAS_DDRS. */
+ else if (alias_ddrs != NULL)
+ alias_ddrs->safe_push (ddr);
+ /* Or simply ignore it. */
+ }
else if (DDR_ARE_DEPENDENT (ddr) == NULL_TREE)
{
if (DDR_REVERSED_P (ddr))
- {
- std::swap (dr1, dr2);
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
+
/* Known dependences can still be unordered througout the
iteration space, see gcc.dg/tree-ssa/ldist-16.c. */
if (DDR_NUM_DIST_VECTS (ddr) != 1)
@@ -1626,12 +1697,10 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
/* If the overlap is exact preserve stmt order. */
else if (lambda_vector_zerop (DDR_DIST_VECT (ddr, 0), 1))
;
+ /* Else as the distance vector is lexicographic positive swap
+ the dependence direction. */
else
- {
- /* Else as the distance vector is lexicographic positive
- swap the dependence direction. */
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
}
else
this_dir = 0;
@@ -1658,11 +1727,628 @@ pgcmp (const void *v1_, const void *v2_)
return v2->post - v1->post;
}
-/* Distributes the code from LOOP in such a way that producer
- statements are placed before consumer statements. Tries to separate
- only the statements from STMTS into separate loops.
- Returns the number of distributed loops. Set *DESTROY_P to whether
- LOOP needs to be destroyed. */
+/* Data attached to vertices of partition dependence graph. */
+struct pg_vdata
+{
+ /* ID of the corresponding partition. */
+ int id;
+ /* The partition. */
+ struct partition *partition;
+};
+
+/* Data attached to edges of partition dependence graph. */
+struct pg_edata
+{
+ /* If the dependence edge can be resolved by runtime alias check,
+ this vector contains data dependence relations for runtime alias
+ check. On the other hand, if the dependence edge is introduced
+ because of compilation time known data dependence, this vector
+ contains nothing. */
+ vec<ddr_p> alias_ddrs;
+};
+
+/* Callback data for traversing edges in graph. */
+struct pg_edge_callback_data
+{
+ /* Bitmap contains strong connected components should be merged. */
+ bitmap sccs_to_merge;
+ /* Array constains component information for all vertices. */
+ int *vertices_component;
+ /* Vector to record all data dependence relations which are needed
+ to break strong connected components by runtime alias checks. */
+ vec<ddr_p> *alias_ddrs;
+};
+
+/* Initialize vertice's data for partition dependence graph PG with
+ PARTITIONS. */
+
+static void
+init_partition_graph_vertices (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ partition *partition;
+ struct pg_vdata *data;
+
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ data = new pg_vdata;
+ pg->vertices[i].data = data;
+ data->id = i;
+ data->partition = partition;
+ }
+}
+
+/* Add edge <I, J> to partition dependence graph PG. Attach vector of data
+ dependence relations to the EDGE if DDRS isn't NULL. */
+
+static void
+add_partition_graph_edge (struct graph *pg, int i, int j, vec<ddr_p> *ddrs)
+{
+ struct graph_edge *e = add_edge (pg, i, j);
+
+ /* If the edge is attached with data dependence relations, it means this
+ dependence edge can be resolved by runtime alias checks. */
+ if (ddrs != NULL)
+ {
+ struct pg_edata *data = new pg_edata;
+
+ gcc_assert (ddrs->length () > 0);
+ e->data = data;
+ data->alias_ddrs = vNULL;
+ data->alias_ddrs.safe_splice (*ddrs);
+ }
+}
+
+/* Callback function for graph travesal algorithm. It returns true
+ if edge E should skipped when traversing the graph. */
+
+static bool
+pg_skip_alias_edge (struct graph_edge *e)
+{
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ return (data != NULL && data->alias_ddrs.length () > 0);
+}
+
+/* Callback function freeing data attached to edge E of graph. */
+
+static void
+free_partition_graph_edata_cb (struct graph *, struct graph_edge *e, void *)
+{
+ if (e->data != NULL)
+ {
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ data->alias_ddrs.release ();
+ delete data;
+ }
+}
+
+/* Free data attached to vertice of partition dependence graph PG. */
+
+static void
+free_partition_graph_vdata (struct graph *pg)
+{
+ int i;
+ struct pg_vdata *data;
+
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ delete data;
+ }
+}
+
+/* Build and return partition dependence graph for PARTITIONS. RDG is
+ reduced dependence graph for the loop to be distributed. If IGNORE_ALIAS_P
+ is true, data dependence caused by possible alias between references
+ is ignored, as if it doesn't exist at all; otherwise all depdendences
+ are considered. */
+
+static struct graph *
+build_partition_graph (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ int i, j;
+ struct partition *partition1, *partition2;
+ graph *pg = new_graph (partitions->length ());
+ auto_vec<ddr_p> alias_ddrs, *alias_ddrs_p;
+
+ alias_ddrs_p = ignore_alias_p ? NULL : &alias_ddrs;
+
+ init_partition_graph_vertices (pg, partitions);
+
+ for (i = 0; partitions->iterate (i, &partition1); ++i)
+ {
+ for (j = i + 1; partitions->iterate (j, &partition2); ++j)
+ {
+ /* dependence direction - 0 is no dependence, -1 is back,
+ 1 is forth, 2 is both (we can stop then, merging will occur). */
+ int dir = 0;
+
+ /* If the first partition has reduction, add back edge; if the
+ second partition has reduction, add forth edge. This makes
+ sure that reduction partition will be sorted as the last one. */
+ if (partition_reduction_p (partition1))
+ dir = -1;
+ else if (partition_reduction_p (partition2))
+ dir = 1;
+
+ /* Cleanup the temporary vector. */
+ alias_ddrs.truncate (0);
+
+ dir = pg_add_dependence_edges (rdg, dir, partition1->datarefs,
+ partition2->datarefs, alias_ddrs_p);
+
+ /* Add edge to partition graph if there exists dependence. There
+ are two types of edges. One type edge is caused by compilation
+ time known dependence, this type can not be resolved by runtime
+ alias check. The other type can be resolved by runtime alias
+ check. */
+ if (dir == 1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != 1 && dir != 2);
+ add_partition_graph_edge (pg, i, j,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ if (dir == -1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != -1 && dir != 2);
+ add_partition_graph_edge (pg, j, i,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ }
+ }
+ return pg;
+}
+
+/* Sort partitions in PG by post order and store them in PARTITIONS. */
+
+static void
+sort_partitions_by_post_order (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ struct pg_vdata *data;
+
+ /* Now order the remaining nodes in postorder. */
+ qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
+ partitions->truncate (0);
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ if (data->partition)
+ partitions->safe_push (data->partition);
+ }
+}
+
+/* Given reduced dependence graph RDG merge strong connected components
+ of PARTITIONS. If IGNORE_ALIAS_P is true, data dependence caused by
+ possible alias between references is ignored, as if it doesn't exist
+ at all; otherwise all depdendences are considered. */
+
+static void
+merge_dep_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ struct partition *partition1, *partition2;
+ struct pg_vdata *data;
+ graph *pg = build_partition_graph (rdg, partitions, ignore_alias_p);
+ int i, j, num_sccs = graphds_scc (pg, NULL);
+
+ /* Strong connected compoenent means dependence cycle, we cannot distribute
+ them. So fuse them together. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &partition1); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (j = j + 1; partitions->iterate (j, &partition2); ++j)
+ if (pg->vertices[j].component == i)
+ {
+ partition_merge_into (partition1, partition2, FUSE_SAME_SCC);
+ (*partitions)[j] = NULL;
+ partition_free (partition2);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ data->partition = NULL;
+ }
+ }
+ sort_partitions_by_post_order (pg, partitions);
+ }
+ gcc_assert (partitions->length () == (unsigned)num_sccs);
+ free_partition_graph_vdata (pg);
+ free_graph (pg);
+}
+
+/* Callback function for traversing edge E in graph G. DATA is private
+ callback data. */
+
+static void
+pg_collect_alias_ddrs (struct graph *g, struct graph_edge *e, void *data)
+{
+ int i, j, component;
+ struct pg_edge_callback_data *cbdata;
+ struct pg_edata *edata = (struct pg_edata *) e->data;
+
+ /* If the edge doesn't have attached data dependence, it represents
+ compilation time known dependences. This type dependence cannot
+ be resolved by runtime alias check. */
+ if (edata == NULL || edata->alias_ddrs.length () == 0)
+ return;
+
+ cbdata = (struct pg_edge_callback_data *) data;
+ i = e->src;
+ j = e->dest;
+ component = cbdata->vertices_component[i];
+ /* Vertices are topologically sorted according to compilation time
+ known dependences, so we can break strong connected components
+ by removing edges of the opposite direction, i.e, edges pointing
+ from vertice with smaller post number to vertice with bigger post
+ number. */
+ if (g->vertices[i].post < g->vertices[j].post
+ /* We only need to remove edges connecting vertices in the same
+ strong connected component to break it. */
+ && component == cbdata->vertices_component[j]
+ /* Check if we want to break the strong connected component or not. */
+ && !bitmap_bit_p (cbdata->sccs_to_merge, component))
+ cbdata->alias_ddrs->safe_splice (edata->alias_ddrs);
+}
+
+/* This is the main function breaking strong conected components in
+ PARTITIONS giving reduced depdendence graph RDG. Store data dependence
+ relations for runtime alias check in ALIAS_DDRS. */
+
+static void
+break_alias_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ int i, j, num_sccs, num_sccs_no_alias;
+ /* Build partition dependence graph. */
+ graph *pg = build_partition_graph (rdg, partitions, false);
+
+ alias_ddrs->truncate (0);
+ /* Find strong connected components in the graph, with all dependence edges
+ considered. */
+ num_sccs = graphds_scc (pg, NULL);
+ /* All SCCs now can be broken by runtime alias checks because SCCs caused by
+ compilation time known dependences are merged before this function. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ struct pg_edge_callback_data cbdata;
+ auto_bitmap sccs_to_merge;
+ auto_vec<enum partition_type> scc_types;
+ struct partition *partition, *first;
+
+ /* If all paritions in a SCC has the same type, we can simply merge the
+ SCC. This loop finds out such SCCS and record them in bitmap. */
+ bitmap_set_range (sccs_to_merge, 0, (unsigned) num_sccs);
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ if (pg->vertices[j].component != i)
+ continue;
+
+ if (first->type != partition->type)
+ {
+ bitmap_clear_bit (sccs_to_merge, i);
+ break;
+ }
+ }
+ }
+
+ /* Initialize callback data for traversing. */
+ cbdata.sccs_to_merge = sccs_to_merge;
+ cbdata.alias_ddrs = alias_ddrs;
+ cbdata.vertices_component = XNEWVEC (int, pg->n_vertices);
+ /* Record the component information which will be corrupted by next
+ graph scc finding call. */
+ for (i = 0; i < pg->n_vertices; ++i)
+ cbdata.vertices_component[i] = pg->vertices[i].component;
+
+ /* Collect data dependences for runtime alias checks to break SCCs. */
+ if (bitmap_count_bits (sccs_to_merge) != (unsigned) num_sccs)
+ {
+ /* Run SCC finding algorithm again, with alias dependence edges
+ skipped. This is to topologically sort paritions according to
+ compilation time known dependence. Note the topological order
+ is stored in the form of pg's post order number. */
+ num_sccs_no_alias = graphds_scc (pg, NULL, pg_skip_alias_edge);
+ gcc_assert (partitions->length () == (unsigned) num_sccs_no_alias);
+ /* With topological order, we can construct two subgraphs L and R.
+ L contains edge <x, y> where x < y in terms of post order, while
+ R contains edge <x, y> where x > y. Edges for compilation time
+ known dependence all fall in R, so we break SCCs by removing all
+ (alias) edges of in subgraph L. */
+ for_each_edge (pg, pg_collect_alias_ddrs, &cbdata);
+ }
+
+ /* For SCC that doesn't need to be broken, merge it. */
+ for (i = 0; i < num_sccs; ++i)
+ {
+ if (!bitmap_bit_p (sccs_to_merge, i))
+ continue;
+
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (cbdata.vertices_component[j] == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ struct pg_vdata *data;
+
+ if (cbdata.vertices_component[j] != i)
+ continue;
+
+ partition_merge_into (first, partition, FUSE_SAME_SCC);
+ (*partitions)[j] = NULL;
+ partition_free (partition);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ gcc_assert (data->id == j);
+ data->partition = NULL;
+ }
+ }
+ }
+
+ sort_partitions_by_post_order (pg, partitions);
+ free_partition_graph_vdata (pg);
+ for_each_edge (pg, free_partition_graph_edata_cb, NULL);
+ free_graph (pg);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "Possible alias data dependence to break:\n");
+ dump_data_dependence_relations (dump_file, *alias_ddrs);
+ }
+}
+
+/* Compute and return an expression whose value is the segment length which
+ will be accessed by DR in NITERS iterations. */
+
+static tree
+data_ref_segment_size (struct data_reference *dr, tree niters)
+{
+ tree segment_length;
+
+ if (integer_zerop (DR_STEP (dr)))
+ segment_length = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
+ else
+ segment_length = size_binop (MULT_EXPR,
+ fold_convert (sizetype, DR_STEP (dr)),
+ fold_convert (sizetype, niters));
+
+ return segment_length;
+}
+
+/* Return true if LOOP's latch is dominated by statement for data reference
+ DR. */
+
+static inline bool
+latch_dominated_by_data_ref (struct loop *loop, data_reference *dr)
+{
+ return dominated_by_p (CDI_DOMINATORS, single_exit (loop)->src,
+ gimple_bb (DR_STMT (dr)));
+}
+
+/* Compute alias check pairs and store them in COMP_ALIAS_PAIRS for LOOP's
+ data dependence relations ALIAS_DDRS. */
+
+static void
+compute_alias_check_pairs (struct loop *loop, vec<ddr_p> *alias_ddrs,
+ vec<dr_with_seg_len_pair_t> *comp_alias_pairs)
+{
+ unsigned int i;
+ unsigned HOST_WIDE_INT factor = 1;
+ tree niters_plus_one, niters = number_of_latch_executions (loop);
+
+ gcc_assert (niters != NULL_TREE && niters != chrec_dont_know);
+ niters = fold_convert (sizetype, niters);
+ niters_plus_one = size_binop (PLUS_EXPR, niters, size_one_node);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Creating alias check pairs:\n");
+
+ /* Iterate all data dependence relations and compute alias check pairs. */
+ for (i = 0; i < alias_ddrs->length (); i++)
+ {
+ ddr_p ddr = (*alias_ddrs)[i];
+ struct data_reference *dr_a = DDR_A (ddr);
+ struct data_reference *dr_b = DDR_B (ddr);
+ tree seg_length_a, seg_length_b;
+ int comp_res = data_ref_compare_tree (DR_BASE_ADDRESS (dr_a),
+ DR_BASE_ADDRESS (dr_b));
+
+ if (comp_res == 0)
+ comp_res = data_ref_compare_tree (DR_OFFSET (dr_a), DR_OFFSET (dr_b));
+ gcc_assert (comp_res != 0);
+
+ if (latch_dominated_by_data_ref (loop, dr_a))
+ seg_length_a = data_ref_segment_size (dr_a, niters_plus_one);
+ else
+ seg_length_a = data_ref_segment_size (dr_a, niters);
+
+ if (latch_dominated_by_data_ref (loop, dr_b))
+ seg_length_b = data_ref_segment_size (dr_b, niters_plus_one);
+ else
+ seg_length_b = data_ref_segment_size (dr_b, niters);
+
+ dr_with_seg_len_pair_t dr_with_seg_len_pair
+ (dr_with_seg_len (dr_a, seg_length_a),
+ dr_with_seg_len (dr_b, seg_length_b));
+
+ /* Canonicalize pairs by sorting the two DR members. */
+ if (comp_res > 0)
+ std::swap (dr_with_seg_len_pair.first, dr_with_seg_len_pair.second);
+
+ comp_alias_pairs->safe_push (dr_with_seg_len_pair);
+ }
+
+ if (tree_fits_uhwi_p (niters))
+ factor = tree_to_uhwi (niters);
+
+ /* Prune alias check pairs. */
+ prune_runtime_alias_test_list (comp_alias_pairs, factor);
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Improved number of alias checks from %d to %d\n",
+ alias_ddrs->length (), comp_alias_pairs->length ());
+}
+
+/* Given data dependence relations in ALIAS_DDRS, generate runtime alias
+ checks and version LOOP under condition of these runtime alias checks. */
+
+static void
+version_loop_by_alias_check (struct loop *loop, vec<ddr_p> *alias_ddrs)
+{
+ unsigned then_prob, else_prob, then_scale, else_scale;
+ basic_block cond_bb;
+ struct loop *nloop;
+ tree lhs, arg0, cond_expr = NULL_TREE;
+ gimple_seq cond_stmts = NULL;
+ gimple *stmt;
+ auto_vec<dr_with_seg_len_pair_t> comp_alias_pairs;
+
+ /* Generate code for runtime alias checks if necessary. */
+ if (alias_ddrs->length () > 0)
+ {
+ compute_alias_check_pairs (loop, alias_ddrs, &comp_alias_pairs);
+ create_runtime_alias_checks (loop, &comp_alias_pairs, &cond_expr);
+ cond_expr = force_gimple_operand_1 (cond_expr, &cond_stmts,
+ is_gimple_condexpr, NULL_TREE);
+ then_prob = 9 * REG_BR_PROB_BASE / 10;
+ else_prob = REG_BR_PROB_BASE - then_prob;
+ then_scale = 9 * REG_BR_PROB_BASE / 10;
+ else_scale = REG_BR_PROB_BASE - then_scale;
+ }
+ else
+ {
+ cond_expr = boolean_true_node;
+ /* Use REG_BR_PROB_BASE for both branches since only one branch
+ will be kept in the end. */
+ then_prob = REG_BR_PROB_BASE;
+ else_prob = REG_BR_PROB_BASE;
+ then_scale = REG_BR_PROB_BASE;
+ else_scale = REG_BR_PROB_BASE;
+ }
+
+ /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
+ if (flag_tree_loop_vectorize)
+ {
+ /* Generate internal function call for loop distribution alias check. */
+ arg0 = build_int_cst (integer_type_node, loop->num);
+ stmt = gimple_build_call_internal (IFN_LOOP_DIST_ALIAS,
+ 2, arg0, cond_expr);
+ lhs = make_ssa_name (boolean_type_node);
+ gimple_call_set_lhs (stmt, lhs);
+ gimple_seq_add_stmt (&cond_stmts, stmt);
+ }
+ else
+ lhs = cond_expr;
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Version loop <%d> with runtime alias check\n", loop->num);
+
+ initialize_original_copy_tables ();
+ nloop = loop_version (loop, lhs, &cond_bb, then_prob,
+ else_prob, then_scale, else_scale, true);
+ free_original_copy_tables ();
+ /* Record the original loop number in newly generated loops. */
+ nloop->ldist_alias_id = loop->num;
+ nloop->dont_vectorize = true;
+ nloop->force_vectorize = false;
+
+ if (cond_stmts)
+ {
+ gimple_stmt_iterator cond_gsi = gsi_last_bb (cond_bb);
+ gsi_insert_seq_before (&cond_gsi, cond_stmts, GSI_SAME_STMT);
+ }
+ update_ssa (TODO_update_ssa);
+}
+
+/* Return true if loop versioning is needed to distrubute PARTITIONS.
+ ALIAS_DDRS are data dependence relations for runtime alias check. */
+
+static inline bool
+version_for_distribution_p (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ unsigned i;
+ struct partition *partition;
+
+ /* No need to version loop if we have only one partition. */
+ if (partitions->length () == 1)
+ return false;
+
+ /* Need to version loop if runtime alias check is necessary. */
+ if (alias_ddrs->length () > 0)
+ return true;
+
+ /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
+ vectorizer is not enable because no other pass can fold it. */
+ if (!flag_tree_loop_vectorize)
+ return false;
+
+ /* Don't version loop if any partition is builtin. */
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ if (partition->kind != PKIND_NORMAL)
+ break;
+ }
+ return (i == partitions->length ());
+}
+
+/* Fuse all partitions if necessary before finalizing distribution. */
+
+static void
+finalize_partitions (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ unsigned i;
+ struct partition *a, *partition;
+
+ if (partitions->length () == 1
+ || alias_ddrs->length () > 0)
+ return;
+
+ a = (*partitions)[0];
+ if (a->kind != PKIND_NORMAL)
+ return;
+
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ /* Don't fuse if partition has different type or it is a builtin. */
+ if (partition->type != a->type
+ || partition->kind != PKIND_NORMAL)
+ return;
+ }
+
+ /* Fuse all partitions. */
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ partition_merge_into (a, partition, FUSE_FINALIZE);
+ partition_free (partition);
+ }
+ partitions->truncate (1);
+}
+
+/* Distributes the code from LOOP in such a way that producer statements
+ are placed before consumer statements. Tries to separate only the
+ statements from STMTS into separate loops. Returns the number of
+ distributed loops. Set NB_CALLS to number of generated builtin calls.
+ Set *DESTROY_P to whether LOOP needs to be destroyed. */
static int
distribute_loop (struct loop *loop, vec<gimple *> stmts,
@@ -1672,8 +2358,6 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
partition *partition;
bool any_builtin;
int i, nbp;
- graph *pg = NULL;
- int num_sccs = 1;
*destroy_p = false;
*nb_calls = 0;
@@ -1721,6 +2405,11 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
auto_vec<struct partition *, 3> partitions;
rdg_build_partitions (rdg, stmts, &partitions);
+ /* Can't do runtime alias check if loop niter is unknown. */
+ tree niters = number_of_latch_executions (loop);
+ bool rt_alias_check_p = (niters != NULL_TREE && niters != chrec_dont_know);
+ auto_vec<ddr_p> alias_ddrs;
+
auto_bitmap stmt_in_all_partitions;
bitmap_copy (stmt_in_all_partitions, partitions[0]->stmts);
for (i = 1; partitions.iterate (i, &partition); ++i)
@@ -1807,79 +2496,14 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
/* Build the partition dependency graph. */
if (partitions.length () > 1)
{
- pg = new_graph (partitions.length ());
- struct pgdata {
- struct partition *partition;
- };
-#define PGDATA(i) ((pgdata *)(pg->vertices[i].data))
- for (i = 0; partitions.iterate (i, &partition); ++i)
- {
- vertex *v = &pg->vertices[i];
- pgdata *data = new pgdata;
- /* FIXME - leaks. */
- v->data = data;
- data->partition = partition;
- }
- struct partition *partition1, *partition2;
- for (i = 0; partitions.iterate (i, &partition1); ++i)
- for (int j = i + 1; partitions.iterate (j, &partition2); ++j)
- {
- /* dependence direction - 0 is no dependence, -1 is back,
- 1 is forth, 2 is both (we can stop then, merging will occur). */
- int dir = pg_add_dependence_edges (rdg, dir,
- partition1->datarefs,
- partition2->datarefs);
- if (dir == 1 || dir == 2)
- add_edge (pg, i, j);
- if (dir == -1 || dir == 2)
- add_edge (pg, j, i);
- }
-
- /* Add edges to the reduction partition (if any) to force it last. */
- unsigned j;
- for (j = 0; partitions.iterate (j, &partition); ++j)
- if (partition_reduction_p (partition))
- break;
- if (j < partitions.length ())
- {
- for (unsigned i = 0; partitions.iterate (i, &partition); ++i)
- if (i != j)
- add_edge (pg, i, j);
- }
-
- /* Compute partitions we cannot separate and fuse them. */
- num_sccs = graphds_scc (pg, NULL);
- for (i = 0; i < num_sccs; ++i)
- {
- struct partition *first;
- int j;
- for (j = 0; partitions.iterate (j, &first); ++j)
- if (pg->vertices[j].component == i)
- break;
- for (j = j + 1; partitions.iterate (j, &partition); ++j)
- if (pg->vertices[j].component == i)
- {
- partition_merge_into (first, partition, FUSE_SAME_SCC);
- partitions[j] = NULL;
- partition_free (partition);
- PGDATA (j)->partition = NULL;
- }
- }
-
- /* Now order the remaining nodes in postorder. */
- qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
- partitions.truncate (0);
- for (i = 0; i < pg->n_vertices; ++i)
- {
- pgdata *data = PGDATA (i);
- if (data->partition)
- partitions.safe_push (data->partition);
- delete data;
- }
- gcc_assert (partitions.length () == (unsigned)num_sccs);
- free_graph (pg);
+ merge_dep_scc_partitions (rdg, &partitions, rt_alias_check_p);
+ alias_ddrs.truncate (0);
+ if (rt_alias_check_p && partitions.length () > 1)
+ break_alias_scc_partitions (rdg, &partitions, &alias_ddrs);
}
+ finalize_partitions (&partitions, &alias_ddrs);
+
nbp = partitions.length ();
if (nbp == 0
|| (nbp == 1 && !partition_builtin_p (partitions[0]))
@@ -1889,8 +2513,15 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
goto ldist_done;
}
+ if (version_for_distribution_p (&partitions, &alias_ddrs))
+ version_loop_by_alias_check (loop, &alias_ddrs);
+
if (dump_file && (dump_flags & TDF_DETAILS))
- dump_rdg_partitions (dump_file, partitions);
+ {
+ fprintf (dump_file,
+ "distribute loop <%d> into partitions:\n", loop->num);
+ dump_rdg_partitions (dump_file, partitions);
+ }
FOR_EACH_VEC_ELT (partitions, i, partition)
{
--
1.9.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-20 9:22 ` Bin.Cheng
@ 2017-06-23 10:30 ` Bin.Cheng
2017-06-27 12:44 ` Richard Biener
0 siblings, 1 reply; 16+ messages in thread
From: Bin.Cheng @ 2017-06-23 10:30 UTC (permalink / raw)
To: gcc-patches, Richard Guenther
[-- Attachment #1: Type: text/plain, Size: 5133 bytes --]
On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> This is the main patch rewriting loop distribution in order to handle hmmer.
>> It improves loop distribution by versioning loop under runtime alias check conditions.
>> As described in comments, the patch basically implements distribution in the following
>> steps:
>>
>> 1) Seed partitions with specific type statements. For now we support
>> two types seed statements: statement defining variable used outside
>> of loop; statement storing to memory.
>> 2) Build reduced dependence graph (RDG) for loop to be distributed.
>> The vertices (RDG:V) model all statements in the loop and the edges
>> (RDG:E) model flow and control dependencies between statements.
>> 3) Apart from RDG, compute data dependencies between memory references.
>> 4) Starting from seed statement, build up partition by adding depended
>> statements according to RDG's dependence information. Partition is
>> classified as parallel type if it can be executed paralleled; or as
>> sequential type if it can't. Parallel type partition is further
>> classified as different builtin kinds if it can be implemented as
>> builtin function calls.
>> 5) Build partition dependence graph (PG) based on data dependencies.
>> The vertices (PG:V) model all partitions and the edges (PG:E) model
>> all data dependencies between every partitions pair. In general,
>> data dependence is either compilation time known or unknown. In C
>> family languages, there exists quite amount compilation time unknown
>> dependencies because of possible alias relation of data references.
>> We categorize PG's edge to two types: "true" edge that represents
>> compilation time known data dependencies; "alias" edge for all other
>> data dependencies.
>> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
>> partitions in each strong connected component (SCC) correspondingly.
>> Build new PG for merged partitions.
>> 7) Traverse PG again and this time with both "true" and "alias" edges
>> included. We try to break SCCs by removing some edges. Because
>> SCCs by "true" edges are all fused in step 6), we can break SCCs
>> by removing some "alias" edges. It's NP-hard to choose optimal
>> edge set, fortunately simple approximation is good enough for us
>> given the small problem scale.
>> 8) Collect all data dependencies of the removed "alias" edges. Create
>> runtime alias checks for collected data dependencies.
>> 9) Version loop under the condition of runtime alias checks. Given
>> loop distribution generally introduces additional overhead, it is
>> only useful if vectorization is achieved in distributed loop. We
>> version loop with internal function call IFN_LOOP_DIST_ALIAS. If
>> no distributed loop can be vectorized, we simply remove distributed
>> loops and recover to the original one.
>>
>> Also, there are some more to improve in the future (which isn't difficult I think):
>> TODO:
>> 1) We only distribute innermost loops now. This pass should handle loop
>> nests in the future.
>> 2) We only fuse partitions in SCC now. A better fusion algorithm is
>> desired to minimize loop overhead, maximize parallelism and maximize
>>
>> Bootstrap and test on x86_64 and AArch64. Is it OK?
>>
> Trivial updated due to changes in previous patches. Also fixed issues
> mentioned by Kugan.
Rebased V3 for changes in previous patches. Bootstap and test on
x86_64 and aarch64.
Thanks,
bin
>
> Thanks,
> bin
>
> 2017-06-17 Bin Cheng <bin.cheng@arm.com>
>
> * tree-loop-distribution.c: Add general explanantion on the pass.
> (pg_add_dependence_edges): New parameter. handle alias data
> dependence specially and record it in the parameter if asked.
> (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
> (init_partition_graph_vertices, add_partition_graph_edge): New.
> (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
> (free_partition_graph_vdata, build_partition_graph): New.
> (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
> (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
> (data_ref_segment_size, latch_dominated_by_data_ref): New.
> (compute_alias_check_pairs, version_loop_by_alias_check): New.
> (version_for_distribution_p, finalize_partitions): New.
> (distribute_loop): Handle alias data dependence specially. Factor
> out loop fusion code as functions and call these functions.
>
> gcc/testsuite/ChangeLog
> 2017-06-17 Bin Cheng <bin.cheng@arm.com>
>
> * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
> * gcc.dg/tree-ssa/ldist-12.c: Ditto.
> * gcc.dg/tree-ssa/ldist-13.c: Ditto.
> * gcc.dg/tree-ssa/ldist-14.c: Ditto.
[-- Attachment #2: 0012-loop-distribution-20170609.txt.patch --]
[-- Type: text/x-patch, Size: 36318 bytes --]
From b063a362af4d59dac06be8b4dac6269627f52d26 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binche01@e108451-lin.cambridge.arm.com>
Date: Thu, 22 Jun 2017 17:18:49 +0100
Subject: [PATCH 12/13] loop-distribution-20170609.txt
---
gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c | 3 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c | 4 +-
gcc/tree-loop-distribution.c | 835 +++++++++++++++++++++++++++----
5 files changed, 742 insertions(+), 110 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
index 53551ca..625dd92 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
@@ -18,4 +18,5 @@ int foo (int * __restrict__ ia,
return oya[22] + oyb[21];
}
-/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 1 "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
index ba39d4d..8c9fd56 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
@@ -16,6 +16,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
index 48c1040..fa4d1a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
@@ -21,6 +21,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
index c36daf071..4def9b4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
@@ -20,7 +20,5 @@ int loop1 (int k)
return b[100-1][1];
}
-/* The current cost model fuses the two partitions because they have
- similar memory accesses. */
-/* { dg-final { scan-tree-dump "similar memory accesses" "ldist" } } */
+/* Distributing inner loop doesn't expose more parallelism. */
/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index b15ec04..afc7bd0 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -36,10 +36,58 @@ along with GCC; see the file COPYING3. If not see
| D(I) = A(I-1)*E
|ENDDO
- This pass uses an RDG, Reduced Dependence Graph built on top of the
- data dependence relations. The RDG is then topologically sorted to
- obtain a map of information producers/consumers based on which it
- generates the new loops. */
+ Loop distribution is the dual of loop fusion. It separates statements
+ of a loop (or loop nest) into multiple loops (or loop nests) with the
+ same loop header. The major goal is to separate statements which may
+ be vectorized from those that can't. This pass implements distribution
+ in the following steps:
+
+ 1) Seed partitions with specific type statements. For now we support
+ two types seed statements: statement defining variable used outside
+ of loop; statement storing to memory.
+ 2) Build reduced dependence graph (RDG) for loop to be distributed.
+ The vertices (RDG:V) model all statements in the loop and the edges
+ (RDG:E) model flow and control dependencies between statements.
+ 3) Apart from RDG, compute data dependencies between memory references.
+ 4) Starting from seed statement, build up partition by adding depended
+ statements according to RDG's dependence information. Partition is
+ classified as parallel type if it can be executed paralleled; or as
+ sequential type if it can't. Parallel type partition is further
+ classified as different builtin kinds if it can be implemented as
+ builtin function calls.
+ 5) Build partition dependence graph (PG) based on data dependencies.
+ The vertices (PG:V) model all partitions and the edges (PG:E) model
+ all data dependencies between every partitions pair. In general,
+ data dependence is either compilation time known or unknown. In C
+ family languages, there exists quite amount compilation time unknown
+ dependencies because of possible alias relation of data references.
+ We categorize PG's edge to two types: "true" edge that represents
+ compilation time known data dependencies; "alias" edge for all other
+ data dependencies.
+ 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
+ partitions in each strong connected component (SCC) correspondingly.
+ Build new PG for merged partitions.
+ 7) Traverse PG again and this time with both "true" and "alias" edges
+ included. We try to break SCCs by removing some edges. Because
+ SCCs by "true" edges are all fused in step 6), we can break SCCs
+ by removing some "alias" edges. It's NP-hard to choose optimal
+ edge set, fortunately simple approximation is good enough for us
+ given the small problem scale.
+ 8) Collect all data dependencies of the removed "alias" edges. Create
+ runtime alias checks for collected data dependencies.
+ 9) Version loop under the condition of runtime alias checks. Given
+ loop distribution generally introduces additional overhead, it is
+ only useful if vectorization is achieved in distributed loop. We
+ version loop with internal function call IFN_LOOP_DIST_ALIAS. If
+ no distributed loop can be vectorized, we simply remove distributed
+ loops and recover to the original one.
+
+ TODO:
+ 1) We only distribute innermost loops now. This pass should handle loop
+ nests in the future.
+ 2) We only fuse partitions in SCC now. A better fusion algorithm is
+ desired to minimize loop overhead, maximize parallelism and maximize
+ data reuse. */
#include "config.h"
#include "system.h"
@@ -744,11 +792,15 @@ generate_loops_for_partition (struct loop *loop, partition *partition,
if (copy_p)
{
+ int ldist_alias_id = loop->num;
loop = copy_loop_before (loop);
gcc_assert (loop != NULL);
+ loop->ldist_alias_id = ldist_alias_id;
create_preheader (loop, CP_SIMPLE_PREHEADERS);
create_bb_after_loop (loop);
}
+ else
+ loop->ldist_alias_id = loop->num;
/* Remove stmts not in the PARTITION bitmap. */
bbs = get_loop_body_in_dom_order (loop);
@@ -1601,11 +1653,14 @@ partition_contains_all_rw (struct graph *rdg,
}
/* Compute partition dependence created by the data references in DRS1
- and DRS2 and modify and return DIR according to that. */
+ and DRS2, modify and return DIR according to that. IF ALIAS_DDR is
+ not NULL, we record dependence introduced by possible alias between
+ two data references in ALIAS_DDRS; otherwise, we simply ignore such
+ dependence as if it doesn't exist at all. */
static int
pg_add_dependence_edges (struct graph *rdg, int dir,
- bitmap drs1, bitmap drs2)
+ bitmap drs1, bitmap drs2, vec<ddr_p> *alias_ddrs)
{
unsigned i, j;
bitmap_iterator bi, bj;
@@ -1619,6 +1674,9 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
EXECUTE_IF_SET_IN_BITMAP (drs2, 0, j, bj)
{
+ int res, this_dir = 1;
+ ddr_p ddr;
+
dr2 = datarefs_vec[j];
/* Skip all <read, read> data dependence. */
@@ -1626,9 +1684,7 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
continue;
saved_dr1 = dr1;
- int this_dir = 1;
- ddr_p ddr;
- /* Re-shuffle data-refs to be in dominator order. */
+ /* Re-shuffle data-refs to be in topological order. */
if (rdg_vertex_for_stmt (rdg, DR_STMT (dr1))
> rdg_vertex_for_stmt (rdg, DR_STMT (dr2)))
{
@@ -1637,14 +1693,33 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
}
ddr = get_data_dependence (rdg, dr1, dr2);
if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know)
- this_dir = 2;
+ {
+ this_dir = 0;
+ res = data_ref_compare_tree (DR_BASE_ADDRESS (dr1),
+ DR_BASE_ADDRESS (dr2));
+ /* Be conservative. If data references are not well analyzed,
+ or the two data references have the same base address and
+ offset, add dependence and consider it alias to each other.
+ In other words, the dependence can not be resolved by
+ runtime alias check. */
+ if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
+ || !DR_INIT (dr1) || !DR_INIT (dr2)
+ || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
+ || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
+ || res == 0)
+ this_dir = 2;
+ /* Data dependence could be resolved by runtime alias check,
+ record it in ALIAS_DDRS. */
+ else if (alias_ddrs != NULL)
+ alias_ddrs->safe_push (ddr);
+ /* Or simply ignore it. */
+ }
else if (DDR_ARE_DEPENDENT (ddr) == NULL_TREE)
{
if (DDR_REVERSED_P (ddr))
- {
- std::swap (dr1, dr2);
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
+
/* Known dependences can still be unordered througout the
iteration space, see gcc.dg/tree-ssa/ldist-16.c. */
if (DDR_NUM_DIST_VECTS (ddr) != 1)
@@ -1652,12 +1727,10 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
/* If the overlap is exact preserve stmt order. */
else if (lambda_vector_zerop (DDR_DIST_VECT (ddr, 0), 1))
;
+ /* Else as the distance vector is lexicographic positive swap
+ the dependence direction. */
else
- {
- /* Else as the distance vector is lexicographic positive
- swap the dependence direction. */
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
}
else
this_dir = 0;
@@ -1684,11 +1757,630 @@ pgcmp (const void *v1_, const void *v2_)
return v2->post - v1->post;
}
-/* Distributes the code from LOOP in such a way that producer
- statements are placed before consumer statements. Tries to separate
- only the statements from STMTS into separate loops.
- Returns the number of distributed loops. Set *DESTROY_P to whether
- LOOP needs to be destroyed. */
+/* Data attached to vertices of partition dependence graph. */
+struct pg_vdata
+{
+ /* ID of the corresponding partition. */
+ int id;
+ /* The partition. */
+ struct partition *partition;
+};
+
+/* Data attached to edges of partition dependence graph. */
+struct pg_edata
+{
+ /* If the dependence edge can be resolved by runtime alias check,
+ this vector contains data dependence relations for runtime alias
+ check. On the other hand, if the dependence edge is introduced
+ because of compilation time known data dependence, this vector
+ contains nothing. */
+ vec<ddr_p> alias_ddrs;
+};
+
+/* Callback data for traversing edges in graph. */
+struct pg_edge_callback_data
+{
+ /* Bitmap contains strong connected components should be merged. */
+ bitmap sccs_to_merge;
+ /* Array constains component information for all vertices. */
+ int *vertices_component;
+ /* Vector to record all data dependence relations which are needed
+ to break strong connected components by runtime alias checks. */
+ vec<ddr_p> *alias_ddrs;
+};
+
+/* Initialize vertice's data for partition dependence graph PG with
+ PARTITIONS. */
+
+static void
+init_partition_graph_vertices (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ partition *partition;
+ struct pg_vdata *data;
+
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ data = new pg_vdata;
+ pg->vertices[i].data = data;
+ data->id = i;
+ data->partition = partition;
+ }
+}
+
+/* Add edge <I, J> to partition dependence graph PG. Attach vector of data
+ dependence relations to the EDGE if DDRS isn't NULL. */
+
+static void
+add_partition_graph_edge (struct graph *pg, int i, int j, vec<ddr_p> *ddrs)
+{
+ struct graph_edge *e = add_edge (pg, i, j);
+
+ /* If the edge is attached with data dependence relations, it means this
+ dependence edge can be resolved by runtime alias checks. */
+ if (ddrs != NULL)
+ {
+ struct pg_edata *data = new pg_edata;
+
+ gcc_assert (ddrs->length () > 0);
+ e->data = data;
+ data->alias_ddrs = vNULL;
+ data->alias_ddrs.safe_splice (*ddrs);
+ }
+}
+
+/* Callback function for graph travesal algorithm. It returns true
+ if edge E should skipped when traversing the graph. */
+
+static bool
+pg_skip_alias_edge (struct graph_edge *e)
+{
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ return (data != NULL && data->alias_ddrs.length () > 0);
+}
+
+/* Callback function freeing data attached to edge E of graph. */
+
+static void
+free_partition_graph_edata_cb (struct graph *, struct graph_edge *e, void *)
+{
+ if (e->data != NULL)
+ {
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ data->alias_ddrs.release ();
+ delete data;
+ }
+}
+
+/* Free data attached to vertice of partition dependence graph PG. */
+
+static void
+free_partition_graph_vdata (struct graph *pg)
+{
+ int i;
+ struct pg_vdata *data;
+
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ delete data;
+ }
+}
+
+/* Build and return partition dependence graph for PARTITIONS. RDG is
+ reduced dependence graph for the loop to be distributed. If IGNORE_ALIAS_P
+ is true, data dependence caused by possible alias between references
+ is ignored, as if it doesn't exist at all; otherwise all depdendences
+ are considered. */
+
+static struct graph *
+build_partition_graph (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ int i, j;
+ struct partition *partition1, *partition2;
+ graph *pg = new_graph (partitions->length ());
+ auto_vec<ddr_p> alias_ddrs, *alias_ddrs_p;
+
+ alias_ddrs_p = ignore_alias_p ? NULL : &alias_ddrs;
+
+ init_partition_graph_vertices (pg, partitions);
+
+ for (i = 0; partitions->iterate (i, &partition1); ++i)
+ {
+ for (j = i + 1; partitions->iterate (j, &partition2); ++j)
+ {
+ /* dependence direction - 0 is no dependence, -1 is back,
+ 1 is forth, 2 is both (we can stop then, merging will occur). */
+ int dir = 0;
+
+ /* If the first partition has reduction, add back edge; if the
+ second partition has reduction, add forth edge. This makes
+ sure that reduction partition will be sorted as the last one. */
+ if (partition_reduction_p (partition1))
+ dir = -1;
+ else if (partition_reduction_p (partition2))
+ dir = 1;
+
+ /* Cleanup the temporary vector. */
+ alias_ddrs.truncate (0);
+
+ dir = pg_add_dependence_edges (rdg, dir, partition1->datarefs,
+ partition2->datarefs, alias_ddrs_p);
+
+ /* Add edge to partition graph if there exists dependence. There
+ are two types of edges. One type edge is caused by compilation
+ time known dependence, this type can not be resolved by runtime
+ alias check. The other type can be resolved by runtime alias
+ check. */
+ if (dir == 1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != 1 && dir != 2);
+ add_partition_graph_edge (pg, i, j,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ if (dir == -1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != -1 && dir != 2);
+ add_partition_graph_edge (pg, j, i,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ }
+ }
+ return pg;
+}
+
+/* Sort partitions in PG by post order and store them in PARTITIONS. */
+
+static void
+sort_partitions_by_post_order (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ struct pg_vdata *data;
+
+ /* Now order the remaining nodes in postorder. */
+ qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
+ partitions->truncate (0);
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ if (data->partition)
+ partitions->safe_push (data->partition);
+ }
+}
+
+/* Given reduced dependence graph RDG merge strong connected components
+ of PARTITIONS. If IGNORE_ALIAS_P is true, data dependence caused by
+ possible alias between references is ignored, as if it doesn't exist
+ at all; otherwise all depdendences are considered. */
+
+static void
+merge_dep_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ struct partition *partition1, *partition2;
+ struct pg_vdata *data;
+ graph *pg = build_partition_graph (rdg, partitions, ignore_alias_p);
+ int i, j, num_sccs = graphds_scc (pg, NULL);
+
+ /* Strong connected compoenent means dependence cycle, we cannot distribute
+ them. So fuse them together. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &partition1); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (j = j + 1; partitions->iterate (j, &partition2); ++j)
+ if (pg->vertices[j].component == i)
+ {
+ partition_merge_into (NULL, partition1,
+ partition2, FUSE_SAME_SCC);
+ partition1->type = PTYPE_SEQUENTIAL;
+ (*partitions)[j] = NULL;
+ partition_free (partition2);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ data->partition = NULL;
+ }
+ }
+ sort_partitions_by_post_order (pg, partitions);
+ }
+ gcc_assert (partitions->length () == (unsigned)num_sccs);
+ free_partition_graph_vdata (pg);
+ free_graph (pg);
+}
+
+/* Callback function for traversing edge E in graph G. DATA is private
+ callback data. */
+
+static void
+pg_collect_alias_ddrs (struct graph *g, struct graph_edge *e, void *data)
+{
+ int i, j, component;
+ struct pg_edge_callback_data *cbdata;
+ struct pg_edata *edata = (struct pg_edata *) e->data;
+
+ /* If the edge doesn't have attached data dependence, it represents
+ compilation time known dependences. This type dependence cannot
+ be resolved by runtime alias check. */
+ if (edata == NULL || edata->alias_ddrs.length () == 0)
+ return;
+
+ cbdata = (struct pg_edge_callback_data *) data;
+ i = e->src;
+ j = e->dest;
+ component = cbdata->vertices_component[i];
+ /* Vertices are topologically sorted according to compilation time
+ known dependences, so we can break strong connected components
+ by removing edges of the opposite direction, i.e, edges pointing
+ from vertice with smaller post number to vertice with bigger post
+ number. */
+ if (g->vertices[i].post < g->vertices[j].post
+ /* We only need to remove edges connecting vertices in the same
+ strong connected component to break it. */
+ && component == cbdata->vertices_component[j]
+ /* Check if we want to break the strong connected component or not. */
+ && !bitmap_bit_p (cbdata->sccs_to_merge, component))
+ cbdata->alias_ddrs->safe_splice (edata->alias_ddrs);
+}
+
+/* This is the main function breaking strong conected components in
+ PARTITIONS giving reduced depdendence graph RDG. Store data dependence
+ relations for runtime alias check in ALIAS_DDRS. */
+
+static void
+break_alias_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ int i, j, num_sccs, num_sccs_no_alias;
+ /* Build partition dependence graph. */
+ graph *pg = build_partition_graph (rdg, partitions, false);
+
+ alias_ddrs->truncate (0);
+ /* Find strong connected components in the graph, with all dependence edges
+ considered. */
+ num_sccs = graphds_scc (pg, NULL);
+ /* All SCCs now can be broken by runtime alias checks because SCCs caused by
+ compilation time known dependences are merged before this function. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ struct pg_edge_callback_data cbdata;
+ auto_bitmap sccs_to_merge;
+ auto_vec<enum partition_type> scc_types;
+ struct partition *partition, *first;
+
+ /* If all paritions in a SCC has the same type, we can simply merge the
+ SCC. This loop finds out such SCCS and record them in bitmap. */
+ bitmap_set_range (sccs_to_merge, 0, (unsigned) num_sccs);
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ if (pg->vertices[j].component != i)
+ continue;
+
+ if (first->type != partition->type)
+ {
+ bitmap_clear_bit (sccs_to_merge, i);
+ break;
+ }
+ }
+ }
+
+ /* Initialize callback data for traversing. */
+ cbdata.sccs_to_merge = sccs_to_merge;
+ cbdata.alias_ddrs = alias_ddrs;
+ cbdata.vertices_component = XNEWVEC (int, pg->n_vertices);
+ /* Record the component information which will be corrupted by next
+ graph scc finding call. */
+ for (i = 0; i < pg->n_vertices; ++i)
+ cbdata.vertices_component[i] = pg->vertices[i].component;
+
+ /* Collect data dependences for runtime alias checks to break SCCs. */
+ if (bitmap_count_bits (sccs_to_merge) != (unsigned) num_sccs)
+ {
+ /* Run SCC finding algorithm again, with alias dependence edges
+ skipped. This is to topologically sort paritions according to
+ compilation time known dependence. Note the topological order
+ is stored in the form of pg's post order number. */
+ num_sccs_no_alias = graphds_scc (pg, NULL, pg_skip_alias_edge);
+ gcc_assert (partitions->length () == (unsigned) num_sccs_no_alias);
+ /* With topological order, we can construct two subgraphs L and R.
+ L contains edge <x, y> where x < y in terms of post order, while
+ R contains edge <x, y> where x > y. Edges for compilation time
+ known dependence all fall in R, so we break SCCs by removing all
+ (alias) edges of in subgraph L. */
+ for_each_edge (pg, pg_collect_alias_ddrs, &cbdata);
+ }
+
+ /* For SCC that doesn't need to be broken, merge it. */
+ for (i = 0; i < num_sccs; ++i)
+ {
+ if (!bitmap_bit_p (sccs_to_merge, i))
+ continue;
+
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (cbdata.vertices_component[j] == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ struct pg_vdata *data;
+
+ if (cbdata.vertices_component[j] != i)
+ continue;
+
+ partition_merge_into (NULL, first, partition, FUSE_SAME_SCC);
+ (*partitions)[j] = NULL;
+ partition_free (partition);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ gcc_assert (data->id == j);
+ data->partition = NULL;
+ }
+ }
+ }
+
+ sort_partitions_by_post_order (pg, partitions);
+ free_partition_graph_vdata (pg);
+ for_each_edge (pg, free_partition_graph_edata_cb, NULL);
+ free_graph (pg);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "Possible alias data dependence to break:\n");
+ dump_data_dependence_relations (dump_file, *alias_ddrs);
+ }
+}
+
+/* Compute and return an expression whose value is the segment length which
+ will be accessed by DR in NITERS iterations. */
+
+static tree
+data_ref_segment_size (struct data_reference *dr, tree niters)
+{
+ tree segment_length;
+
+ if (integer_zerop (DR_STEP (dr)))
+ segment_length = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
+ else
+ segment_length = size_binop (MULT_EXPR,
+ fold_convert (sizetype, DR_STEP (dr)),
+ fold_convert (sizetype, niters));
+
+ return segment_length;
+}
+
+/* Return true if LOOP's latch is dominated by statement for data reference
+ DR. */
+
+static inline bool
+latch_dominated_by_data_ref (struct loop *loop, data_reference *dr)
+{
+ return dominated_by_p (CDI_DOMINATORS, single_exit (loop)->src,
+ gimple_bb (DR_STMT (dr)));
+}
+
+/* Compute alias check pairs and store them in COMP_ALIAS_PAIRS for LOOP's
+ data dependence relations ALIAS_DDRS. */
+
+static void
+compute_alias_check_pairs (struct loop *loop, vec<ddr_p> *alias_ddrs,
+ vec<dr_with_seg_len_pair_t> *comp_alias_pairs)
+{
+ unsigned int i;
+ unsigned HOST_WIDE_INT factor = 1;
+ tree niters_plus_one, niters = number_of_latch_executions (loop);
+
+ gcc_assert (niters != NULL_TREE && niters != chrec_dont_know);
+ niters = fold_convert (sizetype, niters);
+ niters_plus_one = size_binop (PLUS_EXPR, niters, size_one_node);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Creating alias check pairs:\n");
+
+ /* Iterate all data dependence relations and compute alias check pairs. */
+ for (i = 0; i < alias_ddrs->length (); i++)
+ {
+ ddr_p ddr = (*alias_ddrs)[i];
+ struct data_reference *dr_a = DDR_A (ddr);
+ struct data_reference *dr_b = DDR_B (ddr);
+ tree seg_length_a, seg_length_b;
+ int comp_res = data_ref_compare_tree (DR_BASE_ADDRESS (dr_a),
+ DR_BASE_ADDRESS (dr_b));
+
+ if (comp_res == 0)
+ comp_res = data_ref_compare_tree (DR_OFFSET (dr_a), DR_OFFSET (dr_b));
+ gcc_assert (comp_res != 0);
+
+ if (latch_dominated_by_data_ref (loop, dr_a))
+ seg_length_a = data_ref_segment_size (dr_a, niters_plus_one);
+ else
+ seg_length_a = data_ref_segment_size (dr_a, niters);
+
+ if (latch_dominated_by_data_ref (loop, dr_b))
+ seg_length_b = data_ref_segment_size (dr_b, niters_plus_one);
+ else
+ seg_length_b = data_ref_segment_size (dr_b, niters);
+
+ dr_with_seg_len_pair_t dr_with_seg_len_pair
+ (dr_with_seg_len (dr_a, seg_length_a),
+ dr_with_seg_len (dr_b, seg_length_b));
+
+ /* Canonicalize pairs by sorting the two DR members. */
+ if (comp_res > 0)
+ std::swap (dr_with_seg_len_pair.first, dr_with_seg_len_pair.second);
+
+ comp_alias_pairs->safe_push (dr_with_seg_len_pair);
+ }
+
+ if (tree_fits_uhwi_p (niters))
+ factor = tree_to_uhwi (niters);
+
+ /* Prune alias check pairs. */
+ prune_runtime_alias_test_list (comp_alias_pairs, factor);
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Improved number of alias checks from %d to %d\n",
+ alias_ddrs->length (), comp_alias_pairs->length ());
+}
+
+/* Given data dependence relations in ALIAS_DDRS, generate runtime alias
+ checks and version LOOP under condition of these runtime alias checks. */
+
+static void
+version_loop_by_alias_check (struct loop *loop, vec<ddr_p> *alias_ddrs)
+{
+ unsigned then_prob, else_prob, then_scale, else_scale;
+ basic_block cond_bb;
+ struct loop *nloop;
+ tree lhs, arg0, cond_expr = NULL_TREE;
+ gimple_seq cond_stmts = NULL;
+ gimple *stmt;
+ auto_vec<dr_with_seg_len_pair_t> comp_alias_pairs;
+
+ /* Generate code for runtime alias checks if necessary. */
+ if (alias_ddrs->length () > 0)
+ {
+ compute_alias_check_pairs (loop, alias_ddrs, &comp_alias_pairs);
+ create_runtime_alias_checks (loop, &comp_alias_pairs, &cond_expr);
+ cond_expr = force_gimple_operand_1 (cond_expr, &cond_stmts,
+ is_gimple_condexpr, NULL_TREE);
+ then_prob = 9 * REG_BR_PROB_BASE / 10;
+ else_prob = REG_BR_PROB_BASE - then_prob;
+ then_scale = 9 * REG_BR_PROB_BASE / 10;
+ else_scale = REG_BR_PROB_BASE - then_scale;
+ }
+ else
+ {
+ cond_expr = boolean_true_node;
+ /* Use REG_BR_PROB_BASE for both branches since only one branch
+ will be kept in the end. */
+ then_prob = REG_BR_PROB_BASE;
+ else_prob = REG_BR_PROB_BASE;
+ then_scale = REG_BR_PROB_BASE;
+ else_scale = REG_BR_PROB_BASE;
+ }
+
+ /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
+ if (flag_tree_loop_vectorize)
+ {
+ /* Generate internal function call for loop distribution alias check. */
+ arg0 = build_int_cst (integer_type_node, loop->num);
+ stmt = gimple_build_call_internal (IFN_LOOP_DIST_ALIAS,
+ 2, arg0, cond_expr);
+ lhs = make_ssa_name (boolean_type_node);
+ gimple_call_set_lhs (stmt, lhs);
+ gimple_seq_add_stmt (&cond_stmts, stmt);
+ }
+ else
+ lhs = cond_expr;
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Version loop <%d> with runtime alias check\n", loop->num);
+
+ initialize_original_copy_tables ();
+ nloop = loop_version (loop, lhs, &cond_bb, then_prob,
+ else_prob, then_scale, else_scale, true);
+ free_original_copy_tables ();
+ /* Record the original loop number in newly generated loops. */
+ nloop->ldist_alias_id = loop->num;
+ nloop->dont_vectorize = true;
+ nloop->force_vectorize = false;
+
+ if (cond_stmts)
+ {
+ gimple_stmt_iterator cond_gsi = gsi_last_bb (cond_bb);
+ gsi_insert_seq_before (&cond_gsi, cond_stmts, GSI_SAME_STMT);
+ }
+ update_ssa (TODO_update_ssa);
+}
+
+/* Return true if loop versioning is needed to distrubute PARTITIONS.
+ ALIAS_DDRS are data dependence relations for runtime alias check. */
+
+static inline bool
+version_for_distribution_p (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ unsigned i;
+ struct partition *partition;
+
+ /* No need to version loop if we have only one partition. */
+ if (partitions->length () == 1)
+ return false;
+
+ /* Need to version loop if runtime alias check is necessary. */
+ if (alias_ddrs->length () > 0)
+ return true;
+
+ /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
+ vectorizer is not enable because no other pass can fold it. */
+ if (!flag_tree_loop_vectorize)
+ return false;
+
+ /* Don't version loop if any partition is builtin. */
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ if (partition->kind != PKIND_NORMAL)
+ break;
+ }
+ return (i == partitions->length ());
+}
+
+/* Fuse all partitions if necessary before finalizing distribution. */
+
+static void
+finalize_partitions (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ unsigned i;
+ struct partition *a, *partition;
+
+ if (partitions->length () == 1
+ || alias_ddrs->length () > 0)
+ return;
+
+ a = (*partitions)[0];
+ if (a->kind != PKIND_NORMAL)
+ return;
+
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ /* Don't fuse if partition has different type or it is a builtin. */
+ if (partition->type != a->type
+ || partition->kind != PKIND_NORMAL)
+ return;
+ }
+
+ /* Fuse all partitions. */
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ partition_merge_into (NULL, a, partition, FUSE_FINALIZE);
+ partition_free (partition);
+ }
+ partitions->truncate (1);
+}
+
+/* Distributes the code from LOOP in such a way that producer statements
+ are placed before consumer statements. Tries to separate only the
+ statements from STMTS into separate loops. Returns the number of
+ distributed loops. Set NB_CALLS to number of generated builtin calls.
+ Set *DESTROY_P to whether LOOP needs to be destroyed. */
static int
distribute_loop (struct loop *loop, vec<gimple *> stmts,
@@ -1698,8 +2390,6 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
partition *partition;
bool any_builtin;
int i, nbp;
- graph *pg = NULL;
- int num_sccs = 1;
*destroy_p = false;
*nb_calls = 0;
@@ -1747,6 +2437,11 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
auto_vec<struct partition *, 3> partitions;
rdg_build_partitions (rdg, stmts, &partitions);
+ /* Can't do runtime alias check if loop niter is unknown. */
+ tree niters = number_of_latch_executions (loop);
+ bool rt_alias_check_p = (niters != NULL_TREE && niters != chrec_dont_know);
+ auto_vec<ddr_p> alias_ddrs;
+
auto_bitmap stmt_in_all_partitions;
bitmap_copy (stmt_in_all_partitions, partitions[0]->stmts);
for (i = 1; partitions.iterate (i, &partition); ++i)
@@ -1833,81 +2528,14 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
/* Build the partition dependency graph. */
if (partitions.length () > 1)
{
- pg = new_graph (partitions.length ());
- struct pgdata {
- struct partition *partition;
- };
-#define PGDATA(i) ((pgdata *)(pg->vertices[i].data))
- for (i = 0; partitions.iterate (i, &partition); ++i)
- {
- vertex *v = &pg->vertices[i];
- pgdata *data = new pgdata;
- /* FIXME - leaks. */
- v->data = data;
- data->partition = partition;
- }
- struct partition *partition1, *partition2;
- for (i = 0; partitions.iterate (i, &partition1); ++i)
- for (int j = i + 1; partitions.iterate (j, &partition2); ++j)
- {
- /* dependence direction - 0 is no dependence, -1 is back,
- 1 is forth, 2 is both (we can stop then, merging will occur). */
- int dir = pg_add_dependence_edges (rdg, 0,
- partition1->datarefs,
- partition2->datarefs);
- if (dir == 1 || dir == 2)
- add_edge (pg, i, j);
- if (dir == -1 || dir == 2)
- add_edge (pg, j, i);
- }
-
- /* Add edges to the reduction partition (if any) to force it last. */
- unsigned j;
- for (j = 0; partitions.iterate (j, &partition); ++j)
- if (partition_reduction_p (partition))
- break;
- if (j < partitions.length ())
- {
- for (unsigned i = 0; partitions.iterate (i, &partition); ++i)
- if (i != j)
- add_edge (pg, i, j);
- }
-
- /* Compute partitions we cannot separate and fuse them. */
- num_sccs = graphds_scc (pg, NULL);
- for (i = 0; i < num_sccs; ++i)
- {
- struct partition *first;
- int j;
- for (j = 0; partitions.iterate (j, &first); ++j)
- if (pg->vertices[j].component == i)
- break;
- for (j = j + 1; partitions.iterate (j, &partition); ++j)
- if (pg->vertices[j].component == i)
- {
- partition_merge_into (NULL, first,
- partition, FUSE_SAME_SCC);
- first->type = PTYPE_SEQUENTIAL;
- partitions[j] = NULL;
- partition_free (partition);
- PGDATA (j)->partition = NULL;
- }
- }
-
- /* Now order the remaining nodes in postorder. */
- qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
- partitions.truncate (0);
- for (i = 0; i < pg->n_vertices; ++i)
- {
- pgdata *data = PGDATA (i);
- if (data->partition)
- partitions.safe_push (data->partition);
- delete data;
- }
- gcc_assert (partitions.length () == (unsigned)num_sccs);
- free_graph (pg);
+ merge_dep_scc_partitions (rdg, &partitions, rt_alias_check_p);
+ alias_ddrs.truncate (0);
+ if (rt_alias_check_p && partitions.length () > 1)
+ break_alias_scc_partitions (rdg, &partitions, &alias_ddrs);
}
+ finalize_partitions (&partitions, &alias_ddrs);
+
nbp = partitions.length ();
if (nbp == 0
|| (nbp == 1 && !partition_builtin_p (partitions[0]))
@@ -1917,8 +2545,15 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
goto ldist_done;
}
+ if (version_for_distribution_p (&partitions, &alias_ddrs))
+ version_loop_by_alias_check (loop, &alias_ddrs);
+
if (dump_file && (dump_flags & TDF_DETAILS))
- dump_rdg_partitions (dump_file, partitions);
+ {
+ fprintf (dump_file,
+ "distribute loop <%d> into partitions:\n", loop->num);
+ dump_rdg_partitions (dump_file, partitions);
+ }
FOR_EACH_VEC_ELT (partitions, i, partition)
{
--
1.9.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-23 10:30 ` Bin.Cheng
@ 2017-06-27 12:44 ` Richard Biener
2017-06-27 14:07 ` Bin.Cheng
0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2017-06-27 12:44 UTC (permalink / raw)
To: Bin.Cheng; +Cc: gcc-patches, Richard Guenther
On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> This is the main patch rewriting loop distribution in order to handle hmmer.
>>> It improves loop distribution by versioning loop under runtime alias check conditions.
>>> As described in comments, the patch basically implements distribution in the following
>>> steps:
>>>
>>> 1) Seed partitions with specific type statements. For now we support
>>> two types seed statements: statement defining variable used outside
>>> of loop; statement storing to memory.
>>> 2) Build reduced dependence graph (RDG) for loop to be distributed.
>>> The vertices (RDG:V) model all statements in the loop and the edges
>>> (RDG:E) model flow and control dependencies between statements.
>>> 3) Apart from RDG, compute data dependencies between memory references.
>>> 4) Starting from seed statement, build up partition by adding depended
>>> statements according to RDG's dependence information. Partition is
>>> classified as parallel type if it can be executed paralleled; or as
>>> sequential type if it can't. Parallel type partition is further
>>> classified as different builtin kinds if it can be implemented as
>>> builtin function calls.
>>> 5) Build partition dependence graph (PG) based on data dependencies.
>>> The vertices (PG:V) model all partitions and the edges (PG:E) model
>>> all data dependencies between every partitions pair. In general,
>>> data dependence is either compilation time known or unknown. In C
>>> family languages, there exists quite amount compilation time unknown
>>> dependencies because of possible alias relation of data references.
>>> We categorize PG's edge to two types: "true" edge that represents
>>> compilation time known data dependencies; "alias" edge for all other
>>> data dependencies.
>>> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
>>> partitions in each strong connected component (SCC) correspondingly.
>>> Build new PG for merged partitions.
>>> 7) Traverse PG again and this time with both "true" and "alias" edges
>>> included. We try to break SCCs by removing some edges. Because
>>> SCCs by "true" edges are all fused in step 6), we can break SCCs
>>> by removing some "alias" edges. It's NP-hard to choose optimal
>>> edge set, fortunately simple approximation is good enough for us
>>> given the small problem scale.
>>> 8) Collect all data dependencies of the removed "alias" edges. Create
>>> runtime alias checks for collected data dependencies.
>>> 9) Version loop under the condition of runtime alias checks. Given
>>> loop distribution generally introduces additional overhead, it is
>>> only useful if vectorization is achieved in distributed loop. We
>>> version loop with internal function call IFN_LOOP_DIST_ALIAS. If
>>> no distributed loop can be vectorized, we simply remove distributed
>>> loops and recover to the original one.
>>>
>>> Also, there are some more to improve in the future (which isn't difficult I think):
>>> TODO:
>>> 1) We only distribute innermost loops now. This pass should handle loop
>>> nests in the future.
>>> 2) We only fuse partitions in SCC now. A better fusion algorithm is
>>> desired to minimize loop overhead, maximize parallelism and maximize
>>>
>>> Bootstrap and test on x86_64 and AArch64. Is it OK?
>>>
>> Trivial updated due to changes in previous patches. Also fixed issues
>> mentioned by Kugan.
> Rebased V3 for changes in previous patches. Bootstap and test on
> x86_64 and aarch64.
why is ldist-12.c no longer distributed? your comment says it doesn't expose
more "parallelism" but the point is to reduce memory bandwith requirements
which it clearly does.
Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
of "parallelism" still confuses me.
Can you elaborate on that. Now onto the patch:
+ Loop distribution is the dual of loop fusion. It separates statements
+ of a loop (or loop nest) into multiple loops (or loop nests) with the
+ same loop header. The major goal is to separate statements which may
+ be vectorized from those that can't. This pass implements distribution
+ in the following steps:
misses the goal of being a memory stream optimization, not only a vectorization
enabler. distributing a loop can also reduce register pressure.
You introduce ldist_alias_id in struct loop (probably in 01/n which I
didn't look
into yet). If you don't use that please introduce it separately.
+ /* Be conservative. If data references are not well analyzed,
+ or the two data references have the same base address and
+ offset, add dependence and consider it alias to each other.
+ In other words, the dependence can not be resolved by
+ runtime alias check. */
+ if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
+ || !DR_INIT (dr1) || !DR_INIT (dr2)
+ || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
+ || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
+ || res == 0)
ISTR a helper that computes whether we can handle a runtime alias check for
a specific case?
+ /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
+ if (flag_tree_loop_vectorize)
+ {
so at this point I'd condition the whole runtime-alias check generating
on flag_tree_loop_vectorize. You seem to support versioning w/o
that here but in other places disable versioning w/o flag_tree_loop_vectorize.
That looks somewhat inconsistent...
+ /* Don't version loop if any partition is builtin. */
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ if (partition->kind != PKIND_NORMAL)
+ break;
+ }
why's that? Do you handle the case where only a subset of partitions
require a runtime alias check to be generated? Thus from a loop
generate three loops, only two of them being versioned? The
complication would be that both the runtime alias and non-alias
versions would be "transformed". Or do we rely on recursively
distributing the distribution result, thus if we have partitions that
can be handled w/o runtime alias check fuse the remaining partitions
and recurse on those?
Otherwise the patch looks good to me.
Thanks,
Richard.
> Thanks,
> bin
>>
>> Thanks,
>> bin
>>
>> 2017-06-17 Bin Cheng <bin.cheng@arm.com>
>>
>> * tree-loop-distribution.c: Add general explanantion on the pass.
>> (pg_add_dependence_edges): New parameter. handle alias data
>> dependence specially and record it in the parameter if asked.
>> (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
>> (init_partition_graph_vertices, add_partition_graph_edge): New.
>> (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
>> (free_partition_graph_vdata, build_partition_graph): New.
>> (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
>> (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
>> (data_ref_segment_size, latch_dominated_by_data_ref): New.
>> (compute_alias_check_pairs, version_loop_by_alias_check): New.
>> (version_for_distribution_p, finalize_partitions): New.
>> (distribute_loop): Handle alias data dependence specially. Factor
>> out loop fusion code as functions and call these functions.
>>
>> gcc/testsuite/ChangeLog
>> 2017-06-17 Bin Cheng <bin.cheng@arm.com>
>>
>> * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
>> * gcc.dg/tree-ssa/ldist-12.c: Ditto.
>> * gcc.dg/tree-ssa/ldist-13.c: Ditto.
>> * gcc.dg/tree-ssa/ldist-14.c: Ditto.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-27 12:44 ` Richard Biener
@ 2017-06-27 14:07 ` Bin.Cheng
2017-06-28 10:58 ` Richard Biener
0 siblings, 1 reply; 16+ messages in thread
From: Bin.Cheng @ 2017-06-27 14:07 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-patches
On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>> Rebased V3 for changes in previous patches. Bootstap and test on
>> x86_64 and aarch64.
>
> why is ldist-12.c no longer distributed? your comment says it doesn't expose
> more "parallelism" but the point is to reduce memory bandwith requirements
> which it clearly does.
>
> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
> of "parallelism" still confuses me.
>
> Can you elaborate on that. Now onto the patch:
Given we don't model data locality or memory bandwidth, whether
distribution enables loops that can be executed paralleled becomes the
major criteria for distribution. BTW, I think a good memory stream
optimization model shouldn't consider small loops as in ldist-12.c,
etc., appropriate for distribution.
>
> + Loop distribution is the dual of loop fusion. It separates statements
> + of a loop (or loop nest) into multiple loops (or loop nests) with the
> + same loop header. The major goal is to separate statements which may
> + be vectorized from those that can't. This pass implements distribution
> + in the following steps:
>
> misses the goal of being a memory stream optimization, not only a vectorization
> enabler. distributing a loop can also reduce register pressure.
I will revise the comment, but as explained, enabling more
vectorization is the major criteria for distribution to some extend
now.
>
> You introduce ldist_alias_id in struct loop (probably in 01/n which I
> didn't look
> into yet). If you don't use that please introduce it separately.
Hmm, yes it is introduced in patch [01/n] and set in this patch.
>
> + /* Be conservative. If data references are not well analyzed,
> + or the two data references have the same base address and
> + offset, add dependence and consider it alias to each other.
> + In other words, the dependence can not be resolved by
> + runtime alias check. */
> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
> + || !DR_INIT (dr1) || !DR_INIT (dr2)
> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
> + || res == 0)
>
> ISTR a helper that computes whether we can handle a runtime alias check for
> a specific case?
I guess you mean runtime_alias_check_p that I factored out previously?
Unfortunately, it's factored out vectorizer's usage and doesn't fit
here straightforwardly. Shall I try to further generalize the
interface as independence patch to this one?
>
> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
> + if (flag_tree_loop_vectorize)
> + {
>
> so at this point I'd condition the whole runtime-alias check generating
> on flag_tree_loop_vectorize. You seem to support versioning w/o
> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
> That looks somewhat inconsistent...
It is a bit complicated. In function version_for_distribution_p, we have
+
+ /* Need to version loop if runtime alias check is necessary. */
+ if (alias_ddrs->length () > 0)
+ return true;
+
+ /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
+ vectorizer is not enable because no other pass can fold it. */
+ if (!flag_tree_loop_vectorize)
+ return false;
+
It means we also versioning loops even if runtime alias check is
unnecessary. I did this because we lack cost model and current
distribution may result in too many distribution? If that's the case,
at least vectorizer will remove distributed version loop and fall back
to the original one. Hmm, shall I change it into below code:
+
+ /* Need to version loop if runtime alias check is necessary. */
+ if (alias_ddrs->length () == 0)
+ return false;
+
+ /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
+ vectorizer is not enable because no other pass can fold it. */
+ if (!flag_tree_loop_vectorize)
+ return false;
+
Then I can remove the check you mentioned in function
version_loop_by_alias_check?
>
> + /* Don't version loop if any partition is builtin. */
> + for (i = 0; partitions->iterate (i, &partition); ++i)
> + {
> + if (partition->kind != PKIND_NORMAL)
> + break;
> + }
>
> why's that? Do you handle the case where only a subset of partitions
One reason is I generally consider distributed builtin functions as a
win, thus distribution won't be canceled later in vectorizer. Another
reason is if all distributed loops are recognized as builtins, we
can't really version with current logic because the
IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
> require a runtime alias check to be generated? Thus from a loop
> generate three loops, only two of them being versioned? The
> complication would be that both the runtime alias and non-alias
> versions would be "transformed". Or do we rely on recursively
> distributing the distribution result, thus if we have partitions that
> can be handled w/o runtime alias check fuse the remaining partitions
> and recurse on those?
No, this is not precisely handled now, the pass will version the whole
loop once. Though I think it's not very difficult to do two stages
distribution, I am not sure how useful it is.
Thanks,
bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-27 14:07 ` Bin.Cheng
@ 2017-06-28 10:58 ` Richard Biener
2017-06-28 11:46 ` Bin.Cheng
0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2017-06-28 10:58 UTC (permalink / raw)
To: Bin.Cheng; +Cc: gcc-patches
On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>> Hi,
>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>> x86_64 and aarch64.
>>
>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>> more "parallelism" but the point is to reduce memory bandwith requirements
>> which it clearly does.
>>
>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>> of "parallelism" still confuses me.
>>
>> Can you elaborate on that. Now onto the patch:
> Given we don't model data locality or memory bandwidth, whether
> distribution enables loops that can be executed paralleled becomes the
> major criteria for distribution. BTW, I think a good memory stream
> optimization model shouldn't consider small loops as in ldist-12.c,
> etc., appropriate for distribution.
True. But what means "parallel" here? ldist-13.c if partitioned into two loops
can be executed "in parallel"
>>
>> + Loop distribution is the dual of loop fusion. It separates statements
>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>> + same loop header. The major goal is to separate statements which may
>> + be vectorized from those that can't. This pass implements distribution
>> + in the following steps:
>>
>> misses the goal of being a memory stream optimization, not only a vectorization
>> enabler. distributing a loop can also reduce register pressure.
> I will revise the comment, but as explained, enabling more
> vectorization is the major criteria for distribution to some extend
> now.
Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>
>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>> didn't look
>> into yet). If you don't use that please introduce it separately.
> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>
>>
>> + /* Be conservative. If data references are not well analyzed,
>> + or the two data references have the same base address and
>> + offset, add dependence and consider it alias to each other.
>> + In other words, the dependence can not be resolved by
>> + runtime alias check. */
>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>> + || res == 0)
>>
>> ISTR a helper that computes whether we can handle a runtime alias check for
>> a specific case?
> I guess you mean runtime_alias_check_p that I factored out previously?
> Unfortunately, it's factored out vectorizer's usage and doesn't fit
> here straightforwardly. Shall I try to further generalize the
> interface as independence patch to this one?
That would be nice.
>>
>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>> + if (flag_tree_loop_vectorize)
>> + {
>>
>> so at this point I'd condition the whole runtime-alias check generating
>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>> That looks somewhat inconsistent...
> It is a bit complicated. In function version_for_distribution_p, we have
> +
> + /* Need to version loop if runtime alias check is necessary. */
> + if (alias_ddrs->length () > 0)
> + return true;
> +
> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
> + vectorizer is not enable because no other pass can fold it. */
> + if (!flag_tree_loop_vectorize)
> + return false;
> +
>
> It means we also versioning loops even if runtime alias check is
> unnecessary. I did this because we lack cost model and current
> distribution may result in too many distribution? If that's the case,
> at least vectorizer will remove distributed version loop and fall back
> to the original one. Hmm, shall I change it into below code:
Hmm, ok - that really ties things to vectorization apart from the
builtin recognition. So what happens if the vectorizer can vectorize
both the distributed and non-distributed loops?
> +
> + /* Need to version loop if runtime alias check is necessary. */
> + if (alias_ddrs->length () == 0)
> + return false;
> +
> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
> + vectorizer is not enable because no other pass can fold it. */
> + if (!flag_tree_loop_vectorize)
> + return false;
> +
>
> Then I can remove the check you mentioned in function
> version_loop_by_alias_check?
Yeah, I guess that would be easier to understand. Need to update
the comment above the alias_ddrs check though.
>>
>> + /* Don't version loop if any partition is builtin. */
>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>> + {
>> + if (partition->kind != PKIND_NORMAL)
>> + break;
>> + }
>>
>> why's that? Do you handle the case where only a subset of partitions
> One reason is I generally consider distributed builtin functions as a
> win, thus distribution won't be canceled later in vectorizer. Another
> reason is if all distributed loops are recognized as builtins, we
> can't really version with current logic because the
> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
Ah, ok. I guess the maze was too twisted for me to see what
version_for_distribution_p
does ;)
>> require a runtime alias check to be generated? Thus from a loop
>> generate three loops, only two of them being versioned? The
>> complication would be that both the runtime alias and non-alias
>> versions would be "transformed". Or do we rely on recursively
>> distributing the distribution result, thus if we have partitions that
>> can be handled w/o runtime alias check fuse the remaining partitions
>> and recurse on those?
> No, this is not precisely handled now, the pass will version the whole
> loop once. Though I think it's not very difficult to do two stages
> distribution, I am not sure how useful it is.
Not sure either.
So to understand you're looking at loop-distribution as vectorization enabler
and pattern detector. I think that is reasonable without a better cost model.
Note that I think the state before your patches had the sensible cost-model
that it fused very conservatively and just produced the maximum distribution
with the idea that the looping overhead itself is cheap. Note that with
a more "maximum" distribution the vectorizer also gets the chance to
do "partial vectorization" in case profitability is different. Of course the
setup cost may offset that in the case all loops end up vectorized...
So in reality the vectorizer itself should "distribute away"
non-vectorizable parts
of the loop ... :/
Thanks,
Richard.
> Thanks,
> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-28 10:58 ` Richard Biener
@ 2017-06-28 11:46 ` Bin.Cheng
2017-06-28 12:29 ` Richard Biener
0 siblings, 1 reply; 16+ messages in thread
From: Bin.Cheng @ 2017-06-28 11:46 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-patches
On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>> Hi,
>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>> x86_64 and aarch64.
>>>
>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>> which it clearly does.
>>>
>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>> of "parallelism" still confuses me.
>>>
>>> Can you elaborate on that. Now onto the patch:
>> Given we don't model data locality or memory bandwidth, whether
>> distribution enables loops that can be executed paralleled becomes the
>> major criteria for distribution. BTW, I think a good memory stream
>> optimization model shouldn't consider small loops as in ldist-12.c,
>> etc., appropriate for distribution.
>
> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
> can be executed "in parallel"
So if a loop by itself can be vectorized (or so called can be executed
paralleled), we tend to no distribute it into small ones. But there
is one exception here, if the distributed small loops are recognized
as builtin functions, we still distribute it. I assume it's generally
better to call builtin memory functions than vectorize it by GCC?
>
>>>
>>> + Loop distribution is the dual of loop fusion. It separates statements
>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>> + same loop header. The major goal is to separate statements which may
>>> + be vectorized from those that can't. This pass implements distribution
>>> + in the following steps:
>>>
>>> misses the goal of being a memory stream optimization, not only a vectorization
>>> enabler. distributing a loop can also reduce register pressure.
>> I will revise the comment, but as explained, enabling more
>> vectorization is the major criteria for distribution to some extend
>> now.
>
> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
Let's see if any performance drop will be reported against this patch.
Let's see if we can create a cost model for it.
>
>>>
>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>> didn't look
>>> into yet). If you don't use that please introduce it separately.
>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>
>>>
>>> + /* Be conservative. If data references are not well analyzed,
>>> + or the two data references have the same base address and
>>> + offset, add dependence and consider it alias to each other.
>>> + In other words, the dependence can not be resolved by
>>> + runtime alias check. */
>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>> + || res == 0)
>>>
>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>> a specific case?
>> I guess you mean runtime_alias_check_p that I factored out previously?
>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>> here straightforwardly. Shall I try to further generalize the
>> interface as independence patch to this one?
>
> That would be nice.
>
>>>
>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>> + if (flag_tree_loop_vectorize)
>>> + {
>>>
>>> so at this point I'd condition the whole runtime-alias check generating
>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>> That looks somewhat inconsistent...
>> It is a bit complicated. In function version_for_distribution_p, we have
>> +
>> + /* Need to version loop if runtime alias check is necessary. */
>> + if (alias_ddrs->length () > 0)
>> + return true;
>> +
>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>> + vectorizer is not enable because no other pass can fold it. */
>> + if (!flag_tree_loop_vectorize)
>> + return false;
>> +
>>
>> It means we also versioning loops even if runtime alias check is
>> unnecessary. I did this because we lack cost model and current
>> distribution may result in too many distribution? If that's the case,
>> at least vectorizer will remove distributed version loop and fall back
>> to the original one. Hmm, shall I change it into below code:
>
> Hmm, ok - that really ties things to vectorization apart from the
> builtin recognition. So what happens if the vectorizer can vectorize
> both the distributed and non-distributed loops?
Hmm, there is no such cases. Under condition no builtins is
recognized, we wouldn't distribute loop if by itself can be
vectorized. Does this make sense? Of course, cost model for memory
behavior can change this behavior and is wanted.
>
>> +
>> + /* Need to version loop if runtime alias check is necessary. */
>> + if (alias_ddrs->length () == 0)
>> + return false;
>> +
>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>> + vectorizer is not enable because no other pass can fold it. */
>> + if (!flag_tree_loop_vectorize)
>> + return false;
>> +
>>
>> Then I can remove the check you mentioned in function
>> version_loop_by_alias_check?
>
> Yeah, I guess that would be easier to understand. Need to update
> the comment above the alias_ddrs check though.
Yes the logic here is complicated. On one hand, I want to be
conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
can undo all "unnecessary" distribution before memory behavior is
modeled.
>
>>>
>>> + /* Don't version loop if any partition is builtin. */
>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>> + {
>>> + if (partition->kind != PKIND_NORMAL)
>>> + break;
>>> + }
>>>
>>> why's that? Do you handle the case where only a subset of partitions
>> One reason is I generally consider distributed builtin functions as a
>> win, thus distribution won't be canceled later in vectorizer. Another
>> reason is if all distributed loops are recognized as builtins, we
>> can't really version with current logic because the
>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>
> Ah, ok. I guess the maze was too twisted for me to see what
> version_for_distribution_p
> does ;)
>
>>> require a runtime alias check to be generated? Thus from a loop
>>> generate three loops, only two of them being versioned? The
>>> complication would be that both the runtime alias and non-alias
>>> versions would be "transformed". Or do we rely on recursively
>>> distributing the distribution result, thus if we have partitions that
>>> can be handled w/o runtime alias check fuse the remaining partitions
>>> and recurse on those?
>> No, this is not precisely handled now, the pass will version the whole
>> loop once. Though I think it's not very difficult to do two stages
>> distribution, I am not sure how useful it is.
>
> Not sure either.
>
> So to understand you're looking at loop-distribution as vectorization enabler
> and pattern detector. I think that is reasonable without a better cost model.
>
> Note that I think the state before your patches had the sensible cost-model
> that it fused very conservatively and just produced the maximum distribution
> with the idea that the looping overhead itself is cheap. Note that with
> a more "maximum" distribution the vectorizer also gets the chance to
> do "partial vectorization" in case profitability is different. Of course the
> setup cost may offset that in the case all loops end up vectorized...
Ideally, we have cost model for memory behavior in distribution. If
we know distribution is beneficial in loop distribution, we can simply
distribute it; otherwise we pass distribution cost (including memory
cost as well as runtime alias check cost as an argument to
IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
together with vectorization.
>
> So in reality the vectorizer itself should "distribute away"
> non-vectorizable parts
> of the loop ... :/
Do we really want to put more and more things into vectorizer? :)
Michael's opinion (IIUC, about introducing different loop
optimizations as building blocks for loop optimization infrastructure)
is quite inspiring to me. In this way, we can at least make different
optimization conservative. In general, it's much worse to do
aggressive (unnecessary) transformation than simply miss the
opportunity?
Thanks,
bin
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-28 11:46 ` Bin.Cheng
@ 2017-06-28 12:29 ` Richard Biener
2017-06-28 13:09 ` Bin.Cheng
0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2017-06-28 12:29 UTC (permalink / raw)
To: Bin.Cheng; +Cc: gcc-patches
On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>> Hi,
>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>> x86_64 and aarch64.
>>>>
>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>> which it clearly does.
>>>>
>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>> of "parallelism" still confuses me.
>>>>
>>>> Can you elaborate on that. Now onto the patch:
>>> Given we don't model data locality or memory bandwidth, whether
>>> distribution enables loops that can be executed paralleled becomes the
>>> major criteria for distribution. BTW, I think a good memory stream
>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>> etc., appropriate for distribution.
>>
>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>> can be executed "in parallel"
> So if a loop by itself can be vectorized (or so called can be executed
> paralleled), we tend to no distribute it into small ones. But there
> is one exception here, if the distributed small loops are recognized
> as builtin functions, we still distribute it. I assume it's generally
> better to call builtin memory functions than vectorize it by GCC?
Yes.
>>
>>>>
>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>> + same loop header. The major goal is to separate statements which may
>>>> + be vectorized from those that can't. This pass implements distribution
>>>> + in the following steps:
>>>>
>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>> enabler. distributing a loop can also reduce register pressure.
>>> I will revise the comment, but as explained, enabling more
>>> vectorization is the major criteria for distribution to some extend
>>> now.
>>
>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
> Let's see if any performance drop will be reported against this patch.
> Let's see if we can create a cost model for it.
Fine.
>>
>>>>
>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>> didn't look
>>>> into yet). If you don't use that please introduce it separately.
>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>
>>>>
>>>> + /* Be conservative. If data references are not well analyzed,
>>>> + or the two data references have the same base address and
>>>> + offset, add dependence and consider it alias to each other.
>>>> + In other words, the dependence can not be resolved by
>>>> + runtime alias check. */
>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>> + || res == 0)
>>>>
>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>> a specific case?
>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>> here straightforwardly. Shall I try to further generalize the
>>> interface as independence patch to this one?
>>
>> That would be nice.
>>
>>>>
>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>> + if (flag_tree_loop_vectorize)
>>>> + {
>>>>
>>>> so at this point I'd condition the whole runtime-alias check generating
>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>> That looks somewhat inconsistent...
>>> It is a bit complicated. In function version_for_distribution_p, we have
>>> +
>>> + /* Need to version loop if runtime alias check is necessary. */
>>> + if (alias_ddrs->length () > 0)
>>> + return true;
>>> +
>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>> + vectorizer is not enable because no other pass can fold it. */
>>> + if (!flag_tree_loop_vectorize)
>>> + return false;
>>> +
>>>
>>> It means we also versioning loops even if runtime alias check is
>>> unnecessary. I did this because we lack cost model and current
>>> distribution may result in too many distribution? If that's the case,
>>> at least vectorizer will remove distributed version loop and fall back
>>> to the original one. Hmm, shall I change it into below code:
>>
>> Hmm, ok - that really ties things to vectorization apart from the
>> builtin recognition. So what happens if the vectorizer can vectorize
>> both the distributed and non-distributed loops?
> Hmm, there is no such cases. Under condition no builtins is
> recognized, we wouldn't distribute loop if by itself can be
> vectorized. Does this make sense? Of course, cost model for memory
> behavior can change this behavior and is wanted.
So which cases _do_ we split loops then? "more parallelism" -- but what
does that mean exactly? Is there any testcase that shows the desired
splits for vectorization?
>>
>>> +
>>> + /* Need to version loop if runtime alias check is necessary. */
>>> + if (alias_ddrs->length () == 0)
>>> + return false;
>>> +
>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>> + vectorizer is not enable because no other pass can fold it. */
>>> + if (!flag_tree_loop_vectorize)
>>> + return false;
>>> +
>>>
>>> Then I can remove the check you mentioned in function
>>> version_loop_by_alias_check?
>>
>> Yeah, I guess that would be easier to understand. Need to update
>> the comment above the alias_ddrs check though.
> Yes the logic here is complicated. On one hand, I want to be
> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
> can undo all "unnecessary" distribution before memory behavior is
> modeled.
>
>
>>
>>>>
>>>> + /* Don't version loop if any partition is builtin. */
>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>> + {
>>>> + if (partition->kind != PKIND_NORMAL)
>>>> + break;
>>>> + }
>>>>
>>>> why's that? Do you handle the case where only a subset of partitions
>>> One reason is I generally consider distributed builtin functions as a
>>> win, thus distribution won't be canceled later in vectorizer. Another
>>> reason is if all distributed loops are recognized as builtins, we
>>> can't really version with current logic because the
>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>
>> Ah, ok. I guess the maze was too twisted for me to see what
>> version_for_distribution_p
>> does ;)
>>
>>>> require a runtime alias check to be generated? Thus from a loop
>>>> generate three loops, only two of them being versioned? The
>>>> complication would be that both the runtime alias and non-alias
>>>> versions would be "transformed". Or do we rely on recursively
>>>> distributing the distribution result, thus if we have partitions that
>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>> and recurse on those?
>>> No, this is not precisely handled now, the pass will version the whole
>>> loop once. Though I think it's not very difficult to do two stages
>>> distribution, I am not sure how useful it is.
>>
>> Not sure either.
>>
>> So to understand you're looking at loop-distribution as vectorization enabler
>> and pattern detector. I think that is reasonable without a better cost model.
>>
>> Note that I think the state before your patches had the sensible cost-model
>> that it fused very conservatively and just produced the maximum distribution
>> with the idea that the looping overhead itself is cheap. Note that with
>> a more "maximum" distribution the vectorizer also gets the chance to
>> do "partial vectorization" in case profitability is different. Of course the
>> setup cost may offset that in the case all loops end up vectorized...
> Ideally, we have cost model for memory behavior in distribution. If
> we know distribution is beneficial in loop distribution, we can simply
> distribute it; otherwise we pass distribution cost (including memory
> cost as well as runtime alias check cost as an argument to
> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
> together with vectorization.
Yes. The old cost model wasn't really one thus loop distribution was never
enabled by default.
>>
>> So in reality the vectorizer itself should "distribute away"
>> non-vectorizable parts
>> of the loop ... :/
> Do we really want to put more and more things into vectorizer? :)
> Michael's opinion (IIUC, about introducing different loop
> optimizations as building blocks for loop optimization infrastructure)
> is quite inspiring to me. In this way, we can at least make different
> optimization conservative. In general, it's much worse to do
> aggressive (unnecessary) transformation than simply miss the
> opportunity?
Heh, but with increasing number of "different loop optimizations" we
miss an overall goal. Like either each pass knows what the other
would do (vectorize or predcom for example) or we need to have
a meta-engine driving the building blocks and the building blocks
divided into analysis + costing and a transform phase (and if ever
two transforms shall apply their analysis may not be invalidated
by an earlier one). Integrating the simple building blocks into one
framework is what is sexy about GRAPHITE ...
For example predcom might be more beneficial than vectorization
with v2df vectors for example.
Richard.
>
> Thanks,
> bin
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-28 12:29 ` Richard Biener
@ 2017-06-28 13:09 ` Bin.Cheng
2017-06-30 10:43 ` Bin.Cheng
0 siblings, 1 reply; 16+ messages in thread
From: Bin.Cheng @ 2017-06-28 13:09 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-patches
On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>> Hi,
>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>> x86_64 and aarch64.
>>>>>
>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>> which it clearly does.
>>>>>
>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>> of "parallelism" still confuses me.
>>>>>
>>>>> Can you elaborate on that. Now onto the patch:
>>>> Given we don't model data locality or memory bandwidth, whether
>>>> distribution enables loops that can be executed paralleled becomes the
>>>> major criteria for distribution. BTW, I think a good memory stream
>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>> etc., appropriate for distribution.
>>>
>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>> can be executed "in parallel"
>> So if a loop by itself can be vectorized (or so called can be executed
>> paralleled), we tend to no distribute it into small ones. But there
>> is one exception here, if the distributed small loops are recognized
>> as builtin functions, we still distribute it. I assume it's generally
>> better to call builtin memory functions than vectorize it by GCC?
>
> Yes.
>
>>>
>>>>>
>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>> + same loop header. The major goal is to separate statements which may
>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>> + in the following steps:
>>>>>
>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>> enabler. distributing a loop can also reduce register pressure.
>>>> I will revise the comment, but as explained, enabling more
>>>> vectorization is the major criteria for distribution to some extend
>>>> now.
>>>
>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>> Let's see if any performance drop will be reported against this patch.
>> Let's see if we can create a cost model for it.
>
> Fine.
I will run some benchmarks to see if there is breakage.
>
>>>
>>>>>
>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>> didn't look
>>>>> into yet). If you don't use that please introduce it separately.
>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>
>>>>>
>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>> + or the two data references have the same base address and
>>>>> + offset, add dependence and consider it alias to each other.
>>>>> + In other words, the dependence can not be resolved by
>>>>> + runtime alias check. */
>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>> + || res == 0)
>>>>>
>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>> a specific case?
>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>> here straightforwardly. Shall I try to further generalize the
>>>> interface as independence patch to this one?
>>>
>>> That would be nice.
>>>
>>>>>
>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>> + if (flag_tree_loop_vectorize)
>>>>> + {
>>>>>
>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>> That looks somewhat inconsistent...
>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>> +
>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>> + if (alias_ddrs->length () > 0)
>>>> + return true;
>>>> +
>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>> + vectorizer is not enable because no other pass can fold it. */
>>>> + if (!flag_tree_loop_vectorize)
>>>> + return false;
>>>> +
>>>>
>>>> It means we also versioning loops even if runtime alias check is
>>>> unnecessary. I did this because we lack cost model and current
>>>> distribution may result in too many distribution? If that's the case,
>>>> at least vectorizer will remove distributed version loop and fall back
>>>> to the original one. Hmm, shall I change it into below code:
>>>
>>> Hmm, ok - that really ties things to vectorization apart from the
>>> builtin recognition. So what happens if the vectorizer can vectorize
>>> both the distributed and non-distributed loops?
>> Hmm, there is no such cases. Under condition no builtins is
>> recognized, we wouldn't distribute loop if by itself can be
>> vectorized. Does this make sense? Of course, cost model for memory
>> behavior can change this behavior and is wanted.
>
> So which cases _do_ we split loops then? "more parallelism" -- but what
> does that mean exactly? Is there any testcase that shows the desired
> splits for vectorization?
At least one of distributed loop can be executed paralleled while the
original loop can't.
Not many, but ldist-26.c is added by one of patch.
>
>>>
>>>> +
>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>> + if (alias_ddrs->length () == 0)
>>>> + return false;
>>>> +
>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>> + vectorizer is not enable because no other pass can fold it. */
>>>> + if (!flag_tree_loop_vectorize)
>>>> + return false;
>>>> +
>>>>
>>>> Then I can remove the check you mentioned in function
>>>> version_loop_by_alias_check?
>>>
>>> Yeah, I guess that would be easier to understand. Need to update
>>> the comment above the alias_ddrs check though.
>> Yes the logic here is complicated. On one hand, I want to be
>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>> can undo all "unnecessary" distribution before memory behavior is
>> modeled.
>>
>>
>>>
>>>>>
>>>>> + /* Don't version loop if any partition is builtin. */
>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>> + {
>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>> + break;
>>>>> + }
>>>>>
>>>>> why's that? Do you handle the case where only a subset of partitions
>>>> One reason is I generally consider distributed builtin functions as a
>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>> reason is if all distributed loops are recognized as builtins, we
>>>> can't really version with current logic because the
>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>
>>> Ah, ok. I guess the maze was too twisted for me to see what
>>> version_for_distribution_p
>>> does ;)
>>>
>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>> generate three loops, only two of them being versioned? The
>>>>> complication would be that both the runtime alias and non-alias
>>>>> versions would be "transformed". Or do we rely on recursively
>>>>> distributing the distribution result, thus if we have partitions that
>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>> and recurse on those?
>>>> No, this is not precisely handled now, the pass will version the whole
>>>> loop once. Though I think it's not very difficult to do two stages
>>>> distribution, I am not sure how useful it is.
>>>
>>> Not sure either.
>>>
>>> So to understand you're looking at loop-distribution as vectorization enabler
>>> and pattern detector. I think that is reasonable without a better cost model.
>>>
>>> Note that I think the state before your patches had the sensible cost-model
>>> that it fused very conservatively and just produced the maximum distribution
>>> with the idea that the looping overhead itself is cheap. Note that with
>>> a more "maximum" distribution the vectorizer also gets the chance to
>>> do "partial vectorization" in case profitability is different. Of course the
>>> setup cost may offset that in the case all loops end up vectorized...
>> Ideally, we have cost model for memory behavior in distribution. If
>> we know distribution is beneficial in loop distribution, we can simply
>> distribute it; otherwise we pass distribution cost (including memory
>> cost as well as runtime alias check cost as an argument to
>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>> together with vectorization.
>
> Yes. The old cost model wasn't really one thus loop distribution was never
> enabled by default.
>
>>>
>>> So in reality the vectorizer itself should "distribute away"
>>> non-vectorizable parts
>>> of the loop ... :/
>> Do we really want to put more and more things into vectorizer? :)
>> Michael's opinion (IIUC, about introducing different loop
>> optimizations as building blocks for loop optimization infrastructure)
>> is quite inspiring to me. In this way, we can at least make different
>> optimization conservative. In general, it's much worse to do
>> aggressive (unnecessary) transformation than simply miss the
>> opportunity?
>
> Heh, but with increasing number of "different loop optimizations" we
> miss an overall goal. Like either each pass knows what the other
> would do (vectorize or predcom for example) or we need to have
> a meta-engine driving the building blocks and the building blocks
> divided into analysis + costing and a transform phase (and if ever
> two transforms shall apply their analysis may not be invalidated
> by an earlier one). Integrating the simple building blocks into one
> framework is what is sexy about GRAPHITE ...
>
> For example predcom might be more beneficial than vectorization
> with v2df vectors for example.
We need to be careful/conservative for transformations with
complicated impact. But..., as talk is easy, I think I better to stop
here. :)
Thanks,
bin
>
> Richard.
>
>>
>> Thanks,
>> bin
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-28 13:09 ` Bin.Cheng
@ 2017-06-30 10:43 ` Bin.Cheng
2017-07-03 11:50 ` Richard Biener
2017-07-10 9:32 ` Christophe Lyon
0 siblings, 2 replies; 16+ messages in thread
From: Bin.Cheng @ 2017-06-30 10:43 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 10529 bytes --]
On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>> Hi,
>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>> x86_64 and aarch64.
>>>>>>
>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>> which it clearly does.
>>>>>>
>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>> of "parallelism" still confuses me.
>>>>>>
>>>>>> Can you elaborate on that. Now onto the patch:
>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>> etc., appropriate for distribution.
>>>>
>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>> can be executed "in parallel"
>>> So if a loop by itself can be vectorized (or so called can be executed
>>> paralleled), we tend to no distribute it into small ones. But there
>>> is one exception here, if the distributed small loops are recognized
>>> as builtin functions, we still distribute it. I assume it's generally
>>> better to call builtin memory functions than vectorize it by GCC?
>>
>> Yes.
>>
>>>>
>>>>>>
>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>> + in the following steps:
>>>>>>
>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>> I will revise the comment, but as explained, enabling more
>>>>> vectorization is the major criteria for distribution to some extend
>>>>> now.
>>>>
>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>> Let's see if any performance drop will be reported against this patch.
>>> Let's see if we can create a cost model for it.
>>
>> Fine.
> I will run some benchmarks to see if there is breakage.
>>
>>>>
>>>>>>
>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>> didn't look
>>>>>> into yet). If you don't use that please introduce it separately.
>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>
>>>>>>
>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>> + or the two data references have the same base address and
>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>> + In other words, the dependence can not be resolved by
>>>>>> + runtime alias check. */
>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>> + || res == 0)
>>>>>>
>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>> a specific case?
>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>> here straightforwardly. Shall I try to further generalize the
>>>>> interface as independence patch to this one?
>>>>
>>>> That would be nice.
>>>>
>>>>>>
>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>> + if (flag_tree_loop_vectorize)
>>>>>> + {
>>>>>>
>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>> That looks somewhat inconsistent...
>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>> +
>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>> + if (alias_ddrs->length () > 0)
>>>>> + return true;
>>>>> +
>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>> + if (!flag_tree_loop_vectorize)
>>>>> + return false;
>>>>> +
>>>>>
>>>>> It means we also versioning loops even if runtime alias check is
>>>>> unnecessary. I did this because we lack cost model and current
>>>>> distribution may result in too many distribution? If that's the case,
>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>> to the original one. Hmm, shall I change it into below code:
>>>>
>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>> both the distributed and non-distributed loops?
>>> Hmm, there is no such cases. Under condition no builtins is
>>> recognized, we wouldn't distribute loop if by itself can be
>>> vectorized. Does this make sense? Of course, cost model for memory
>>> behavior can change this behavior and is wanted.
>>
>> So which cases _do_ we split loops then? "more parallelism" -- but what
>> does that mean exactly? Is there any testcase that shows the desired
>> splits for vectorization?
> At least one of distributed loop can be executed paralleled while the
> original loop can't.
> Not many, but ldist-26.c is added by one of patch.
>>
>>>>
>>>>> +
>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>> + if (alias_ddrs->length () == 0)
>>>>> + return false;
>>>>> +
>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>> + if (!flag_tree_loop_vectorize)
>>>>> + return false;
>>>>> +
>>>>>
>>>>> Then I can remove the check you mentioned in function
>>>>> version_loop_by_alias_check?
>>>>
>>>> Yeah, I guess that would be easier to understand. Need to update
>>>> the comment above the alias_ddrs check though.
>>> Yes the logic here is complicated. On one hand, I want to be
>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>> can undo all "unnecessary" distribution before memory behavior is
>>> modeled.
>>>
>>>
>>>>
>>>>>>
>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>> + {
>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>> + break;
>>>>>> + }
>>>>>>
>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>> One reason is I generally consider distributed builtin functions as a
>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>> can't really version with current logic because the
>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>
>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>> version_for_distribution_p
>>>> does ;)
>>>>
>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>> generate three loops, only two of them being versioned? The
>>>>>> complication would be that both the runtime alias and non-alias
>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>> and recurse on those?
>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>> distribution, I am not sure how useful it is.
>>>>
>>>> Not sure either.
>>>>
>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>
>>>> Note that I think the state before your patches had the sensible cost-model
>>>> that it fused very conservatively and just produced the maximum distribution
>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>> do "partial vectorization" in case profitability is different. Of course the
>>>> setup cost may offset that in the case all loops end up vectorized...
>>> Ideally, we have cost model for memory behavior in distribution. If
>>> we know distribution is beneficial in loop distribution, we can simply
>>> distribute it; otherwise we pass distribution cost (including memory
>>> cost as well as runtime alias check cost as an argument to
>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>> together with vectorization.
>>
>> Yes. The old cost model wasn't really one thus loop distribution was never
>> enabled by default.
>>
Hi,
This is updated patch. It makes below changes as well as renaming
ldist_alias_id to orig_loop_num.
1) Simplifies relation with flag_tree_loop_vectorization. Now it only
versions loop if runtime alias check is needed.
2) Record the new versioned loop as original loop in order to simplify
dominance working routine. It also makes sense since versioned loop
is the one same with the original loop.
No change for ChangeLog entry. Bootstrap and test. Is it OK?
Thanks,
bin
[-- Attachment #2: 0012-loop-distribution-20170621.txt.patch --]
[-- Type: text/x-patch, Size: 35997 bytes --]
From 08b5784294f41b79f980eac86d7150269fcf8295 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binche01@e108451-lin.cambridge.arm.com>
Date: Thu, 22 Jun 2017 17:18:49 +0100
Subject: [PATCH 12/13] loop-distribution-20170621.txt
---
gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c | 3 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c | 5 +-
gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c | 4 +-
gcc/tree-loop-distribution.c | 820 +++++++++++++++++++++++++++----
5 files changed, 727 insertions(+), 110 deletions(-)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
index 53551ca..625dd92 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c
@@ -18,4 +18,5 @@ int foo (int * __restrict__ ia,
return oya[22] + oyb[21];
}
-/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 1 "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
index ba39d4d..8c9fd56 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c
@@ -16,6 +16,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
index 48c1040..fa4d1a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c
@@ -21,6 +21,5 @@ float foo (int n)
return tmp;
}
-/* We should apply loop distribution. */
-
-/* { dg-final { scan-tree-dump "Loop 1 distributed: split to 2 loops" "ldist" } } */
+/* Distributing the loop doesn't expose more parallelism. */
+/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 2 loops" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
index c36daf071..4def9b4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c
@@ -20,7 +20,5 @@ int loop1 (int k)
return b[100-1][1];
}
-/* The current cost model fuses the two partitions because they have
- similar memory accesses. */
-/* { dg-final { scan-tree-dump "similar memory accesses" "ldist" } } */
+/* Distributing inner loop doesn't expose more parallelism. */
/* { dg-final { scan-tree-dump-times "distributed: split to 2 loops" 0 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index b15ec04..47f6303 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -36,10 +36,58 @@ along with GCC; see the file COPYING3. If not see
| D(I) = A(I-1)*E
|ENDDO
- This pass uses an RDG, Reduced Dependence Graph built on top of the
- data dependence relations. The RDG is then topologically sorted to
- obtain a map of information producers/consumers based on which it
- generates the new loops. */
+ Loop distribution is the dual of loop fusion. It separates statements
+ of a loop (or loop nest) into multiple loops (or loop nests) with the
+ same loop header. The major goal is to separate statements which may
+ be vectorized from those that can't. This pass implements distribution
+ in the following steps:
+
+ 1) Seed partitions with specific type statements. For now we support
+ two types seed statements: statement defining variable used outside
+ of loop; statement storing to memory.
+ 2) Build reduced dependence graph (RDG) for loop to be distributed.
+ The vertices (RDG:V) model all statements in the loop and the edges
+ (RDG:E) model flow and control dependencies between statements.
+ 3) Apart from RDG, compute data dependencies between memory references.
+ 4) Starting from seed statement, build up partition by adding depended
+ statements according to RDG's dependence information. Partition is
+ classified as parallel type if it can be executed paralleled; or as
+ sequential type if it can't. Parallel type partition is further
+ classified as different builtin kinds if it can be implemented as
+ builtin function calls.
+ 5) Build partition dependence graph (PG) based on data dependencies.
+ The vertices (PG:V) model all partitions and the edges (PG:E) model
+ all data dependencies between every partitions pair. In general,
+ data dependence is either compilation time known or unknown. In C
+ family languages, there exists quite amount compilation time unknown
+ dependencies because of possible alias relation of data references.
+ We categorize PG's edge to two types: "true" edge that represents
+ compilation time known data dependencies; "alias" edge for all other
+ data dependencies.
+ 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
+ partitions in each strong connected component (SCC) correspondingly.
+ Build new PG for merged partitions.
+ 7) Traverse PG again and this time with both "true" and "alias" edges
+ included. We try to break SCCs by removing some edges. Because
+ SCCs by "true" edges are all fused in step 6), we can break SCCs
+ by removing some "alias" edges. It's NP-hard to choose optimal
+ edge set, fortunately simple approximation is good enough for us
+ given the small problem scale.
+ 8) Collect all data dependencies of the removed "alias" edges. Create
+ runtime alias checks for collected data dependencies.
+ 9) Version loop under the condition of runtime alias checks. Given
+ loop distribution generally introduces additional overhead, it is
+ only useful if vectorization is achieved in distributed loop. We
+ version loop with internal function call IFN_LOOP_DIST_ALIAS. If
+ no distributed loop can be vectorized, we simply remove distributed
+ loops and recover to the original one.
+
+ TODO:
+ 1) We only distribute innermost loops now. This pass should handle loop
+ nests in the future.
+ 2) We only fuse partitions in SCC now. A better fusion algorithm is
+ desired to minimize loop overhead, maximize parallelism and maximize
+ data reuse. */
#include "config.h"
#include "system.h"
@@ -744,11 +792,18 @@ generate_loops_for_partition (struct loop *loop, partition *partition,
if (copy_p)
{
+ int orig_loop_num = loop->orig_loop_num;
loop = copy_loop_before (loop);
gcc_assert (loop != NULL);
+ loop->orig_loop_num = orig_loop_num;
create_preheader (loop, CP_SIMPLE_PREHEADERS);
create_bb_after_loop (loop);
}
+ else
+ {
+ /* Origin number is set to the new versioned loop's num. */
+ gcc_assert (loop->orig_loop_num != loop->num);
+ }
/* Remove stmts not in the PARTITION bitmap. */
bbs = get_loop_body_in_dom_order (loop);
@@ -1601,11 +1656,14 @@ partition_contains_all_rw (struct graph *rdg,
}
/* Compute partition dependence created by the data references in DRS1
- and DRS2 and modify and return DIR according to that. */
+ and DRS2, modify and return DIR according to that. IF ALIAS_DDR is
+ not NULL, we record dependence introduced by possible alias between
+ two data references in ALIAS_DDRS; otherwise, we simply ignore such
+ dependence as if it doesn't exist at all. */
static int
pg_add_dependence_edges (struct graph *rdg, int dir,
- bitmap drs1, bitmap drs2)
+ bitmap drs1, bitmap drs2, vec<ddr_p> *alias_ddrs)
{
unsigned i, j;
bitmap_iterator bi, bj;
@@ -1619,6 +1677,9 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
EXECUTE_IF_SET_IN_BITMAP (drs2, 0, j, bj)
{
+ int res, this_dir = 1;
+ ddr_p ddr;
+
dr2 = datarefs_vec[j];
/* Skip all <read, read> data dependence. */
@@ -1626,9 +1687,7 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
continue;
saved_dr1 = dr1;
- int this_dir = 1;
- ddr_p ddr;
- /* Re-shuffle data-refs to be in dominator order. */
+ /* Re-shuffle data-refs to be in topological order. */
if (rdg_vertex_for_stmt (rdg, DR_STMT (dr1))
> rdg_vertex_for_stmt (rdg, DR_STMT (dr2)))
{
@@ -1637,14 +1696,33 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
}
ddr = get_data_dependence (rdg, dr1, dr2);
if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know)
- this_dir = 2;
+ {
+ this_dir = 0;
+ res = data_ref_compare_tree (DR_BASE_ADDRESS (dr1),
+ DR_BASE_ADDRESS (dr2));
+ /* Be conservative. If data references are not well analyzed,
+ or the two data references have the same base address and
+ offset, add dependence and consider it alias to each other.
+ In other words, the dependence can not be resolved by
+ runtime alias check. */
+ if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
+ || !DR_INIT (dr1) || !DR_INIT (dr2)
+ || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
+ || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
+ || res == 0)
+ this_dir = 2;
+ /* Data dependence could be resolved by runtime alias check,
+ record it in ALIAS_DDRS. */
+ else if (alias_ddrs != NULL)
+ alias_ddrs->safe_push (ddr);
+ /* Or simply ignore it. */
+ }
else if (DDR_ARE_DEPENDENT (ddr) == NULL_TREE)
{
if (DDR_REVERSED_P (ddr))
- {
- std::swap (dr1, dr2);
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
+
/* Known dependences can still be unordered througout the
iteration space, see gcc.dg/tree-ssa/ldist-16.c. */
if (DDR_NUM_DIST_VECTS (ddr) != 1)
@@ -1652,12 +1730,10 @@ pg_add_dependence_edges (struct graph *rdg, int dir,
/* If the overlap is exact preserve stmt order. */
else if (lambda_vector_zerop (DDR_DIST_VECT (ddr, 0), 1))
;
+ /* Else as the distance vector is lexicographic positive swap
+ the dependence direction. */
else
- {
- /* Else as the distance vector is lexicographic positive
- swap the dependence direction. */
- this_dir = -this_dir;
- }
+ this_dir = -this_dir;
}
else
this_dir = 0;
@@ -1684,11 +1760,612 @@ pgcmp (const void *v1_, const void *v2_)
return v2->post - v1->post;
}
-/* Distributes the code from LOOP in such a way that producer
- statements are placed before consumer statements. Tries to separate
- only the statements from STMTS into separate loops.
- Returns the number of distributed loops. Set *DESTROY_P to whether
- LOOP needs to be destroyed. */
+/* Data attached to vertices of partition dependence graph. */
+struct pg_vdata
+{
+ /* ID of the corresponding partition. */
+ int id;
+ /* The partition. */
+ struct partition *partition;
+};
+
+/* Data attached to edges of partition dependence graph. */
+struct pg_edata
+{
+ /* If the dependence edge can be resolved by runtime alias check,
+ this vector contains data dependence relations for runtime alias
+ check. On the other hand, if the dependence edge is introduced
+ because of compilation time known data dependence, this vector
+ contains nothing. */
+ vec<ddr_p> alias_ddrs;
+};
+
+/* Callback data for traversing edges in graph. */
+struct pg_edge_callback_data
+{
+ /* Bitmap contains strong connected components should be merged. */
+ bitmap sccs_to_merge;
+ /* Array constains component information for all vertices. */
+ int *vertices_component;
+ /* Vector to record all data dependence relations which are needed
+ to break strong connected components by runtime alias checks. */
+ vec<ddr_p> *alias_ddrs;
+};
+
+/* Initialize vertice's data for partition dependence graph PG with
+ PARTITIONS. */
+
+static void
+init_partition_graph_vertices (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ partition *partition;
+ struct pg_vdata *data;
+
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ data = new pg_vdata;
+ pg->vertices[i].data = data;
+ data->id = i;
+ data->partition = partition;
+ }
+}
+
+/* Add edge <I, J> to partition dependence graph PG. Attach vector of data
+ dependence relations to the EDGE if DDRS isn't NULL. */
+
+static void
+add_partition_graph_edge (struct graph *pg, int i, int j, vec<ddr_p> *ddrs)
+{
+ struct graph_edge *e = add_edge (pg, i, j);
+
+ /* If the edge is attached with data dependence relations, it means this
+ dependence edge can be resolved by runtime alias checks. */
+ if (ddrs != NULL)
+ {
+ struct pg_edata *data = new pg_edata;
+
+ gcc_assert (ddrs->length () > 0);
+ e->data = data;
+ data->alias_ddrs = vNULL;
+ data->alias_ddrs.safe_splice (*ddrs);
+ }
+}
+
+/* Callback function for graph travesal algorithm. It returns true
+ if edge E should skipped when traversing the graph. */
+
+static bool
+pg_skip_alias_edge (struct graph_edge *e)
+{
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ return (data != NULL && data->alias_ddrs.length () > 0);
+}
+
+/* Callback function freeing data attached to edge E of graph. */
+
+static void
+free_partition_graph_edata_cb (struct graph *, struct graph_edge *e, void *)
+{
+ if (e->data != NULL)
+ {
+ struct pg_edata *data = (struct pg_edata *)e->data;
+ data->alias_ddrs.release ();
+ delete data;
+ }
+}
+
+/* Free data attached to vertice of partition dependence graph PG. */
+
+static void
+free_partition_graph_vdata (struct graph *pg)
+{
+ int i;
+ struct pg_vdata *data;
+
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ delete data;
+ }
+}
+
+/* Build and return partition dependence graph for PARTITIONS. RDG is
+ reduced dependence graph for the loop to be distributed. If IGNORE_ALIAS_P
+ is true, data dependence caused by possible alias between references
+ is ignored, as if it doesn't exist at all; otherwise all depdendences
+ are considered. */
+
+static struct graph *
+build_partition_graph (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ int i, j;
+ struct partition *partition1, *partition2;
+ graph *pg = new_graph (partitions->length ());
+ auto_vec<ddr_p> alias_ddrs, *alias_ddrs_p;
+
+ alias_ddrs_p = ignore_alias_p ? NULL : &alias_ddrs;
+
+ init_partition_graph_vertices (pg, partitions);
+
+ for (i = 0; partitions->iterate (i, &partition1); ++i)
+ {
+ for (j = i + 1; partitions->iterate (j, &partition2); ++j)
+ {
+ /* dependence direction - 0 is no dependence, -1 is back,
+ 1 is forth, 2 is both (we can stop then, merging will occur). */
+ int dir = 0;
+
+ /* If the first partition has reduction, add back edge; if the
+ second partition has reduction, add forth edge. This makes
+ sure that reduction partition will be sorted as the last one. */
+ if (partition_reduction_p (partition1))
+ dir = -1;
+ else if (partition_reduction_p (partition2))
+ dir = 1;
+
+ /* Cleanup the temporary vector. */
+ alias_ddrs.truncate (0);
+
+ dir = pg_add_dependence_edges (rdg, dir, partition1->datarefs,
+ partition2->datarefs, alias_ddrs_p);
+
+ /* Add edge to partition graph if there exists dependence. There
+ are two types of edges. One type edge is caused by compilation
+ time known dependence, this type can not be resolved by runtime
+ alias check. The other type can be resolved by runtime alias
+ check. */
+ if (dir == 1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != 1 && dir != 2);
+ add_partition_graph_edge (pg, i, j,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ if (dir == -1 || dir == 2
+ || alias_ddrs.length () > 0)
+ {
+ /* Attach data dependence relations to edge that can be resolved
+ by runtime alias check. */
+ bool alias_edge_p = (dir != -1 && dir != 2);
+ add_partition_graph_edge (pg, j, i,
+ (alias_edge_p) ? &alias_ddrs : NULL);
+ }
+ }
+ }
+ return pg;
+}
+
+/* Sort partitions in PG by post order and store them in PARTITIONS. */
+
+static void
+sort_partitions_by_post_order (struct graph *pg,
+ vec<struct partition *> *partitions)
+{
+ int i;
+ struct pg_vdata *data;
+
+ /* Now order the remaining nodes in postorder. */
+ qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
+ partitions->truncate (0);
+ for (i = 0; i < pg->n_vertices; ++i)
+ {
+ data = (struct pg_vdata *)pg->vertices[i].data;
+ if (data->partition)
+ partitions->safe_push (data->partition);
+ }
+}
+
+/* Given reduced dependence graph RDG merge strong connected components
+ of PARTITIONS. If IGNORE_ALIAS_P is true, data dependence caused by
+ possible alias between references is ignored, as if it doesn't exist
+ at all; otherwise all depdendences are considered. */
+
+static void
+merge_dep_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ bool ignore_alias_p)
+{
+ struct partition *partition1, *partition2;
+ struct pg_vdata *data;
+ graph *pg = build_partition_graph (rdg, partitions, ignore_alias_p);
+ int i, j, num_sccs = graphds_scc (pg, NULL);
+
+ /* Strong connected compoenent means dependence cycle, we cannot distribute
+ them. So fuse them together. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &partition1); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (j = j + 1; partitions->iterate (j, &partition2); ++j)
+ if (pg->vertices[j].component == i)
+ {
+ partition_merge_into (NULL, partition1,
+ partition2, FUSE_SAME_SCC);
+ partition1->type = PTYPE_SEQUENTIAL;
+ (*partitions)[j] = NULL;
+ partition_free (partition2);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ data->partition = NULL;
+ }
+ }
+ sort_partitions_by_post_order (pg, partitions);
+ }
+ gcc_assert (partitions->length () == (unsigned)num_sccs);
+ free_partition_graph_vdata (pg);
+ free_graph (pg);
+}
+
+/* Callback function for traversing edge E in graph G. DATA is private
+ callback data. */
+
+static void
+pg_collect_alias_ddrs (struct graph *g, struct graph_edge *e, void *data)
+{
+ int i, j, component;
+ struct pg_edge_callback_data *cbdata;
+ struct pg_edata *edata = (struct pg_edata *) e->data;
+
+ /* If the edge doesn't have attached data dependence, it represents
+ compilation time known dependences. This type dependence cannot
+ be resolved by runtime alias check. */
+ if (edata == NULL || edata->alias_ddrs.length () == 0)
+ return;
+
+ cbdata = (struct pg_edge_callback_data *) data;
+ i = e->src;
+ j = e->dest;
+ component = cbdata->vertices_component[i];
+ /* Vertices are topologically sorted according to compilation time
+ known dependences, so we can break strong connected components
+ by removing edges of the opposite direction, i.e, edges pointing
+ from vertice with smaller post number to vertice with bigger post
+ number. */
+ if (g->vertices[i].post < g->vertices[j].post
+ /* We only need to remove edges connecting vertices in the same
+ strong connected component to break it. */
+ && component == cbdata->vertices_component[j]
+ /* Check if we want to break the strong connected component or not. */
+ && !bitmap_bit_p (cbdata->sccs_to_merge, component))
+ cbdata->alias_ddrs->safe_splice (edata->alias_ddrs);
+}
+
+/* This is the main function breaking strong conected components in
+ PARTITIONS giving reduced depdendence graph RDG. Store data dependence
+ relations for runtime alias check in ALIAS_DDRS. */
+
+static void
+break_alias_scc_partitions (struct graph *rdg,
+ vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ int i, j, num_sccs, num_sccs_no_alias;
+ /* Build partition dependence graph. */
+ graph *pg = build_partition_graph (rdg, partitions, false);
+
+ alias_ddrs->truncate (0);
+ /* Find strong connected components in the graph, with all dependence edges
+ considered. */
+ num_sccs = graphds_scc (pg, NULL);
+ /* All SCCs now can be broken by runtime alias checks because SCCs caused by
+ compilation time known dependences are merged before this function. */
+ if ((unsigned) num_sccs < partitions->length ())
+ {
+ struct pg_edge_callback_data cbdata;
+ auto_bitmap sccs_to_merge;
+ auto_vec<enum partition_type> scc_types;
+ struct partition *partition, *first;
+
+ /* If all paritions in a SCC has the same type, we can simply merge the
+ SCC. This loop finds out such SCCS and record them in bitmap. */
+ bitmap_set_range (sccs_to_merge, 0, (unsigned) num_sccs);
+ for (i = 0; i < num_sccs; ++i)
+ {
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (pg->vertices[j].component == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ if (pg->vertices[j].component != i)
+ continue;
+
+ if (first->type != partition->type)
+ {
+ bitmap_clear_bit (sccs_to_merge, i);
+ break;
+ }
+ }
+ }
+
+ /* Initialize callback data for traversing. */
+ cbdata.sccs_to_merge = sccs_to_merge;
+ cbdata.alias_ddrs = alias_ddrs;
+ cbdata.vertices_component = XNEWVEC (int, pg->n_vertices);
+ /* Record the component information which will be corrupted by next
+ graph scc finding call. */
+ for (i = 0; i < pg->n_vertices; ++i)
+ cbdata.vertices_component[i] = pg->vertices[i].component;
+
+ /* Collect data dependences for runtime alias checks to break SCCs. */
+ if (bitmap_count_bits (sccs_to_merge) != (unsigned) num_sccs)
+ {
+ /* Run SCC finding algorithm again, with alias dependence edges
+ skipped. This is to topologically sort paritions according to
+ compilation time known dependence. Note the topological order
+ is stored in the form of pg's post order number. */
+ num_sccs_no_alias = graphds_scc (pg, NULL, pg_skip_alias_edge);
+ gcc_assert (partitions->length () == (unsigned) num_sccs_no_alias);
+ /* With topological order, we can construct two subgraphs L and R.
+ L contains edge <x, y> where x < y in terms of post order, while
+ R contains edge <x, y> where x > y. Edges for compilation time
+ known dependence all fall in R, so we break SCCs by removing all
+ (alias) edges of in subgraph L. */
+ for_each_edge (pg, pg_collect_alias_ddrs, &cbdata);
+ }
+
+ /* For SCC that doesn't need to be broken, merge it. */
+ for (i = 0; i < num_sccs; ++i)
+ {
+ if (!bitmap_bit_p (sccs_to_merge, i))
+ continue;
+
+ for (j = 0; partitions->iterate (j, &first); ++j)
+ if (cbdata.vertices_component[j] == i)
+ break;
+ for (++j; partitions->iterate (j, &partition); ++j)
+ {
+ struct pg_vdata *data;
+
+ if (cbdata.vertices_component[j] != i)
+ continue;
+
+ partition_merge_into (NULL, first, partition, FUSE_SAME_SCC);
+ (*partitions)[j] = NULL;
+ partition_free (partition);
+ data = (struct pg_vdata *)pg->vertices[j].data;
+ gcc_assert (data->id == j);
+ data->partition = NULL;
+ }
+ }
+ }
+
+ sort_partitions_by_post_order (pg, partitions);
+ free_partition_graph_vdata (pg);
+ for_each_edge (pg, free_partition_graph_edata_cb, NULL);
+ free_graph (pg);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "Possible alias data dependence to break:\n");
+ dump_data_dependence_relations (dump_file, *alias_ddrs);
+ }
+}
+
+/* Compute and return an expression whose value is the segment length which
+ will be accessed by DR in NITERS iterations. */
+
+static tree
+data_ref_segment_size (struct data_reference *dr, tree niters)
+{
+ tree segment_length;
+
+ if (integer_zerop (DR_STEP (dr)))
+ segment_length = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
+ else
+ segment_length = size_binop (MULT_EXPR,
+ fold_convert (sizetype, DR_STEP (dr)),
+ fold_convert (sizetype, niters));
+
+ return segment_length;
+}
+
+/* Return true if LOOP's latch is dominated by statement for data reference
+ DR. */
+
+static inline bool
+latch_dominated_by_data_ref (struct loop *loop, data_reference *dr)
+{
+ return dominated_by_p (CDI_DOMINATORS, single_exit (loop)->src,
+ gimple_bb (DR_STMT (dr)));
+}
+
+/* Compute alias check pairs and store them in COMP_ALIAS_PAIRS for LOOP's
+ data dependence relations ALIAS_DDRS. */
+
+static void
+compute_alias_check_pairs (struct loop *loop, vec<ddr_p> *alias_ddrs,
+ vec<dr_with_seg_len_pair_t> *comp_alias_pairs)
+{
+ unsigned int i;
+ unsigned HOST_WIDE_INT factor = 1;
+ tree niters_plus_one, niters = number_of_latch_executions (loop);
+
+ gcc_assert (niters != NULL_TREE && niters != chrec_dont_know);
+ niters = fold_convert (sizetype, niters);
+ niters_plus_one = size_binop (PLUS_EXPR, niters, size_one_node);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Creating alias check pairs:\n");
+
+ /* Iterate all data dependence relations and compute alias check pairs. */
+ for (i = 0; i < alias_ddrs->length (); i++)
+ {
+ ddr_p ddr = (*alias_ddrs)[i];
+ struct data_reference *dr_a = DDR_A (ddr);
+ struct data_reference *dr_b = DDR_B (ddr);
+ tree seg_length_a, seg_length_b;
+ int comp_res = data_ref_compare_tree (DR_BASE_ADDRESS (dr_a),
+ DR_BASE_ADDRESS (dr_b));
+
+ if (comp_res == 0)
+ comp_res = data_ref_compare_tree (DR_OFFSET (dr_a), DR_OFFSET (dr_b));
+ gcc_assert (comp_res != 0);
+
+ if (latch_dominated_by_data_ref (loop, dr_a))
+ seg_length_a = data_ref_segment_size (dr_a, niters_plus_one);
+ else
+ seg_length_a = data_ref_segment_size (dr_a, niters);
+
+ if (latch_dominated_by_data_ref (loop, dr_b))
+ seg_length_b = data_ref_segment_size (dr_b, niters_plus_one);
+ else
+ seg_length_b = data_ref_segment_size (dr_b, niters);
+
+ dr_with_seg_len_pair_t dr_with_seg_len_pair
+ (dr_with_seg_len (dr_a, seg_length_a),
+ dr_with_seg_len (dr_b, seg_length_b));
+
+ /* Canonicalize pairs by sorting the two DR members. */
+ if (comp_res > 0)
+ std::swap (dr_with_seg_len_pair.first, dr_with_seg_len_pair.second);
+
+ comp_alias_pairs->safe_push (dr_with_seg_len_pair);
+ }
+
+ if (tree_fits_uhwi_p (niters))
+ factor = tree_to_uhwi (niters);
+
+ /* Prune alias check pairs. */
+ prune_runtime_alias_test_list (comp_alias_pairs, factor);
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Improved number of alias checks from %d to %d\n",
+ alias_ddrs->length (), comp_alias_pairs->length ());
+}
+
+/* Given data dependence relations in ALIAS_DDRS, generate runtime alias
+ checks and version LOOP under condition of these runtime alias checks. */
+
+static void
+version_loop_by_alias_check (struct loop *loop, vec<ddr_p> *alias_ddrs)
+{
+ unsigned then_prob, else_prob, then_scale, else_scale;
+ basic_block cond_bb;
+ struct loop *nloop;
+ tree lhs, arg0, cond_expr = NULL_TREE;
+ gimple_seq cond_stmts = NULL;
+ gimple *call_stmt = NULL;
+ auto_vec<dr_with_seg_len_pair_t> comp_alias_pairs;
+
+ /* Generate code for runtime alias checks if necessary. */
+ gcc_assert (alias_ddrs->length () > 0);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Version loop <%d> with runtime alias check\n", loop->num);
+
+ compute_alias_check_pairs (loop, alias_ddrs, &comp_alias_pairs);
+ create_runtime_alias_checks (loop, &comp_alias_pairs, &cond_expr);
+ cond_expr = force_gimple_operand_1 (cond_expr, &cond_stmts,
+ is_gimple_condexpr, NULL_TREE);
+ then_prob = 9 * REG_BR_PROB_BASE / 10;
+ else_prob = REG_BR_PROB_BASE - then_prob;
+ then_scale = 9 * REG_BR_PROB_BASE / 10;
+ else_scale = REG_BR_PROB_BASE - then_scale;
+
+ /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
+ if (flag_tree_loop_vectorize)
+ {
+ /* Generate internal function call for loop distribution alias check. */
+ call_stmt = gimple_build_call_internal (IFN_LOOP_DIST_ALIAS,
+ 2, NULL_TREE, cond_expr);
+ lhs = make_ssa_name (boolean_type_node);
+ gimple_call_set_lhs (call_stmt, lhs);
+ }
+ else
+ lhs = cond_expr;
+
+ initialize_original_copy_tables ();
+ nloop = loop_version (loop, lhs, &cond_bb, then_prob,
+ else_prob, then_scale, else_scale, true);
+ free_original_copy_tables ();
+ /* Record the original loop number in newly generated loops. In case of
+ distribution, the original loop will be distributed and the new loop
+ is kept. */
+ loop->orig_loop_num = nloop->num;
+ nloop->orig_loop_num = nloop->num;
+ nloop->dont_vectorize = true;
+ nloop->force_vectorize = false;
+
+ if (call_stmt)
+ {
+ /* Record new loop's num in IFN_LOOP_DIST_ALIAS because the original
+ loop could be destroyed. */
+ arg0 = build_int_cst (integer_type_node, loop->orig_loop_num);
+ gimple_call_set_arg (call_stmt, 0, arg0);
+ gimple_seq_add_stmt_without_update (&cond_stmts, call_stmt);
+ }
+
+ if (cond_stmts)
+ {
+ gimple_stmt_iterator cond_gsi = gsi_last_bb (cond_bb);
+ gsi_insert_seq_before (&cond_gsi, cond_stmts, GSI_SAME_STMT);
+ }
+ update_ssa (TODO_update_ssa);
+}
+
+/* Return true if loop versioning is needed to distrubute PARTITIONS.
+ ALIAS_DDRS are data dependence relations for runtime alias check. */
+
+static inline bool
+version_for_distribution_p (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ /* No need to version loop if we have only one partition. */
+ if (partitions->length () == 1)
+ return false;
+
+ /* Need to version loop if runtime alias check is necessary. */
+ return (alias_ddrs->length () > 0);
+}
+
+/* Fuse all partitions if necessary before finalizing distribution. */
+
+static void
+finalize_partitions (vec<struct partition *> *partitions,
+ vec<ddr_p> *alias_ddrs)
+{
+ unsigned i;
+ struct partition *a, *partition;
+
+ if (partitions->length () == 1
+ || alias_ddrs->length () > 0)
+ return;
+
+ a = (*partitions)[0];
+ if (a->kind != PKIND_NORMAL)
+ return;
+
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ /* Don't fuse if partition has different type or it is a builtin. */
+ if (partition->type != a->type
+ || partition->kind != PKIND_NORMAL)
+ return;
+ }
+
+ /* Fuse all partitions. */
+ for (i = 1; partitions->iterate (i, &partition); ++i)
+ {
+ partition_merge_into (NULL, a, partition, FUSE_FINALIZE);
+ partition_free (partition);
+ }
+ partitions->truncate (1);
+}
+
+/* Distributes the code from LOOP in such a way that producer statements
+ are placed before consumer statements. Tries to separate only the
+ statements from STMTS into separate loops. Returns the number of
+ distributed loops. Set NB_CALLS to number of generated builtin calls.
+ Set *DESTROY_P to whether LOOP needs to be destroyed. */
static int
distribute_loop (struct loop *loop, vec<gimple *> stmts,
@@ -1698,8 +2375,6 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
partition *partition;
bool any_builtin;
int i, nbp;
- graph *pg = NULL;
- int num_sccs = 1;
*destroy_p = false;
*nb_calls = 0;
@@ -1747,6 +2422,11 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
auto_vec<struct partition *, 3> partitions;
rdg_build_partitions (rdg, stmts, &partitions);
+ /* Can't do runtime alias check if loop niter is unknown. */
+ tree niters = number_of_latch_executions (loop);
+ bool rt_alias_check_p = (niters != NULL_TREE && niters != chrec_dont_know);
+ auto_vec<ddr_p> alias_ddrs;
+
auto_bitmap stmt_in_all_partitions;
bitmap_copy (stmt_in_all_partitions, partitions[0]->stmts);
for (i = 1; partitions.iterate (i, &partition); ++i)
@@ -1833,81 +2513,14 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
/* Build the partition dependency graph. */
if (partitions.length () > 1)
{
- pg = new_graph (partitions.length ());
- struct pgdata {
- struct partition *partition;
- };
-#define PGDATA(i) ((pgdata *)(pg->vertices[i].data))
- for (i = 0; partitions.iterate (i, &partition); ++i)
- {
- vertex *v = &pg->vertices[i];
- pgdata *data = new pgdata;
- /* FIXME - leaks. */
- v->data = data;
- data->partition = partition;
- }
- struct partition *partition1, *partition2;
- for (i = 0; partitions.iterate (i, &partition1); ++i)
- for (int j = i + 1; partitions.iterate (j, &partition2); ++j)
- {
- /* dependence direction - 0 is no dependence, -1 is back,
- 1 is forth, 2 is both (we can stop then, merging will occur). */
- int dir = pg_add_dependence_edges (rdg, 0,
- partition1->datarefs,
- partition2->datarefs);
- if (dir == 1 || dir == 2)
- add_edge (pg, i, j);
- if (dir == -1 || dir == 2)
- add_edge (pg, j, i);
- }
-
- /* Add edges to the reduction partition (if any) to force it last. */
- unsigned j;
- for (j = 0; partitions.iterate (j, &partition); ++j)
- if (partition_reduction_p (partition))
- break;
- if (j < partitions.length ())
- {
- for (unsigned i = 0; partitions.iterate (i, &partition); ++i)
- if (i != j)
- add_edge (pg, i, j);
- }
-
- /* Compute partitions we cannot separate and fuse them. */
- num_sccs = graphds_scc (pg, NULL);
- for (i = 0; i < num_sccs; ++i)
- {
- struct partition *first;
- int j;
- for (j = 0; partitions.iterate (j, &first); ++j)
- if (pg->vertices[j].component == i)
- break;
- for (j = j + 1; partitions.iterate (j, &partition); ++j)
- if (pg->vertices[j].component == i)
- {
- partition_merge_into (NULL, first,
- partition, FUSE_SAME_SCC);
- first->type = PTYPE_SEQUENTIAL;
- partitions[j] = NULL;
- partition_free (partition);
- PGDATA (j)->partition = NULL;
- }
- }
-
- /* Now order the remaining nodes in postorder. */
- qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
- partitions.truncate (0);
- for (i = 0; i < pg->n_vertices; ++i)
- {
- pgdata *data = PGDATA (i);
- if (data->partition)
- partitions.safe_push (data->partition);
- delete data;
- }
- gcc_assert (partitions.length () == (unsigned)num_sccs);
- free_graph (pg);
+ merge_dep_scc_partitions (rdg, &partitions, rt_alias_check_p);
+ alias_ddrs.truncate (0);
+ if (rt_alias_check_p && partitions.length () > 1)
+ break_alias_scc_partitions (rdg, &partitions, &alias_ddrs);
}
+ finalize_partitions (&partitions, &alias_ddrs);
+
nbp = partitions.length ();
if (nbp == 0
|| (nbp == 1 && !partition_builtin_p (partitions[0]))
@@ -1917,8 +2530,15 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts,
goto ldist_done;
}
+ if (version_for_distribution_p (&partitions, &alias_ddrs))
+ version_loop_by_alias_check (loop, &alias_ddrs);
+
if (dump_file && (dump_flags & TDF_DETAILS))
- dump_rdg_partitions (dump_file, partitions);
+ {
+ fprintf (dump_file,
+ "distribute loop <%d> into partitions:\n", loop->num);
+ dump_rdg_partitions (dump_file, partitions);
+ }
FOR_EACH_VEC_ELT (partitions, i, partition)
{
--
1.9.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-30 10:43 ` Bin.Cheng
@ 2017-07-03 11:50 ` Richard Biener
2017-07-10 9:32 ` Christophe Lyon
1 sibling, 0 replies; 16+ messages in thread
From: Richard Biener @ 2017-07-03 11:50 UTC (permalink / raw)
To: Bin.Cheng; +Cc: gcc-patches
On Fri, Jun 30, 2017 at 12:43 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>>> x86_64 and aarch64.
>>>>>>>
>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>> which it clearly does.
>>>>>>>
>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>>> of "parallelism" still confuses me.
>>>>>>>
>>>>>>> Can you elaborate on that. Now onto the patch:
>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>> etc., appropriate for distribution.
>>>>>
>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>>> can be executed "in parallel"
>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>> paralleled), we tend to no distribute it into small ones. But there
>>>> is one exception here, if the distributed small loops are recognized
>>>> as builtin functions, we still distribute it. I assume it's generally
>>>> better to call builtin memory functions than vectorize it by GCC?
>>>
>>> Yes.
>>>
>>>>>
>>>>>>>
>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>>> + in the following steps:
>>>>>>>
>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>>> I will revise the comment, but as explained, enabling more
>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>> now.
>>>>>
>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>> Let's see if any performance drop will be reported against this patch.
>>>> Let's see if we can create a cost model for it.
>>>
>>> Fine.
>> I will run some benchmarks to see if there is breakage.
>>>
>>>>>
>>>>>>>
>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>> didn't look
>>>>>>> into yet). If you don't use that please introduce it separately.
>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>
>>>>>>>
>>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>>> + or the two data references have the same base address and
>>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>>> + In other words, the dependence can not be resolved by
>>>>>>> + runtime alias check. */
>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>> + || res == 0)
>>>>>>>
>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>> a specific case?
>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>> here straightforwardly. Shall I try to further generalize the
>>>>>> interface as independence patch to this one?
>>>>>
>>>>> That would be nice.
>>>>>
>>>>>>>
>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>>> + if (flag_tree_loop_vectorize)
>>>>>>> + {
>>>>>>>
>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>> That looks somewhat inconsistent...
>>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>>> +
>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>> + if (alias_ddrs->length () > 0)
>>>>>> + return true;
>>>>>> +
>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>> + return false;
>>>>>> +
>>>>>>
>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>> unnecessary. I did this because we lack cost model and current
>>>>>> distribution may result in too many distribution? If that's the case,
>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>> to the original one. Hmm, shall I change it into below code:
>>>>>
>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>>> both the distributed and non-distributed loops?
>>>> Hmm, there is no such cases. Under condition no builtins is
>>>> recognized, we wouldn't distribute loop if by itself can be
>>>> vectorized. Does this make sense? Of course, cost model for memory
>>>> behavior can change this behavior and is wanted.
>>>
>>> So which cases _do_ we split loops then? "more parallelism" -- but what
>>> does that mean exactly? Is there any testcase that shows the desired
>>> splits for vectorization?
>> At least one of distributed loop can be executed paralleled while the
>> original loop can't.
>> Not many, but ldist-26.c is added by one of patch.
>>>
>>>>>
>>>>>> +
>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>> + if (alias_ddrs->length () == 0)
>>>>>> + return false;
>>>>>> +
>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>> + return false;
>>>>>> +
>>>>>>
>>>>>> Then I can remove the check you mentioned in function
>>>>>> version_loop_by_alias_check?
>>>>>
>>>>> Yeah, I guess that would be easier to understand. Need to update
>>>>> the comment above the alias_ddrs check though.
>>>> Yes the logic here is complicated. On one hand, I want to be
>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>> can undo all "unnecessary" distribution before memory behavior is
>>>> modeled.
>>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>> + {
>>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>>> + break;
>>>>>>> + }
>>>>>>>
>>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>> can't really version with current logic because the
>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>
>>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>>> version_for_distribution_p
>>>>> does ;)
>>>>>
>>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>>> generate three loops, only two of them being versioned? The
>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>> and recurse on those?
>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>>> distribution, I am not sure how useful it is.
>>>>>
>>>>> Not sure either.
>>>>>
>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>>
>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>> do "partial vectorization" in case profitability is different. Of course the
>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>> Ideally, we have cost model for memory behavior in distribution. If
>>>> we know distribution is beneficial in loop distribution, we can simply
>>>> distribute it; otherwise we pass distribution cost (including memory
>>>> cost as well as runtime alias check cost as an argument to
>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>> together with vectorization.
>>>
>>> Yes. The old cost model wasn't really one thus loop distribution was never
>>> enabled by default.
>>>
>
> Hi,
> This is updated patch. It makes below changes as well as renaming
> ldist_alias_id to orig_loop_num.
> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only
> versions loop if runtime alias check is needed.
> 2) Record the new versioned loop as original loop in order to simplify
> dominance working routine. It also makes sense since versioned loop
> is the one same with the original loop.
>
> No change for ChangeLog entry. Bootstrap and test. Is it OK?
Ok.
The profile stuff might need updating now, please double-check.
Thanks,
Richard.
> Thanks,
> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-06-30 10:43 ` Bin.Cheng
2017-07-03 11:50 ` Richard Biener
@ 2017-07-10 9:32 ` Christophe Lyon
2017-07-10 9:39 ` Bin.Cheng
2017-07-17 10:06 ` Bin.Cheng
1 sibling, 2 replies; 16+ messages in thread
From: Christophe Lyon @ 2017-07-10 9:32 UTC (permalink / raw)
To: Bin.Cheng; +Cc: Richard Biener, gcc-patches
Hi Bin,
On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>>> x86_64 and aarch64.
>>>>>>>
>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>> which it clearly does.
>>>>>>>
>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>>> of "parallelism" still confuses me.
>>>>>>>
>>>>>>> Can you elaborate on that. Now onto the patch:
>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>> etc., appropriate for distribution.
>>>>>
>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>>> can be executed "in parallel"
>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>> paralleled), we tend to no distribute it into small ones. But there
>>>> is one exception here, if the distributed small loops are recognized
>>>> as builtin functions, we still distribute it. I assume it's generally
>>>> better to call builtin memory functions than vectorize it by GCC?
>>>
>>> Yes.
>>>
>>>>>
>>>>>>>
>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>>> + in the following steps:
>>>>>>>
>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>>> I will revise the comment, but as explained, enabling more
>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>> now.
>>>>>
>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>> Let's see if any performance drop will be reported against this patch.
>>>> Let's see if we can create a cost model for it.
>>>
>>> Fine.
>> I will run some benchmarks to see if there is breakage.
>>>
>>>>>
>>>>>>>
>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>> didn't look
>>>>>>> into yet). If you don't use that please introduce it separately.
>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>
>>>>>>>
>>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>>> + or the two data references have the same base address and
>>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>>> + In other words, the dependence can not be resolved by
>>>>>>> + runtime alias check. */
>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>> + || res == 0)
>>>>>>>
>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>> a specific case?
>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>> here straightforwardly. Shall I try to further generalize the
>>>>>> interface as independence patch to this one?
>>>>>
>>>>> That would be nice.
>>>>>
>>>>>>>
>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>>> + if (flag_tree_loop_vectorize)
>>>>>>> + {
>>>>>>>
>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>> That looks somewhat inconsistent...
>>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>>> +
>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>> + if (alias_ddrs->length () > 0)
>>>>>> + return true;
>>>>>> +
>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>> + return false;
>>>>>> +
>>>>>>
>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>> unnecessary. I did this because we lack cost model and current
>>>>>> distribution may result in too many distribution? If that's the case,
>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>> to the original one. Hmm, shall I change it into below code:
>>>>>
>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>>> both the distributed and non-distributed loops?
>>>> Hmm, there is no such cases. Under condition no builtins is
>>>> recognized, we wouldn't distribute loop if by itself can be
>>>> vectorized. Does this make sense? Of course, cost model for memory
>>>> behavior can change this behavior and is wanted.
>>>
>>> So which cases _do_ we split loops then? "more parallelism" -- but what
>>> does that mean exactly? Is there any testcase that shows the desired
>>> splits for vectorization?
>> At least one of distributed loop can be executed paralleled while the
>> original loop can't.
>> Not many, but ldist-26.c is added by one of patch.
>>>
>>>>>
>>>>>> +
>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>> + if (alias_ddrs->length () == 0)
>>>>>> + return false;
>>>>>> +
>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>> + return false;
>>>>>> +
>>>>>>
>>>>>> Then I can remove the check you mentioned in function
>>>>>> version_loop_by_alias_check?
>>>>>
>>>>> Yeah, I guess that would be easier to understand. Need to update
>>>>> the comment above the alias_ddrs check though.
>>>> Yes the logic here is complicated. On one hand, I want to be
>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>> can undo all "unnecessary" distribution before memory behavior is
>>>> modeled.
>>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>> + {
>>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>>> + break;
>>>>>>> + }
>>>>>>>
>>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>> can't really version with current logic because the
>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>
>>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>>> version_for_distribution_p
>>>>> does ;)
>>>>>
>>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>>> generate three loops, only two of them being versioned? The
>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>> and recurse on those?
>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>>> distribution, I am not sure how useful it is.
>>>>>
>>>>> Not sure either.
>>>>>
>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>>
>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>> do "partial vectorization" in case profitability is different. Of course the
>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>> Ideally, we have cost model for memory behavior in distribution. If
>>>> we know distribution is beneficial in loop distribution, we can simply
>>>> distribute it; otherwise we pass distribution cost (including memory
>>>> cost as well as runtime alias check cost as an argument to
>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>> together with vectorization.
>>>
>>> Yes. The old cost model wasn't really one thus loop distribution was never
>>> enabled by default.
>>>
>
> Hi,
> This is updated patch. It makes below changes as well as renaming
> ldist_alias_id to orig_loop_num.
> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only
> versions loop if runtime alias check is needed.
> 2) Record the new versioned loop as original loop in order to simplify
> dominance working routine. It also makes sense since versioned loop
> is the one same with the original loop.
>
> No change for ChangeLog entry. Bootstrap and test. Is it OK?
>
I've noticed that this patch introduces regressions on armeb:
FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions execution
test
FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test
For instance for
armeb-none-linux-gnueabihf
--with-cpu=cortex-a9
--with-fpu=neon-fp16
Christophe
> Thanks,
> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-07-10 9:32 ` Christophe Lyon
@ 2017-07-10 9:39 ` Bin.Cheng
2017-07-17 10:06 ` Bin.Cheng
1 sibling, 0 replies; 16+ messages in thread
From: Bin.Cheng @ 2017-07-10 9:39 UTC (permalink / raw)
To: Christophe Lyon; +Cc: Richard Biener, gcc-patches
On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> Hi Bin,
>
> On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>>>> x86_64 and aarch64.
>>>>>>>>
>>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>>> which it clearly does.
>>>>>>>>
>>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>
>>>>>>>> Can you elaborate on that. Now onto the patch:
>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>> etc., appropriate for distribution.
>>>>>>
>>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>>>> can be executed "in parallel"
>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>> paralleled), we tend to no distribute it into small ones. But there
>>>>> is one exception here, if the distributed small loops are recognized
>>>>> as builtin functions, we still distribute it. I assume it's generally
>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>
>>>> Yes.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>>>> + in the following steps:
>>>>>>>>
>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>> now.
>>>>>>
>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>>> Let's see if any performance drop will be reported against this patch.
>>>>> Let's see if we can create a cost model for it.
>>>>
>>>> Fine.
>>> I will run some benchmarks to see if there is breakage.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>> didn't look
>>>>>>>> into yet). If you don't use that please introduce it separately.
>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>
>>>>>>>>
>>>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>>>> + or the two data references have the same base address and
>>>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>>>> + In other words, the dependence can not be resolved by
>>>>>>>> + runtime alias check. */
>>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>>> + || res == 0)
>>>>>>>>
>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>>> a specific case?
>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>> here straightforwardly. Shall I try to further generalize the
>>>>>>> interface as independence patch to this one?
>>>>>>
>>>>>> That would be nice.
>>>>>>
>>>>>>>>
>>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>>>> + if (flag_tree_loop_vectorize)
>>>>>>>> + {
>>>>>>>>
>>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>>> That looks somewhat inconsistent...
>>>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>>>> +
>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>> + if (alias_ddrs->length () > 0)
>>>>>>> + return true;
>>>>>>> +
>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>> + return false;
>>>>>>> +
>>>>>>>
>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>> unnecessary. I did this because we lack cost model and current
>>>>>>> distribution may result in too many distribution? If that's the case,
>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>> to the original one. Hmm, shall I change it into below code:
>>>>>>
>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>>>> both the distributed and non-distributed loops?
>>>>> Hmm, there is no such cases. Under condition no builtins is
>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>> vectorized. Does this make sense? Of course, cost model for memory
>>>>> behavior can change this behavior and is wanted.
>>>>
>>>> So which cases _do_ we split loops then? "more parallelism" -- but what
>>>> does that mean exactly? Is there any testcase that shows the desired
>>>> splits for vectorization?
>>> At least one of distributed loop can be executed paralleled while the
>>> original loop can't.
>>> Not many, but ldist-26.c is added by one of patch.
>>>>
>>>>>>
>>>>>>> +
>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>> + if (alias_ddrs->length () == 0)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>> + return false;
>>>>>>> +
>>>>>>>
>>>>>>> Then I can remove the check you mentioned in function
>>>>>>> version_loop_by_alias_check?
>>>>>>
>>>>>> Yeah, I guess that would be easier to understand. Need to update
>>>>>> the comment above the alias_ddrs check though.
>>>>> Yes the logic here is complicated. On one hand, I want to be
>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>> modeled.
>>>>>
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>> + {
>>>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>>
>>>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>> can't really version with current logic because the
>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>
>>>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>>>> version_for_distribution_p
>>>>>> does ;)
>>>>>>
>>>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>>>> generate three loops, only two of them being versioned? The
>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>> and recurse on those?
>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>>>> distribution, I am not sure how useful it is.
>>>>>>
>>>>>> Not sure either.
>>>>>>
>>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>>>
>>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>> do "partial vectorization" in case profitability is different. Of course the
>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>> Ideally, we have cost model for memory behavior in distribution. If
>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>> cost as well as runtime alias check cost as an argument to
>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>> together with vectorization.
>>>>
>>>> Yes. The old cost model wasn't really one thus loop distribution was never
>>>> enabled by default.
>>>>
>>
>> Hi,
>> This is updated patch. It makes below changes as well as renaming
>> ldist_alias_id to orig_loop_num.
>> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only
>> versions loop if runtime alias check is needed.
>> 2) Record the new versioned loop as original loop in order to simplify
>> dominance working routine. It also makes sense since versioned loop
>> is the one same with the original loop.
>>
>> No change for ChangeLog entry. Bootstrap and test. Is it OK?
>>
>
> I've noticed that this patch introduces regressions on armeb:
> FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer
> -funroll-loops -fpeel-loops -ftracer -finline-functions execution
> test
> FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test
Thanks for reporting this, I will create a PR and investigate.
Thanks,
bin
>
> For instance for
> armeb-none-linux-gnueabihf
> --with-cpu=cortex-a9
> --with-fpu=neon-fp16
>
> Christophe
>
>> Thanks,
>> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-07-10 9:32 ` Christophe Lyon
2017-07-10 9:39 ` Bin.Cheng
@ 2017-07-17 10:06 ` Bin.Cheng
2017-07-17 12:09 ` Christophe Lyon
1 sibling, 1 reply; 16+ messages in thread
From: Bin.Cheng @ 2017-07-17 10:06 UTC (permalink / raw)
To: Christophe Lyon; +Cc: Richard Biener, gcc-patches
On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> Hi Bin,
>
> On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>>>> x86_64 and aarch64.
>>>>>>>>
>>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>>> which it clearly does.
>>>>>>>>
>>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>
>>>>>>>> Can you elaborate on that. Now onto the patch:
>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>> etc., appropriate for distribution.
>>>>>>
>>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>>>> can be executed "in parallel"
>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>> paralleled), we tend to no distribute it into small ones. But there
>>>>> is one exception here, if the distributed small loops are recognized
>>>>> as builtin functions, we still distribute it. I assume it's generally
>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>
>>>> Yes.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>>>> + in the following steps:
>>>>>>>>
>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>> now.
>>>>>>
>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>>> Let's see if any performance drop will be reported against this patch.
>>>>> Let's see if we can create a cost model for it.
>>>>
>>>> Fine.
>>> I will run some benchmarks to see if there is breakage.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>> didn't look
>>>>>>>> into yet). If you don't use that please introduce it separately.
>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>
>>>>>>>>
>>>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>>>> + or the two data references have the same base address and
>>>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>>>> + In other words, the dependence can not be resolved by
>>>>>>>> + runtime alias check. */
>>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>>> + || res == 0)
>>>>>>>>
>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>>> a specific case?
>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>> here straightforwardly. Shall I try to further generalize the
>>>>>>> interface as independence patch to this one?
>>>>>>
>>>>>> That would be nice.
>>>>>>
>>>>>>>>
>>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>>>> + if (flag_tree_loop_vectorize)
>>>>>>>> + {
>>>>>>>>
>>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>>> That looks somewhat inconsistent...
>>>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>>>> +
>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>> + if (alias_ddrs->length () > 0)
>>>>>>> + return true;
>>>>>>> +
>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>> + return false;
>>>>>>> +
>>>>>>>
>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>> unnecessary. I did this because we lack cost model and current
>>>>>>> distribution may result in too many distribution? If that's the case,
>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>> to the original one. Hmm, shall I change it into below code:
>>>>>>
>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>>>> both the distributed and non-distributed loops?
>>>>> Hmm, there is no such cases. Under condition no builtins is
>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>> vectorized. Does this make sense? Of course, cost model for memory
>>>>> behavior can change this behavior and is wanted.
>>>>
>>>> So which cases _do_ we split loops then? "more parallelism" -- but what
>>>> does that mean exactly? Is there any testcase that shows the desired
>>>> splits for vectorization?
>>> At least one of distributed loop can be executed paralleled while the
>>> original loop can't.
>>> Not many, but ldist-26.c is added by one of patch.
>>>>
>>>>>>
>>>>>>> +
>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>> + if (alias_ddrs->length () == 0)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>> + return false;
>>>>>>> +
>>>>>>>
>>>>>>> Then I can remove the check you mentioned in function
>>>>>>> version_loop_by_alias_check?
>>>>>>
>>>>>> Yeah, I guess that would be easier to understand. Need to update
>>>>>> the comment above the alias_ddrs check though.
>>>>> Yes the logic here is complicated. On one hand, I want to be
>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>> modeled.
>>>>>
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>> + {
>>>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>>
>>>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>> can't really version with current logic because the
>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>
>>>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>>>> version_for_distribution_p
>>>>>> does ;)
>>>>>>
>>>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>>>> generate three loops, only two of them being versioned? The
>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>> and recurse on those?
>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>>>> distribution, I am not sure how useful it is.
>>>>>>
>>>>>> Not sure either.
>>>>>>
>>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>>>
>>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>> do "partial vectorization" in case profitability is different. Of course the
>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>> Ideally, we have cost model for memory behavior in distribution. If
>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>> cost as well as runtime alias check cost as an argument to
>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>> together with vectorization.
>>>>
>>>> Yes. The old cost model wasn't really one thus loop distribution was never
>>>> enabled by default.
>>>>
>>
>> Hi,
>> This is updated patch. It makes below changes as well as renaming
>> ldist_alias_id to orig_loop_num.
>> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only
>> versions loop if runtime alias check is needed.
>> 2) Record the new versioned loop as original loop in order to simplify
>> dominance working routine. It also makes sense since versioned loop
>> is the one same with the original loop.
>>
>> No change for ChangeLog entry. Bootstrap and test. Is it OK?
>>
>
> I've noticed that this patch introduces regressions on armeb:
> FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer
> -funroll-loops -fpeel-loops -ftracer -finline-functions execution
> test
> FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test
>
> For instance for
> armeb-none-linux-gnueabihf
> --with-cpu=cortex-a9
> --with-fpu=neon-fp16
Hi Christophe,
I am having difficulty in reproducing this one. According to you test
results at
http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html
http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt
It started at least from r248356? Also I didn't get any difference in
generated assembly between r248356/r248342 for my local toolchain.
Maybe because I used different configuration options?
Thanks,
bin
>
> Christophe
>
>> Thanks,
>> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-07-17 10:06 ` Bin.Cheng
@ 2017-07-17 12:09 ` Christophe Lyon
2017-07-18 9:04 ` Bin.Cheng
0 siblings, 1 reply; 16+ messages in thread
From: Christophe Lyon @ 2017-07-17 12:09 UTC (permalink / raw)
To: Bin.Cheng; +Cc: Richard Biener, gcc-patches
On 17 July 2017 at 12:06, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
> <christophe.lyon@linaro.org> wrote:
>> Hi Bin,
>>
>> On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>>>>> x86_64 and aarch64.
>>>>>>>>>
>>>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>>>> which it clearly does.
>>>>>>>>>
>>>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>>
>>>>>>>>> Can you elaborate on that. Now onto the patch:
>>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>>> etc., appropriate for distribution.
>>>>>>>
>>>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>>>>> can be executed "in parallel"
>>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>>> paralleled), we tend to no distribute it into small ones. But there
>>>>>> is one exception here, if the distributed small loops are recognized
>>>>>> as builtin functions, we still distribute it. I assume it's generally
>>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>>
>>>>> Yes.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>>>>> + in the following steps:
>>>>>>>>>
>>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>>> now.
>>>>>>>
>>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>>>> Let's see if any performance drop will be reported against this patch.
>>>>>> Let's see if we can create a cost model for it.
>>>>>
>>>>> Fine.
>>>> I will run some benchmarks to see if there is breakage.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>>> didn't look
>>>>>>>>> into yet). If you don't use that please introduce it separately.
>>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>>>>> + or the two data references have the same base address and
>>>>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>>>>> + In other words, the dependence can not be resolved by
>>>>>>>>> + runtime alias check. */
>>>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>>>> + || res == 0)
>>>>>>>>>
>>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>>>> a specific case?
>>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>>> here straightforwardly. Shall I try to further generalize the
>>>>>>>> interface as independence patch to this one?
>>>>>>>
>>>>>>> That would be nice.
>>>>>>>
>>>>>>>>>
>>>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>>>>> + if (flag_tree_loop_vectorize)
>>>>>>>>> + {
>>>>>>>>>
>>>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>>>> That looks somewhat inconsistent...
>>>>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>>>>> +
>>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>>> + if (alias_ddrs->length () > 0)
>>>>>>>> + return true;
>>>>>>>> +
>>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>>> + return false;
>>>>>>>> +
>>>>>>>>
>>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>>> unnecessary. I did this because we lack cost model and current
>>>>>>>> distribution may result in too many distribution? If that's the case,
>>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>>> to the original one. Hmm, shall I change it into below code:
>>>>>>>
>>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>>>>> both the distributed and non-distributed loops?
>>>>>> Hmm, there is no such cases. Under condition no builtins is
>>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>>> vectorized. Does this make sense? Of course, cost model for memory
>>>>>> behavior can change this behavior and is wanted.
>>>>>
>>>>> So which cases _do_ we split loops then? "more parallelism" -- but what
>>>>> does that mean exactly? Is there any testcase that shows the desired
>>>>> splits for vectorization?
>>>> At least one of distributed loop can be executed paralleled while the
>>>> original loop can't.
>>>> Not many, but ldist-26.c is added by one of patch.
>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>>> + if (alias_ddrs->length () == 0)
>>>>>>>> + return false;
>>>>>>>> +
>>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>>> + return false;
>>>>>>>> +
>>>>>>>>
>>>>>>>> Then I can remove the check you mentioned in function
>>>>>>>> version_loop_by_alias_check?
>>>>>>>
>>>>>>> Yeah, I guess that would be easier to understand. Need to update
>>>>>>> the comment above the alias_ddrs check though.
>>>>>> Yes the logic here is complicated. On one hand, I want to be
>>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>>> modeled.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>>> + {
>>>>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>>>>> + break;
>>>>>>>>> + }
>>>>>>>>>
>>>>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>>> can't really version with current logic because the
>>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>>
>>>>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>>>>> version_for_distribution_p
>>>>>>> does ;)
>>>>>>>
>>>>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>>>>> generate three loops, only two of them being versioned? The
>>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>>> and recurse on those?
>>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>>>>> distribution, I am not sure how useful it is.
>>>>>>>
>>>>>>> Not sure either.
>>>>>>>
>>>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>>>>
>>>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>>> do "partial vectorization" in case profitability is different. Of course the
>>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>>> Ideally, we have cost model for memory behavior in distribution. If
>>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>>> cost as well as runtime alias check cost as an argument to
>>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>>> together with vectorization.
>>>>>
>>>>> Yes. The old cost model wasn't really one thus loop distribution was never
>>>>> enabled by default.
>>>>>
>>>
>>> Hi,
>>> This is updated patch. It makes below changes as well as renaming
>>> ldist_alias_id to orig_loop_num.
>>> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only
>>> versions loop if runtime alias check is needed.
>>> 2) Record the new versioned loop as original loop in order to simplify
>>> dominance working routine. It also makes sense since versioned loop
>>> is the one same with the original loop.
>>>
>>> No change for ChangeLog entry. Bootstrap and test. Is it OK?
>>>
>>
>> I've noticed that this patch introduces regressions on armeb:
>> FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer
>> -funroll-loops -fpeel-loops -ftracer -finline-functions execution
>> test
>> FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test
>>
>> For instance for
>> armeb-none-linux-gnueabihf
>> --with-cpu=cortex-a9
>> --with-fpu=neon-fp16
> Hi Christophe,
> I am having difficulty in reproducing this one. According to you test
> results at
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt
>
> It started at least from r248356? Also I didn't get any difference in
> generated assembly between r248356/r248342 for my local toolchain.
> Maybe because I used different configuration options?
>
Hi Bin,
Sorry I should have shared more details.
I noticed the regression when you committed this patch (r249994):
http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/249994/report-build-info.html
Checking in my validation logs, it was also present
between r248356 and r249990 (your patch 9/13 Simply cost model merges
partitions with the same references)
and is failing since r249994 (this patch)
Was it "fixed" by accident by r249990 and latent until r249994 ?
HTH
Thanks,
Christophe
> Thanks,
> bin
>>
>> Christophe
>>
>>> Thanks,
>>> bin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
2017-07-17 12:09 ` Christophe Lyon
@ 2017-07-18 9:04 ` Bin.Cheng
0 siblings, 0 replies; 16+ messages in thread
From: Bin.Cheng @ 2017-07-18 9:04 UTC (permalink / raw)
To: Christophe Lyon; +Cc: Richard Biener, gcc-patches
On Mon, Jul 17, 2017 at 1:09 PM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> On 17 July 2017 at 12:06, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
>> <christophe.lyon@linaro.org> wrote:
>>> Hi Bin,
>>>
>>> On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>>>>>> x86_64 and aarch64.
>>>>>>>>>>
>>>>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>>>>> which it clearly does.
>>>>>>>>>>
>>>>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>>>
>>>>>>>>>> Can you elaborate on that. Now onto the patch:
>>>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>>>> etc., appropriate for distribution.
>>>>>>>>
>>>>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>>>>>> can be executed "in parallel"
>>>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>>>> paralleled), we tend to no distribute it into small ones. But there
>>>>>>> is one exception here, if the distributed small loops are recognized
>>>>>>> as builtin functions, we still distribute it. I assume it's generally
>>>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>>>>>> + in the following steps:
>>>>>>>>>>
>>>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>>>> now.
>>>>>>>>
>>>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>>>>> Let's see if any performance drop will be reported against this patch.
>>>>>>> Let's see if we can create a cost model for it.
>>>>>>
>>>>>> Fine.
>>>>> I will run some benchmarks to see if there is breakage.
>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>>>> didn't look
>>>>>>>>>> into yet). If you don't use that please introduce it separately.
>>>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>>>>>> + or the two data references have the same base address and
>>>>>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>>>>>> + In other words, the dependence can not be resolved by
>>>>>>>>>> + runtime alias check. */
>>>>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>>>>> + || res == 0)
>>>>>>>>>>
>>>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>>>>> a specific case?
>>>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>>>> here straightforwardly. Shall I try to further generalize the
>>>>>>>>> interface as independence patch to this one?
>>>>>>>>
>>>>>>>> That would be nice.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>>>>>> + if (flag_tree_loop_vectorize)
>>>>>>>>>> + {
>>>>>>>>>>
>>>>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>>>>> That looks somewhat inconsistent...
>>>>>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>>>>>> +
>>>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>>>> + if (alias_ddrs->length () > 0)
>>>>>>>>> + return true;
>>>>>>>>> +
>>>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>>>> + return false;
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>>>> unnecessary. I did this because we lack cost model and current
>>>>>>>>> distribution may result in too many distribution? If that's the case,
>>>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>>>> to the original one. Hmm, shall I change it into below code:
>>>>>>>>
>>>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>>>>>> both the distributed and non-distributed loops?
>>>>>>> Hmm, there is no such cases. Under condition no builtins is
>>>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>>>> vectorized. Does this make sense? Of course, cost model for memory
>>>>>>> behavior can change this behavior and is wanted.
>>>>>>
>>>>>> So which cases _do_ we split loops then? "more parallelism" -- but what
>>>>>> does that mean exactly? Is there any testcase that shows the desired
>>>>>> splits for vectorization?
>>>>> At least one of distributed loop can be executed paralleled while the
>>>>> original loop can't.
>>>>> Not many, but ldist-26.c is added by one of patch.
>>>>>>
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>>>> + if (alias_ddrs->length () == 0)
>>>>>>>>> + return false;
>>>>>>>>> +
>>>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>>>> + return false;
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Then I can remove the check you mentioned in function
>>>>>>>>> version_loop_by_alias_check?
>>>>>>>>
>>>>>>>> Yeah, I guess that would be easier to understand. Need to update
>>>>>>>> the comment above the alias_ddrs check though.
>>>>>>> Yes the logic here is complicated. On one hand, I want to be
>>>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>>>> modeled.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>>>> + {
>>>>>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>>>>>> + break;
>>>>>>>>>> + }
>>>>>>>>>>
>>>>>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>>>> can't really version with current logic because the
>>>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>>>
>>>>>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>>>>>> version_for_distribution_p
>>>>>>>> does ;)
>>>>>>>>
>>>>>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>>>>>> generate three loops, only two of them being versioned? The
>>>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>>>> and recurse on those?
>>>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>>>>>> distribution, I am not sure how useful it is.
>>>>>>>>
>>>>>>>> Not sure either.
>>>>>>>>
>>>>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>>>>>
>>>>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>>>> do "partial vectorization" in case profitability is different. Of course the
>>>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>>>> Ideally, we have cost model for memory behavior in distribution. If
>>>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>>>> cost as well as runtime alias check cost as an argument to
>>>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>>>> together with vectorization.
>>>>>>
>>>>>> Yes. The old cost model wasn't really one thus loop distribution was never
>>>>>> enabled by default.
>>>>>>
>>>>
>>>> Hi,
>>>> This is updated patch. It makes below changes as well as renaming
>>>> ldist_alias_id to orig_loop_num.
>>>> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only
>>>> versions loop if runtime alias check is needed.
>>>> 2) Record the new versioned loop as original loop in order to simplify
>>>> dominance working routine. It also makes sense since versioned loop
>>>> is the one same with the original loop.
>>>>
>>>> No change for ChangeLog entry. Bootstrap and test. Is it OK?
>>>>
>>>
>>> I've noticed that this patch introduces regressions on armeb:
>>> FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer
>>> -funroll-loops -fpeel-loops -ftracer -finline-functions execution
>>> test
>>> FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test
>>>
>>> For instance for
>>> armeb-none-linux-gnueabihf
>>> --with-cpu=cortex-a9
>>> --with-fpu=neon-fp16
>> Hi Christophe,
>> I am having difficulty in reproducing this one. According to you test
>> results at
>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html
>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt
>>
>> It started at least from r248356? Also I didn't get any difference in
>> generated assembly between r248356/r248342 for my local toolchain.
>> Maybe because I used different configuration options?
>>
>
> Hi Bin,
>
> Sorry I should have shared more details.
> I noticed the regression when you committed this patch (r249994):
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/249994/report-build-info.html
>
> Checking in my validation logs, it was also present
> between r248356 and r249990 (your patch 9/13 Simply cost model merges
> partitions with the same references)
> and is failing since r249994 (this patch)
>
> Was it "fixed" by accident by r249990 and latent until r249994 ?
Ah, thanks very much for the information. Most likely it was latent
because of temporarily conservative loop distribution. I filed
PR81472 for tracking.
Thanks,
bin
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-07-18 9:04 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-12 17:03 [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check Bin Cheng
2017-06-20 9:22 ` Bin.Cheng
2017-06-23 10:30 ` Bin.Cheng
2017-06-27 12:44 ` Richard Biener
2017-06-27 14:07 ` Bin.Cheng
2017-06-28 10:58 ` Richard Biener
2017-06-28 11:46 ` Bin.Cheng
2017-06-28 12:29 ` Richard Biener
2017-06-28 13:09 ` Bin.Cheng
2017-06-30 10:43 ` Bin.Cheng
2017-07-03 11:50 ` Richard Biener
2017-07-10 9:32 ` Christophe Lyon
2017-07-10 9:39 ` Bin.Cheng
2017-07-17 10:06 ` Bin.Cheng
2017-07-17 12:09 ` Christophe Lyon
2017-07-18 9:04 ` Bin.Cheng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).