* [gomp4] Preserve NVPTX "reconvergence" points @ 2015-05-28 14:20 Julian Brown 2015-05-28 14:59 ` Jakub Jelinek 2015-05-28 15:02 ` Richard Biener 0 siblings, 2 replies; 23+ messages in thread From: Julian Brown @ 2015-05-28 14:20 UTC (permalink / raw) To: gcc-patches, Bernd Schmidt, Jakub Jelinek, Thomas Schwinge [-- Attachment #1: Type: text/plain, Size: 6991 bytes --] For NVPTX, it is vitally important that the divergence of threads within a warp can be controlled: in particular we must be able to generate code that we know "reconverges" at a particular point. Unfortunately GCC's middle-end optimisers can cause this property to be violated, which causes problems for the OpenACC execution model we're planning to use for NVPTX. As a brief example: code running in vector-single mode runs on a single thread of a warp, and must broadcast condition results to other threads of the warp so that they can "follow along" and be ready for vector-partitioned execution when necessary. #pragma acc parallel { #pragma acc loop gang for (i = 0; i < N; i++) { /* This is vector-single mode. */ n = ...; switch (n) { case 1: #pragma acc loop vector for (...) { /* This is vector-partitioned mode. */ } ... } } } Here, the calculation "n = ..." takes place on a single thread (of each partitioned gang of the outer loop), but the switch statement (terminating the BB) must be executed by all threads in the warp. The vector-single statements will be translated using a branch around for the "idle" threads: if (threadIdx.x == 0) { n_0 = ...; } n_x = broadcast (n_0) switch (n_x) ... Where "broadcast" is an operation that transfers values from some other thread of a warp (i.e., the zeroth) to the current thread (implemented as a "shfl" instruction for NVPTX). I observed a similar example to this cloning the broadcast and switch instructions (in the .dom1 dump), along the lines of: if (threadIdx.x == 0) { n_0 = ...; n_x = broadcast (n_0) switch (n_x) ... } else { n_x = broadcast (n_0) switch (n_x) ... } This doesn't work because the "broadcast" operation has to be run with non-diverged warps for correct operation, and here there is divergence due to the "if (threadIdx.x == 0)" condition. So, the way I have tried to handle this is by attempting to inhibit optimisation along edges which have a reconvergence point as their destination. The essential idea is to make such edges "abnormal", although the existing EDGE_ABNORMAL flag is not used because that has implicit meaning built into it already, and the new edge type may need to be handled differently in some areas. One example is that at present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL set on their EDGE_TRUE or EDGE_FALSE outgoing edges. The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE), for the GIMPLE CFG only. In principle there's nothing to stop the flag being propagated to the RTL CFG also, in which case it'd probably be set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g. EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which inspect the RTL CFG can continue to only check the ABNORMAL flag. But so far (in rather limited testing!), that has not been observed to be necessary. (We can control RTL CFG manipulation indirectly by using the CANNOT_COPY_INSN_P target hook, sensitive e.g. to the "broadcast" instruction.) For the GIMPLE CFG (i.e. in passes operating on GIMPLE form), EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e., inhibiting certain optimisations), and so has been added to relevant conditionals largely mechanically. Places where it is treated specially are: * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on outgoing edges of a block concluding with a GIMPLE_COND statement. But, we allow EDGE_TO_RECONVERGENCE there. * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass tries to split the edge later). There are probably other optimisations that will be tripped up by the new flag along the same lines as the VRP tweak above, which we will no doubt discover in due course. Together with the patch, https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html This shows no regressions for the libgomp tests. OK for gomp4 branch? Thanks, Julian ChangeLog gcc/ * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag. (bb_hash_abnorm_or_reconv_pred): New function. (hash_abnormal_or_eh_outgoing_edge_p): Consider EDGE_TO_RECONVERGENCE also. * cfg-flags.def (TO_RECONVERGENCE): Add flag. * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges leading to a reconvergence point. * cfgbuild.c (purge_dead_tablejump_edges): Consider EDGE_TO_RECONVERGENCE. * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise. * cfgexpand.c (expand_gimple_tailcall, construct_exit_block) (pass_expand::execute): Likewise. * cfghooks.c (can_copy_bbs_p): Likewise. * cfgloop.c (bb_loop_header_p): Likewise. * cfgloopmanip.c (scale_loop_profile): Likewise. * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise. * graph.c (draw_cfg_node_succ_edges): Likewise. * graphite-scope-detection.c (canonicalize_loop_closed_ssa): Likewise. * predict.c (tree_bb_level_predictions): Likewise. * profile.c (instrument_edges, branch_prop, find_spanning_tree): Likewise. * tree-cfg.c (replace_uses_by, gimple_split_edge) (gimple_redirect_edge_and_branch, split_critical_edges): Likewise. * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block) (pass_merge_phi::execute): Likewise. * tree-chkp.c (chkp_fix_cfg): Likewise. * tree-if-conv.c (if_convertible_bb_p): Likewise. * tree-inline.c (update_ssa_across_abnormal_edges): Likewise. * tree-into-ssa.c (rewrite_update_phi_arguments) (rewrite_update_dom_walker::before_dom_children) (create_new_def_for): Likewise. * tree-outof-ssa.c (eliminate_phi): Likewise. * tree-phinodes.c (add_phi_arg): Likewise. * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map) (coalesce_partitions): Likewise. * tree-ssa-dom.c (cprop_into_successor_phis) (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs): Likewise. * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise. * tree-ssa-loop-prefetch.c (emit_mfence_after_loop) (may_use_storent_in_loop_p): Likewise. * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise. * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block): Likewise. * tree-ssa-propagate.c (simulate_block, replace_phi_args_in): Likewise. * tree-ssa-sink.c (sink_code_in_bb): Likewise. * tree-ssa-threadedge.c (thread_across_edge): Likewise. * tree-ssa-threadupdate.c (thread_single_edge): Likewise. * tree-ssa-uninit.c (compute_control_dep_chain): Likewise. * tree-ssa.c (verify_phi_args): Likewise. * tree-vect-loop.c (vect_analyze_loop_form): Likewise. * value-prof.c (gimple_ic): Likewise. * tree-vrp.c (infer_value_range, process_assert_insertions_for): Likewise. (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges. [-- Attachment #2: to-reconvergence-4.diff --] [-- Type: text/x-patch, Size: 31591 bytes --] commit 472bd543b30356f7a4c59efc961f9f61b11ca197 Author: Julian Brown <julian@codesourcery.com> Date: Wed May 20 11:35:45 2015 -0700 Introduce EDGE_TO_RECONVERGENCE, and tweak some uses of EDGE_ABNORMAL. diff --git a/gcc/basic-block.h b/gcc/basic-block.h index f28fa57..7fe25f0 100644 --- a/gcc/basic-block.h +++ b/gcc/basic-block.h @@ -70,7 +70,8 @@ enum cfg_edge_flags { Test the edge flags on EDGE_COMPLEX to detect all forms of "strange" control flow transfers. */ #define EDGE_COMPLEX \ - (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_EH | EDGE_PRESERVE) + (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_EH | EDGE_PRESERVE \ + | EDGE_TO_RECONVERGENCE) struct GTY(()) rtl_bb_info { /* The first insn of the block is embedded into bb->il.x. */ @@ -559,6 +560,20 @@ bb_has_abnormal_pred (basic_block bb) return false; } +static inline bool +bb_has_abnorm_or_reconv_pred (basic_block bb) +{ + edge e; + edge_iterator ei; + + FOR_EACH_EDGE (e, ei, bb->preds) + { + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) + return true; + } + return false; +} + /* Return the fallthru edge in EDGES if it exists, NULL otherwise. */ static inline edge find_fallthru_edge (vec<edge, va_gc> *edges) @@ -629,9 +644,10 @@ has_abnormal_or_eh_outgoing_edge_p (basic_block bb) edge_iterator ei; FOR_EACH_EDGE (e, ei, bb->succs) - if (e->flags & (EDGE_ABNORMAL | EDGE_EH)) + if (e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE)) return true; return false; } + #endif /* GCC_BASIC_BLOCK_H */ diff --git a/gcc/cfg-flags.def b/gcc/cfg-flags.def index eedcd69..fd51e2f 100644 --- a/gcc/cfg-flags.def +++ b/gcc/cfg-flags.def @@ -177,6 +177,10 @@ DEF_EDGE_FLAG(TM_UNINSTRUMENTED, 15) /* Abort (over) edge out of a GIMPLE_TRANSACTION statement. */ DEF_EDGE_FLAG(TM_ABORT, 16) +/* An "immutable" edge to an OpenACC (currently, NVPTX) reconvergence point. + This flag is only used for the GIMPLE CFG. */ +DEF_EDGE_FLAG(TO_RECONVERGENCE, 17) + #endif /* diff --git a/gcc/cfgbuild.c b/gcc/cfgbuild.c index 7cbed50..7185f07 100644 --- a/gcc/cfgbuild.c +++ b/gcc/cfgbuild.c @@ -449,7 +449,7 @@ purge_dead_tablejump_edges (basic_block bb, rtx_jump_table_data *table) if (FULL_STATE (e->dest) & BLOCK_USED_BY_TABLEJUMP) SET_STATE (e->dest, FULL_STATE (e->dest) & ~(size_t) BLOCK_USED_BY_TABLEJUMP); - else if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH))) + else if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE))) { remove_edge (e); continue; diff --git a/gcc/cfgcleanup.c b/gcc/cfgcleanup.c index 797d14a..e73062a 100644 --- a/gcc/cfgcleanup.c +++ b/gcc/cfgcleanup.c @@ -2031,7 +2031,7 @@ try_crossjump_to_edge (int mode, edge e1, edge e2, /* Avoid deleting preserve label when redirecting ABNORMAL edges. */ if (block_has_preserve_label (e1->dest) - && (e1->flags & EDGE_ABNORMAL)) + && (e1->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) return false; /* Here we know that the insns in the end of SRC1 which are common with SRC2 @@ -2389,7 +2389,7 @@ try_head_merge_bb (basic_block bb) return false; } - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) return false; /* Normally, all destination blocks must only be reachable from this diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 5905ddb..688158b 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -3569,7 +3569,7 @@ expand_gimple_tailcall (basic_block bb, gcall *stmt, bool *can_fallthru) for (ei = ei_start (bb->succs); (e = ei_safe_edge (ei)); ) { - if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH))) + if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE))) { if (e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun)) { @@ -5674,7 +5674,7 @@ construct_exit_block (void) while (ix < EDGE_COUNT (EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)) { e = EDGE_PRED (EXIT_BLOCK_PTR_FOR_FN (cfun), ix); - if (!(e->flags & EDGE_ABNORMAL)) + if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) redirect_edge_succ (e, exit_block); else ix++; @@ -6222,7 +6222,7 @@ pass_expand::execute (function *fun) representation. It is safe to remove them here as find_many_sub_basic_blocks will rediscover them. In the future we should get this fixed properly. */ - if ((e->flags & EDGE_ABNORMAL) + if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) && !(e->flags & EDGE_SIBCALL)) remove_edge (e); else diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c index fc23edb..64483f4 100644 --- a/gcc/cfghooks.c +++ b/gcc/cfghooks.c @@ -1304,7 +1304,7 @@ can_copy_bbs_p (basic_block *bbs, unsigned n) /* In case we should redirect abnormal edge during duplication, fail. */ edge_iterator ei; FOR_EACH_EDGE (e, ei, bbs[i]->succs) - if ((e->flags & EDGE_ABNORMAL) + if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) && (e->dest->flags & BB_DUPLICATED)) { ret = false; diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c index 5767494..d4a1c93 100644 --- a/gcc/cfgloop.c +++ b/gcc/cfgloop.c @@ -387,7 +387,7 @@ bb_loop_header_p (basic_block header) /* If we have an abnormal predecessor, do not consider the loop (not worth the problems). */ - if (bb_has_abnormal_pred (header)) + if (bb_has_abnorm_or_reconv_pred (header)) return false; /* Look for back edges where a predecessor is dominated diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c index 45cc85d..8bbf4cb 100644 --- a/gcc/cfgloopmanip.c +++ b/gcc/cfgloopmanip.c @@ -554,7 +554,8 @@ scale_loop_profile (struct loop *loop, int scale, gcov_type iteration_bound) gcov_type count_delta; FOR_EACH_EDGE (other_e, ei, e->src->succs) - if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE)) + if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE + | EDGE_TO_RECONVERGENCE)) && e != other_e) break; diff --git a/gcc/gimple-iterator.c b/gcc/gimple-iterator.c index df29123..bcb6649 100644 --- a/gcc/gimple-iterator.c +++ b/gcc/gimple-iterator.c @@ -792,7 +792,7 @@ gimple_find_edge_insert_loc (edge e, gimple_stmt_iterator *gsi, the last statement does not end a basic block, insert there. Except for the entry block. */ src = e->src; - if ((e->flags & EDGE_ABNORMAL) == 0 + if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) == 0 && single_succ_p (src) && src != ENTRY_BLOCK_PTR_FOR_FN (cfun)) { diff --git a/gcc/graph.c b/gcc/graph.c index 5fb0d78..e5d3646 100644 --- a/gcc/graph.c +++ b/gcc/graph.c @@ -142,7 +142,7 @@ draw_cfg_node_succ_edges (pretty_printer *pp, int funcdef_no, basic_block bb) weight = 100; } - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) color = "red"; pp_printf (pp, diff --git a/gcc/graphite-scop-detection.c b/gcc/graphite-scop-detection.c index 02e9e50..8f91f6e 100644 --- a/gcc/graphite-scop-detection.c +++ b/gcc/graphite-scop-detection.c @@ -1339,7 +1339,7 @@ canonicalize_loop_closed_ssa (loop_p loop) edge e = single_exit (loop); basic_block bb; - if (!e || e->flags & EDGE_ABNORMAL) + if (!e || e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) return; bb = e->dest; diff --git a/gcc/omp-low.c b/gcc/omp-low.c index f82247b..e28f5b4 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -10447,6 +10447,7 @@ predicate_bb (basic_block bb, struct omp_region *parent) generate_vector_broadcast (broadcast_cond, cond_var, gsi_asgn); edge e = split_block (bb, asgn); + e->flags = EDGE_TO_RECONVERGENCE; skip_dest_bb = e->dest; gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR, @@ -10465,6 +10466,7 @@ predicate_bb (basic_block bb, struct omp_region *parent) generate_vector_broadcast (new_var, var, gsi_asgn); edge e = split_block (bb, asgn); + e->flags = EDGE_TO_RECONVERGENCE; skip_dest_bb = e->dest; gimple_switch_set_index (sstmt, new_var); @@ -10477,6 +10479,7 @@ predicate_bb (basic_block bb, struct omp_region *parent) && gimple_code (stmt) != GIMPLE_OMP_CONTINUE) { edge e = single_succ_edge (bb); + e->flags = EDGE_TO_RECONVERGENCE; skip_dest_bb = e->dest; if (gimple_code (stmt) == GIMPLE_OMP_RETURN) { @@ -10490,6 +10493,7 @@ predicate_bb (basic_block bb, struct omp_region *parent) if (!split_stmt) return; edge e = split_block (bb, split_stmt); + e->flags = EDGE_TO_RECONVERGENCE; skip_dest_bb = e->dest; if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE) { @@ -10508,6 +10512,7 @@ predicate_bb (basic_block bb, struct omp_region *parent) else if (single_succ_p (bb)) { edge e = single_succ_edge (bb); + e->flags |= EDGE_TO_RECONVERGENCE; skip_dest_bb = e->dest; if (gimple_code (stmt) == GIMPLE_GOTO) gsi_prev (&gsi); @@ -10540,7 +10545,8 @@ predicate_bb (basic_block bb, struct omp_region *parent) gsi_insert_after (&tmp_gsi, cond_stmt, GSI_CONTINUE_LINKING); e2->flags = EDGE_TRUE_VALUE; - make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); + make_edge (cond_bb, skip_dest_bb, + EDGE_FALSE_VALUE | EDGE_TO_RECONVERGENCE); } } diff --git a/gcc/predict.c b/gcc/predict.c index 67d5d20..9098bce 100644 --- a/gcc/predict.c +++ b/gcc/predict.c @@ -2188,7 +2188,8 @@ tree_bb_level_predictions (void) edge_iterator ei; FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds) - if (!(e->flags & (EDGE_ABNORMAL | EDGE_FAKE | EDGE_EH))) + if (!(e->flags & (EDGE_ABNORMAL | EDGE_FAKE | EDGE_EH + | EDGE_TO_RECONVERGENCE))) { has_return_edges = true; break; diff --git a/gcc/profile.c b/gcc/profile.c index a178a1b..cd96b9f 100644 --- a/gcc/profile.c +++ b/gcc/profile.c @@ -170,7 +170,8 @@ instrument_edges (struct edge_list *el) if (!inf->ignore && !inf->on_tree) { - gcc_assert (!(e->flags & EDGE_ABNORMAL)); + gcc_assert (!(e->flags & (EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE))); if (dump_file) fprintf (dump_file, "Edge %d to %d instrumented%s\n", e->src->index, e->dest->index, @@ -1107,7 +1108,8 @@ branch_prob (void) edge ne = single_succ_edge (new_bb); ne->goto_locus = e->goto_locus; } - if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL)) + if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL + | EDGE_TO_RECONVERGENCE)) && e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun)) need_exit_edge = 1; if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)) @@ -1115,7 +1117,8 @@ branch_prob (void) } FOR_EACH_EDGE (e, ei, bb->preds) { - if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL)) + if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL + | EDGE_TO_RECONVERGENCE)) && e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)) need_entry_edge = 1; if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun)) @@ -1180,7 +1183,8 @@ branch_prob (void) e->count = 0; /* Mark edges we've replaced by fake edges above as ignored. */ - if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL)) + if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL + | EDGE_TO_RECONVERGENCE)) && e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun) && e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun)) { @@ -1430,7 +1434,8 @@ find_spanning_tree (struct edge_list *el) for (i = 0; i < num_edges; i++) { edge e = INDEX_EDGE (el, i); - if (((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_FAKE)) + if (((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_FAKE + | EDGE_TO_RECONVERGENCE)) || e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)) && !EDGE_INFO (e)->ignore && (find_group (e->src) != find_group (e->dest))) diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index 99b27c7..fa12678 100644 --- a/gcc/tree-cfg.c +++ b/gcc/tree-cfg.c @@ -1865,7 +1865,7 @@ replace_uses_by (tree name, tree val) { e = gimple_phi_arg_edge (as_a <gphi *> (stmt), PHI_ARG_INDEX_FROM_USE (use)); - if (e->flags & EDGE_ABNORMAL + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE) && !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (val)) { /* This can only occur for virtual operands, since @@ -2776,7 +2776,7 @@ gimple_split_edge (edge edge_in) edge new_edge, e; /* Abnormal edges cannot be split. */ - gcc_assert (!(edge_in->flags & EDGE_ABNORMAL)); + gcc_assert (!(edge_in->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))); dest = edge_in->dest; @@ -5578,7 +5578,7 @@ gimple_redirect_edge_and_branch (edge e, basic_block dest) edge ret; gimple stmt; - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) return NULL; if (e->dest == dest) @@ -5724,7 +5724,7 @@ gimple_redirect_edge_and_branch (edge e, basic_block dest) static bool gimple_can_remove_branch_p (const_edge e) { - if (e->flags & (EDGE_ABNORMAL | EDGE_EH)) + if (e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE)) return false; return true; @@ -8288,7 +8288,8 @@ split_critical_edges (void) { FOR_EACH_EDGE (e, ei, bb->succs) { - if (EDGE_CRITICAL_P (e) && !(e->flags & EDGE_ABNORMAL)) + if (EDGE_CRITICAL_P (e) && !(e->flags & (EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE))) split_edge (e); /* PRE inserts statements to edges and expects that since split_critical_edges was done beforehand, committing edge @@ -8301,7 +8302,7 @@ split_critical_edges (void) || !gimple_seq_empty_p (phi_nodes (e->dest)) || e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)) && e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun) - && !(e->flags & EDGE_ABNORMAL)) + && !(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) { gimple_stmt_iterator gsi; diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c index 26258aa..e361833 100644 --- a/gcc/tree-cfgcleanup.c +++ b/gcc/tree-cfgcleanup.c @@ -307,7 +307,8 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted) /* Nor should this be an infinite loop. */ || single_succ (bb) == bb /* BB may not have an abnormal outgoing edge. */ - || (single_succ_edge (bb)->flags & EDGE_ABNORMAL)) + || (single_succ_edge (bb)->flags & (EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE))) return false; gcc_checking_assert (bb != ENTRY_BLOCK_PTR_FOR_FN (cfun)); @@ -451,8 +452,8 @@ remove_forwarder_block (basic_block bb) So if there is an abnormal edge to BB, proceed only if there is no abnormal edge to DEST and there are no phi nodes in DEST. */ - if (bb_has_abnormal_pred (bb) - && (bb_has_abnormal_pred (dest) + if (bb_has_abnorm_or_reconv_pred (bb) + && (bb_has_abnorm_or_reconv_pred (dest) || !gimple_seq_empty_p (phi_nodes (dest)))) return false; @@ -483,7 +484,7 @@ remove_forwarder_block (basic_block bb) { bitmap_set_bit (cfgcleanup_altered_bbs, e->src->index); - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) { /* If there is an abnormal edge, redirect it anyway, and move the labels to the new block to make it legal. */ @@ -1019,7 +1020,7 @@ pass_merge_phi::execute (function *fun) if (gimple_seq_empty_p (phi_nodes (dest)) /* We don't want to deal with a basic block with abnormal edges. */ - || bb_has_abnormal_pred (bb)) + || bb_has_abnorm_or_reconv_pred (bb)) continue; if (!dominated_by_p (CDI_DOMINATORS, dest, bb)) diff --git a/gcc/tree-chkp.c b/gcc/tree-chkp.c index 288470b..a916525 100644 --- a/gcc/tree-chkp.c +++ b/gcc/tree-chkp.c @@ -3994,7 +3994,7 @@ chkp_fix_cfg () /* We cannot split abnormal edge. Therefore we store its params, make it regular and then rebuild abnormal edge after split. */ - if (fall->flags & EDGE_ABNORMAL) + if (fall->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) { flags = fall->flags & ~EDGE_FALLTHRU; dest = fall->dest; diff --git a/gcc/tree-if-conv.c b/gcc/tree-if-conv.c index 49ff458..7849934 100644 --- a/gcc/tree-if-conv.c +++ b/gcc/tree-if-conv.c @@ -1070,7 +1070,8 @@ if_convertible_bb_p (struct loop *loop, basic_block bb, basic_block exit_bb) /* Be less adventurous and handle only normal edges. */ FOR_EACH_EDGE (e, ei, bb->succs) - if (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP)) + if (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP + | EDGE_TO_RECONVERGENCE)) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Difficult to handle edges\n"); diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c index 71d75d9..72aa414 100644 --- a/gcc/tree-inline.c +++ b/gcc/tree-inline.c @@ -2182,8 +2182,10 @@ update_ssa_across_abnormal_edges (basic_block bb, basic_block ret_bb, re = find_edge (ret_bb, e->dest); gcc_checking_assert (re); - gcc_assert ((re->flags & (EDGE_EH | EDGE_ABNORMAL)) - == (e->flags & (EDGE_EH | EDGE_ABNORMAL))); + gcc_assert ((re->flags & (EDGE_EH | EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE)) + == (e->flags & (EDGE_EH | EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE))); SET_USE (PHI_ARG_DEF_PTR_FROM_EDGE (phi, e), USE_FROM_PTR (PHI_ARG_DEF_PTR_FROM_EDGE (phi, re))); diff --git a/gcc/tree-into-ssa.c b/gcc/tree-into-ssa.c index e8b55c1..7e9cb57 100644 --- a/gcc/tree-into-ssa.c +++ b/gcc/tree-into-ssa.c @@ -2101,7 +2101,7 @@ rewrite_update_phi_arguments (basic_block bb) } - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) SSA_NAME_OCCURS_IN_ABNORMAL_PHI (USE_FROM_PTR (arg_p)) = 1; } } @@ -2138,7 +2138,7 @@ rewrite_update_dom_walker::before_dom_children (basic_block bb) /* Mark the LHS if any of the arguments flows through an abnormal edge. */ - is_abnormal_phi = bb_has_abnormal_pred (bb); + is_abnormal_phi = bb_has_abnorm_or_reconv_pred (bb); /* If any of the PHI nodes is a replacement for a name in OLD_SSA_NAMES or it's one of the names in NEW_SSA_NAMES, then @@ -2899,7 +2899,8 @@ create_new_def_for (tree old_name, gimple stmt, def_operand_p def) basic_block bb = gimple_bb (stmt); /* If needed, mark NEW_NAME as occurring in an abnormal PHI node. */ - SSA_NAME_OCCURS_IN_ABNORMAL_PHI (new_name) = bb_has_abnormal_pred (bb); + SSA_NAME_OCCURS_IN_ABNORMAL_PHI (new_name) + = bb_has_abnorm_or_reconv_pred (bb); } add_new_name_mapping (new_name, old_name); diff --git a/gcc/tree-outof-ssa.c b/gcc/tree-outof-ssa.c index e23bc0b..fe24cf8 100644 --- a/gcc/tree-outof-ssa.c +++ b/gcc/tree-outof-ssa.c @@ -766,7 +766,7 @@ eliminate_phi (edge e, elim_graph g) gcc_assert (g->copy_locus.length () == 0); /* Abnormal edges already have everything coalesced. */ - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) return; g->e = e; diff --git a/gcc/tree-phinodes.c b/gcc/tree-phinodes.c index d657907..05c383d 100644 --- a/gcc/tree-phinodes.c +++ b/gcc/tree-phinodes.c @@ -392,7 +392,7 @@ add_phi_arg (gphi *phi, tree def, edge e, source_location locus) /* Copy propagation needs to know what object occur in abnormal PHI nodes. This is a convenient place to record such information. */ - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) { SSA_NAME_OCCURS_IN_ABNORMAL_PHI (def) = 1; SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)) = 1; diff --git a/gcc/tree-ssa-coalesce.c b/gcc/tree-ssa-coalesce.c index eeac5a4..a72c86c 100644 --- a/gcc/tree-ssa-coalesce.c +++ b/gcc/tree-ssa-coalesce.c @@ -170,7 +170,7 @@ coalesce_cost_edge (edge e) /* Inserting copy on critical edge costs more than inserting it elsewhere. */ if (EDGE_CRITICAL_P (e)) mult = 2; - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) return MUST_COALESCE_COST; if (e->flags & EDGE_EH) { @@ -975,11 +975,11 @@ create_outofssa_var_map (coalesce_list_p cl, bitmap used_in_copy) register_ssa_partition (map, arg); if (gimple_can_coalesce_p (arg, res) - || (e->flags & EDGE_ABNORMAL)) + || (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) { saw_copy = true; bitmap_set_bit (used_in_copy, SSA_NAME_VERSION (arg)); - if ((e->flags & EDGE_ABNORMAL) == 0) + if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) == 0) { int cost = coalesce_cost_edge (e); if (cost == 1 && has_single_use (arg)) @@ -1206,7 +1206,7 @@ coalesce_partitions (var_map map, ssa_conflicts_p graph, coalesce_list_p cl, FOR_EACH_BB_FN (bb, cfun) { FOR_EACH_EDGE (e, ei, bb->preds) - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) { gphi_iterator gsi; for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c index 14f3e9e..9a27acf 100644 --- a/gcc/tree-ssa-dom.c +++ b/gcc/tree-ssa-dom.c @@ -1874,7 +1874,7 @@ cprop_into_successor_phis (basic_block bb) /* If this is an abnormal edge, then we do not want to copy propagate into the PHI alternative associated with this edge. */ - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) continue; gsi = gsi_start_phis (e->dest); @@ -1983,7 +1983,8 @@ dom_opt_dom_walker::after_dom_children (basic_block bb) may be able to statically determine which of the outgoing edges will be traversed when the incoming edge from BB is traversed. */ if (single_succ_p (bb) - && (single_succ_edge (bb)->flags & EDGE_ABNORMAL) == 0 + && (single_succ_edge (bb)->flags & (EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE)) == 0 && potentially_threadable_block (single_succ (bb))) { thread_across_edge (single_succ_edge (bb)); @@ -1991,8 +1992,10 @@ dom_opt_dom_walker::after_dom_children (basic_block bb) else if ((last = last_stmt (bb)) && gimple_code (last) == GIMPLE_COND && EDGE_COUNT (bb->succs) == 2 - && (EDGE_SUCC (bb, 0)->flags & EDGE_ABNORMAL) == 0 - && (EDGE_SUCC (bb, 1)->flags & EDGE_ABNORMAL) == 0) + && (EDGE_SUCC (bb, 0)->flags & (EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE)) == 0 + && (EDGE_SUCC (bb, 1)->flags & (EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE)) == 0) { edge true_edge, false_edge; @@ -2957,7 +2960,7 @@ propagate_rhs_into_lhs (gimple stmt, tree lhs, tree rhs, bitmap interesting_name /* And fixup the flags on the single remaining edge. */ te->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); - te->flags &= ~EDGE_ABNORMAL; + te->flags &= ~(EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE); te->flags |= EDGE_FALLTHRU; if (te->probability > REG_BR_PROB_BASE) te->probability = REG_BR_PROB_BASE; diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 11fc699..cb61979 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -2313,7 +2313,7 @@ loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED, edge ex; FOR_EACH_VEC_ELT (exits, i, ex) - if (ex->flags & (EDGE_ABNORMAL | EDGE_EH)) + if (ex->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE)) return false; return true; diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c index a948d68..f817b9e 100644 --- a/gcc/tree-ssa-loop-prefetch.c +++ b/gcc/tree-ssa-loop-prefetch.c @@ -1283,7 +1283,7 @@ emit_mfence_after_loop (struct loop *loop) if (!single_pred_p (exit->dest) /* If possible, we prefer not to insert the fence on other paths in cfg. */ - && !(exit->flags & EDGE_ABNORMAL)) + && !(exit->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) split_loop_exit_edge (exit); bsi = gsi_after_labels (exit->dest); @@ -1313,7 +1313,7 @@ may_use_storent_in_loop_p (struct loop *loop) edge exit; FOR_EACH_VEC_ELT (exits, i, exit) - if ((exit->flags & EDGE_ABNORMAL) + if ((exit->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) && exit->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)) ret = false; diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c index 7c846c2..91ae8dc 100644 --- a/gcc/tree-ssa-phiopt.c +++ b/gcc/tree-ssa-phiopt.c @@ -239,8 +239,8 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads) bb2 = e2->dest; /* We cannot do the optimization on abnormal edges. */ - if ((e1->flags & EDGE_ABNORMAL) != 0 - || (e2->flags & EDGE_ABNORMAL) != 0) + if ((e1->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) != 0 + || (e2->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) != 0) continue; /* If either bb1's succ or bb2 or bb2's succ is non NULL. */ diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c index d857d84..2f69389 100644 --- a/gcc/tree-ssa-pre.c +++ b/gcc/tree-ssa-pre.c @@ -2410,7 +2410,7 @@ compute_antic (void) FOR_EACH_EDGE (e, ei, block->preds) { e->flags &= ~EDGE_DFS_BACK; - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) { bitmap_set_bit (has_abnormal_preds, block->index); break; @@ -3035,7 +3035,7 @@ insert_into_preds_of_block (basic_block block, unsigned int exprnum, { builtexpr = create_expression_by_pieces (bprime, eprime, &stmts, type); - gcc_assert (!(pred->flags & EDGE_ABNORMAL)); + gcc_assert (!(pred->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))); gsi_insert_seq_on_edge (pred, stmts); if (!builtexpr) { diff --git a/gcc/tree-ssa-propagate.c b/gcc/tree-ssa-propagate.c index e23da70..26b8ec8 100644 --- a/gcc/tree-ssa-propagate.c +++ b/gcc/tree-ssa-propagate.c @@ -526,7 +526,7 @@ simulate_block (basic_block block) normal_edge = NULL; FOR_EACH_EDGE (e, ei, block->succs) { - if (e->flags & (EDGE_ABNORMAL | EDGE_EH)) + if (e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE)) add_control_edge (e); else { @@ -1072,7 +1072,7 @@ replace_phi_args_in (gphi *phi, ssa_prop_get_value_fn get_value) through an abnormal edge, update the replacement accordingly. */ if (TREE_CODE (val) == SSA_NAME - && e->flags & EDGE_ABNORMAL + && e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE) && !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (val)) { /* This can only occur for virtual operands, since diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c index 1ed8a0e..c3d091c 100644 --- a/gcc/tree-ssa-sink.c +++ b/gcc/tree-ssa-sink.c @@ -501,7 +501,7 @@ sink_code_in_bb (basic_block bb) /* We can't move things across abnormal edges, so don't try. */ FOR_EACH_EDGE (e, ei, bb->succs) - if (e->flags & EDGE_ABNORMAL) + if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) goto earlyout; for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi);) diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index acbbb67..5b6626a 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -1417,7 +1417,7 @@ thread_across_edge (gcond *dummy_cond, /* If E->dest has abnormal outgoing edges, then there's no guarantee we can safely redirect any of the edges. Just punt those cases. */ FOR_EACH_EDGE (taken_edge, ei, e->dest->succs) - if (taken_edge->flags & EDGE_ABNORMAL) + if (taken_edge->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) { const_and_copies->pop_to_marker (); BITMAP_FREE (visited); diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 0d61c18..49b010d 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -1661,7 +1661,8 @@ thread_single_edge (edge e) remove_ctrl_stmt_and_useless_edges (bb, eto->dest); /* And fixup the flags on the single remaining edge. */ - eto->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE | EDGE_ABNORMAL); + eto->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE | EDGE_ABNORMAL + | EDGE_TO_RECONVERGENCE); eto->flags |= EDGE_FALLTHRU; return bb; diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c index 19a3e82..a23d83a 100644 --- a/gcc/tree-ssa-uninit.c +++ b/gcc/tree-ssa-uninit.c @@ -459,7 +459,7 @@ compute_control_dep_chain (basic_block bb, basic_block dep_bb, { basic_block cd_bb; int post_dom_check = 0; - if (e->flags & (EDGE_FAKE | EDGE_ABNORMAL)) + if (e->flags & (EDGE_FAKE | EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) continue; cd_bb = e->dest; diff --git a/gcc/tree-ssa.c b/gcc/tree-ssa.c index 10d3314..86fd30c 100644 --- a/gcc/tree-ssa.c +++ b/gcc/tree-ssa.c @@ -883,7 +883,9 @@ verify_phi_args (gphi *phi, basic_block bb, basic_block *definition_block) { err = verify_ssa_name (op, virtual_operand_p (gimple_phi_result (phi))); err |= verify_use (e->src, definition_block[SSA_NAME_VERSION (op)], - op_p, phi, e->flags & EDGE_ABNORMAL, NULL); + op_p, phi, + e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE), + NULL); } if (TREE_CODE (op) == ADDR_EXPR) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 49bf518..dcca3fe 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -1277,7 +1277,7 @@ vect_analyze_loop_form (struct loop *loop) if (!single_pred_p (single_exit (loop)->dest)) { edge e = single_exit (loop); - if (!(e->flags & EDGE_ABNORMAL)) + if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) { split_loop_exit_edge (e); if (dump_enabled_p ()) diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c index 6744a91..97a3ccf 100644 --- a/gcc/tree-vrp.c +++ b/gcc/tree-vrp.c @@ -4926,7 +4926,7 @@ infer_value_range (gimple stmt, tree op, enum tree_code *comp_code_p, tree *val_ edge e; FOR_EACH_EDGE (e, ei, gimple_bb (stmt)->succs) - if (!(e->flags & EDGE_ABNORMAL)) + if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) break; if (e == NULL) return false; @@ -5949,7 +5949,9 @@ find_conditional_asserts (basic_block bb, gcond *last) insert. */ FOR_EACH_EDGE (e, ei, bb->succs) { - if (e->dest == bb) + /* Skip over EDGE_TO_RECONVERGENCE edges because they cannot be split + later in the pass. */ + if (e->dest == bb || (e->flags & EDGE_TO_RECONVERGENCE)) continue; /* Register the necessary assertions for each operand in the @@ -6412,7 +6414,7 @@ process_assert_insertions_for (tree name, assert_locus_t loc) STMT is not control flow, there may only be one non-abnormal edge out of BB. */ FOR_EACH_EDGE (e, ei, loc->bb->succs) - if (!(e->flags & EDGE_ABNORMAL)) + if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) { gsi_insert_on_edge (e, assert_stmt); return true; diff --git a/gcc/value-prof.c b/gcc/value-prof.c index b16bce8..7a70c49 100644 --- a/gcc/value-prof.c +++ b/gcc/value-prof.c @@ -1565,7 +1565,7 @@ gimple_ic (gcall *icall_stmt, struct cgraph_node *direct_call, } FOR_EACH_EDGE (e_eh, ei, icall_bb->succs) - if (e_eh->flags & (EDGE_EH | EDGE_ABNORMAL)) + if (e_eh->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) { e = make_edge (dcall_bb, e_eh->dest, e_eh->flags); for (gphi_iterator psi = gsi_start_phis (e_eh->dest); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-05-28 14:20 [gomp4] Preserve NVPTX "reconvergence" points Julian Brown @ 2015-05-28 14:59 ` Jakub Jelinek 2015-05-28 15:14 ` Thomas Schwinge 2015-05-28 15:02 ` Richard Biener 1 sibling, 1 reply; 23+ messages in thread From: Jakub Jelinek @ 2015-05-28 14:59 UTC (permalink / raw) To: Julian Brown; +Cc: gcc-patches, Bernd Schmidt, Thomas Schwinge On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote: > For NVPTX, it is vitally important that the divergence of threads > within a warp can be controlled: in particular we must be able to > generate code that we know "reconverges" at a particular point. > Unfortunately GCC's middle-end optimisers can cause this property to > be violated, which causes problems for the OpenACC execution model > we're planning to use for NVPTX. > > As a brief example: code running in vector-single mode runs on a > single thread of a warp, and must broadcast condition results to other > threads of the warp so that they can "follow along" and be ready for > vector-partitioned execution when necessary. I think the lowering of this already at ompexp time is premature, I think much better would be to have a function attribute (or cgraph flag) that would be set for functions you want to compile this way (plus a targetm flag that the targets want to support it that way), plus a flag in loop structure for the acc loop vector loops (perhaps the current OpenMP simd loop flags are good enough for that), and lower it somewhere around the vectorization pass or so. Or, what exactly do you emit for the fallback code, or for other GPGPUs or XeonPhi? To me e.g. for XeonPhi or HSA this sounds like you want to implement the acc loop gang as a work-sharing loop among threads (like #pragma omp for) and #pragma acc loop vector like a loop that should be vectorized if at all possible (like #pragma omp simd). I really think it is important that OpenACC GCC support is not so strongly tied to one specific GPGPU, and similarly OpenMP should be usable for all offloading targets GCC supports. That way, it is possible to auto-vectorize the code too, decision how to expand the code of offloaded function is done already separately for each offloading target, there is a space for optimizations on much simpler cfg, etc. Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-05-28 14:59 ` Jakub Jelinek @ 2015-05-28 15:14 ` Thomas Schwinge 2015-05-28 15:28 ` Jakub Jelinek 0 siblings, 1 reply; 23+ messages in thread From: Thomas Schwinge @ 2015-05-28 15:14 UTC (permalink / raw) To: Jakub Jelinek; +Cc: gcc-patches, Bernd Schmidt, Nathan Sidwell, Julian Brown [-- Attachment #1: Type: text/plain, Size: 2239 bytes --] Hi! On Thu, 28 May 2015 16:20:11 +0200, Jakub Jelinek <jakub@redhat.com> wrote: > On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote: > > [...] > I think the lowering of this already at ompexp time is premature Yes, we're aware of this "wart". :-| > I think much better would be to have a function attribute (or cgraph > flag) that would be set for functions you want to compile this way > (plus a targetm flag that the targets want to support it that way), > plus a flag in loop structure for the acc loop vector loops > (perhaps the current OpenMP simd loop flags are good enough for that), > and lower it somewhere around the vectorization pass or so. Moving the loop lowering/expansion later is along the same lines as we've been thinking. Figuring out how the OpenMP simd implementation works, is another thing I wanted to look into. > Or, what exactly do you emit for the fallback code, or for other GPGPUs > or XeonPhi? To me e.g. for XeonPhi or HSA this sounds like you > want to implement the acc loop gang as a work-sharing loop among > threads (like #pragma omp for) and #pragma acc loop vector like > a loop that should be vectorized if at all possible (like #pragma omp simd). > I really think it is important that OpenACC GCC support is not so strongly > tied to one specific GPGPU Not disagreeing, but: we have to start somewhere. GPU offloading and all its peculiarities is still entering unknown terriroty in GCC; we're still learning, and shall try to converge the emerging different implementations in the future. Doing the completely generic (agnostic of specific offloading device) implementation right now is a challenging task, hence the work on a "nvptx-specific prototype" first, to put it this way. That said, we of course very much welcome your continued review of our work, and your suggestions! > and similarly OpenMP should be usable for > all offloading targets GCC supports. > > That way, it is possible to auto-vectorize the code too, decision how > to expand the code of offloaded function is done already separately for each > offloading target, there is a space for optimizations on much simpler > cfg, etc. Grüße, Thomas [-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-05-28 15:14 ` Thomas Schwinge @ 2015-05-28 15:28 ` Jakub Jelinek 2015-06-19 10:44 ` Bernd Schmidt 0 siblings, 1 reply; 23+ messages in thread From: Jakub Jelinek @ 2015-05-28 15:28 UTC (permalink / raw) To: Thomas Schwinge; +Cc: gcc-patches, Bernd Schmidt, Nathan Sidwell, Julian Brown On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote: > > I think much better would be to have a function attribute (or cgraph > > flag) that would be set for functions you want to compile this way > > (plus a targetm flag that the targets want to support it that way), > > plus a flag in loop structure for the acc loop vector loops > > (perhaps the current OpenMP simd loop flags are good enough for that), > > and lower it somewhere around the vectorization pass or so. > > Moving the loop lowering/expansion later is along the same lines as we've > been thinking. Figuring out how the OpenMP simd implementation works, is > another thing I wanted to look into. The OpenMP simd expansion is actually quite simple thing. Basically, the simd loop is in ompexp expanded as a normal loop with some flags in the loop structure (which are pretty much optimization hints). There is a flag that the user would really like to vectorize it, and another field that says (from what user told) what vectorization factor is safe to use regardless of compiler's analysis. There is some complications with privatization clauses, so some variables are in GIMPLE represented as arrays with maximum vf elements and indexed by internal function (simd lane), which the vectorizer then either turns into a scalar again (if the loop isn't vectorized), or vectorizes it and for addressables keeps in arrays with actual vf elements. I admit I don't know too much about OpenACC, but I'd think doing something similar (i.e. some loop structure hint or request that a particular loop is vectorized and perhaps something about lexical forward/backward dependencies in the loop) could work. Then for XeonPhi or host fallback, you'd just use normal vectorizer. And for PTX you could instead about the same time instead of vectorization lower code to a single working thread doing stuff except for simd marked loops which would be lowered to run on all threads in the warp. > Not disagreeing, but: we have to start somewhere. GPU offloading and all > its peculiarities is still entering unknown terriroty in GCC; we're still > learning, and shall try to converge the emerging different > implementations in the future. Doing the completely generic (agnostic of > specific offloading device) implementation right now is a challenging > task, hence the work on a "nvptx-specific prototype" first, to put it > this way. I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. E.g. the XeonPhi is not hard to understand, it is pretty much just a many core x86_64 chip where the offloading is some process how to run something on the other device and the emulation mode very well emulates that through running it in a different process. This stuff is already about what happens in offloaded code, so considerations for it are similar to those for host code (especially hosts that can vectorize). As far as OpenMP / PTX goes, I'll try to find time for it again soon (busy with OpenMP 4.1 work so far), but e.g. the above stuff (having a single thread in warp do most of the non-vectorized work, and only use other threads in the warp for vectorization) is definitely what OpenMP will benefit from too. Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-05-28 15:28 ` Jakub Jelinek @ 2015-06-19 10:44 ` Bernd Schmidt 2015-06-19 12:32 ` Jakub Jelinek 0 siblings, 1 reply; 23+ messages in thread From: Bernd Schmidt @ 2015-06-19 10:44 UTC (permalink / raw) To: Jakub Jelinek, Thomas Schwinge; +Cc: gcc-patches, Nathan Sidwell, Julian Brown [-- Attachment #1: Type: text/plain, Size: 2492 bytes --] On 05/28/2015 05:08 PM, Jakub Jelinek wrote: > I understand it is more work, I'd just like to ask that when designing stuff > for the OpenACC offloading you (plural) try to take the other offloading > devices and host fallback into account. The problem is that many of the transformations we need to do are really GPU specific, and with the current structure of omplow/ompexp they are being done in the host compiler. The offloading scheme we decided on does not give us the means to write out multiple versions of an offloaded function where each target gets a different one. For that reason I think we should postpone these lowering decisions until we're in the accel compiler, where they could be controlled by target hooks, and over the last two weeks I've been doing some experiments to see how that could be achieved. The basic idea is to delay expanding the inner regions of an OpenACC target region during ompexp, write out offload LTO (almost) immediately afterwards, and then have another ompexp phase which runs on the accel compiler to take the offloaded function to its final form. The first attempt really did write LTO immediately after, before moving to SSA phase. It seems that this could be made to work, but the pass manager and LTO code rather expects that what is being read in is in SSA form already. Also, some offloaded code is produced by OpenACC kernels expansion much later in the compilation, so with this approach we have an inconsistency where functions we get back from LTO are at very different levels of lowering. The next attempt was to run the into-ssa passes after ompexpand, and only then write things out. I've changed the gimple representation of some OMP statements (primarily gimple_omp_for) so that they are relatively normal statements with operands that can be transformed into SSA form. As far as what's easier to work with - I believe some of the transformations we have to do could benefit from being in SSA, but on the other hand the OpenACC predication code has given me some trouble. I've still not sompletely convinced myself that the update_ssa call I've added will actually do the right thing after we've mucked up the CFG. I'm appending a proof-of-concept patch. This is intended to show the general outline of what I have in mind, rather than pass the testsuite. It's good enough to compile some of the OpenACC testcases (let's say worker-single-3 if you need one). Let me know what you think. Bernd [-- Attachment #2: offload-early.diff --] [-- Type: text/x-patch, Size: 60689 bytes --] Index: gcc/cgraphunit.c =================================================================== --- gcc/cgraphunit.c (revision 224547) +++ gcc/cgraphunit.c (working copy) @@ -2171,6 +2171,23 @@ ipa_passes (void) execute_ipa_pass_list (passes->all_small_ipa_passes); if (seen_error ()) return; + + if (g->have_offload) + { + extern void write_offload_lto (); + section_name_prefix = OFFLOAD_SECTION_NAME_PREFIX; + write_offload_lto (); + } + } + bool do_local_opts = !in_lto_p; +#ifdef ACCEL_COMPILER + do_local_opts = true; +#endif + if (do_local_opts) + { + execute_ipa_pass_list (passes->all_local_opt_passes); + if (seen_error ()) + return; } /* This extra symtab_remove_unreachable_nodes pass tends to catch some @@ -2182,7 +2199,7 @@ ipa_passes (void) if (symtab->state < IPA_SSA) symtab->state = IPA_SSA; - if (!in_lto_p) + if (do_local_opts) { /* Generate coverage variables and constructors. */ coverage_finish (); @@ -2285,6 +2302,14 @@ symbol_table::compile (void) if (seen_error ()) return; +#ifdef ACCEL_COMPILER + { + cgraph_node *node; + FOR_EACH_DEFINED_FUNCTION (node) + node->get_untransformed_body (); + } +#endif + #ifdef ENABLE_CHECKING symtab_node::verify_symtab_nodes (); #endif Index: gcc/config/nvptx/nvptx.c =================================================================== --- gcc/config/nvptx/nvptx.c (revision 224547) +++ gcc/config/nvptx/nvptx.c (working copy) @@ -1171,18 +1171,42 @@ nvptx_section_from_addr_space (addr_spac } } -/* Determine whether DECL goes into .const or .global. */ +/* Determine the address space DECL lives in. */ -const char * -nvptx_section_for_decl (const_tree decl) +static addr_space_t +nvptx_addr_space_for_decl (const_tree decl) { + if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL) + return ADDR_SPACE_GENERIC; + + if (lookup_attribute ("oacc ganglocal", DECL_ATTRIBUTES (decl)) != NULL_TREE) + return ADDR_SPACE_SHARED; + bool is_const = (CONSTANT_CLASS_P (decl) || TREE_CODE (decl) == CONST_DECL || TREE_READONLY (decl)); if (is_const) - return ".const"; + return ADDR_SPACE_CONST; - return ".global"; + return ADDR_SPACE_GLOBAL; +} + +/* Return a ptx string representing the address space for a variable DECL. */ + +const char * +nvptx_section_for_decl (const_tree decl) +{ + switch (nvptx_addr_space_for_decl (decl)) + { + case ADDR_SPACE_CONST: + return ".const"; + case ADDR_SPACE_SHARED: + return ".shared"; + case ADDR_SPACE_GLOBAL: + return ".global"; + default: + gcc_unreachable (); + } } /* Look for a SYMBOL_REF in ADDR and return the address space to be used @@ -1196,17 +1220,7 @@ nvptx_addr_space_from_address (rtx addr) if (GET_CODE (addr) != SYMBOL_REF) return ADDR_SPACE_GENERIC; - tree decl = SYMBOL_REF_DECL (addr); - if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL) - return ADDR_SPACE_GENERIC; - - bool is_const = (CONSTANT_CLASS_P (decl) - || TREE_CODE (decl) == CONST_DECL - || TREE_READONLY (decl)); - if (is_const) - return ADDR_SPACE_CONST; - - return ADDR_SPACE_GLOBAL; + return nvptx_addr_space_for_decl (SYMBOL_REF_DECL (addr)); } \f /* Machinery to output constant initializers. */ Index: gcc/gimple-pretty-print.c =================================================================== --- gcc/gimple-pretty-print.c (revision 224547) +++ gcc/gimple-pretty-print.c (working copy) @@ -1175,11 +1175,12 @@ dump_gimple_omp_for (pretty_printer *buf dump_gimple_fmt (buffer, spc, flags, " >,"); for (i = 0; i < gimple_omp_for_collapse (gs); i++) dump_gimple_fmt (buffer, spc, flags, - "%+%T, %T, %T, %s, %T,%n", + "%+%T, %T, %T, %s, %s, %T,%n", gimple_omp_for_index (gs, i), gimple_omp_for_initial (gs, i), gimple_omp_for_final (gs, i), get_tree_code_name (gimple_omp_for_cond (gs, i)), + get_tree_code_name (gimple_omp_for_incr_code (gs, i)), gimple_omp_for_incr (gs, i)); dump_gimple_fmt (buffer, spc, flags, "PRE_BODY <%S>%->", gimple_omp_for_pre_body (gs)); @@ -1259,6 +1260,20 @@ dump_gimple_omp_for (pretty_printer *buf dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc, flags, false); pp_string (buffer, " = "); + dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc, + flags, false); + switch (gimple_omp_for_incr_code (gs, i)) + { + case POINTER_PLUS_EXPR: + case PLUS_EXPR: + pp_plus (buffer); + break; + case MINUS_EXPR: + pp_minus (buffer); + break; + default: + gcc_unreachable (); + } dump_generic_node (buffer, gimple_omp_for_incr (gs, i), spc, flags, false); pp_right_paren (buffer); Index: gcc/gimple-streamer-in.c =================================================================== --- gcc/gimple-streamer-in.c (revision 224547) +++ gcc/gimple-streamer-in.c (working copy) @@ -176,6 +176,7 @@ input_gimple_stmt (struct lto_input_bloc } /* Fallthru */ + case GIMPLE_OMP_ENTRY_END: case GIMPLE_ASSIGN: case GIMPLE_CALL: case GIMPLE_RETURN: @@ -225,6 +226,7 @@ input_gimple_stmt (struct lto_input_bloc case GIMPLE_NOP: case GIMPLE_PREDICT: + case GIMPLE_OMP_RETURN: break; case GIMPLE_TRANSACTION: @@ -232,6 +234,42 @@ input_gimple_stmt (struct lto_input_bloc stream_read_tree (ib, data_in)); break; + case GIMPLE_OMP_FOR: + { + gomp_for *for_stmt = as_a <gomp_for *> (stmt); + gimple_omp_for_set_clauses (for_stmt, stream_read_tree (ib, data_in)); + size_t collapse = streamer_read_hwi (ib); + for_stmt->collapse = collapse; + for_stmt->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse); + for (size_t i = 0; i < collapse; i++) + { + gimple_omp_for_set_cond (stmt, i, streamer_read_enum (ib, tree_code, + MAX_TREE_CODES)); + gimple_omp_for_set_incr_code (stmt, i, streamer_read_enum (ib, tree_code, + MAX_TREE_CODES)); + gimple_omp_for_set_index (stmt, i, stream_read_tree (ib, data_in)); + gimple_omp_for_set_initial (stmt, i, stream_read_tree (ib, data_in)); + gimple_omp_for_set_final (stmt, i, stream_read_tree (ib, data_in)); + gimple_omp_for_set_incr (stmt, i, stream_read_tree (ib, data_in)); + } + } + break; + + case GIMPLE_OMP_CONTINUE: + { + gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt); + gimple_omp_continue_set_control_def (cont_stmt, stream_read_tree (ib, data_in)); + gimple_omp_continue_set_control_use (cont_stmt, stream_read_tree (ib, data_in)); + } + break; + + case GIMPLE_OMP_TARGET: + { + gomp_target *tgt_stmt = as_a <gomp_target *> (stmt); + gimple_omp_target_set_clauses (tgt_stmt, stream_read_tree (ib, data_in)); + } + break; + default: internal_error ("bytecode stream: unknown GIMPLE statement tag %s", lto_tag_name (tag)); @@ -239,9 +277,9 @@ input_gimple_stmt (struct lto_input_bloc /* Update the properties of symbols, SSA names and labels associated with STMT. */ - if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL) + if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL || code == GIMPLE_OMP_CONTINUE) { - tree lhs = gimple_get_lhs (stmt); + tree lhs = gimple_op (stmt, 0); if (lhs && TREE_CODE (lhs) == SSA_NAME) SSA_NAME_DEF_STMT (lhs) = stmt; } @@ -257,7 +295,16 @@ input_gimple_stmt (struct lto_input_bloc SSA_NAME_DEF_STMT (op) = stmt; } } - + else if (code == GIMPLE_OMP_FOR) + { + gomp_for *for_stmt = as_a <gomp_for *> (stmt); + for (unsigned i = 0; i < gimple_omp_for_collapse (for_stmt); i++) + { + tree op = gimple_omp_for_index (for_stmt, i); + if (TREE_CODE (op) == SSA_NAME) + SSA_NAME_DEF_STMT (op) = stmt; + } + } /* Reset alias information. */ if (code == GIMPLE_CALL) gimple_call_reset_alias_info (as_a <gcall *> (stmt)); Index: gcc/gimple-streamer-out.c =================================================================== --- gcc/gimple-streamer-out.c (revision 224547) +++ gcc/gimple-streamer-out.c (working copy) @@ -147,6 +147,7 @@ output_gimple_stmt (struct output_block } /* Fallthru */ + case GIMPLE_OMP_ENTRY_END: case GIMPLE_ASSIGN: case GIMPLE_CALL: case GIMPLE_RETURN: @@ -201,6 +202,7 @@ output_gimple_stmt (struct output_block case GIMPLE_NOP: case GIMPLE_PREDICT: + case GIMPLE_OMP_RETURN: break; case GIMPLE_TRANSACTION: @@ -211,6 +213,45 @@ output_gimple_stmt (struct output_block } break; + case GIMPLE_OMP_FOR: + { + gomp_for *for_stmt = as_a <gomp_for *> (stmt); + stream_write_tree (ob, gimple_omp_for_clauses (for_stmt), true); + size_t collapse_count = gimple_omp_for_collapse (for_stmt); + streamer_write_hwi (ob, collapse_count); + for (size_t i = 0; i < collapse_count; i++) + { + streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES, + gimple_omp_for_cond (for_stmt, i)); + streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES, + gimple_omp_for_incr_code (for_stmt, i)); + stream_write_tree (ob, gimple_omp_for_index (for_stmt, i), true); + stream_write_tree (ob, gimple_omp_for_initial (for_stmt, i), true); + stream_write_tree (ob, gimple_omp_for_final (for_stmt, i), true); + stream_write_tree (ob, gimple_omp_for_incr (for_stmt, i), true); + } + /* No need to write out the pre-body, it's empty by the time we + get here. */ + } + break; + + case GIMPLE_OMP_CONTINUE: + { + gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt); + stream_write_tree (ob, gimple_omp_continue_control_def (cont_stmt), + true); + stream_write_tree (ob, gimple_omp_continue_control_use (cont_stmt), + true); + } + break; + + case GIMPLE_OMP_TARGET: + { + gomp_target *tgt_stmt = as_a <gomp_target *> (stmt); + stream_write_tree (ob, gimple_omp_target_clauses (tgt_stmt), true); + } + break; + default: gcc_unreachable (); } Index: gcc/gimple.c =================================================================== --- gcc/gimple.c (revision 224547) +++ gcc/gimple.c (working copy) @@ -855,9 +855,11 @@ gimple_build_debug_source_bind_stat (tre /* Build a GIMPLE_OMP_ENTRY_END statement. */ gimple -gimple_build_omp_entry_end (void) +gimple_build_omp_entry_end (tree var) { - return gimple_alloc (GIMPLE_OMP_ENTRY_END, 0); + gimple t = gimple_alloc (GIMPLE_OMP_ENTRY_END, 1); + gimple_set_op (t, 0, var); + return t; } @@ -890,13 +892,14 @@ gomp_for * gimple_build_omp_for (gimple_seq body, int kind, tree clauses, size_t collapse, gimple_seq pre_body) { - gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, 0)); + int nops = collapse * 4; + gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, nops)); if (body) gimple_omp_set_body (p, body); gimple_omp_for_set_clauses (p, clauses); gimple_omp_for_set_kind (p, kind); p->collapse = collapse; - p->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse); + p->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse); if (pre_body) gimple_omp_for_set_pre_body (p, pre_body); @@ -1011,7 +1014,7 @@ gomp_continue * gimple_build_omp_continue (tree control_def, tree control_use) { gomp_continue *p - = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 0)); + = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 2)); gimple_omp_continue_set_control_def (p, control_def); gimple_omp_continue_set_control_use (p, control_use); return p; Index: gcc/gimple.def =================================================================== --- gcc/gimple.def (revision 224547) +++ gcc/gimple.def (working copy) @@ -225,11 +225,11 @@ DEFGSCODE(GIMPLE_OMP_ATOMIC_STORE, "gimp /* GIMPLE_OMP_CONTINUE marks the location of the loop or sections iteration in partially lowered OpenMP code. */ -DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_OMP_CONTINUE) +DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_WITH_OPS) /* GIMPLE_OMP_ENTRY_END marks the end of the unpredicated entry block into an offloaded region. */ -DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_BASE) +DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_WITH_OPS) /* GIMPLE_OMP_CRITICAL <NAME, BODY> represents Index: gcc/gimple.h =================================================================== --- gcc/gimple.h (revision 224547) +++ gcc/gimple.h (working copy) @@ -301,7 +301,7 @@ struct GTY((tag("GSS_CALL"))) /* OMP statements. */ struct GTY((tag("GSS_OMP"))) - gimple_statement_omp : public gimple_statement_base + gimple_statement_omp : public gimple_statement_with_ops_base { /* [ WORD 1-6 ] : base class */ @@ -520,20 +520,8 @@ struct GTY((tag("GSS_OMP_CRITICAL"))) struct GTY(()) gimple_omp_for_iter { - /* Condition code. */ - enum tree_code cond; - - /* Index variable. */ - tree index; - - /* Initial value. */ - tree initial; - - /* Final value. */ - tree final; - - /* Increment. */ - tree incr; + /* Condition code and increment code. */ + enum tree_code cond, incr; }; /* GIMPLE_OMP_FOR */ @@ -556,6 +544,12 @@ struct GTY((tag("GSS_OMP_FOR"))) /* [ WORD 11 ] Pre-body evaluated before the loop body begins. */ gimple_seq pre_body; + + /* [ WORD 12 ] + Operand vector. NOTE! This must always be the last field + of this structure. In particular, this means that this + structure cannot be embedded inside another one. */ + tree GTY((length ("%h.num_ops"))) op[1]; }; @@ -581,10 +575,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT /* [ WORD 11 ] Size of the gang-local memory to allocate. */ tree ganglocal_size; - - /* [ WORD 12 ] - A pointer to the array to be used for broadcasting across threads. */ - tree broadcast_array; }; /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */ @@ -655,16 +645,10 @@ struct GTY((tag("GSS_OMP_SECTIONS"))) Note: This does not inherit from gimple_statement_omp, because we do not need the body field. */ -struct GTY((tag("GSS_OMP_CONTINUE"))) - gomp_continue : public gimple_statement_base +struct GTY((tag("GSS_WITH_OPS"))) + gomp_continue : public gimple_statement_with_ops { - /* [ WORD 1-6 ] : base class */ - - /* [ WORD 7 ] */ - tree control_def; - - /* [ WORD 8 ] */ - tree control_use; + /* no additional fields; this uses the layout for GSS_WITH_OPS. */ }; /* GIMPLE_OMP_SINGLE, GIMPLE_OMP_TEAMS */ @@ -1356,7 +1340,7 @@ gimple gimple_build_omp_taskgroup (gimpl gomp_continue *gimple_build_omp_continue (tree, tree); gimple gimple_build_omp_ordered (gimple_seq); gimple gimple_build_omp_return (bool); -gimple gimple_build_omp_entry_end (); +gimple gimple_build_omp_entry_end (tree); gomp_sections *gimple_build_omp_sections (gimple_seq, tree); gimple gimple_build_omp_sections_switch (void); gomp_single *gimple_build_omp_single (gimple_seq, tree); @@ -1853,7 +1837,10 @@ gimple_init_singleton (gimple g) static inline bool gimple_has_ops (const_gimple g) { - return gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN; + return ((gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN) + || gimple_code (g) == GIMPLE_OMP_FOR + || gimple_code (g) == GIMPLE_OMP_ENTRY_END + || gimple_code (g) == GIMPLE_OMP_CONTINUE); } template <> @@ -4559,6 +4546,27 @@ gimple_omp_for_set_cond (gimple gs, size omp_for_stmt->iter[i].cond = cond; } +/* Return the increment code associated with the OMP_FOR statement GS. */ + +static inline enum tree_code +gimple_omp_for_incr_code (const_gimple gs, size_t i) +{ + const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs); + gcc_gimple_checking_assert (i < omp_for_stmt->collapse); + return omp_for_stmt->iter[i].incr; +} + + +/* Set INCR to be the increment code for the OMP_FOR statement GS. */ + +static inline void +gimple_omp_for_set_incr_code (gimple gs, size_t i, enum tree_code incr) +{ + gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); + gcc_gimple_checking_assert (i < omp_for_stmt->collapse); + omp_for_stmt->iter[i].incr = incr; +} + /* Return the index variable for the OMP_FOR statement GS. */ @@ -4567,7 +4575,7 @@ gimple_omp_for_index (const_gimple gs, s { const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs); gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return omp_for_stmt->iter[i].index; + return gimple_op (gs, i); } @@ -4578,7 +4586,7 @@ gimple_omp_for_index_ptr (gimple gs, siz { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return &omp_for_stmt->iter[i].index; + return gimple_op_ptr (gs, i); } @@ -4588,8 +4596,9 @@ static inline void gimple_omp_for_set_index (gimple gs, size_t i, tree index) { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - omp_for_stmt->iter[i].index = index; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + gimple_set_op (gs, i, index); } @@ -4599,8 +4608,9 @@ static inline tree gimple_omp_for_initial (const_gimple gs, size_t i) { const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return omp_for_stmt->iter[i].initial; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + return gimple_op (gs, i + c); } @@ -4610,8 +4620,9 @@ static inline tree * gimple_omp_for_initial_ptr (gimple gs, size_t i) { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return &omp_for_stmt->iter[i].initial; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + return gimple_op_ptr (gs, i + c); } @@ -4621,8 +4632,9 @@ static inline void gimple_omp_for_set_initial (gimple gs, size_t i, tree initial) { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - omp_for_stmt->iter[i].initial = initial; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + gimple_set_op (gs, i + c, initial); } @@ -4632,8 +4644,9 @@ static inline tree gimple_omp_for_final (const_gimple gs, size_t i) { const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return omp_for_stmt->iter[i].final; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + return gimple_op (gs, i + c * 2); } @@ -4643,8 +4656,9 @@ static inline tree * gimple_omp_for_final_ptr (gimple gs, size_t i) { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return &omp_for_stmt->iter[i].final; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + return gimple_op_ptr (gs, i + c * 2); } @@ -4654,8 +4668,9 @@ static inline void gimple_omp_for_set_final (gimple gs, size_t i, tree final) { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - omp_for_stmt->iter[i].final = final; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + gimple_set_op (gs, i + c * 2, final); } @@ -4665,8 +4680,9 @@ static inline tree gimple_omp_for_incr (const_gimple gs, size_t i) { const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return omp_for_stmt->iter[i].incr; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + return gimple_op (gs, i + c * 3); } @@ -4676,8 +4692,9 @@ static inline tree * gimple_omp_for_incr_ptr (gimple gs, size_t i) { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - return &omp_for_stmt->iter[i].incr; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + return gimple_op_ptr (gs, i + c * 3); } @@ -4687,8 +4704,9 @@ static inline void gimple_omp_for_set_incr (gimple gs, size_t i, tree incr) { gomp_for *omp_for_stmt = as_a <gomp_for *> (gs); - gcc_gimple_checking_assert (i < omp_for_stmt->collapse); - omp_for_stmt->iter[i].incr = incr; + size_t c = omp_for_stmt->collapse; + gcc_gimple_checking_assert (i < c); + gimple_set_op (gs, i + c * 3, incr); } @@ -5248,25 +5266,6 @@ gimple_omp_target_set_ganglocal_size (go } -/* Return the pointer to the broadcast array associated with OMP_TARGET GS. */ - -static inline tree -gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt) -{ - return omp_target_stmt->broadcast_array; -} - - -/* Set PTR to be the broadcast array associated with OMP_TARGET - GS. */ - -static inline void -gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr) -{ - omp_target_stmt->broadcast_array = ptr; -} - - /* Return the clauses associated with OMP_TEAMS GS. */ static inline tree @@ -5446,7 +5445,7 @@ gimple_omp_atomic_load_rhs_ptr (gomp_ato static inline tree gimple_omp_continue_control_def (const gomp_continue *cont_stmt) { - return cont_stmt->control_def; + return gimple_op (cont_stmt, 0); } /* The same as above, but return the address. */ @@ -5454,7 +5453,7 @@ gimple_omp_continue_control_def (const g static inline tree * gimple_omp_continue_control_def_ptr (gomp_continue *cont_stmt) { - return &cont_stmt->control_def; + return gimple_op_ptr (cont_stmt, 0); } /* Set the definition of the control variable in a GIMPLE_OMP_CONTINUE. */ @@ -5462,7 +5461,7 @@ gimple_omp_continue_control_def_ptr (gom static inline void gimple_omp_continue_set_control_def (gomp_continue *cont_stmt, tree def) { - cont_stmt->control_def = def; + gimple_set_op (cont_stmt, 0, def); } @@ -5471,7 +5470,7 @@ gimple_omp_continue_set_control_def (gom static inline tree gimple_omp_continue_control_use (const gomp_continue *cont_stmt) { - return cont_stmt->control_use; + return gimple_op (cont_stmt, 1); } @@ -5480,7 +5479,7 @@ gimple_omp_continue_control_use (const g static inline tree * gimple_omp_continue_control_use_ptr (gomp_continue *cont_stmt) { - return &cont_stmt->control_use; + return gimple_op_ptr (cont_stmt, 1); } @@ -5489,7 +5488,7 @@ gimple_omp_continue_control_use_ptr (gom static inline void gimple_omp_continue_set_control_use (gomp_continue *cont_stmt, tree use) { - cont_stmt->control_use = use; + gimple_set_op (cont_stmt, 1, use); } /* Return a pointer to the body for the GIMPLE_TRANSACTION statement Index: gcc/gimplify.c =================================================================== --- gcc/gimplify.c (revision 224547) +++ gcc/gimplify.c (working copy) @@ -7582,12 +7582,15 @@ gimplify_omp_for (tree *expr_p, gimple_s for (i = 0; i < TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)); i++) { t = TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), i); - gimple_omp_for_set_index (gfor, i, TREE_OPERAND (t, 0)); + tree idxvar = TREE_OPERAND (t, 0); + gimple_omp_for_set_index (gfor, i, idxvar); gimple_omp_for_set_initial (gfor, i, TREE_OPERAND (t, 1)); t = TREE_VEC_ELT (OMP_FOR_COND (for_stmt), i); gimple_omp_for_set_cond (gfor, i, TREE_CODE (t)); gimple_omp_for_set_final (gfor, i, TREE_OPERAND (t, 1)); t = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i); + t = TREE_OPERAND (t, 1); + gimple_omp_for_set_incr_code (gfor, i, TREE_CODE (t)); gimple_omp_for_set_incr (gfor, i, TREE_OPERAND (t, 1)); } Index: gcc/gsstruct.def =================================================================== --- gcc/gsstruct.def (revision 224547) +++ gcc/gsstruct.def (working copy) @@ -42,12 +42,11 @@ DEFGSSTRUCT(GSS_EH_ELSE, geh_else, false DEFGSSTRUCT(GSS_WCE, gimple_statement_wce, false) DEFGSSTRUCT(GSS_OMP, gimple_statement_omp, false) DEFGSSTRUCT(GSS_OMP_CRITICAL, gomp_critical, false) -DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, false) +DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, true) DEFGSSTRUCT(GSS_OMP_PARALLEL_LAYOUT, gimple_statement_omp_parallel_layout, false) DEFGSSTRUCT(GSS_OMP_TASK, gomp_task, false) DEFGSSTRUCT(GSS_OMP_SECTIONS, gomp_sections, false) DEFGSSTRUCT(GSS_OMP_SINGLE_LAYOUT, gimple_statement_omp_single_layout, false) -DEFGSSTRUCT(GSS_OMP_CONTINUE, gomp_continue, false) DEFGSSTRUCT(GSS_OMP_ATOMIC_LOAD, gomp_atomic_load, false) DEFGSSTRUCT(GSS_OMP_ATOMIC_STORE_LAYOUT, gomp_atomic_store, false) DEFGSSTRUCT(GSS_TRANSACTION, gtransaction, false) Index: gcc/ipa-inline-analysis.c =================================================================== --- gcc/ipa-inline-analysis.c (revision 224547) +++ gcc/ipa-inline-analysis.c (working copy) @@ -4122,10 +4122,12 @@ inline_generate_summary (void) { struct cgraph_node *node; +#ifndef ACCEL_COMPILER /* When not optimizing, do not bother to analyze. Inlining is still done because edge redirection needs to happen there. */ if (!optimize && !flag_generate_lto && !flag_generate_offload && !flag_wpa) return; +#endif if (!inline_summaries) inline_summaries = (inline_summary_t*) inline_summary_t::create_ggc (symtab); Index: gcc/lto/lto.c =================================================================== --- gcc/lto/lto.c (revision 224547) +++ gcc/lto/lto.c (working copy) @@ -3115,8 +3115,10 @@ read_cgraph_and_symbols (unsigned nfiles /* Read the IPA summary data. */ if (flag_ltrans) ipa_read_optimization_summaries (); +#ifndef ACCEL_COMPILER else ipa_read_summaries (); +#endif for (i = 0; all_file_decl_data[i]; i++) { Index: gcc/lto-streamer-out.c =================================================================== --- gcc/lto-streamer-out.c (revision 224547) +++ gcc/lto-streamer-out.c (working copy) @@ -1800,27 +1800,32 @@ output_ssa_names (struct output_block *o { unsigned int i, len; - len = vec_safe_length (SSANAMES (fn)); - streamer_write_uhwi (ob, len); - - for (i = 1; i < len; i++) + if (cfun->gimple_df) { - tree ptr = (*SSANAMES (fn))[i]; + len = vec_safe_length (SSANAMES (fn)); + streamer_write_uhwi (ob, len); - if (ptr == NULL_TREE - || SSA_NAME_IN_FREE_LIST (ptr) - || virtual_operand_p (ptr)) - continue; + for (i = 1; i < len; i++) + { + tree ptr = (*SSANAMES (fn))[i]; - streamer_write_uhwi (ob, i); - streamer_write_char_stream (ob->main_stream, - SSA_NAME_IS_DEFAULT_DEF (ptr)); - if (SSA_NAME_VAR (ptr)) - stream_write_tree (ob, SSA_NAME_VAR (ptr), true); - else - /* ??? This drops SSA_NAME_IDENTIFIER on the floor. */ - stream_write_tree (ob, TREE_TYPE (ptr), true); + if (ptr == NULL_TREE + || SSA_NAME_IN_FREE_LIST (ptr) + || virtual_operand_p (ptr)) + continue; + + streamer_write_uhwi (ob, i); + streamer_write_char_stream (ob->main_stream, + SSA_NAME_IS_DEFAULT_DEF (ptr)); + if (SSA_NAME_VAR (ptr)) + stream_write_tree (ob, SSA_NAME_VAR (ptr), true); + else + /* ??? This drops SSA_NAME_IDENTIFIER on the floor. */ + stream_write_tree (ob, TREE_TYPE (ptr), true); + } } + else + streamer_write_zero (ob); streamer_write_zero (ob); } Index: gcc/omp-low.c =================================================================== --- gcc/omp-low.c (revision 224547) +++ gcc/omp-low.c (working copy) @@ -110,7 +110,7 @@ along with GCC; see the file COPYING3. #include "gomp-constants.h" #include "gimple-pretty-print.h" #include "set" - +#include "output.h" /* Lowering of OMP parallel and workshare constructs proceeds in two phases. The first phase scans the function looking for OMP statements @@ -597,17 +597,17 @@ extract_omp_for_data (gomp_for *for_stmt } t = gimple_omp_for_incr (for_stmt, i); - gcc_assert (TREE_OPERAND (t, 0) == var); - switch (TREE_CODE (t)) + enum tree_code incr_code = gimple_omp_for_incr_code (for_stmt, i); + switch (incr_code) { case PLUS_EXPR: - loop->step = TREE_OPERAND (t, 1); + loop->step = t; break; case POINTER_PLUS_EXPR: - loop->step = fold_convert (ssizetype, TREE_OPERAND (t, 1)); + loop->step = fold_convert (ssizetype, t); break; case MINUS_EXPR: - loop->step = TREE_OPERAND (t, 1); + loop->step = t; loop->step = fold_build1_loc (loc, NEGATE_EXPR, TREE_TYPE (loop->step), loop->step); @@ -9721,12 +9721,21 @@ loop_get_oacc_kernels_region_entry (stru } } +static bool +was_offloaded_p (tree fn) +{ +#ifdef ACCEL_COMPILER + return true; +#endif + struct cgraph_node *node = cgraph_node::get (fn); + return node->offloadable; +} + /* Expand the GIMPLE_OMP_TARGET starting at REGION. */ static void expand_omp_target (struct omp_region *region) { - basic_block entry_bb, exit_bb, new_bb; struct function *child_cfun; tree child_fn, block, t; gimple_stmt_iterator gsi; @@ -9736,12 +9745,33 @@ expand_omp_target (struct omp_region *re bool offloaded, data_region; bool do_emit_library_call = true; bool do_splitoff = true; + bool already_offloaded = was_offloaded_p (current_function_decl); entry_stmt = as_a <gomp_target *> (last_stmt (region->entry)); + location_t entry_loc = gimple_location (entry_stmt); - new_bb = region->entry; + basic_block new_bb = region->entry; + basic_block entry_bb = region->entry; + basic_block exit_bb = region->exit; + basic_block entry_succ_bb = single_succ (entry_bb); - offloaded = is_gimple_omp_offloaded (entry_stmt); + if (already_offloaded) + { + gsi = gsi_for_stmt (entry_stmt); + gsi_remove (&gsi, true); + + gsi = gsi_last_bb (exit_bb); + gcc_assert (!gsi_end_p (gsi) + && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN); + gsi_remove (&gsi, true); + + gsi = gsi_last_bb (entry_succ_bb); + if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END) + gsi_remove (&gsi, true); + return; + } + + offloaded = !already_offloaded && is_gimple_omp_offloaded (entry_stmt); switch (gimple_omp_target_kind (entry_stmt)) { case GF_OMP_TARGET_KIND_REGION: @@ -9773,9 +9803,6 @@ expand_omp_target (struct omp_region *re if (child_cfun != NULL) gcc_checking_assert (!child_cfun->cfg); - entry_bb = region->entry; - exit_bb = region->exit; - if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS) { if (!gimple_in_ssa_p (cfun)) @@ -9814,13 +9841,7 @@ expand_omp_target (struct omp_region *re } } - basic_block entry_succ_bb = single_succ (entry_bb); - if (offloaded && !gimple_in_ssa_p (cfun)) - { - gsi = gsi_last_bb (entry_succ_bb); - if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END) - gsi_remove (&gsi, true); - } + tree data_arg = gimple_omp_target_data_arg (entry_stmt); if (offloaded && do_splitoff) @@ -9840,7 +9861,6 @@ expand_omp_target (struct omp_region *re a function call that has been inlined, the original PARM_DECL .OMP_DATA_I may have been converted into a different local variable. In which case, we need to keep the assignment. */ - tree data_arg = gimple_omp_target_data_arg (entry_stmt); if (data_arg) { gimple_stmt_iterator gsi; @@ -9923,8 +9943,12 @@ expand_omp_target (struct omp_region *re stmt = gsi_stmt (gsi); gcc_assert (stmt && gimple_code (stmt) == gimple_code (entry_stmt)); + gsi_prev (&gsi); + stmt = gsi_stmt (gsi); e = split_block (entry_bb, stmt); +#if 0 gsi_remove (&gsi, true); +#endif entry_bb = e->dest; single_succ_edge (entry_bb)->flags = EDGE_FALLTHRU; @@ -9932,11 +9956,16 @@ expand_omp_target (struct omp_region *re if (exit_bb) { gsi = gsi_last_bb (exit_bb); + gimple ompret = gsi_stmt (gsi); gcc_assert (!gsi_end_p (gsi) - && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN); + && gimple_code (ompret) == GIMPLE_OMP_RETURN); stmt = gimple_build_return (NULL); gsi_insert_after (&gsi, stmt, GSI_SAME_STMT); +#if 0 gsi_remove (&gsi, true); +#endif + edge e1 = split_block (exit_bb, ompret); + exit_bb = e1->dest; /* A vuse in single_succ (exit_bb) may use a vdef from the region which is about to be split off. Mark the vdef for renaming. */ @@ -9955,6 +9984,9 @@ expand_omp_target (struct omp_region *re else block = gimple_block (entry_stmt); + /* Make sure we don't try to copy these. */ + gimple_omp_target_set_child_fn (entry_stmt, NULL); + gimple_omp_target_set_data_arg (entry_stmt, NULL); new_bb = move_sese_region_to_fn (child_cfun, entry_bb, exit_bb, block); if (exit_bb) single_succ_edge (new_bb)->flags = EDGE_FALLTHRU; @@ -9979,6 +10011,8 @@ expand_omp_target (struct omp_region *re /* Inform the callgraph about the new function. */ DECL_STRUCT_FUNCTION (child_fn)->curr_properties = cfun->curr_properties; + DECL_STRUCT_FUNCTION (child_fn)->curr_properties &= ~PROP_gimple_eomp; + cgraph_node::add_new_function (child_fn, true); cgraph_node::get (child_fn)->parallelized_function = 1; @@ -10088,7 +10122,7 @@ expand_omp_target (struct omp_region *re clause_loc = OMP_CLAUSE_LOCATION (c); } else - clause_loc = gimple_location (entry_stmt); + clause_loc = entry_loc; /* Ensure 'device' is of the correct type. */ device = fold_convert_loc (clause_loc, integer_type_node, device); @@ -10147,7 +10181,7 @@ expand_omp_target (struct omp_region *re gsi = gsi_last_bb (new_bb); t = gimple_omp_target_data_arg (entry_stmt); - if (t == NULL) + if (data_arg == NULL) { t1 = size_zero_node; t2 = build_zero_cst (ptr_type_node); @@ -10156,11 +10190,11 @@ expand_omp_target (struct omp_region *re } else { - t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (t, 1)))); + t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (data_arg, 1)))); t1 = size_binop (PLUS_EXPR, t1, size_int (1)); - t2 = build_fold_addr_expr (TREE_VEC_ELT (t, 0)); - t3 = build_fold_addr_expr (TREE_VEC_ELT (t, 1)); - t4 = build_fold_addr_expr (TREE_VEC_ELT (t, 2)); + t2 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 0)); + t3 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 1)); + t4 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 2)); } gimple g; @@ -10209,8 +10243,7 @@ expand_omp_target (struct omp_region *re /* Default values for num_gangs, num_workers, and vector_length. */ t_num_gangs = t_num_workers = t_vector_length - = fold_convert_loc (gimple_location (entry_stmt), - integer_type_node, integer_one_node); + = fold_convert_loc (entry_loc, integer_type_node, integer_one_node); /* ..., but if present, use the value specified by the respective clause, making sure that are of the correct type. */ c = find_omp_clause (clauses, OMP_CLAUSE_NUM_GANGS); @@ -10241,8 +10274,7 @@ expand_omp_target (struct omp_region *re int t_wait_idx; /* Default values for t_async. */ - t_async = fold_convert_loc (gimple_location (entry_stmt), - integer_type_node, + t_async = fold_convert_loc (entry_loc, integer_type_node, build_int_cst (integer_type_node, GOMP_ASYNC_SYNC)); /* ..., but if present, use the value specified by the respective @@ -10257,8 +10289,7 @@ expand_omp_target (struct omp_region *re /* Save the index, and... */ t_wait_idx = args.length (); /* ... push a default value. */ - args.quick_push (fold_convert_loc (gimple_location (entry_stmt), - integer_type_node, + args.quick_push (fold_convert_loc (entry_loc, integer_type_node, integer_zero_node)); c = find_omp_clause (clauses, OMP_CLAUSE_WAIT); if (c) @@ -10279,8 +10310,7 @@ expand_omp_target (struct omp_region *re /* Now that we know the number, replace the default value. */ args.ordered_remove (t_wait_idx); args.quick_insert (t_wait_idx, - fold_convert_loc (gimple_location (entry_stmt), - integer_type_node, + fold_convert_loc (entry_loc, integer_type_node, build_int_cst (integer_type_node, n))); } } @@ -10290,7 +10320,7 @@ expand_omp_target (struct omp_region *re } g = gimple_build_call_vec (builtin_decl_explicit (start_ix), args); - gimple_set_location (g, gimple_location (entry_stmt)); + gimple_set_location (g, entry_loc); gsi_insert_before (&gsi, g, GSI_SAME_STMT); if (!offloaded) { @@ -10310,6 +10340,23 @@ expand_omp_target (struct omp_region *re update_ssa (TODO_update_ssa_only_virtuals); } +static bool +expand_region_inner_p (omp_region *region) +{ + if (!region->inner) + return false; + + if (region->type != GIMPLE_OMP_TARGET) + return true; + if (was_offloaded_p (current_function_decl)) + return true; + + gomp_target *entry_stmt = as_a <gomp_target *> (last_stmt (region->entry)); + bool offloaded = is_gimple_omp_offloaded (entry_stmt); + + return !offloaded || !is_gimple_omp_oacc (entry_stmt); +} + /* Expand the parallel region tree rooted at REGION. Expansion proceeds in depth-first order. Innermost regions are expanded first. This way, parallel regions that require a new function to @@ -10340,8 +10387,7 @@ expand_omp (struct omp_region *region) if (region->type == GIMPLE_OMP_FOR && gimple_omp_for_combined_p (last_stmt (region->entry))) inner_stmt = last_stmt (region->inner->entry); - - if (region->inner) + if (expand_region_inner_p (region)) expand_omp (region->inner); saved_location = input_location; @@ -10439,7 +10485,9 @@ find_omp_target_region_data (struct omp_ region->gwv_this |= MASK_WORKER; if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH)) region->gwv_this |= MASK_VECTOR; - region->broadcast_array = gimple_omp_target_broadcast_array (stmt); + basic_block entry_succ = single_succ (region->entry); + gimple ee_stmt = last_stmt (entry_succ); + region->broadcast_array = gimple_op (ee_stmt, 0); } /* Helper for build_omp_regions. Scan the dominator tree starting at @@ -10666,6 +10714,7 @@ generate_vector_broadcast (tree dest_var conv1 = gimple_build_assign (casted_var, NOP_EXPR, var); gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING); + retval = conv1; } tree decl = builtin_decl_explicit (fn); @@ -10709,19 +10758,21 @@ generate_oacc_broadcast (omp_region *reg omp_region *parent = enclosing_target_region (region); tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE); - tree ptr = create_tmp_var (build_pointer_type (elttype)); - gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR, + tree ptrtype = build_pointer_type (elttype); + tree ptr1 = make_ssa_name (ptrtype); + tree ptr2 = make_ssa_name (ptrtype); + gassign *cast1 = gimple_build_assign (ptr1, NOP_EXPR, parent->broadcast_array); gsi_insert_after (&where, cast1, GSI_NEW_STMT); - gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var); + gassign *st = gimple_build_assign (build_simple_mem_ref (ptr1), var); gsi_insert_after (&where, st, GSI_NEW_STMT); gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT); - gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR, + gassign *cast2 = gimple_build_assign (ptr2, NOP_EXPR, parent->broadcast_array); gsi_insert_after (&where, cast2, GSI_NEW_STMT); - gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr)); + gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr2)); gsi_insert_after (&where, ld, GSI_NEW_STMT); gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT); @@ -10735,7 +10786,8 @@ generate_oacc_broadcast (omp_region *reg the bits MASK_VECTOR and/or MASK_WORKER. */ static void -make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask) +make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask, + bool set_dominator) { basic_block cond_bb = true_edge->src; @@ -10747,7 +10799,7 @@ make_predication_test (edge true_edge, b if (mask & MASK_VECTOR) { gimple call = gimple_build_call (decl, 1, integer_zero_node); - vvar = create_tmp_var (unsigned_type_node); + vvar = make_ssa_name (unsigned_type_node); comp_var = vvar; gimple_call_set_lhs (call, vvar); gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT); @@ -10755,14 +10807,14 @@ make_predication_test (edge true_edge, b if (mask & MASK_WORKER) { gimple call = gimple_build_call (decl, 1, integer_one_node); - wvar = create_tmp_var (unsigned_type_node); + wvar = make_ssa_name (unsigned_type_node); comp_var = wvar; gimple_call_set_lhs (call, wvar); gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT); } if (wvar && vvar) { - comp_var = create_tmp_var (unsigned_type_node); + comp_var = make_ssa_name (unsigned_type_node); gassign *ior = gimple_build_assign (comp_var, BIT_IOR_EXPR, wvar, vvar); gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT); } @@ -10782,6 +10834,9 @@ make_predication_test (edge true_edge, b basic_block false_abnorm_bb = split_edge (e); edge abnorm_edge = single_succ_edge (false_abnorm_bb); abnorm_edge->flags |= EDGE_ABNORMAL; + + if (set_dominator) + set_immediate_dominator (CDI_DOMINATORS, skip_dest_bb, cond_bb); } /* Apply OpenACC predication to basic block BB which is in @@ -10791,6 +10846,8 @@ make_predication_test (edge true_edge, b static void predicate_bb (basic_block bb, struct omp_region *parent, int mask) { + bool set_dominator = true; + /* We handle worker-single vector-partitioned loops by jumping around them if not in the controlling worker. Don't insert unnecessary (and incorrect) predication. */ @@ -10816,8 +10873,8 @@ predicate_bb (basic_block bb, struct omp if (gimple_code (stmt) == GIMPLE_COND) { - tree cond_var = create_tmp_var (boolean_type_node); - tree broadcast_cond = create_tmp_var (boolean_type_node); + tree cond_var = make_ssa_name (boolean_type_node); + tree broadcast_cond = make_ssa_name (boolean_type_node); gassign *asgn = gimple_build_assign (cond_var, gimple_cond_code (stmt), gimple_cond_lhs (stmt), @@ -10830,30 +10887,36 @@ predicate_bb (basic_block bb, struct omp mask); edge e = split_block (bb, splitpoint); + set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src); e->flags = EDGE_ABNORMAL; skip_dest_bb = e->dest; gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR, broadcast_cond, boolean_true_node); + update_stmt (stmt); } else if (gimple_code (stmt) == GIMPLE_SWITCH) { gswitch *sstmt = as_a <gswitch *> (stmt); tree var = gimple_switch_index (sstmt); - tree new_var = create_tmp_var (TREE_TYPE (var)); + tree new_var = make_ssa_name (TREE_TYPE (var)); +#if 0 gassign *asgn = gimple_build_assign (new_var, var); gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING); gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn); - +#endif + gsi_prev (&gsi); gimple splitpoint = generate_oacc_broadcast (parent, new_var, var, - gsi_asgn, mask); + gsi, mask); edge e = split_block (bb, splitpoint); + set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src); e->flags = EDGE_ABNORMAL; skip_dest_bb = e->dest; gimple_switch_set_index (sstmt, new_var); + update_stmt (stmt); } else if (is_gimple_omp (stmt)) { @@ -10876,6 +10939,7 @@ predicate_bb (basic_block bb, struct omp gimple_stmt_iterator head_gsi = gsi_start_bb (bb); gsi_prev (&head_gsi); edge e0 = split_block (bb, gsi_stmt (head_gsi)); + set_immediate_dominator (CDI_DOMINATORS, e0->dest, e0->src); int mask2 = mask; if (code == GIMPLE_OMP_FOR) mask2 &= ~MASK_VECTOR; @@ -10885,7 +10949,7 @@ predicate_bb (basic_block bb, struct omp so we just need to make one branch around the entire loop. */ inner->entry = e0->dest; - make_predication_test (e0, skip_dest_bb, mask2); + make_predication_test (e0, skip_dest_bb, mask2, true); return; } basic_block for_block = e0->dest; @@ -10896,9 +10960,9 @@ predicate_bb (basic_block bb, struct omp edge e2 = split_block (for_block, split_stmt); basic_block bb2 = e2->dest; - make_predication_test (e0, bb2, mask); + make_predication_test (e0, bb2, mask, true); make_predication_test (single_pred_edge (bb3), skip_dest_bb, - mask2); + mask2, true); inner->entry = bb3; return; } @@ -10917,6 +10981,7 @@ predicate_bb (basic_block bb, struct omp if (!split_stmt) return; edge e = split_block (bb, split_stmt); + set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src); skip_dest_bb = e->dest; if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE) { @@ -10945,6 +11010,8 @@ predicate_bb (basic_block bb, struct omp gsi_prev (&gsi); if (gsi_stmt (gsi) == 0) return; + if (get_immediate_dominator (CDI_DOMINATORS, skip_dest_bb) != bb) + set_dominator = false; } if (skip_dest_bb != NULL) @@ -10952,24 +11019,31 @@ predicate_bb (basic_block bb, struct omp gimple_stmt_iterator head_gsi = gsi_start_bb (bb); gsi_prev (&head_gsi); edge e2 = split_block (bb, gsi_stmt (head_gsi)); - make_predication_test (e2, skip_dest_bb, mask); + set_immediate_dominator (CDI_DOMINATORS, e2->dest, e2->src); + make_predication_test (e2, skip_dest_bb, mask, set_dominator); } } /* Walk the dominator tree starting at BB to collect basic blocks in WORKLIST which need OpenACC vector predication applied to them. */ -static void +static bool find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist) { + bool ret = false; struct omp_region *parent = *bb_region_map->get (bb); if (required_predication_mask (parent) != 0) - worklist.safe_push (bb); + { + worklist.safe_push (bb); + ret = true; + } + basic_block son; for (son = first_dom_son (CDI_DOMINATORS, bb); son; son = next_dom_son (CDI_DOMINATORS, son)) - find_predicatable_bbs (son, worklist); + ret |= find_predicatable_bbs (son, worklist); + return ret; } /* Apply OpenACC vector predication to all basic blocks. HEAD_BB is the @@ -10979,7 +11053,9 @@ static void predicate_omp_regions (basic_block head_bb) { vec<basic_block> worklist = vNULL; - find_predicatable_bbs (head_bb, worklist); + if (!find_predicatable_bbs (head_bb, worklist)) + return; + int i; basic_block bb; FOR_EACH_VEC_ELT (worklist, i, bb) @@ -10988,6 +11064,11 @@ predicate_omp_regions (basic_block head_ int mask = required_predication_mask (region); predicate_bb (bb, region, mask); } + free_dominance_info (CDI_DOMINATORS); + calculate_dominance_info (CDI_DOMINATORS); + mark_virtual_operands_for_renaming (cfun); + update_ssa (TODO_update_ssa); + verify_ssa (true, true); } /* USE and GET sets for variable broadcasting. */ @@ -11176,7 +11257,8 @@ oacc_broadcast (basic_block entry_bb, ba /* Currently, subroutines aren't supported. */ gcc_assert (!lookup_attribute ("oacc function", - DECL_ATTRIBUTES (current_function_decl))); + DECL_ATTRIBUTES (current_function_decl)) + || was_offloaded_p (current_function_decl)); /* Populate live_in. */ oacc_populate_live_in (entry_bb, region); @@ -11236,7 +11318,7 @@ oacc_broadcast (basic_block entry_bb, ba gsi_prev (&gsi); edge e2 = split_block (entry_bb, gsi_stmt (gsi)); e2->flags |= EDGE_ABNORMAL; - make_predication_test (e2, dest_bb, mask); + make_predication_test (e2, dest_bb, mask, true); /* Update entry_bb. */ entry_bb = dest_bb; @@ -11249,7 +11331,7 @@ oacc_broadcast (basic_block entry_bb, ba /* Main entry point for expanding OMP-GIMPLE into runtime calls. */ static unsigned int -execute_expand_omp (void) +execute_expand_omp (bool first) { bb_region_map = new hash_map<basic_block, omp_region *>; @@ -11264,7 +11346,8 @@ execute_expand_omp (void) fprintf (dump_file, "\n"); } - predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun)); + if (!first) + predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun)); remove_exit_barriers (root_omp_region); @@ -11317,9 +11400,10 @@ public: if (!gate) return 0; - return execute_expand_omp (); + return execute_expand_omp (true); } + opt_pass * clone () { return new pass_expand_omp (m_ctxt); } }; // class pass_expand_omp } // anon namespace @@ -11400,9 +11484,9 @@ public: } virtual unsigned int execute (function *) { - unsigned res = execute_expand_omp (); + unsigned res = execute_expand_omp (false); release_dangling_ssa_names (); - return res; + return res | TODO_update_ssa; } opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); } @@ -12562,7 +12646,7 @@ lower_omp_for (gimple_stmt_iterator *gsi if (!is_gimple_min_invariant (*rhs_p)) *rhs_p = get_formal_tmp_var (*rhs_p, &body); - rhs_p = &TREE_OPERAND (gimple_omp_for_incr (stmt, i), 1); + rhs_p = gimple_omp_for_incr_ptr (stmt, i); if (!is_gimple_min_invariant (*rhs_p)) *rhs_p = get_formal_tmp_var (*rhs_p, &body); } @@ -13547,7 +13631,7 @@ lower_omp_target (gimple_stmt_iterator * if (offloaded) { - gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end ()); + gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end (ctx->worker_sync_elt)); if (has_reduction) { gimple_seq_add_seq (&irlist, tgt_body); @@ -13583,7 +13667,6 @@ lower_omp_target (gimple_stmt_iterator * gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT); gimple_omp_target_set_ganglocal_size (stmt, sz); - gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt); pop_gimplify_context (NULL); } Index: gcc/pass_manager.h =================================================================== --- gcc/pass_manager.h (revision 224547) +++ gcc/pass_manager.h (working copy) @@ -28,6 +28,7 @@ struct register_pass_info; #define GCC_PASS_LISTS \ DEF_PASS_LIST (all_lowering_passes) \ DEF_PASS_LIST (all_small_ipa_passes) \ + DEF_PASS_LIST (all_local_opt_passes) \ DEF_PASS_LIST (all_regular_ipa_passes) \ DEF_PASS_LIST (all_late_ipa_passes) \ DEF_PASS_LIST (all_passes) @@ -82,6 +83,7 @@ public: /* The root of the compilation pass tree, once constructed. */ opt_pass *all_passes; opt_pass *all_small_ipa_passes; + opt_pass *all_local_opt_passes; opt_pass *all_lowering_passes; opt_pass *all_regular_ipa_passes; opt_pass *all_late_ipa_passes; Index: gcc/passes.c =================================================================== --- gcc/passes.c (revision 224547) +++ gcc/passes.c (working copy) @@ -454,8 +454,12 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { - /* Don't bother doing anything if the program has errors. */ - return (!seen_error () && !in_lto_p); + if (seen_error ()) + return false; +#ifdef ACCEL_COMPILER + return true; +#endif + return !in_lto_p; } }; // class pass_local_optimization_passes @@ -952,6 +956,7 @@ pass_manager::dump_passes () const dump_pass_list (all_lowering_passes, 1); dump_pass_list (all_small_ipa_passes, 1); + dump_pass_list (all_local_opt_passes, 1); dump_pass_list (all_regular_ipa_passes, 1); dump_pass_list (all_late_ipa_passes, 1); dump_pass_list (all_passes, 1); @@ -1463,6 +1468,8 @@ pass_manager::register_pass (struct regi if (!success || all_instances) success |= position_pass (pass_info, &all_small_ipa_passes); if (!success || all_instances) + success |= position_pass (pass_info, &all_local_opt_passes); + if (!success || all_instances) success |= position_pass (pass_info, &all_regular_ipa_passes); if (!success || all_instances) success |= position_pass (pass_info, &all_late_ipa_passes); @@ -1515,9 +1522,10 @@ pass_manager::register_pass (struct regi If we are optimizing, compile is then invoked: compile () - ipa_passes () -> all_small_ipa_passes + ipa_passes () -> all_small_ipa_passes, + all_local_opt_passes -> Analysis of all_regular_ipa_passes - * possible LTO streaming at copmilation time * + * possible LTO streaming at compilation time * -> Execution of all_regular_ipa_passes * possible LTO streaming at link time * -> all_late_ipa_passes @@ -1541,8 +1549,8 @@ pass_manager::operator delete (void *ptr } pass_manager::pass_manager (context *ctxt) -: all_passes (NULL), all_small_ipa_passes (NULL), all_lowering_passes (NULL), - all_regular_ipa_passes (NULL), +: all_passes (NULL), all_small_ipa_passes (NULL), all_local_opt_passes (NULL), + all_lowering_passes (NULL), all_regular_ipa_passes (NULL), all_late_ipa_passes (NULL), passes_by_id (NULL), passes_by_id_size (0), m_ctxt (ctxt) { @@ -1592,6 +1600,7 @@ pass_manager::pass_manager (context *ctx /* Register the passes with the tree dump code. */ register_dump_files (all_lowering_passes); register_dump_files (all_small_ipa_passes); + register_dump_files (all_local_opt_passes); register_dump_files (all_regular_ipa_passes); register_dump_files (all_late_ipa_passes); register_dump_files (all_passes); @@ -2463,24 +2472,15 @@ ipa_write_summaries_1 (lto_symtab_encode lto_delete_out_decl_state (state); } -/* Write out summaries for all the nodes in the callgraph. */ - -void -ipa_write_summaries (void) +static lto_symtab_encoder_t +build_symtab_encoder (void) { - lto_symtab_encoder_t encoder; + lto_symtab_encoder_t encoder = lto_symtab_encoder_new (false); int i, order_pos; varpool_node *vnode; struct cgraph_node *node; struct cgraph_node **order; - if ((!flag_generate_lto && !flag_generate_offload) || seen_error ()) - return; - - select_what_to_stream (); - - encoder = lto_symtab_encoder_new (false); - /* Create the callgraph set in the same order used in cgraph_expand_all_functions. This mostly facilitates debugging, since it causes the gimple file to be processed in the same order @@ -2515,10 +2515,50 @@ ipa_write_summaries (void) FOR_EACH_DEFINED_VARIABLE (vnode) if (vnode->need_lto_streaming) lto_set_symtab_encoder_in_partition (encoder, vnode); + free (order); + return encoder; +} +/* Write out summaries for all the nodes in the callgraph. */ + +void +ipa_write_summaries (void) +{ + if ((!flag_generate_lto && !flag_generate_offload) || seen_error ()) + return; + + select_what_to_stream (); + lto_symtab_encoder_t encoder = build_symtab_encoder (); ipa_write_summaries_1 (compute_ltrans_boundary (encoder)); +} - free (order); +void +write_offload_lto (void) +{ + if (!flag_generate_offload || seen_error ()) + return; + + lto_stream_offload_p = true; + + select_what_to_stream (); + lto_symtab_encoder_t encoder = build_symtab_encoder (); + encoder = compute_ltrans_boundary (encoder); + + struct lto_out_decl_state *state = lto_new_out_decl_state (); + state->symtab_node_encoder = encoder; + + lto_output_init_mode_table (); + lto_push_out_decl_state (state); + + gcc_assert (!flag_wpa); + + write_lto (); + + gcc_assert (lto_get_out_decl_state () == state); + lto_pop_out_decl_state (); + lto_delete_out_decl_state (state); + + lto_stream_offload_p = false; } /* Same as execute_pass_list but assume that subpasses of IPA passes Index: gcc/passes.def =================================================================== --- gcc/passes.def (revision 224547) +++ gcc/passes.def (working copy) @@ -60,6 +60,10 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_early_warn_uninitialized); NEXT_PASS (pass_nothrow); POP_INSERT_PASSES () + TERMINATE_PASS_LIST () + + /* Local optimization passes. */ + INSERT_PASSES_AFTER (all_local_opt_passes) NEXT_PASS (pass_chkp_instrumentation_passes); PUSH_INSERT_PASSES_WITHIN (pass_chkp_instrumentation_passes) @@ -70,6 +74,7 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_local_optimization_passes); PUSH_INSERT_PASSES_WITHIN (pass_local_optimization_passes) + NEXT_PASS (pass_expand_omp_ssa); NEXT_PASS (pass_fixup_cfg); NEXT_PASS (pass_rebuild_cgraph_edges); NEXT_PASS (pass_inline_parameters); Index: gcc/ssa-iterators.h =================================================================== --- gcc/ssa-iterators.h (revision 224547) +++ gcc/ssa-iterators.h (working copy) @@ -609,17 +609,21 @@ op_iter_init (ssa_op_iter *ptr, gimple s { switch (gimple_code (stmt)) { - case GIMPLE_ASSIGN: - case GIMPLE_CALL: - ptr->numops = 1; - break; - case GIMPLE_ASM: - ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt)); - break; - default: - ptr->numops = 0; - flags &= ~(SSA_OP_DEF | SSA_OP_VDEF); - break; + case GIMPLE_ASSIGN: + case GIMPLE_CALL: + case GIMPLE_OMP_CONTINUE: + ptr->numops = 1; + break; + case GIMPLE_ASM: + ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt)); + break; + case GIMPLE_OMP_FOR: + ptr->numops = gimple_omp_for_collapse (stmt); + break; + default: + ptr->numops = 0; + flags &= ~(SSA_OP_DEF | SSA_OP_VDEF); + break; } } ptr->uses = (flags & (SSA_OP_USE|SSA_OP_VUSE)) ? gimple_use_ops (stmt) : NULL; Index: gcc/tree-cfg.c =================================================================== --- gcc/tree-cfg.c (revision 224547) +++ gcc/tree-cfg.c (working copy) @@ -6649,6 +6649,7 @@ move_stmt_r (gimple_stmt_iterator *gsi_p case GIMPLE_OMP_RETURN: case GIMPLE_OMP_CONTINUE: + case GIMPLE_OMP_ENTRY_END: break; default: if (is_gimple_omp (stmt)) @@ -6659,7 +6660,7 @@ move_stmt_r (gimple_stmt_iterator *gsi_p function. */ bool save_remap_decls_p = p->remap_decls_p; p->remap_decls_p = false; - *handled_ops_p = true; + // *handled_ops_p = true; walk_gimple_seq_mod (gimple_omp_body_ptr (stmt), move_stmt_r, move_stmt_op, wi); Index: gcc/tree-into-ssa.c =================================================================== --- gcc/tree-into-ssa.c (revision 224547) +++ gcc/tree-into-ssa.c (working copy) @@ -2442,6 +2442,7 @@ pass_build_ssa::execute (function *fun) SET_SSA_NAME_VAR_OR_IDENTIFIER (name, DECL_NAME (decl)); } + verify_ssa (false, true); return 0; } Index: gcc/tree-nested.c =================================================================== --- gcc/tree-nested.c (revision 224547) +++ gcc/tree-nested.c (working copy) @@ -673,14 +673,8 @@ walk_gimple_omp_for (gomp_for *for_stmt, wi.is_lhs = false; walk_tree (gimple_omp_for_final_ptr (for_stmt, i), callback_op, &wi, NULL); - - t = gimple_omp_for_incr (for_stmt, i); - gcc_assert (BINARY_CLASS_P (t)); - wi.val_only = false; - walk_tree (&TREE_OPERAND (t, 0), callback_op, &wi, NULL); - wi.val_only = true; - wi.is_lhs = false; - walk_tree (&TREE_OPERAND (t, 1), callback_op, &wi, NULL); + walk_tree (gimple_omp_for_incr_ptr (for_stmt, i), callback_op, + &wi, NULL); } seq = gsi_seq (wi.gsi); Index: gcc/tree-ssa-operands.c =================================================================== --- gcc/tree-ssa-operands.c (revision 224547) +++ gcc/tree-ssa-operands.c (working copy) @@ -942,11 +942,18 @@ parse_ssa_operands (struct function *fn, append_vuse (gimple_vop (fn)); goto do_default; + case GIMPLE_OMP_FOR: + start = gimple_omp_for_collapse (stmt); + for (i = 0; i < start; i++) + get_expr_operands (fn, stmt, gimple_op_ptr (stmt, i), opf_def); + goto do_default; + case GIMPLE_CALL: /* Add call-clobbered operands, if needed. */ maybe_add_call_vops (fn, as_a <gcall *> (stmt)); /* FALLTHRU */ + case GIMPLE_OMP_CONTINUE: case GIMPLE_ASSIGN: get_expr_operands (fn, stmt, gimple_op_ptr (stmt, 0), opf_def); start = 1; ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-19 10:44 ` Bernd Schmidt @ 2015-06-19 12:32 ` Jakub Jelinek 2015-06-19 13:07 ` Bernd Schmidt 2015-06-22 14:00 ` Julian Brown 0 siblings, 2 replies; 23+ messages in thread From: Jakub Jelinek @ 2015-06-19 12:32 UTC (permalink / raw) To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote: > On 05/28/2015 05:08 PM, Jakub Jelinek wrote: > > >I understand it is more work, I'd just like to ask that when designing stuff > >for the OpenACC offloading you (plural) try to take the other offloading > >devices and host fallback into account. > > The problem is that many of the transformations we need to do are really GPU > specific, and with the current structure of omplow/ompexp they are being > done in the host compiler. The offloading scheme we decided on does not give > us the means to write out multiple versions of an offloaded function where > each target gets a different one. For that reason I think we should postpone > these lowering decisions until we're in the accel compiler, where they could > be controlled by target hooks, and over the last two weeks I've been doing > some experiments to see how that could be achieved. Emitting PTX specific code from current ompexp is highly undesirable of course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees around for too long either, they've never meant to be used in low gimple, and even all the early optimization passes could screw them up badly, they are also very much OpenMP or OpenACC specific, rather than representing language neutral behavior, so there is a problem that you'd need M x N different expansions of those constructs, which is not really maintainable (M being number of supported offloading standards, right now 2, and N number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). I wonder why struct loop flags and other info together with function attributes and/or cgraph flags and other info aren't sufficient for the OpenACC needs. Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? Why can't the execution model (normal, vector-single and worker-single) be simply attributes on functions or cgraph node flags and the kind of #acc loop simply be flags on struct loop, like already OpenMP simd / Cilk+ simd is? I mean, you need to implement the PTX broadcasting etc. for the 3 different modes (one where each thread executes everything, another one where only first thread in a warp executes everything, other threads only call functions with the same mode, or specially marked loops), another one where only a single thread (in the CTA) executes everything, other threads only call functions with the same mode or specially marked loops, because if you have #acc routine (something) ... that is just an attribute of a function, not really some construct in the body of it. The vector level parallelism is something where on the host/host_noshm/XeonPhi (dunno about HSA) you want vectorization to happen, and that is already implemented in the vectorizer pass, implementing it again elsewhere is highly undesirable. For PTX the implementation is of course different, and the vectorizer is likely not the right pass to handle them, but why can't the same struct loop flags be used by the pass that handles the conditionalization of execution for the 2 of the 3 above modes? Then there is the worker level parallelism, but I'd hope it can be handled similarly, and supposedly the pass that handles vector-single and worker-single lowering for PTX could do the same for non-PTX targets - if the OpenACC execution model is that all the (e.g. pthread based) threads are started immediately and you skip in worker-single mode work on other than the first thread, then it needs to behave similarly to PTX, just probably needs to use library calls rather than PTX builtins to query the thread number. Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-19 12:32 ` Jakub Jelinek @ 2015-06-19 13:07 ` Bernd Schmidt 2015-06-19 14:10 ` Jakub Jelinek 2015-06-22 14:00 ` Julian Brown 1 sibling, 1 reply; 23+ messages in thread From: Bernd Schmidt @ 2015-06-19 13:07 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown On 06/19/2015 02:25 PM, Jakub Jelinek wrote: > Emitting PTX specific code from current ompexp is highly undesirable of > course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees > around for too long either, they've never meant to be used in low gimple, > and even all the early optimization passes could screw them up badly, The idea is not to keep them around for very long, but I think there's no reason why they couldn't survive a while longer. Between ompexpand and the end of build_ssa_passes, we have (ignoring things like chkp and ubsan which can just be turned off for offloaded functions if necessary): NEXT_PASS (pass_ipa_free_lang_data); NEXT_PASS (pass_ipa_function_and_variable_visibility); NEXT_PASS (pass_fixup_cfg); NEXT_PASS (pass_init_datastructures); NEXT_PASS (pass_build_ssa); NEXT_PASS (pass_early_warn_uninitialized); NEXT_PASS (pass_nothrow); Nothing in there strikes me as particularly problematic if we can make things like GIMPLE_OMP_FOR survive into-ssa - which I think I did in my patch. Besides, the OpenACC kernels path generates them in SSA form anyway during parloops so one could make the argument that this is a step towards better consistency. > they are also very much OpenMP or OpenACC specific, rather than representing > language neutral behavior, so there is a problem that you'd need M x N > different expansions of those constructs, which is not really maintainable > (M being number of supported offloading standards, right now 2, and N > number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). Well, that's a problem we have anyway, independent on how we implement all these devices and standards. I don't see how that's relevant to the discussion. > I wonder why struct loop flags and other info together with function > attributes and/or cgraph flags and other info aren't sufficient for the > OpenACC needs. > Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? > Why can't the execution model (normal, vector-single and worker-single) > be simply attributes on functions or cgraph node flags and the kind of > #acc loop simply be flags on struct loop, like already OpenMP simd > / Cilk+ simd is? We haven't looked at Cilk+ or anything like that. You suggest using attributes and flags, but at what point do you intend to actually lower the IR to actually represent what's going on? > The vector level parallelism is something where on the host/host_noshm/XeonPhi > (dunno about HSA) you want vectorization to happen, and that is already > implemented in the vectorizer pass, implementing it again elsewhere is > highly undesirable. For PTX the implementation is of course different, > and the vectorizer is likely not the right pass to handle them, but why > can't the same struct loop flags be used by the pass that handles the > conditionalization of execution for the 2 of the 3 above modes? Agreed on wanting the vectorizer to handle things for "normal" machines, that is one of the motivations for pushing the lowering past the offload LTO writeout stage. The problem with OpenACC on GPUs is that the predication really changes the CFG and the data flow - I fear unpredictable effects if we let any optimizers run before lowering OpenACC to the point where we actually represent what's going on in the function. Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-19 13:07 ` Bernd Schmidt @ 2015-06-19 14:10 ` Jakub Jelinek 2015-06-22 14:04 ` Bernd Schmidt 2015-06-24 13:37 ` Bernd Schmidt 0 siblings, 2 replies; 23+ messages in thread From: Jakub Jelinek @ 2015-06-19 14:10 UTC (permalink / raw) To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote: > >they are also very much OpenMP or OpenACC specific, rather than representing > >language neutral behavior, so there is a problem that you'd need M x N > >different expansions of those constructs, which is not really maintainable > >(M being number of supported offloading standards, right now 2, and N > >number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). > > Well, that's a problem we have anyway, independent on how we implement all > these devices and standards. I don't see how that's relevant to the > discussion. It is relevant, because if you lower early (omplower/ompexp) into some IL form common to all the offloading standards, then it is M + N. > >I wonder why struct loop flags and other info together with function > >attributes and/or cgraph flags and other info aren't sufficient for the > >OpenACC needs. > >Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? > > >Why can't the execution model (normal, vector-single and worker-single) > >be simply attributes on functions or cgraph node flags and the kind of > >#acc loop simply be flags on struct loop, like already OpenMP simd > >/ Cilk+ simd is? > > We haven't looked at Cilk+ or anything like that. You suggest using > attributes and flags, but at what point do you intend to actually lower the > IR to actually represent what's going on? I think around where the vectorizer is, perhaps before the loop optimization pass queue (or after it, some investigation is needed). > >The vector level parallelism is something where on the host/host_noshm/XeonPhi > >(dunno about HSA) you want vectorization to happen, and that is already > >implemented in the vectorizer pass, implementing it again elsewhere is > >highly undesirable. For PTX the implementation is of course different, > >and the vectorizer is likely not the right pass to handle them, but why > >can't the same struct loop flags be used by the pass that handles the > >conditionalization of execution for the 2 of the 3 above modes? > > Agreed on wanting the vectorizer to handle things for "normal" machines, > that is one of the motivations for pushing the lowering past the offload LTO > writeout stage. The problem with OpenACC on GPUs is that the predication > really changes the CFG and the data flow - I fear unpredictable effects if > we let any optimizers run before lowering OpenACC to the point where we > actually represent what's going on in the function. I actually believe having some optimization passes in between the ompexp and the lowering of the IR into the form PTX wants is highly desirable, the form with the worker-single or vector-single mode lowered will contain too complex CFG for many optimizations to be really effective, especially if it uses abnormal edges. E.g. inlining supposedly would have harder job etc. What exact unpredictable effects do you fear? If the loop remains in the IL (isn't optimized away as unreachable or isn't removed, e.g. as a non-loop - say if it contains a noreturn call), the flags on struct loop should be still there. For the loop clauses (reduction always, and private/lastprivate if addressable etc.) for OpenMP simd / Cilk+ simd we use special arrays indexed by internal functions, which then during vectorization are shrunk (but in theory could be expanded too) to the right vectorization factor if vectorized, of course accesses within the loop vectorized using SIMD, and if not vectorized, shrunk to 1 element. So the PTX IL lowering pass could use the same arrays ("omp simd array" attribute) to transform the decls into thread local vars as opposed to vars shared by the whole CTA. Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-19 14:10 ` Jakub Jelinek @ 2015-06-22 14:04 ` Bernd Schmidt 2015-06-22 14:25 ` Jakub Jelinek 2015-06-24 13:37 ` Bernd Schmidt 1 sibling, 1 reply; 23+ messages in thread From: Bernd Schmidt @ 2015-06-22 14:04 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown On 06/19/2015 03:45 PM, Jakub Jelinek wrote: > I actually believe having some optimization passes in between the ompexp > and the lowering of the IR into the form PTX wants is highly desirable, > the form with the worker-single or vector-single mode lowered will contain > too complex CFG for many optimizations to be really effective, especially > if it uses abnormal edges. E.g. inlining supposedly would have harder job > etc. What exact unpredictable effects do you fear? Mostly the ones I can't predict. But let's take one example, LICM: let's say you pull some assignment out of a loop, then you find yourself in one of two possible situations: either it's become not actually available inside the loop (because the data and control flow is not described correctly and the compiler doesn't know what's going on), or, to avoid that, you introduce additional broadcasting operations when entering the loop, which might be quite expensive. Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 14:04 ` Bernd Schmidt @ 2015-06-22 14:25 ` Jakub Jelinek 0 siblings, 0 replies; 23+ messages in thread From: Jakub Jelinek @ 2015-06-22 14:25 UTC (permalink / raw) To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown On Mon, Jun 22, 2015 at 03:59:57PM +0200, Bernd Schmidt wrote: > On 06/19/2015 03:45 PM, Jakub Jelinek wrote: > >I actually believe having some optimization passes in between the ompexp > >and the lowering of the IR into the form PTX wants is highly desirable, > >the form with the worker-single or vector-single mode lowered will contain > >too complex CFG for many optimizations to be really effective, especially > >if it uses abnormal edges. E.g. inlining supposedly would have harder job > >etc. What exact unpredictable effects do you fear? > > Mostly the ones I can't predict. But let's take one example, LICM: let's say > you pull some assignment out of a loop, then you find yourself in one of two > possible situations: either it's become not actually available inside the > loop (because the data and control flow is not described correctly and the > compiler doesn't know what's going on), or, to avoid that, you introduce Why do you think that would happen? E.g. for non-addressable gimple types you'd most likely just have a PHI for it on the loop. > additional broadcasting operations when entering the loop, which might be > quite expensive. If the PHI has cheap initialization, there is not a problem to emit it as initialization in the loop instead of a broadcast (kind like RA rematerialization). And by actually adding such an optimization, you help even code that has computation in a vector-single code and uses it in vector acc loop. Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-19 14:10 ` Jakub Jelinek 2015-06-22 14:04 ` Bernd Schmidt @ 2015-06-24 13:37 ` Bernd Schmidt 2015-06-24 14:08 ` Jakub Jelinek 1 sibling, 1 reply; 23+ messages in thread From: Bernd Schmidt @ 2015-06-24 13:37 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown On 06/19/2015 03:45 PM, Jakub Jelinek wrote: > If the loop remains in the IL (isn't optimized away as unreachable or > isn't removed, e.g. as a non-loop - say if it contains a noreturn call), > the flags on struct loop should be still there. For the loop clauses > (reduction always, and private/lastprivate if addressable etc.) for > OpenMP simd / Cilk+ simd we use special arrays indexed by internal > functions, which then during vectorization are shrunk (but in theory could > be expanded too) to the right vectorization factor if vectorized, of course > accesses within the loop vectorized using SIMD, and if not vectorized, > shrunk to 1 element. I'd appreciate if you could describe that mechanism in more detail. As far as I can tell it is very poorly commented and documented in the code. I mean, it doesn't even follow the minimal coding standards of describing function inputs: /* Helper function of lower_rec_input_clauses, used for #pragma omp simd privatization. */ static bool lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf, tree &idx, tree &lane, tree &ivar, tree &lvar) Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-24 13:37 ` Bernd Schmidt @ 2015-06-24 14:08 ` Jakub Jelinek 0 siblings, 0 replies; 23+ messages in thread From: Jakub Jelinek @ 2015-06-24 14:08 UTC (permalink / raw) To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown On Wed, Jun 24, 2015 at 03:11:04PM +0200, Bernd Schmidt wrote: > On 06/19/2015 03:45 PM, Jakub Jelinek wrote: > > >If the loop remains in the IL (isn't optimized away as unreachable or > >isn't removed, e.g. as a non-loop - say if it contains a noreturn call), > >the flags on struct loop should be still there. For the loop clauses > >(reduction always, and private/lastprivate if addressable etc.) for > >OpenMP simd / Cilk+ simd we use special arrays indexed by internal > >functions, which then during vectorization are shrunk (but in theory could > >be expanded too) to the right vectorization factor if vectorized, of course > >accesses within the loop vectorized using SIMD, and if not vectorized, > >shrunk to 1 element. > > I'd appreciate if you could describe that mechanism in more detail. As far > as I can tell it is very poorly commented and documented in the code. I > mean, it doesn't even follow the minimal coding standards of describing > function inputs: > > /* Helper function of lower_rec_input_clauses, used for #pragma omp simd > privatization. */ > > static bool > lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf, > tree &idx, tree &lane, tree &ivar, tree &lvar) Here is the theory behind it: https://gcc.gnu.org/ml/gcc-patches/2013-04/msg01661.html In the end it is using internal functions instead of uglified builtins. I'd suggest you look at some of the libgomp.c/simd*.c tests, say with -O2 -mavx2 -fdump-tree-{omplower,ssa,ifcvt,vect,optimized} to see how it is lowered and expanded. I assume #pragma omp simd roughly corresponds to #pragma acc loop vector, maxvf for PTX vectorization is supposedly 32 (warp size). For SIMD vectorization, if the vectorization fails, the arrays are shrunk to 1 element, otherwise they are shrunk to the vectorization factor, and later optimizations if they aren't really addressable optimized using FRE and other memory optimizations so that they don't touch memory unless really needed. For the PTX style vectorization (parallelization between threads in a warp), I'd say you would always shrink to 1 element again, but such variables would be local to each of the threads in the warp (or another possibility is shared arrays of size 32 indexed by %tid.x & 31), while addressable variables without such magic type would be shared among all threads; non-addressable variables (SSA_NAMEs) depending on where they are used. You'd need to transform reductions (which are right now represented as another loop, from 0 to an internal function, so easily recognizable) into the PTX reductions. Also, lastprivate is now an access to the array using last lane internal function, dunno what that corresponds to in PTX (perhaps also a reduction where all but the thread executing the last iteration say or in 0 and the remaining thread ors in the lastprivate value). Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-19 12:32 ` Jakub Jelinek 2015-06-19 13:07 ` Bernd Schmidt @ 2015-06-22 14:00 ` Julian Brown 2015-06-22 14:36 ` Jakub Jelinek 1 sibling, 1 reply; 23+ messages in thread From: Julian Brown @ 2015-06-22 14:00 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell On Fri, 19 Jun 2015 14:25:57 +0200 Jakub Jelinek <jakub@redhat.com> wrote: > On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote: > > On 05/28/2015 05:08 PM, Jakub Jelinek wrote: > > > > >I understand it is more work, I'd just like to ask that when > > >designing stuff for the OpenACC offloading you (plural) try to > > >take the other offloading devices and host fallback into account. > > > > The problem is that many of the transformations we need to do are > > really GPU specific, and with the current structure of > > omplow/ompexp they are being done in the host compiler. The > > offloading scheme we decided on does not give us the means to write > > out multiple versions of an offloaded function where each target > > gets a different one. For that reason I think we should postpone > > these lowering decisions until we're in the accel compiler, where > > they could be controlled by target hooks, and over the last two > > weeks I've been doing some experiments to see how that could be > > achieved. > I wonder why struct loop flags and other info together with function > attributes and/or cgraph flags and other info aren't sufficient for > the OpenACC needs. > Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ > simd? > > Why can't the execution model (normal, vector-single and > worker-single) be simply attributes on functions or cgraph node flags > and the kind of #acc loop simply be flags on struct loop, like > already OpenMP simd / Cilk+ simd is? One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say "all threads in a CTA/warp will be coherent after such-and-such a loop"), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become "ill-formed".) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how "late" loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) Julian ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 14:00 ` Julian Brown @ 2015-06-22 14:36 ` Jakub Jelinek 2015-06-22 15:18 ` Julian Brown ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Jakub Jelinek @ 2015-06-22 14:36 UTC (permalink / raw) To: Julian Brown; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: > One problem is that (at least on the GPU hardware we've considered so > far) we're somewhat constrained in how much control we have over how the > underlying hardware executes code: it's possible to draw up a scheme > where OpenACC source-level control-flow semantics are reflected directly > in the PTX assembly output (e.g. to say "all threads in a CTA/warp will > be coherent after such-and-such a loop"), and lowering OpenACC > directives quite early seems to make that relatively tractable. (Even > if the resulting code is relatively un-optimisable due to the abnormal > edges inserted to make sure that the CFG doesn't become "ill-formed".) > > If arbitrary optimisations are done between OMP-lowering time and > somewhere around vectorisation (say), it's less clear if that > correspondence can be maintained. Say if the code executed by half the > threads in a warp becomes physically separated from the code executed > by the other half of the threads in a warp due to some loop > optimisation, we can no longer easily determine where that warp will > reconverge, and certain other operations (relying on coherent warps -- > e.g. CTA synchronisation) become impossible. A similar issue exists for > warps within a CTA. > > So, essentially -- I don't know how "late" loop lowering would interact > with: > > (a) Maintaining a CFG that will work with PTX. > > (b) Predication for worker-single and/or vector-single modes > (actually all currently-proposed schemes have problems with proper > representation of data-dependencies for variables and > compiler-generated temporaries between predicated regions.) I don't understand why lowering the way you suggest helps here at all. In the proposed scheme, you essentially have whole function in e.g. worker-single or vector-single mode, which you need to be able to handle properly in any case, because users can write such routines themselves. And then you can have a loop in such a function that has some special attribute, a hint that it is desirable to vectorize it (for PTX the PTX way) or use vector-single mode for it in a worker-single function. So, the special pass then of course needs to handle all the needed broadcasting and reduction required to change the mode from e.g. worker-single to vector-single, but the convergence points still would be either on the boundary of such loops to be vectorized or parallelized, or wherever else they appear in normal vector-single or worker-single functions (around the calls to certainly calls?). Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 14:36 ` Jakub Jelinek @ 2015-06-22 15:18 ` Julian Brown 2015-06-22 15:33 ` Bernd Schmidt 2015-06-22 17:54 ` Julian Brown 2 siblings, 0 replies; 23+ messages in thread From: Julian Brown @ 2015-06-22 15:18 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek <jakub@redhat.com> wrote: > On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: > > One problem is that (at least on the GPU hardware we've considered > > so far) we're somewhat constrained in how much control we have over > > how the underlying hardware executes code: it's possible to draw up > > a scheme where OpenACC source-level control-flow semantics are > > reflected directly in the PTX assembly output (e.g. to say "all > > threads in a CTA/warp will be coherent after such-and-such a > > loop"), and lowering OpenACC directives quite early seems to make > > that relatively tractable. (Even if the resulting code is > > relatively un-optimisable due to the abnormal edges inserted to > > make sure that the CFG doesn't become "ill-formed".) > > > > If arbitrary optimisations are done between OMP-lowering time and > > somewhere around vectorisation (say), it's less clear if that > > correspondence can be maintained. Say if the code executed by half > > the threads in a warp becomes physically separated from the code > > executed by the other half of the threads in a warp due to some loop > > optimisation, we can no longer easily determine where that warp will > > reconverge, and certain other operations (relying on coherent warps > > -- e.g. CTA synchronisation) become impossible. A similar issue > > exists for warps within a CTA. > > > > So, essentially -- I don't know how "late" loop lowering would > > interact with: > > > > (a) Maintaining a CFG that will work with PTX. > > > > (b) Predication for worker-single and/or vector-single modes > > (actually all currently-proposed schemes have problems with proper > > representation of data-dependencies for variables and > > compiler-generated temporaries between predicated regions.) > > I don't understand why lowering the way you suggest helps here at all. > In the proposed scheme, you essentially have whole function > in e.g. worker-single or vector-single mode, which you need to be > able to handle properly in any case, because users can write such > routines themselves. And then you can have a loop in such a function > that has some special attribute, a hint that it is desirable to > vectorize it (for PTX the PTX way) or use vector-single mode for it > in a worker-single function. So, the special pass then of course > needs to handle all the needed broadcasting and reduction required to > change the mode from e.g. worker-single to vector-single, but the > convergence points still would be either on the boundary of such > loops to be vectorized or parallelized, or wherever else they appear > in normal vector-single or worker-single functions (around the calls > to certainly calls?). I think most of my concerns are centred around loops (with the markings you suggest) that might be split into parts: if that cannot happen for loops that are annotated as you describe, maybe things will work out OK. (Apologies for my ignorance here, this isn't a part of the compiler that I know anything about.) Julian ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 14:36 ` Jakub Jelinek 2015-06-22 15:18 ` Julian Brown @ 2015-06-22 15:33 ` Bernd Schmidt 2015-06-22 16:13 ` Nathan Sidwell 2015-06-22 17:54 ` Julian Brown 2 siblings, 1 reply; 23+ messages in thread From: Bernd Schmidt @ 2015-06-22 15:33 UTC (permalink / raw) To: Jakub Jelinek, Julian Brown; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell On 06/22/2015 04:24 PM, Jakub Jelinek wrote: > I don't understand why lowering the way you suggest helps here at all. > In the proposed scheme, you essentially have whole function > in e.g. worker-single or vector-single mode, which you need to be able to > handle properly in any case, because users can write such routines > themselves. And then you can have a loop in such a function that > has some special attribute, a hint that it is desirable to vectorize it > (for PTX the PTX way) or use vector-single mode for it in a worker-single > function. You can have a hint that it is desirable, but not a hint that it is correct (because passes in between may invalidate that). The OpenACC directives guarantee to the compiler that the program can be transformed into a parallel form. If we lose them early we must then rely on our analysis which may not be strong enough to prove that the loop can be parallelized. If we make these transformations early enough, while we still have the OpenACC directives, we can guarantee that we do exactly what the programmer specified. Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 15:33 ` Bernd Schmidt @ 2015-06-22 16:13 ` Nathan Sidwell 2015-06-22 16:27 ` Jakub Jelinek 0 siblings, 1 reply; 23+ messages in thread From: Nathan Sidwell @ 2015-06-22 16:13 UTC (permalink / raw) To: Bernd Schmidt, Jakub Jelinek, Julian Brown; +Cc: Thomas Schwinge, gcc-patches On 06/22/15 11:18, Bernd Schmidt wrote: > You can have a hint that it is desirable, but not a hint that it is correct > (because passes in between may invalidate that). The OpenACC directives > guarantee to the compiler that the program can be transformed into a parallel > form. If we lose them early we must then rely on our analysis which may not be > strong enough to prove that the loop can be parallelized. If we make these > transformations early enough, while we still have the OpenACC directives, we can > guarantee that we do exactly what the programmer specified. How does this differ from openmp's needs to preserve parallelism on a parallel loop? Is it more than the reconvergence issue? nathan -- Nathan Sidwell ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 16:13 ` Nathan Sidwell @ 2015-06-22 16:27 ` Jakub Jelinek 2015-06-22 16:35 ` Nathan Sidwell 0 siblings, 1 reply; 23+ messages in thread From: Jakub Jelinek @ 2015-06-22 16:27 UTC (permalink / raw) To: Nathan Sidwell; +Cc: Bernd Schmidt, Julian Brown, Thomas Schwinge, gcc-patches On Mon, Jun 22, 2015 at 12:08:36PM -0400, Nathan Sidwell wrote: > On 06/22/15 11:18, Bernd Schmidt wrote: > > >You can have a hint that it is desirable, but not a hint that it is correct > >(because passes in between may invalidate that). The OpenACC directives > >guarantee to the compiler that the program can be transformed into a parallel > >form. If we lose them early we must then rely on our analysis which may not be > >strong enough to prove that the loop can be parallelized. If we make these > >transformations early enough, while we still have the OpenACC directives, we can > >guarantee that we do exactly what the programmer specified. > > How does this differ from openmp's needs to preserve parallelism on a > parallel loop? Is it more than the reconvergence issue? OpenMP has significantly different execution model, a parallel block in OpenMP is run by certain number of threads (the initial thread (the one encountering that region) and then dpeending on clauses and library decisions perhaps others), with a barrier at the end of the region, and afterwards only the initial thread continues again. So, an OpenMP parallel is implemented as a library call, taking outlined function from the parallel's body as one of its arguments and the body is executed by the initial thread and perhaps others. OpenMP worksharing loop is just coordination between the threads in the team, which thread takes which subset of the loop's iterations, and optionally followed by a barrier. OpenMP simd loop is a loop that has certain properties guaranteed by the user and can be vectorized. In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then idles on some of them until there is work for them. Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 16:27 ` Jakub Jelinek @ 2015-06-22 16:35 ` Nathan Sidwell 0 siblings, 0 replies; 23+ messages in thread From: Nathan Sidwell @ 2015-06-22 16:35 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Bernd Schmidt, Julian Brown, Thomas Schwinge, gcc-patches On 06/22/15 12:20, Jakub Jelinek wrote: > OpenMP worksharing loop is just coordination between the threads in the > team, which thread takes which subset of the loop's iterations, and > optionally followed by a barrier. OpenMP simd loop is a loop that has > certain properties guaranteed by the user and can be vectorized. > In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then > idles on some of them until there is work for them. correct. I expressed my question poorly. What I mean is that in openmp, a loop that is parallelizeable (by user decree, I guess[*]), should not be transformed such that it is not parallelizeable. This seems to me to be a common requirement of both languages. How one gets parallel threads of execution to the body of the loop is a different question. nathan [*] For ones where the compiler needs to detect parallizeablilty, it's preferable that it doesn't do something earlier to force serializeablility. -- Nathan Sidwell ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 14:36 ` Jakub Jelinek 2015-06-22 15:18 ` Julian Brown 2015-06-22 15:33 ` Bernd Schmidt @ 2015-06-22 17:54 ` Julian Brown 2015-06-22 18:48 ` Jakub Jelinek 2 siblings, 1 reply; 23+ messages in thread From: Julian Brown @ 2015-06-22 17:54 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek <jakub@redhat.com> wrote: > On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: > > One problem is that (at least on the GPU hardware we've considered > > so far) we're somewhat constrained in how much control we have over > > how the underlying hardware executes code: it's possible to draw up > > a scheme where OpenACC source-level control-flow semantics are > > reflected directly in the PTX assembly output (e.g. to say "all > > threads in a CTA/warp will be coherent after such-and-such a > > loop"), and lowering OpenACC directives quite early seems to make > > that relatively tractable. (Even if the resulting code is > > relatively un-optimisable due to the abnormal edges inserted to > > make sure that the CFG doesn't become "ill-formed".) > > > > If arbitrary optimisations are done between OMP-lowering time and > > somewhere around vectorisation (say), it's less clear if that > > correspondence can be maintained. Say if the code executed by half > > the threads in a warp becomes physically separated from the code > > executed by the other half of the threads in a warp due to some loop > > optimisation, we can no longer easily determine where that warp will > > reconverge, and certain other operations (relying on coherent warps > > -- e.g. CTA synchronisation) become impossible. A similar issue > > exists for warps within a CTA. > > > > So, essentially -- I don't know how "late" loop lowering would > > interact with: > > > > (a) Maintaining a CFG that will work with PTX. > > > > (b) Predication for worker-single and/or vector-single modes > > (actually all currently-proposed schemes have problems with proper > > representation of data-dependencies for variables and > > compiler-generated temporaries between predicated regions.) > > I don't understand why lowering the way you suggest helps here at all. > In the proposed scheme, you essentially have whole function > in e.g. worker-single or vector-single mode, which you need to be > able to handle properly in any case, because users can write such > routines themselves. In vector-single or worker-single mode, divergence of threads within a warp or a CTA is controlled by broadcasting the controlling expression of conditional branches to the set of "inactive" threads, so each of those follows along with the active thread. So you only get potentially-problematic thread divergence when workers or vectors are operating in partitioned mode. So, for instance, a made-up example: #pragma acc parallel { #pragma acc loop gang for (i = 0; i < N; i++)) { #pragma acc loop worker for (j = 0; j < M; j++) { if (j < M / 2) /* stmt 1 */ else /* stmt 2 */ } /* reconvergence point: thread barrier */ [...] } } Here "stmt 1" and "stmt 2" execute in worker-partitioned, vector-single mode. With "early lowering", the reconvergence point can be inserted at the end of the loop, and abnormal edges (etc.) can be used to ensure that the CFG does not get changed in such a way that there is no longer a unique point at which the loop threads reconverge. With "late lowering", it's no longer obvious to me if that can still be done. Julian ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-06-22 17:54 ` Julian Brown @ 2015-06-22 18:48 ` Jakub Jelinek 0 siblings, 0 replies; 23+ messages in thread From: Jakub Jelinek @ 2015-06-22 18:48 UTC (permalink / raw) To: Julian Brown; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell On Mon, Jun 22, 2015 at 06:48:10PM +0100, Julian Brown wrote: > In vector-single or worker-single mode, divergence of threads within a > warp or a CTA is controlled by broadcasting the controlling expression > of conditional branches to the set of "inactive" threads, so each of > those follows along with the active thread. So you only get > potentially-problematic thread divergence when workers or vectors are > operating in partitioned mode. > > So, for instance, a made-up example: > > #pragma acc parallel > { > #pragma acc loop gang > for (i = 0; i < N; i++)) > { > #pragma acc loop worker > for (j = 0; j < M; j++) > { > if (j < M / 2) > /* stmt 1 */ > else > /* stmt 2 */ > } > > /* reconvergence point: thread barrier */ > > [...] > } > } > > Here "stmt 1" and "stmt 2" execute in worker-partitioned, vector-single > mode. With "early lowering", the reconvergence point can be > inserted at the end of the loop, and abnormal edges (etc.) can be used > to ensure that the CFG does not get changed in such a way that there is > no longer a unique point at which the loop threads reconverge. > > With "late lowering", it's no longer obvious to me if that can still be > done. Why? The loop still has an exit edge (if there is no break/return/throw out of the loop which I bet is not allowed), so you just insert the reconvergence point at the exit edge from the loop. For the "late lowering", I said it is up for benchmarking/investigation where it would be best placed, it doesn't have to be after the loop passes, there are plenty of optimization passes even before those. But once you turn many of the SSA_NAMEs in a function into (ab) ssa vars, many optimizations just give up. And, if you really want to avoid certain loop optimizations, you have always the possibility to e.g. wrap certain statement in the loop in internal function (e.g. the loop condition) or something similar to make the passes more careful about those loops and make it easier to lower it later. Jakub ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-05-28 14:20 [gomp4] Preserve NVPTX "reconvergence" points Julian Brown 2015-05-28 14:59 ` Jakub Jelinek @ 2015-05-28 15:02 ` Richard Biener 2015-06-03 11:47 ` Julian Brown 1 sibling, 1 reply; 23+ messages in thread From: Richard Biener @ 2015-05-28 15:02 UTC (permalink / raw) To: Julian Brown; +Cc: GCC Patches, Bernd Schmidt, Jakub Jelinek, Thomas Schwinge On Thu, May 28, 2015 at 4:06 PM, Julian Brown <julian@codesourcery.com> wrote: > For NVPTX, it is vitally important that the divergence of threads > within a warp can be controlled: in particular we must be able to > generate code that we know "reconverges" at a particular point. > Unfortunately GCC's middle-end optimisers can cause this property to > be violated, which causes problems for the OpenACC execution model > we're planning to use for NVPTX. > > As a brief example: code running in vector-single mode runs on a > single thread of a warp, and must broadcast condition results to other > threads of the warp so that they can "follow along" and be ready for > vector-partitioned execution when necessary. > > #pragma acc parallel > { > #pragma acc loop gang > for (i = 0; i < N; i++) > { > /* This is vector-single mode. */ > n = ...; > switch (n) > { > case 1: > #pragma acc loop vector > for (...) > { > /* This is vector-partitioned mode. */ > } > ... > } > } > } > > Here, the calculation "n = ..." takes place on a single thread (of > each partitioned gang of the outer loop), but the switch statement > (terminating the BB) must be executed by all threads in the warp. The > vector-single statements will be translated using a branch around for > the "idle" threads: > > if (threadIdx.x == 0) > { > n_0 = ...; > } > n_x = broadcast (n_0) > switch (n_x) > ... > > Where "broadcast" is an operation that transfers values from some > other thread of a warp (i.e., the zeroth) to the current thread > (implemented as a "shfl" instruction for NVPTX). > > I observed a similar example to this cloning the broadcast and switch > instructions (in the .dom1 dump), along the lines of: > > if (threadIdx.x == 0) > { > n_0 = ...; > n_x = broadcast (n_0) > switch (n_x) > ... > } > else > { > n_x = broadcast (n_0) > switch (n_x) > ... > } > > This doesn't work because the "broadcast" operation has to be run with > non-diverged warps for correct operation, and here there is divergence > due to the "if (threadIdx.x == 0)" condition. > > So, the way I have tried to handle this is by attempting to inhibit > optimisation along edges which have a reconvergence point as their > destination. The essential idea is to make such edges "abnormal", > although the existing EDGE_ABNORMAL flag is not used because that has > implicit meaning built into it already, and the new edge type may need > to be handled differently in some areas. One example is that at > present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL > set on their EDGE_TRUE or EDGE_FALSE outgoing edges. > > The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE), > for the GIMPLE CFG only. In principle there's nothing to stop the flag > being propagated to the RTL CFG also, in which case it'd probably be > set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g. > EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which > inspect the RTL CFG can continue to only check the ABNORMAL flag. But > so far (in rather limited testing!), that has not been observed to be > necessary. (We can control RTL CFG manipulation indirectly by using the > CANNOT_COPY_INSN_P target hook, sensitive e.g. to the "broadcast" > instruction.) > > For the GIMPLE CFG (i.e. in passes operating on GIMPLE form), > EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e., > inhibiting certain optimisations), and so has been added to relevant > conditionals largely mechanically. Places where it is treated specially > are: > > * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on > outgoing edges of a block concluding with a GIMPLE_COND statement. > But, we allow EDGE_TO_RECONVERGENCE there. > > * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND > edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass > tries to split the edge later). > > There are probably other optimisations that will be tripped up by the > new flag along the same lines as the VRP tweak above, which we will no > doubt discover in due course. > > Together with the patch, > > https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html > > This shows no regressions for the libgomp tests. > > OK for gomp4 branch? Hmm, I don't think adding a new edge flag is good nor necessary. It seems to me that instead the broadcast operation should have abnormal control flow and thus basic-blocks should be split either before or after it (so either incoming or outgoing edge(s) should be abnormal). I suppose splitting before the broadcast would be best (thus handle it similar to setjmp ()). Richard. > Thanks, > > Julian > > ChangeLog > > gcc/ > * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag. > (bb_hash_abnorm_or_reconv_pred): New function. > (hash_abnormal_or_eh_outgoing_edge_p): Consider > EDGE_TO_RECONVERGENCE also. > * cfg-flags.def (TO_RECONVERGENCE): Add flag. > * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges > leading to a reconvergence point. > * cfgbuild.c (purge_dead_tablejump_edges): Consider > EDGE_TO_RECONVERGENCE. > * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise. > * cfgexpand.c (expand_gimple_tailcall, construct_exit_block) > (pass_expand::execute): Likewise. > * cfghooks.c (can_copy_bbs_p): Likewise. > * cfgloop.c (bb_loop_header_p): Likewise. > * cfgloopmanip.c (scale_loop_profile): Likewise. > * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise. > * graph.c (draw_cfg_node_succ_edges): Likewise. > * graphite-scope-detection.c (canonicalize_loop_closed_ssa): > Likewise. > * predict.c (tree_bb_level_predictions): Likewise. > * profile.c (instrument_edges, branch_prop, find_spanning_tree): > Likewise. > * tree-cfg.c (replace_uses_by, gimple_split_edge) > (gimple_redirect_edge_and_branch, split_critical_edges): Likewise. > * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block) > (pass_merge_phi::execute): Likewise. > * tree-chkp.c (chkp_fix_cfg): Likewise. > * tree-if-conv.c (if_convertible_bb_p): Likewise. > * tree-inline.c (update_ssa_across_abnormal_edges): Likewise. > * tree-into-ssa.c (rewrite_update_phi_arguments) > (rewrite_update_dom_walker::before_dom_children) > (create_new_def_for): Likewise. > * tree-outof-ssa.c (eliminate_phi): Likewise. > * tree-phinodes.c (add_phi_arg): Likewise. > * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map) > (coalesce_partitions): Likewise. > * tree-ssa-dom.c (cprop_into_successor_phis) > (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs): > Likewise. > * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise. > * tree-ssa-loop-prefetch.c (emit_mfence_after_loop) > (may_use_storent_in_loop_p): Likewise. > * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise. > * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block): > Likewise. > * tree-ssa-propagate.c (simulate_block, replace_phi_args_in): > Likewise. > * tree-ssa-sink.c (sink_code_in_bb): Likewise. > * tree-ssa-threadedge.c (thread_across_edge): Likewise. > * tree-ssa-threadupdate.c (thread_single_edge): Likewise. > * tree-ssa-uninit.c (compute_control_dep_chain): Likewise. > * tree-ssa.c (verify_phi_args): Likewise. > * tree-vect-loop.c (vect_analyze_loop_form): Likewise. > * value-prof.c (gimple_ic): Likewise. > * tree-vrp.c (infer_value_range, process_assert_insertions_for): > Likewise. > (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [gomp4] Preserve NVPTX "reconvergence" points 2015-05-28 15:02 ` Richard Biener @ 2015-06-03 11:47 ` Julian Brown 0 siblings, 0 replies; 23+ messages in thread From: Julian Brown @ 2015-06-03 11:47 UTC (permalink / raw) To: Richard Biener; +Cc: GCC Patches, Bernd Schmidt, Jakub Jelinek, Thomas Schwinge [-- Attachment #1: Type: text/plain, Size: 2666 bytes --] On Thu, 28 May 2015 16:37:04 +0200 Richard Biener <richard.guenther@gmail.com> wrote: > On Thu, May 28, 2015 at 4:06 PM, Julian Brown > <julian@codesourcery.com> wrote: > > For NVPTX, it is vitally important that the divergence of threads > > within a warp can be controlled: in particular we must be able to > > generate code that we know "reconverges" at a particular point. > > Unfortunately GCC's middle-end optimisers can cause this property to > > be violated, which causes problems for the OpenACC execution model > > we're planning to use for NVPTX. > > Hmm, I don't think adding a new edge flag is good nor necessary. It > seems to me that instead the broadcast operation should have abnormal > control flow and thus basic-blocks should be split either before or > after it (so either incoming or outgoing edge(s) should be > abnormal). I suppose splitting before the broadcast would be best > (thus handle it similar to setjmp ()). Here's a version of the patch that uses abnormal edges with semantics unchanged, splitting the "false"/non-execution edge using a dummy block to avoid the prohibited case of both EDGE_TRUE/EDGE_FALSE and EDGE_ABNORMAL on the outgoing edges of a GIMPLE_COND. So for a fragment like this: if (threadIdx.x == 0) /* cond_bb */ { /* work */ p0 = ...; /* assign */ } pN = broadcast(p0); if (pN) goto T; else goto F; Incoming edges to a broadcast operation have EDGE_ABNORMAL set: +--------+ |cond_bb |--------, +--------+ | | (true edge) | (false edge) v v +--------+ +-------+ | (work) | | dummy | +--------+ +-------+ | assign | | +--------+ | ABNORM| |ABNORM v | +--------+<-------' | bcast | +--------+ | cond | +--------+ / \ T F The abnormal edges actually serve two purposes, I think: as well as ensuring the broadcast operation takes place when a warp is non-diverged/coherent, they ensure that p0 is not seen as uninitialised along the "false" path from cond_bb, possibly leading to the broadcast operation being optimised away as partially redundant. This feels somewhat fragile though! We'll have to continue to think about warp divergence in subsequent patches. The patch passes libgomp testing (with Bernd's recent worker-single patch also). OK for gomp4 branch (together with the previously-mentioned inline thread builtin patch)? Thanks, Julian ChangeLog gcc/ * omp-low.c (make_predication_test): Split false block out of cond_bb, making latter edge abnormal. (predicate_bb): Set EDGE_ABNORMAL on edges before broadcast operations. [-- Attachment #2: to-reconvergence-5.diff --] [-- Type: text/x-patch, Size: 1751 bytes --] commit 38056ae4a29f93ce54715dfad843a233f3b0fd2a Author: Julian Brown <julian@codesourcery.com> Date: Mon Jun 1 11:12:41 2015 -0700 Use abnormal edges before broadcast ops diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 7048f9f..310eb72 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -10555,7 +10555,16 @@ make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask) gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT); true_edge->flags = EDGE_TRUE_VALUE; - make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); + + /* Force an abnormal edge before a broadcast operation that might be present + in SKIP_DEST_BB. This is only done for the non-execution edge (with + respect to the predication done by this function) -- the opposite + (execution) edge that reaches the broadcast operation must be made + abnormal also, e.g. in this function's caller. */ + edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); + basic_block false_abnorm_bb = split_edge (e); + edge abnorm_edge = single_succ_edge (false_abnorm_bb); + abnorm_edge->flags |= EDGE_ABNORMAL; } /* Apply OpenACC predication to basic block BB which is in @@ -10605,6 +10614,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask) mask); edge e = split_block (bb, splitpoint); + e->flags = EDGE_ABNORMAL; skip_dest_bb = e->dest; gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR, @@ -10624,6 +10634,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask) gsi_asgn, mask); edge e = split_block (bb, splitpoint); + e->flags = EDGE_ABNORMAL; skip_dest_bb = e->dest; gimple_switch_set_index (sstmt, new_var); ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2015-06-24 13:53 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-28 14:20 [gomp4] Preserve NVPTX "reconvergence" points Julian Brown 2015-05-28 14:59 ` Jakub Jelinek 2015-05-28 15:14 ` Thomas Schwinge 2015-05-28 15:28 ` Jakub Jelinek 2015-06-19 10:44 ` Bernd Schmidt 2015-06-19 12:32 ` Jakub Jelinek 2015-06-19 13:07 ` Bernd Schmidt 2015-06-19 14:10 ` Jakub Jelinek 2015-06-22 14:04 ` Bernd Schmidt 2015-06-22 14:25 ` Jakub Jelinek 2015-06-24 13:37 ` Bernd Schmidt 2015-06-24 14:08 ` Jakub Jelinek 2015-06-22 14:00 ` Julian Brown 2015-06-22 14:36 ` Jakub Jelinek 2015-06-22 15:18 ` Julian Brown 2015-06-22 15:33 ` Bernd Schmidt 2015-06-22 16:13 ` Nathan Sidwell 2015-06-22 16:27 ` Jakub Jelinek 2015-06-22 16:35 ` Nathan Sidwell 2015-06-22 17:54 ` Julian Brown 2015-06-22 18:48 ` Jakub Jelinek 2015-05-28 15:02 ` Richard Biener 2015-06-03 11:47 ` Julian Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).