From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 62431 invoked by alias); 28 May 2015 14:37:16 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 62391 invoked by uid 89); 28 May 2015 14:37:15 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-ob0-f170.google.com Received: from mail-ob0-f170.google.com (HELO mail-ob0-f170.google.com) (209.85.214.170) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Thu, 28 May 2015 14:37:07 +0000 Received: by obbea2 with SMTP id ea2so34226483obb.3 for ; Thu, 28 May 2015 07:37:04 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.182.56.4 with SMTP id w4mr2735101obp.79.1432823824690; Thu, 28 May 2015 07:37:04 -0700 (PDT) Received: by 10.76.115.167 with HTTP; Thu, 28 May 2015 07:37:04 -0700 (PDT) In-Reply-To: <20150528150635.7bd5db23@octopus> References: <20150528150635.7bd5db23@octopus> Date: Thu, 28 May 2015 15:02:00 -0000 Message-ID: Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points From: Richard Biener To: Julian Brown Cc: GCC Patches , Bernd Schmidt , Jakub Jelinek , Thomas Schwinge Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes X-SW-Source: 2015-05/txt/msg02656.txt.bz2 On Thu, May 28, 2015 at 4:06 PM, Julian Brown wrote: > For NVPTX, it is vitally important that the divergence of threads > within a warp can be controlled: in particular we must be able to > generate code that we know "reconverges" at a particular point. > Unfortunately GCC's middle-end optimisers can cause this property to > be violated, which causes problems for the OpenACC execution model > we're planning to use for NVPTX. > > As a brief example: code running in vector-single mode runs on a > single thread of a warp, and must broadcast condition results to other > threads of the warp so that they can "follow along" and be ready for > vector-partitioned execution when necessary. > > #pragma acc parallel > { > #pragma acc loop gang > for (i = 0; i < N; i++) > { > /* This is vector-single mode. */ > n = ...; > switch (n) > { > case 1: > #pragma acc loop vector > for (...) > { > /* This is vector-partitioned mode. */ > } > ... > } > } > } > > Here, the calculation "n = ..." takes place on a single thread (of > each partitioned gang of the outer loop), but the switch statement > (terminating the BB) must be executed by all threads in the warp. The > vector-single statements will be translated using a branch around for > the "idle" threads: > > if (threadIdx.x == 0) > { > n_0 = ...; > } > n_x = broadcast (n_0) > switch (n_x) > ... > > Where "broadcast" is an operation that transfers values from some > other thread of a warp (i.e., the zeroth) to the current thread > (implemented as a "shfl" instruction for NVPTX). > > I observed a similar example to this cloning the broadcast and switch > instructions (in the .dom1 dump), along the lines of: > > if (threadIdx.x == 0) > { > n_0 = ...; > n_x = broadcast (n_0) > switch (n_x) > ... > } > else > { > n_x = broadcast (n_0) > switch (n_x) > ... > } > > This doesn't work because the "broadcast" operation has to be run with > non-diverged warps for correct operation, and here there is divergence > due to the "if (threadIdx.x == 0)" condition. > > So, the way I have tried to handle this is by attempting to inhibit > optimisation along edges which have a reconvergence point as their > destination. The essential idea is to make such edges "abnormal", > although the existing EDGE_ABNORMAL flag is not used because that has > implicit meaning built into it already, and the new edge type may need > to be handled differently in some areas. One example is that at > present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL > set on their EDGE_TRUE or EDGE_FALSE outgoing edges. > > The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE), > for the GIMPLE CFG only. In principle there's nothing to stop the flag > being propagated to the RTL CFG also, in which case it'd probably be > set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g. > EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which > inspect the RTL CFG can continue to only check the ABNORMAL flag. But > so far (in rather limited testing!), that has not been observed to be > necessary. (We can control RTL CFG manipulation indirectly by using the > CANNOT_COPY_INSN_P target hook, sensitive e.g. to the "broadcast" > instruction.) > > For the GIMPLE CFG (i.e. in passes operating on GIMPLE form), > EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e., > inhibiting certain optimisations), and so has been added to relevant > conditionals largely mechanically. Places where it is treated specially > are: > > * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on > outgoing edges of a block concluding with a GIMPLE_COND statement. > But, we allow EDGE_TO_RECONVERGENCE there. > > * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND > edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass > tries to split the edge later). > > There are probably other optimisations that will be tripped up by the > new flag along the same lines as the VRP tweak above, which we will no > doubt discover in due course. > > Together with the patch, > > https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html > > This shows no regressions for the libgomp tests. > > OK for gomp4 branch? Hmm, I don't think adding a new edge flag is good nor necessary. It seems to me that instead the broadcast operation should have abnormal control flow and thus basic-blocks should be split either before or after it (so either incoming or outgoing edge(s) should be abnormal). I suppose splitting before the broadcast would be best (thus handle it similar to setjmp ()). Richard. > Thanks, > > Julian > > ChangeLog > > gcc/ > * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag. > (bb_hash_abnorm_or_reconv_pred): New function. > (hash_abnormal_or_eh_outgoing_edge_p): Consider > EDGE_TO_RECONVERGENCE also. > * cfg-flags.def (TO_RECONVERGENCE): Add flag. > * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges > leading to a reconvergence point. > * cfgbuild.c (purge_dead_tablejump_edges): Consider > EDGE_TO_RECONVERGENCE. > * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise. > * cfgexpand.c (expand_gimple_tailcall, construct_exit_block) > (pass_expand::execute): Likewise. > * cfghooks.c (can_copy_bbs_p): Likewise. > * cfgloop.c (bb_loop_header_p): Likewise. > * cfgloopmanip.c (scale_loop_profile): Likewise. > * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise. > * graph.c (draw_cfg_node_succ_edges): Likewise. > * graphite-scope-detection.c (canonicalize_loop_closed_ssa): > Likewise. > * predict.c (tree_bb_level_predictions): Likewise. > * profile.c (instrument_edges, branch_prop, find_spanning_tree): > Likewise. > * tree-cfg.c (replace_uses_by, gimple_split_edge) > (gimple_redirect_edge_and_branch, split_critical_edges): Likewise. > * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block) > (pass_merge_phi::execute): Likewise. > * tree-chkp.c (chkp_fix_cfg): Likewise. > * tree-if-conv.c (if_convertible_bb_p): Likewise. > * tree-inline.c (update_ssa_across_abnormal_edges): Likewise. > * tree-into-ssa.c (rewrite_update_phi_arguments) > (rewrite_update_dom_walker::before_dom_children) > (create_new_def_for): Likewise. > * tree-outof-ssa.c (eliminate_phi): Likewise. > * tree-phinodes.c (add_phi_arg): Likewise. > * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map) > (coalesce_partitions): Likewise. > * tree-ssa-dom.c (cprop_into_successor_phis) > (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs): > Likewise. > * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise. > * tree-ssa-loop-prefetch.c (emit_mfence_after_loop) > (may_use_storent_in_loop_p): Likewise. > * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise. > * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block): > Likewise. > * tree-ssa-propagate.c (simulate_block, replace_phi_args_in): > Likewise. > * tree-ssa-sink.c (sink_code_in_bb): Likewise. > * tree-ssa-threadedge.c (thread_across_edge): Likewise. > * tree-ssa-threadupdate.c (thread_single_edge): Likewise. > * tree-ssa-uninit.c (compute_control_dep_chain): Likewise. > * tree-ssa.c (verify_phi_args): Likewise. > * tree-vect-loop.c (vect_analyze_loop_form): Likewise. > * value-prof.c (gimple_ic): Likewise. > * tree-vrp.c (infer_value_range, process_assert_insertions_for): > Likewise. > (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges.