Hi Cesar! On Tue, 16 Sep 2014 17:32:54 -0700, Cesar Philippidis wrote: > The patch introduces the following OpenACC/PTX-specific built-ins: > > * GOACC_ntid > * GOACC_tid > * GOACC_nctaid > * GOACC_ctaid > * acc_on_device > * GOACC_get_thread_num > * GOACC_get_num_threads > > Of these functions, the only one part of the OpenACC spec is > acc_on_device. The other functions are helpers for omp-low.c. In > particular, I'm using GOACC_get_thread_num and GOACC_get_num_threads to > determine the number of accelerator threads available to the reduction > clause. Current GOACC_get_num_threads is num_gangs * vector_length, but > value is subject to change later on. It's probably a premature to > include the PTX built-ins right now, but I'd like to middle end of our > internal OpenACC branch in sync with gomp-4_0-branch. In my opinion (and we've once only very briefly discussed this internally), exposing concepts such as TID or CTA is wrong at this abstraction level, for these are PTX concepts, but we'd like OpenACC concepts here: numbers and IDs of gangs, workers, vector-length. That said, I'm not opposed to having these committed to gomp-4_0-branch, and we'll fix it in the following. (And -- full disclosure ;-) -- it was me who internally added GOACC_ntid/GOACC_tid, when I needed those as prototypes, and never got around to re-doing that properly.) > This patch also allows OpenACC reductions to process the array holding > partial reductions on the accelerator, instead of copying that array > back to the host. Currently, this only happens when num_gangs = 1. For > PTX targets, we're going to need to use another kernel to process the > array of partial results because PTX lacks inter-CTA synchronization > (we're currently mapping gangs to CTAs). That's why I was working on the > routine clause recently. > > Is this OK for gomp-4_0-branch? If we agree to fix this up as discussed above (and I guess we have no chance but to agree on this), I'd say so. A few comments, none of which are blocking: > --- a/gcc/builtins.def > +++ b/gcc/builtins.def > #undef DEF_GOACC_BUILTIN > #define DEF_GOACC_BUILTIN(ENUM, NAME, TYPE, ATTRS) \ > DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE, \ > - false, true, true, ATTRS, false, flag_openacc) > + false, true, true, ATTRS, false, \ > + (/* TODO */ true || flag_openacc)) Hack that, I hope, will soon be addressed properly. > #define DEF_GOMP_BUILTIN(ENUM, NAME, TYPE, ATTRS) \ > DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE, \ > false, true, true, ATTRS, false, \ > - (flag_openmp || flag_tree_parallelize_loops)) > + (/* TODO */ true || flag_openmp || flag_tree_parallelize_loops)) Likewise. > --- a/gcc/omp-builtins.def > +++ b/gcc/omp-builtins.def > @@ -236,6 +236,3 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TARGET_UPDATE, "GOMP_target_update", > BT_FN_VOID_INT_PTR_SIZE_PTR_PTR_PTR, ATTR_NOTHROW_LIST) > DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TEAMS, "GOMP_teams", > BT_FN_VOID_UINT_UINT, ATTR_NOTHROW_LIST) > - > -DEF_GOMP_BUILTIN (BUILT_IN_OMP_SET_NUM_THREADS, "omp_set_num_threads", > - BT_FN_VOID_INT, ATTR_CONST_NOTHROW_LEAF_LIST) To avoid confusion: that has been added to gomp-4_0-branch earlier, and is now reverted to the trunk state. > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > +static bool > +finish_reduction_on_host (omp_context *ctx) > +static tree > +oacc_host_nthreads (omp_context *ctx) > +static void > +finalize_reduction_data (tree clauses, tree nthreads, gimple_seq *stmt_seqp, > @@ -4433,61 +4519,26 @@ lower_reduction_clauses (tree clauses, gimple_seq *stmt_seqp, omp_context *ctx) > @@ -9782,6 +9900,14 @@ initialize_reduction_data (tree clauses, tree nthreads, gimple_seq *stmt_seqp, > @@ -9913,16 +10089,15 @@ finalize_reduction_data (tree clauses, tree nthreads, gimple_seq *stmt_seqp, > @@ -10026,7 +10211,6 @@ process_reduction_data (gimple_seq *body, gimple_seq *in_stmt_seqp, Have not reviewed in detail, but that also applies to previous reduction changes, so OK to apply on top of that. > @@ -5644,9 +5739,9 @@ expand_oacc_offload (struct omp_region *region) > tree openmp_target = get_offload_symbol_decl (); > tree fnaddr = build_fold_addr_expr (child_fn); > g = gimple_build_call (builtin_decl_explicit (start_ix), 10, device, > - fnaddr, build_fold_addr_expr (openmp_target), > - t1, t2, t3, t4, > - t_num_gangs, t_num_workers, t_vector_length); > + fnaddr, build_fold_addr_expr (openmp_target), > + t1, t2, t3, t4, > + t_num_gangs, t_num_workers, t_vector_length); > gimple_set_location (g, gimple_location (entry_stmt)); > gsi_insert_before (&gsi, g, GSI_SAME_STMT); > } Unintentional? > @@ -6913,8 +7008,10 @@ expand_omp_for_static_nochunk (struct omp_region *region, > + /* Ensure nthreads is at least 1. BUILT_IN_GOACC_NTID returns 0 for a target > + that does not have a specific expansion. */ Remains to be decided. > @@ -7317,8 +7423,10 @@ expand_omp_for_static_chunk (struct omp_region *region, > + /* Ensure nthreads is at least 1. BUILT_IN_GOACC_NTID returns 0 for a target > + that does not have a specific expansion. */ > + nthreads > + = fold_build2 (MAX_EXPR, itype, nthreads, > + fold_convert (TREE_TYPE (nthreads), integer_one_node)); > + nthreads = force_gimple_operand_gsi (&gsi, nthreads, true, NULL_TREE, > + true, GSI_SAME_STMT); Likewise. > /* Initialize the reduction variables to be value of the first array > - element. */ > + element. FIXME: A parallel loop should use the original reduction > + variable as the initial value. */ Is that , and where you had sent that question to the OpenACC technical mailing list, weeks ago, which has not yet been answered as far as I know -- might be a good idea to "ping" them? Grüße, Thomas