On 09/11/15 20:52, Tom de Vries wrote: > On 09/11/15 16:35, Tom de Vries wrote: >> Hi, >> >> this patch series for stage1 trunk adds support to: >> - parallelize oacc kernels regions using parloops, and >> - map the loops onto the oacc gang dimension. >> >> The patch series contains these patches: >> >> 1 Insert new exit block only when needed in >> transform_to_exit_first_loop_alt >> 2 Make create_parallel_loop return void >> 3 Ignore reduction clause on kernels directive >> 4 Implement -foffload-alias >> 5 Add in_oacc_kernels_region in struct loop >> 6 Add pass_oacc_kernels >> 7 Add pass_dominator_oacc_kernels >> 8 Add pass_ch_oacc_kernels >> 9 Add pass_parallelize_loops_oacc_kernels >> 10 Add pass_oacc_kernels pass group in passes.def >> 11 Update testcases after adding kernels pass group >> 12 Handle acc loop directive >> 13 Add c-c++-common/goacc/kernels-*.c >> 14 Add gfortran.dg/goacc/kernels-*.f95 >> 15 Add libgomp.oacc-c-c++-common/kernels-*.c >> 16 Add libgomp.oacc-fortran/kernels-*.f95 >> >> The first 9 patches are more or less independent, but patches 10-16 are >> intended to be committed at the same time. >> >> Bootstrapped and reg-tested on x86_64. >> >> Build and reg-tested with nvidia accelerator, in combination with a >> patch that enables accelerator testing (which is submitted at >> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ). >> >> I'll post the individual patches in reply to this message. > > This patch adds pass_parallelize_loops_oacc_kernels. > > There's a number of things we do differently in parloops for oacc kernels: > - in normal parloops, we generate code to choose between a parallel > version of the loop, and a sequential (low iteration count) version. > Since the code in oacc kernels region is supposed to run on the > accelerator anyway, we skip this check, and don't add a low iteration > count loop. > - in normal parloops, we generate an #pragma omp parallel / > GIMPLE_OMP_RETURN pair to delimit the region which will we split off > into a thread function. Since the oacc kernels region is already > split off, we don't add this pair. > - we indicate the parallelization factor by setting the oacc function > attributes > - we generate an #pragma oacc loop instead of an #pragma omp for, and > we add the gang clause > - in normal parloops, we rewrite the variable accesses in the loop in > terms into accesses relative to a thread function parameter. For the > oacc kernels region, that rewrite has already been done at omp-lower, > so we skip this. > - we need to ensure that the entire kernels region can be run in > parallel. The loop independence check is already present, so for oacc > kernels we add a check between blocks outside the loop and the entire > region. > - we guard stores in the blocks outside the loop with gang_pos == 0. > There's no need for each gang to write to a single location, we can > do this in just one gang. (Typically this is the write of the final > value of the iteration variable if that one is copied back to the > host). > Reposting with loop optimizer init added in pass_parallelize_loops_oacc_kernels::execute. Thanks, - Tom