On 09/11/15 20:52, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> This patch adds pass_parallelize_loops_oacc_kernels.
>
> There's a number of things we do differently in parloops for oacc kernels:
> - in normal parloops, we generate code to choose between a parallel
>    version of the loop, and a sequential (low iteration count) version.
>    Since the code in oacc kernels region is supposed to run on the
>    accelerator anyway, we skip this check, and don't add a low iteration
>    count loop.
> - in normal parloops, we generate an #pragma omp parallel /
>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>    into a thread function. Since the oacc kernels region is already
>    split off, we don't add this pair.
> - we indicate the parallelization factor by setting the oacc function
>    attributes
> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>    we add the gang clause
> - in normal parloops, we rewrite the variable accesses in the loop in
>    terms into accesses relative to a thread function parameter. For the
>    oacc kernels region, that rewrite has already been done at omp-lower,
>    so we skip this.
> - we need to ensure that the entire kernels region can be run in
>    parallel. The loop independence check is already present, so for oacc
>    kernels we add a check between blocks outside the loop and the entire
>    region.
> - we guard stores in the blocks outside the loop with gang_pos == 0.
>    There's no need for each gang to write to a single location, we can
>    do this in just one gang. (Typically this is the write of the final
>    value of the iteration variable if that one is copied back to the
>    host).
>

Reposting with loop optimizer init added in 
pass_parallelize_loops_oacc_kernels::execute.

Thanks,
- Tom