public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH series, 16] Use parloops to parallelize oacc kernels regions
@ 2015-11-09 15:35 Tom de Vries
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
                   ` (15 more replies)
  0 siblings, 16 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:35 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1382 bytes --]

Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

      1	Insert new exit block only when needed in
         transform_to_exit_first_loop_alt
      2	Make create_parallel_loop return void
      3	Ignore reduction clause on kernels directive
      4	Implement -foffload-alias
      5	Add in_oacc_kernels_region in struct loop
      6	Add pass_oacc_kernels
      7	Add pass_dominator_oacc_kernels
      8	Add pass_ch_oacc_kernels
      9	Add pass_parallelize_loops_oacc_kernels
     10	Add pass_oacc_kernels pass group in passes.def
     11	Update testcases after adding kernels pass group
     12	Handle acc loop directive
     13	Add c-c++-common/goacc/kernels-*.c
     14	Add gfortran.dg/goacc/kernels-*.f95
     15	Add libgomp.oacc-c-c++-common/kernels-*.c
     16	Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are 
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a 
patch that enables accelerator testing (which is submitted at 
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.

Thanks,
- Tom

[-- Attachment #2: patch-series-summmary.txt --]
[-- Type: text/plain, Size: 10935 bytes --]

---

1
Insert new exit block only when needed in transform_to_exit_first_loop_alt

2015-06-30  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (transform_to_exit_first_loop_alt): Insert new exit
	block only when needed.
---

2
Make create_parallel_loop return void

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (create_parallel_loop): Return void.
---

3
Ignore reduction clause on kernels directive

2015-11-08  Tom de Vries  <tom@codesourcery.com>

	* c-omp.c (c_oacc_split_loop_clauses): Don't copy OMP_CLAUSE_REDUCTION,
	classify as loop clause.
---

4
Implement -foffload-alias

2015-11-03  Tom de Vries  <tom@codesourcery.com>

	* common.opt (foffload-alias): New option.
	* flag-types.h (enum offload_alias): New enum.
	* omp-low.c (install_var_field): Handle flag_offload_alias.
	* doc/invoke.texi (@item Code Generation Options): Add -foffload-alias.
	(@item -foffload-alias): New item.

	* c-c++-common/goacc/kernels-loop-offload-alias-none.c: New test.
	* c-c++-common/goacc/kernels-loop-offload-alias-ptr.c: New test.
---

5
Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.
---

6
Add pass_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_oacc_kernels): Declare.
	* tree-ssa-loop.c (gate_oacc_kernels): New static function.
	(pass_data_oacc_kernels): New pass_data.
	(class pass_oacc_kernels): New pass.
	(make_pass_oacc_kernels): New function.
---

7
Add pass_dominator_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_dominator_oacc_kernels): Declare.
	* tree-ssa-dom.c (class dominator_base): New class.  Factor out of ...
	(class pass_dominator): ... here.
	(dominator_base::may_peel_loop_headers_p)
        (pass_dominator::may_peel_loop_headers_p): New function.
	(pass_dominator_oacc_kernels): New pass.
	(make_pass_dominator_oacc_kernels): New function.
	(dominator_base::execute): Use may_peel_loop_headers_p.
---

8
Add pass_ch_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_ch_oacc_kernels): Declare.
	* tree-ssa-loop-ch.c (pass_ch::pass_ch (pass_data, gcc::context)): New
	constructor.
	(pass_data_ch_oacc_kernels): New pass_data.
	(class pass_ch_oacc_kernels): New pass.
	(pass_ch_oacc_kernels::process_loop_p): New function.
	(make_pass_ch_oacc_kernels): New function.
---

9
Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
        (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.
---

10
Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* tree-ssa-loop.c (pass_scev_cprop::clone, pass_tree_loop_init::clone)
	(pass_tree_loop_done::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
---

11
Update testcases after adding kernels pass group

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/restrict-2.c: Update after adding pass_oacc_kernels pass
	group.
	* c-c++-common/restrict-4.c: Same.
	* g++.dg/tree-ssa/copyprop-1.C: Same.
	* g++.dg/tree-ssa/pr33615.C: Same.
	* g++.dg/tree-ssa/restrict1.C: Same.
	* gcc.dg/gomp/notify-new-function-3.c: Same.
	* gcc.dg/pr23911.c: Same.
	* gcc.dg/pr41488.c: Same.
	* gcc.dg/tm/pub-safety-1.c: Same.
	* gcc.dg/tm/reg-promotion.c: Same.
	* gcc.dg/tree-ssa/20030709-2.c: Same.
	* gcc.dg/tree-ssa/20030731-2.c: Same.
	* gcc.dg/tree-ssa/20040729-1.c: Same.
	* gcc.dg/tree-ssa/20050314-1.c: Same.
	* gcc.dg/tree-ssa/cfgcleanup-1.c: Same.
	* gcc.dg/tree-ssa/loop-17.c: Same.
	* gcc.dg/tree-ssa/loop-32.c: Same.
	* gcc.dg/tree-ssa/loop-33.c: Same.
	* gcc.dg/tree-ssa/loop-34.c: Same.
	* gcc.dg/tree-ssa/loop-35.c: Same.
	* gcc.dg/tree-ssa/loop-36.c: Same.
	* gcc.dg/tree-ssa/loop-39.c: Same.
	* gcc.dg/tree-ssa/loop-7.c: Same.
	* gcc.dg/tree-ssa/pr21086.c: Same.
	* gcc.dg/tree-ssa/pr23109.c: Same.
	* gcc.dg/tree-ssa/restrict-3.c: Same.
	* gcc.dg/tree-ssa/restrict-5.c: Same.
	* gcc.dg/tree-ssa/scev-7.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-1.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-1.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-10.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-11.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-12.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-3.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-6.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-7.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-8.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-9.c: Same.
	* gcc.dg/tree-ssa/structopt-1.c: Same.
	* gcc.dg/vect/pr26359.c: Same.
	* gfortran.dg/pr32921.f: Same.
---

12
Handle acc loop directive

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (struct omp_region): Add inside_kernels_p field.
	(expand_omp_for_generic): Only set address taken for istart0
	and end0 unless necessary.  Adjust to generate a 'sequential' loop
	when GOMP builtin arguments are BUILT_IN_NONE.
	(expand_omp_for): Use expand_omp_for_generic() to generate a
	non-parallelized loop for OMP_FORs inside OpenACC kernels regions.
	(expand_omp): Mark inside_kernels_p field true for regions
	nested inside OpenACC kernels constructs.
---

13
Add c-c++-common/goacc/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/goacc/kernels-acc-loop-reduction.c: New test.
	* c-c++-common/goacc/kernels-acc-loop-smaller-equal.c: New test.
	* c-c++-common/goacc/kernels-counter-var-redundant-load.c: New test.
	* c-c++-common/goacc/kernels-counter-vars-function-scope.c: New test.
	* c-c++-common/goacc/kernels-double-reduction.c: New test.
	* c-c++-common/goacc/kernels-empty.c: New test.
	* c-c++-common/goacc/kernels-eternal.c: New test.
	* c-c++-common/goacc/kernels-loop-2-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-2.c: New test.
	* c-c++-common/goacc/kernels-loop-3-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-3.c: New test.
	* c-c++-common/goacc/kernels-loop-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-data-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-loop-data-update.c: New test.
	* c-c++-common/goacc/kernels-loop-data.c: New test.
	* c-c++-common/goacc/kernels-loop-g.c: New test.
	* c-c++-common/goacc/kernels-loop-mod-not-zero.c: New test.
	* c-c++-common/goacc/kernels-loop-n-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-n.c: New test.
	* c-c++-common/goacc/kernels-loop-nest.c: New test.
	* c-c++-common/goacc/kernels-loop.c: New test.
	* c-c++-common/goacc/kernels-noreturn.c: New test.
	* c-c++-common/goacc/kernels-one-counter-var.c: New test.
	* c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-reduction.c: New test.
---

14
Add gfortran.dg/goacc/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* gfortran.dg/goacc/kernels-loop-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-update.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data.f95: New test.
	* gfortran.dg/goacc/kernels-loop.f95: New test.
	* gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95: New test.
---

15
Add libgomp.oacc-c-c++-common/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c: Same.
---

16
Add libgomp.oacc-fortran/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: New test.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
	Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95:
	Same.
---

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
@ 2015-11-09 15:44 ` Tom de Vries
  2015-11-11 10:50   ` Richard Biener
  2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1801 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.
>

In transform_to_exit_first_loop_alt we insert a new exit block  in 
between the new loop header and the old exit block. Currently, we also 
do this if this is not necessary.

This patch figures out when we need to insert a new exit block, and only 
then inserts it.

Thanks,
- Tom


[-- Attachment #2: 0001-Insert-new-exit-block-only-when-needed-in-transform_.patch --]
[-- Type: text/x-patch, Size: 3396 bytes --]

Insert new exit block only when needed in transform_to_exit_first_loop_alt

2015-06-30  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (transform_to_exit_first_loop_alt): Insert new exit
	block only when needed.
---
 gcc/tree-parloops.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 3d41275..6a49aa9 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -1695,10 +1695,15 @@ transform_to_exit_first_loop_alt (struct loop *loop,
   /* Set the latch arguments of the new phis to ivtmp/sum_b.  */
   flush_pending_stmts (post_inc_edge);
 
-  /* Create a new empty exit block, inbetween the new loop header and the old
-     exit block.  The function separate_decls_in_region needs this block to
-     insert code that is active on loop exit, but not any other path.  */
-  basic_block new_exit_block = split_edge (exit);
+
+  basic_block new_exit_block = NULL;
+  if (!single_pred_p (exit->dest))
+    {
+      /* Create a new empty exit block, inbetween the new loop header and the
+	 old exit block.  The function separate_decls_in_region needs this block
+	 to insert code that is active on loop exit, but not any other path.  */
+      new_exit_block = split_edge (exit);
+    }
 
   /* Insert and register the reduction exit phis.  */
   for (gphi_iterator gsi = gsi_start_phis (exit_block);
@@ -1706,17 +1711,24 @@ transform_to_exit_first_loop_alt (struct loop *loop,
        gsi_next (&gsi))
     {
       gphi *phi = gsi.phi ();
+      gphi *nphi = NULL;
       tree res_z = PHI_RESULT (phi);
+      tree res_c;
 
-      /* Now that we have a new exit block, duplicate the phi of the old exit
-	 block in the new exit block to preserve loop-closed ssa.  */
-      edge succ_new_exit_block = single_succ_edge (new_exit_block);
-      edge pred_new_exit_block = single_pred_edge (new_exit_block);
-      tree res_y = copy_ssa_name (res_z, phi);
-      gphi *nphi = create_phi_node (res_y, new_exit_block);
-      tree res_c = PHI_ARG_DEF_FROM_EDGE (phi, succ_new_exit_block);
-      add_phi_arg (nphi, res_c, pred_new_exit_block, UNKNOWN_LOCATION);
-      add_phi_arg (phi, res_y, succ_new_exit_block, UNKNOWN_LOCATION);
+      if (new_exit_block != NULL)
+	{
+	  /* Now that we have a new exit block, duplicate the phi of the old
+	     exit block in the new exit block to preserve loop-closed ssa.  */
+	  edge succ_new_exit_block = single_succ_edge (new_exit_block);
+	  edge pred_new_exit_block = single_pred_edge (new_exit_block);
+	  tree res_y = copy_ssa_name (res_z, phi);
+	  nphi = create_phi_node (res_y, new_exit_block);
+	  res_c = PHI_ARG_DEF_FROM_EDGE (phi, succ_new_exit_block);
+	  add_phi_arg (nphi, res_c, pred_new_exit_block, UNKNOWN_LOCATION);
+	  add_phi_arg (phi, res_y, succ_new_exit_block, UNKNOWN_LOCATION);
+	}
+      else
+	res_c = PHI_ARG_DEF_FROM_EDGE (phi, exit);
 
       if (virtual_operand_p (res_z))
 	continue;
@@ -1724,7 +1736,9 @@ transform_to_exit_first_loop_alt (struct loop *loop,
       gimple *reduc_phi = SSA_NAME_DEF_STMT (res_c);
       struct reduction_info *red = reduction_phi (reduction_list, reduc_phi);
       if (red != NULL)
-	red->keep_res = nphi;
+	red->keep_res = (nphi != NULL
+			 ? nphi
+			 : phi);
     }
 
   /* We're going to cancel the loop at the end of gen_parallel_loop, but until
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 2/16] Make create_parallel_loop return void
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
@ 2015-11-09 15:45 ` Tom de Vries
  2015-11-11 10:50   ` Richard Biener
  2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:45 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch makes create_parallel_loop return void.  The result is 
currently unused.

Thanks,
- Tom


[-- Attachment #2: 0002-Make-create_parallel_loop-return-void.patch --]
[-- Type: text/x-patch, Size: 1362 bytes --]

Make create_parallel_loop return void

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (create_parallel_loop): Return void.
---
 gcc/tree-parloops.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 6a49aa9..17415a8 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -1986,10 +1986,9 @@ transform_to_exit_first_loop (struct loop *loop,
 /* Create the parallel constructs for LOOP as described in gen_parallel_loop.
    LOOP_FN and DATA are the arguments of GIMPLE_OMP_PARALLEL.
    NEW_DATA is the variable that should be initialized from the argument
-   of LOOP_FN.  N_THREADS is the requested number of threads.  Returns the
-   basic block containing GIMPLE_OMP_PARALLEL tree.  */
+   of LOOP_FN.  N_THREADS is the requested number of threads.  */
 
-static basic_block
+static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 		      tree new_data, unsigned n_threads, location_t loc)
 {
@@ -2162,8 +2161,6 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   /* After the above dom info is hosed.  Re-compute it.  */
   free_dominance_info (CDI_DOMINATORS);
   calculate_dominance_info (CDI_DOMINATORS);
-
-  return paral_bb;
 }
 
 /* Generates code to execute the iterations of LOOP in N_THREADS
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 3/16] Ignore reduction clause on kernels directive
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
  2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
@ 2015-11-09 15:51 ` Tom de Vries
  2015-11-24 12:25   ` [PING][PATCH, " Tom de Vries
  2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1698 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

As discussed here ( 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00785.html ), the kernels 
directive does not allow the reduction clause.  This patch fixes that.

Thanks,
- Tom


[-- Attachment #2: 0003-Ignore-reduction-clause-on-kernels-directive.patch --]
[-- Type: text/x-patch, Size: 1306 bytes --]

Ignore reduction clause on kernels directive

2015-11-08  Tom de Vries  <tom@codesourcery.com>

	* c-omp.c (c_oacc_split_loop_clauses): Don't copy OMP_CLAUSE_REDUCTION,
	classify as loop clause.
---
 gcc/c-family/c-omp.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/gcc/c-family/c-omp.c b/gcc/c-family/c-omp.c
index 3e93b59..a3b99b2 100644
--- a/gcc/c-family/c-omp.c
+++ b/gcc/c-family/c-omp.c
@@ -867,7 +867,7 @@ c_omp_check_loop_iv_exprs (location_t stmt_loc, tree declv, tree decl,
 tree
 c_oacc_split_loop_clauses (tree clauses, tree *not_loop_clauses)
 {
-  tree next, loop_clauses, t;
+  tree next, loop_clauses;
 
   loop_clauses = *not_loop_clauses = NULL_TREE;
   for (; clauses ; clauses = next)
@@ -886,16 +886,11 @@ c_oacc_split_loop_clauses (tree clauses, tree *not_loop_clauses)
 	case OMP_CLAUSE_SEQ:
 	case OMP_CLAUSE_INDEPENDENT:
 	case OMP_CLAUSE_PRIVATE:
+	case OMP_CLAUSE_REDUCTION:
 	  OMP_CLAUSE_CHAIN (clauses) = loop_clauses;
 	  loop_clauses = clauses;
 	  break;
 
-	  /* Reductions belong in both constructs.  */
-	case OMP_CLAUSE_REDUCTION:
-	  t = copy_node (clauses);
-	  OMP_CLAUSE_CHAIN (t) = loop_clauses;
-	  loop_clauses = t;
-
 	  /* Parallel/kernels clauses.  */
 	default:
 	  OMP_CLAUSE_CHAIN (clauses) = *not_loop_clauses;
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 4/16] Implement -foffload-alias
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (2 preceding siblings ...)
  2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
@ 2015-11-09 16:10 ` Tom de Vries
  2015-11-11 10:53   ` Richard Biener
  2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 16:10 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2652 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch addresses the problem that once the offloading region has 
been split off from the original function, alias analysis can no longer 
use information available in the original function that would allow it 
to do a more precise analysis for the offloading function. [ At some 
point we could use fipa-pta for that, as discussed in PR46032, but 
that's not feasible now. ]

The basic idea behind the patch is that for typical usage, the base 
pointers used in an offloaded region are non-aliasing. The patch works 
by adding restrict to the types of the fields used to pass data to an 
offloading region.


The patch implements a new option
-foffload-alias=<none|pointer|all>.

The option -foffload-alias=none instructs the compiler to assume that
object references and pointer dereferences in an offload region do not
alias.

The option -foffload-alias=pointer instructs the compiler to assume that 
objects references in an offload region do not alias.

The option -foffload-alias=all instructs the compiler to make no
assumptions about aliasing in offload regions.

The default value is -foffload-alias=none.

Thanks,
- Tom


[-- Attachment #2: 0004-Implement-foffload-alias.patch --]
[-- Type: text/x-patch, Size: 9102 bytes --]

Implement -foffload-alias

2015-11-03  Tom de Vries  <tom@codesourcery.com>

	* common.opt (foffload-alias): New option.
	* flag-types.h (enum offload_alias): New enum.
	* omp-low.c (install_var_field): Handle flag_offload_alias.
	* doc/invoke.texi (@item Code Generation Options): Add -foffload-alias.
	(@item -foffload-alias): New item.

	* c-c++-common/goacc/kernels-loop-offload-alias-none.c: New test.
	* c-c++-common/goacc/kernels-loop-offload-alias-ptr.c: New test.
---
 gcc/common.opt                                     | 16 ++++++
 gcc/doc/invoke.texi                                | 11 ++++
 gcc/flag-types.h                                   |  7 +++
 gcc/omp-low.c                                      | 28 +++++++++-
 .../goacc/kernels-loop-offload-alias-none.c        | 61 ++++++++++++++++++++++
 .../goacc/kernels-loop-offload-alias-ptr.c         | 44 ++++++++++++++++
 6 files changed, 165 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c

diff --git a/gcc/common.opt b/gcc/common.opt
index 961a1b6..7135b1a 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1735,6 +1735,22 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
 EnumValue
 Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
 
+foffload-alias=
+Common Joined RejectNegative Enum(offload_alias) Var(flag_offload_alias) Init(OFFLOAD_ALIAS_NONE)
+-foffload-alias=[all|pointer|none]     Assume non-aliasing in an offload region
+
+Enum
+Name(offload_alias) Type(enum offload_alias) UnknownError(unknown offload aliasing %qs)
+
+EnumValue
+Enum(offload_alias) String(all) Value(OFFLOAD_ALIAS_ALL)
+
+EnumValue
+Enum(offload_alias) String(pointer) Value(OFFLOAD_ALIAS_POINTER)
+
+EnumValue
+Enum(offload_alias) String(none) Value(OFFLOAD_ALIAS_NONE)
+
 fomit-frame-pointer
 Common Report Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2e5953b..6928efd 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1143,6 +1143,7 @@ See S/390 and zSeries Options.
 -finstrument-functions-exclude-function-list=@var{sym},@var{sym},@dots{} @gol
 -finstrument-functions-exclude-file-list=@var{file},@var{file},@dots{} @gol
 -fno-common  -fno-ident @gol
+-foffload-alias=@r{[}none@r{|}pointer@r{|}all@r{]} @gol
 -fpcc-struct-return  -fpic  -fPIC -fpie -fPIE -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
@@ -23852,6 +23853,16 @@ The options @option{-ftrapv} and @option{-fwrapv} override each other, so using
 using @option{-ftrapv} @option{-fwrapv} @option{-fno-wrapv} on the command-line
 results in @option{-ftrapv} being effective.
 
+@item -foffload-alias=@r{[}none@r{|}pointer@r{|}all@r{]}
+@opindex -foffload-alias
+The option @option{-foffload-alias=none} instructs the compiler to assume that
+objects references and pointer dereferences in an offload region do not alias.
+The option @option{-foffload-alias=pointer} instruct the compiler to assume that
+objects references in an offload region do not alias.  The option
+@option{-foffload-alias=all} instructs the compiler to make no assumptions about
+aliasing in offload regions.  The default value is
+@option{-foffload-alias=none}.
+
 @item -fexceptions
 @opindex fexceptions
 Enable exception handling.  Generates extra code needed to propagate
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index 6301cea..87b1677 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -293,5 +293,12 @@ enum gfc_convert
   GFC_FLAG_CONVERT_LITTLE
 };
 
+enum offload_alias
+{
+  OFFLOAD_ALIAS_ALL,
+  OFFLOAD_ALIAS_POINTER,
+  OFFLOAD_ALIAS_NONE
+};
+
 
 #endif /* ! GCC_FLAG_TYPES_H */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 45d1927..d052c13 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1371,6 +1371,14 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
   tree field, type, sfield = NULL_TREE;
   splay_tree_key key = (splay_tree_key) var;
 
+  /* We use flag_offload_alias only for the oacc kernels region for the
+     moment.  */
+  bool offload_alias_p = is_oacc_kernels (ctx);
+  bool no_alias_var_p
+    = offload_alias_p && flag_offload_alias != OFFLOAD_ALIAS_ALL;
+  bool no_alias_ptr_p
+    = offload_alias_p && flag_offload_alias == OFFLOAD_ALIAS_NONE;
+
   if ((mask & 8) != 0)
     {
       key = (splay_tree_key) &DECL_UID (var);
@@ -1387,10 +1395,26 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
   if (mask & 4)
     {
       gcc_assert (TREE_CODE (type) == ARRAY_TYPE);
-      type = build_pointer_type (build_pointer_type (type));
+
+      type = build_pointer_type (type);
+      if (no_alias_var_p)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+
+      type = build_pointer_type (type);
+      if (no_alias_var_p)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
     }
   else if (by_ref)
-    type = build_pointer_type (type);
+    {
+      if (no_alias_ptr_p
+	  && POINTER_TYPE_P (type))
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+
+      type = build_pointer_type (type);
+
+      if (no_alias_var_p)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+    }
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
 
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c
new file mode 100644
index 0000000..79d8daa
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c
@@ -0,0 +1,61 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+/* { dg-additional-options "-fdump-tree-alias-all" } */
+/* { dg-additional-options "-foffload-alias=none" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+static void
+foo (unsigned int *a, unsigned int *b, unsigned int *c)
+{
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+}
+
+int
+main (void)
+{
+  unsigned int *a;
+  unsigned int *b;
+  unsigned int *c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  foo (a, b, c);
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 3 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 6" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 7" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 9 "alias" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c
new file mode 100644
index 0000000..de4f45a
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c
@@ -0,0 +1,44 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+/* { dg-additional-options "-fdump-tree-alias-all" } */
+/* { dg-additional-options "-foffload-alias=pointer" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+
+int
+main (void)
+{
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  return 0;
+}
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 3 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 6 "alias" } } */
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (3 preceding siblings ...)
  2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
@ 2015-11-09 16:31 ` Tom de Vries
  2015-11-11 10:57   ` Richard Biener
  2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1976 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch adds and initializes the field in_oacc_kernels_region field 
in struct loop.

The field is used to signal to subsequent passes that we're dealing with 
a loop in a kernels region that we're trying parallelize.

Note that we do not parallelize kernels regions with more than one loop 
nest. [ In general, kernels regions with more than one loop nest should 
be split up into seperate kernels regions, but that's not supported atm. ]

Thanks,
- Tom


[-- Attachment #2: 0005-Add-in_oacc_kernels_region-in-struct-loop.patch --]
[-- Type: text/x-patch, Size: 3333 bytes --]

Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.
---
 gcc/cfgloop.h |  3 +++
 gcc/omp-low.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 6af6893..ee73bf9 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if the loop is part of an oacc kernels region.  */
+  bool in_oacc_kernels_region;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index d052c13..7121d73 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12429,6 +12429,61 @@ get_oacc_ifn_dim_arg (const gimple *stmt)
   return (int) axis;
 }
 
+/* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
+   at REGION_EXIT.  */
+
+static void
+mark_loops_in_oacc_kernels_region (basic_block region_entry,
+				   basic_block region_exit)
+{
+  bitmap dominated_bitmap = BITMAP_GGC_ALLOC ();
+  bitmap excludes_bitmap = BITMAP_GGC_ALLOC ();
+  unsigned di;
+  basic_block bb;
+
+  bitmap_clear (dominated_bitmap);
+  bitmap_clear (excludes_bitmap);
+
+  /* Get all the blocks dominated by the region entry.  That will include the
+     entire region.  */
+  vec<basic_block> dominated
+    = get_all_dominated_blocks (CDI_DOMINATORS, region_entry);
+  FOR_EACH_VEC_ELT (dominated, di, bb)
+      bitmap_set_bit (dominated_bitmap, bb->index);
+
+  /* Exclude all the blocks which are not in the region: the blocks dominated by
+     the region exit.  */
+  if (region_exit != NULL)
+    {
+      vec<basic_block> excludes
+	= get_all_dominated_blocks (CDI_DOMINATORS, region_exit);
+      FOR_EACH_VEC_ELT (excludes, di, bb)
+	bitmap_set_bit (excludes_bitmap, bb->index);
+    }
+
+  /* Don't parallelize the kernels region if it contains more than one outer
+     loop.  */
+  unsigned int nr_outer_loops = 0;
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 0)
+    {
+      if (loop_outer (loop) != current_loops->tree_root)
+	continue;
+
+      if (bitmap_bit_p (dominated_bitmap, loop->header->index)
+	  && !bitmap_bit_p (excludes_bitmap, loop->header->index))
+	nr_outer_loops++;
+    }
+  if (nr_outer_loops != 1)
+    return;
+
+  /* Mark the loops in the region.  */
+  FOR_EACH_LOOP (loop, 0)
+    if (bitmap_bit_p (dominated_bitmap, loop->header->index)
+	&& !bitmap_bit_p (excludes_bitmap, loop->header->index))
+      loop->in_oacc_kernels_region = true;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -12483,6 +12538,9 @@ expand_omp_target (struct omp_region *region)
   entry_bb = region->entry;
   exit_bb = region->exit;
 
+  if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+    mark_loops_in_oacc_kernels_region (region->entry, region->exit);
+
   if (offloaded)
     {
       unsigned srcidx, dstidx, num;
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (4 preceding siblings ...)
  2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
@ 2015-11-09 17:39 ` Tom de Vries
  2015-11-11 10:59   ` Richard Biener
  2016-02-05 12:06   ` Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels) Thomas Schwinge
  2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 17:39 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patchs add a pass group pass_oacc_kernels (which will be added to 
the pass list as a whole in patch 10).

Atm, the parallelization behaviour for the kernels region is controlled 
by flag_tree_parallelize_loops, which is also used to control generic 
auto-parallelization by autopar using omp. That is not ideal, and we may 
want a separate flag (or param) to control the behaviour for oacc 
kernels, f.i. -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.

The purpose of the pass group as a whole is to massage the offloaded 
function into a shape that parloops can deal with it, and then run 
parloops on it.

Consider a testcase with a reduction, and a loop counter declared 
outside the offload region:
...
unsigned int a[n];

unsigned int
foo (void)
{
   int i;
   unsigned int sum = 1;

#pragma acc kernels copyin (a[0:n]) copy (sum)
   {
     for (i = 0; i < n; ++i)
       sum += a[i];
   }

   return sum;
}
...

After ealias, the loop body looks like this:
...
   <bb 5>:
   _8 = *.omp_data_i_3(D).a;
   _9 = *.omp_data_i_3(D).i;
   _10 = *_9;
   _11 = *_8[_10];
   _12 = *.omp_data_i_3(D).sum;
   sum.0_13 = *_12;
   sum.1_14 = _11 + sum.0_13;
   _15 = *.omp_data_i_3(D).sum;
   *_15 = sum.1_14;
   _17 = *.omp_data_i_3(D).i;
   _18 = *_17;
   _19 = *.omp_data_i_3(D).i;
   _20 = _18 + 1;
   *_19 = _20;
   goto <bb 6>;
...
In other words, the iteration variable is in memory, as is the reduction 
variable, and the body contains lots of loop invariant loads.

At the end of the pass group, just before parloops, the body has been 
rewritten to have a local iteration variable and a local reduction 
variable, and all the loop invariant loads have been moved out of the loop:
...
   <bb 4>:
   # _27 = PHI <0(2), _20(5)>
   # D__lsm.7_28 = PHI <D__lsm.7_29(2), sum.1_14(5)>
   _11 = *_8[_27];
   sum.1_14 = _11 + D__lsm.7_28;
   _20 = _27 + 1;
   if (_20 <= 9999)
     goto <bb 5>;
   else
     goto <bb 3>;
...

Thanks,
- Tom


[-- Attachment #2: 0006-Add-pass_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 2986 bytes --]

Add pass_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_oacc_kernels): Declare.
	* tree-ssa-loop.c (gate_oacc_kernels): New static function.
	(pass_data_oacc_kernels): New pass_data.
	(class pass_oacc_kernels): New pass.
	(make_pass_oacc_kernels): New function.
---
 gcc/tree-pass.h     |  1 +
 gcc/tree-ssa-loop.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..4ed8da6 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -463,6 +463,7 @@ extern gimple_opt_pass *make_pass_strength_reduction (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 8ecd140..b51cac2 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
+#include "omp-low.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -141,6 +142,70 @@ make_pass_tree_loop (gcc::context *ctxt)
   return new pass_tree_loop (ctxt);
 }
 
+/* Gate for oacc kernels pass group.  */
+
+static bool
+gate_oacc_kernels (function *fn)
+{
+  if (flag_tree_parallelize_loops <= 1)
+    return false;
+
+  tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
+  if (oacc_function_attr == NULL_TREE)
+    return false;
+
+  tree val = TREE_VALUE (oacc_function_attr);
+  while (val != NULL_TREE && TREE_VALUE (val) == NULL_TREE)
+    val = TREE_CHAIN (val);
+
+  if (val != NULL_TREE)
+    return false;
+
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 0)
+    if (loop->in_oacc_kernels_region)
+      return true;
+
+  return false;
+}
+
+/* The oacc kernels superpass.  */
+
+namespace {
+
+const pass_data pass_data_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
+
+}; // class pass_oacc_kernels
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_oacc_kernels (ctxt);
+}
+
 /* The no-loop superpass.  */
 
 namespace {
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 7/16] Add pass_dominator_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (5 preceding siblings ...)
  2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
@ 2015-11-09 18:14 ` Tom de Vries
  2015-11-11 11:05   ` Richard Biener
  2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 18:14 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2069 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch adds pass_dominator_oacc_kernels (which we may as well call 
pass_dominator_no_peel_loop_headers. It doesn't do anything 
oacc-kernels-specific), to be used in the kernels pass group.

The reason I'm adding a new pass instead of using pass_dominator is that 
pass_dominator uses first_pass_instance. So adding a pass_dominator 
instance A before a pass_dominator instance B has the unexpected 
consequence that it may change the behaviour of instance B. I've filed 
PR68247 - "Remove pass_first_instance" to note this issue.

Thanks,
- Tom


[-- Attachment #2: 0007-Add-pass_dominator_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 4482 bytes --]

Add pass_dominator_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_dominator_oacc_kernels): Declare.
	* tree-ssa-dom.c (class dominator_base): New class.  Factor out of ...
	(class pass_dominator): ... here.
	(dominator_base::may_peel_loop_headers_p)
        (pass_dominator::may_peel_loop_headers_p): New function.
	(pass_dominator_oacc_kernels): New pass.
	(make_pass_dominator_oacc_kernels): New function.
	(dominator_base::execute): Use may_peel_loop_headers_p.
---
 gcc/tree-pass.h    |  1 +
 gcc/tree-ssa-dom.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 4ed8da6..2825aea 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -395,6 +395,7 @@ extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ealias (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_dominator (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_dominator_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_dce (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cd_dce (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
index 3887bbe1..e4ff63a 100644
--- a/gcc/tree-ssa-dom.c
+++ b/gcc/tree-ssa-dom.c
@@ -519,6 +519,19 @@ private:
 
 namespace {
 
+class dominator_base : public gimple_opt_pass
+{
+ protected:
+  dominator_base (pass_data data, gcc::context *ctxt)
+    : gimple_opt_pass (data, ctxt)
+  {}
+
+  unsigned int execute (function *);
+
+ protected:
+  virtual bool may_peel_loop_headers_p (void) { return true; }
+}; // class dominator_base
+
 const pass_data pass_data_dominator =
 {
   GIMPLE_PASS, /* type */
@@ -532,22 +545,23 @@ const pass_data pass_data_dominator =
   ( TODO_cleanup_cfg | TODO_update_ssa ), /* todo_flags_finish */
 };
 
-class pass_dominator : public gimple_opt_pass
+class pass_dominator : public dominator_base
 {
 public:
   pass_dominator (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_dominator, ctxt)
+    : dominator_base (pass_data_dominator, ctxt)
   {}
 
   /* opt_pass methods: */
   opt_pass * clone () { return new pass_dominator (m_ctxt); }
   virtual bool gate (function *) { return flag_tree_dom != 0; }
-  virtual unsigned int execute (function *);
 
+ protected:
+  virtual bool may_peel_loop_headers_p (void) { return first_pass_instance; }
 }; // class pass_dominator
 
 unsigned int
-pass_dominator::execute (function *fun)
+dominator_base::execute (function *fun)
 {
   memset (&opt_stats, 0, sizeof (opt_stats));
 
@@ -619,7 +633,7 @@ pass_dominator::execute (function *fun)
   free_all_edge_infos ();
 
   /* Thread jumps, creating duplicate blocks as needed.  */
-  cfg_altered |= thread_through_all_blocks (first_pass_instance);
+  cfg_altered |= thread_through_all_blocks (may_peel_loop_headers_p ());
 
   if (cfg_altered)
     free_dominance_info (CDI_DOMINATORS);
@@ -700,6 +714,34 @@ pass_dominator::execute (function *fun)
   return 0;
 }
 
+const pass_data pass_data_dominator_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "dom_oacc_kernels", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_TREE_SSA_DOMINATOR_OPTS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  ( TODO_cleanup_cfg | TODO_update_ssa ), /* todo_flags_finish */
+};
+
+class pass_dominator_oacc_kernels : public dominator_base
+{
+public:
+  pass_dominator_oacc_kernels (gcc::context *ctxt)
+    : dominator_base (pass_data_dominator_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_dominator_oacc_kernels (m_ctxt); }
+  virtual bool gate (function *) { return true; }
+
+ protected:
+  virtual bool may_peel_loop_headers_p (void) { return false; }
+}; // class pass_dominator_oacc_kernels
+
 } // anon namespace
 
 gimple_opt_pass *
@@ -708,6 +750,11 @@ make_pass_dominator (gcc::context *ctxt)
   return new pass_dominator (ctxt);
 }
 
+gimple_opt_pass *
+make_pass_dominator_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_dominator_oacc_kernels (ctxt);
+}
 
 /* Given a conditional statement CONDSTMT, convert the
    condition to a canonical form.  */
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 8/16] Add pass_ch_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (6 preceding siblings ...)
  2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
@ 2015-11-09 18:34 ` Tom de Vries
  2015-11-11 20:29   ` Tom de Vries
  2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2076 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch adds a pass pass_ch_oacc_kernels, which is like pass_ch, but 
only runs for loops with oacc_kernels_region set.

[ But... thinking about it a bit more, I think that we could use a 
regular pass_ch instead. We only use the kernels pass group for a single 
loop nest in a kernels region, and we mark all the loops in the loop 
nest with oacc_kernels_region. So I think that the oacc_kernels_region 
test in pass_ch_oacc_kernels::process_loop_p evaluates to true. ]

So, I'll try to confirm with retesting that we can drop this patch.

Thanks,
- Tom


[-- Attachment #2: 0008-Add-pass_ch_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 3397 bytes --]

Add pass_ch_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_ch_oacc_kernels): Declare.
	* tree-ssa-loop-ch.c (pass_ch::pass_ch (pass_data, gcc::context)): New
	constructor.
	(pass_data_ch_oacc_kernels): New pass_data.
	(class pass_ch_oacc_kernels): New pass.
	(pass_ch_oacc_kernels::process_loop_p): New function.
	(make_pass_ch_oacc_kernels): New function.
---
 gcc/tree-pass.h        |  1 +
 gcc/tree-ssa-loop-ch.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 2825aea..f95a820 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -389,6 +389,7 @@ extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_ch_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..8bf47fe 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -33,6 +33,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-ssa-scopedtables.h"
 #include "tree-ssa-threadedge.h"
+#include "omp-low.h"
 
 /* Duplicates headers of loops if they are small enough, so that the statements
    in the loop body are always executed when the loop is entered.  This
@@ -124,7 +125,7 @@ do_while_loop_p (struct loop *loop)
 
 namespace {
 
-/* Common superclass for both header-copying phases.  */
+/* Common superclass for header-copying phases.  */
 class ch_base : public gimple_opt_pass
 {
   protected:
@@ -159,6 +160,10 @@ public:
     : ch_base (pass_data_ch, ctxt)
   {}
 
+  pass_ch (pass_data data, gcc::context *ctxt)
+    : ch_base (data, ctxt)
+  {}
+
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_ch != 0; }
   
@@ -414,3 +419,50 @@ make_pass_ch (gcc::context *ctxt)
 {
   return new pass_ch (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_ch_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "ch_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_CH, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_cleanup_cfg, /* todo_flags_finish */
+};
+
+class pass_ch_oacc_kernels : public pass_ch
+{
+public:
+  pass_ch_oacc_kernels (gcc::context *ctxt)
+    : pass_ch (pass_data_ch_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return true; }
+
+protected:
+  /* ch_base method: */
+  virtual bool process_loop_p (struct loop *loop);
+}; // class pass_ch_oacc_kernels
+
+} // anon namespace
+
+bool
+pass_ch_oacc_kernels::process_loop_p (struct loop *loop)
+{
+  if (!loop->in_oacc_kernels_region)
+    return false;
+
+  return pass_ch::process_loop_p (loop);
+}
+
+gimple_opt_pass *
+make_pass_ch_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_ch_oacc_kernels (ctxt);
+}
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (7 preceding siblings ...)
  2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
@ 2015-11-09 19:53 ` Tom de Vries
  2015-11-16 11:59   ` Tom de Vries
  2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 19:53 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3122 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds pass_parallelize_loops_oacc_kernels.

There's a number of things we do differently in parloops for oacc kernels:
- in normal parloops, we generate code to choose between a parallel
   version of the loop, and a sequential (low iteration count) version.
   Since the code in oacc kernels region is supposed to run on the
   accelerator anyway, we skip this check, and don't add a low iteration
   count loop.
- in normal parloops, we generate an #pragma omp parallel /
   GIMPLE_OMP_RETURN pair to delimit the region which will we split off
   into a thread function. Since the oacc kernels region is already
   split off, we don't add this pair.
- we indicate the parallelization factor by setting the oacc function
   attributes
- we generate an #pragma oacc loop instead of an #pragma omp for, and
   we add the gang clause
- in normal parloops, we rewrite the variable accesses in the loop in
   terms into accesses relative to a thread function parameter. For the
   oacc kernels region, that rewrite has already been done at omp-lower,
   so we skip this.
- we need to ensure that the entire kernels region can be run in
   parallel. The loop independence check is already present, so for oacc
   kernels we add a check between blocks outside the loop and the entire
   region.
- we guard stores in the blocks outside the loop with gang_pos == 0.
   There's no need for each gang to write to a single location, we can
   do this in just one gang. (Typically this is the write of the final
   value of the iteration variable if that one is copied back to the
   host).

Thanks,
- Tom


[-- Attachment #2: 0009-Add-pass_parallelize_loops_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 30668 bytes --]

Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
        (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.
---
 gcc/omp-low.c       |   8 +-
 gcc/omp-low.h       |   1 +
 gcc/tree-parloops.c | 689 +++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-pass.h     |   2 +
 4 files changed, 636 insertions(+), 64 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 39c12c1..13fa456 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11967,10 +11967,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
   gsi_remove (&gsi, true);
   gsi = gsi_last_bb (store_bb);
+  stmt = gsi_stmt (gsi);
   gsi_remove (&gsi, true);
 
   if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_no_phi);
+    {
+      release_defs (stmt);
+      update_ssa (TODO_update_ssa_no_phi);
+    }
 
   return true;
 }
@@ -12344,7 +12348,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index ee0f8ac..fa5396d 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -31,6 +31,7 @@ extern bool make_gimple_omp_edges (basic_block, struct omp_region **, int *);
 extern void omp_finish_file (void);
 extern tree omp_member_access_dummy_var (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..0222016 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
 
-  addr = build_addr (t);
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
   basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   gomp_continue *omp_cont_stmt;
   tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
   edge exit, nexit, guard, end, e;
+  tree for_clauses = NULL_TREE;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
   bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (!oacc_kernels_p)
+    {
+      paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
+    }
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+  if (!oacc_kernels_p)
+    {
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+    }
+  else
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
 
   /* Initialize NEW_DATA.  */
   if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
     }
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+  /* Skip insertion of OMP_RETURN for oacc_kernels_p.  We've already generated
+     one when lowering the oacc kernels directive in
+     pass_lower_omp/lower_omp (). */
+  if (!oacc_kernels_p)
+    {
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
     OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
       = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  if (1)
+    {
+      /* In combination with the NUM_GANGS on the parallel.  */
+      for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+    }
+
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   for_clauses, 1, NULL);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
+      many_iterations_cond
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2527,12 +2606,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2588,6 +2676,7 @@ try_create_reduction_list (loop_p loop,
 			 "  FAILED: it is not a part of reduction.\n");
 	      return false;
 	    }
+	  red->keep_res = phi;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
 	      fprintf (dump_file, "reduction phi is  ");
@@ -2622,15 +2711,402 @@ try_create_reduction_list (loop_p loop,
     }
 
 
+  if (oacc_kernels_p)
+    {
+      edge e = loop_preheader_edge (loop);
+
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      struct reduction_info *red;
+	      red = reduction_phi (reduction_list, phi);
+
+	      /* Look for pattern:
+
+		 <bb preheader>
+		   .omp_data_i = &.omp_data_arr;
+		   addr = .omp_data_i->sum;
+		   sum_a = *addr;
+
+		 <bb header>:
+		   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+		 and assign addr to reduc->reduc_addr.  */
+
+	      tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      gimple *stmt = SSA_NAME_DEF_STMT (arg);
+	      if (!gimple_assign_single_p (stmt))
+		return false;
+	      tree memref = gimple_assign_rhs1 (stmt);
+	      if (TREE_CODE (memref) != MEM_REF)
+		return false;
+	      tree addr = TREE_OPERAND (memref, 0);
+
+	      gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+	      if (!gimple_assign_single_p (stmt2))
+		return false;
+	      tree compref = gimple_assign_rhs1 (stmt2);
+	      if (TREE_CODE (compref) != COMPONENT_REF)
+		return false;
+	      tree addr2 = TREE_OPERAND (compref, 0);
+	      if (TREE_CODE (addr2) != MEM_REF)
+		return false;
+	      addr2 = TREE_OPERAND (addr2, 0);
+	      if (TREE_CODE (addr2) != SSA_NAME
+		  || addr2 != get_omp_data_i_param ())
+		return false;
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
+
+  return true;
+}
+
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      tree omp_data_i,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
   return true;
 }
 
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  tree omp_data_i = get_omp_data_i_param ();
+  gcc_assert (omp_data_i != NULL_TREE);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+				   reduction_list, reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
+  return res;
+}
+
 /* Detect parallel loops and generate parallel code using libgomp
    primitives.  Returns true if some loop was parallelized, false
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2642,19 +3118,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2666,6 +3152,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2707,6 +3209,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2716,14 +3219,23 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (!flag_loop_parallelize_all
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
+      /* Skip inner loop(s) of parallelized loop.  */
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2736,8 +3248,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2787,7 +3300,7 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
@@ -2806,3 +3319,55 @@ make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "parloops_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+
+      return TODO_update_ssa;
+    }
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index f95a820..8eaf678 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -384,6 +384,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+  make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (8 preceding siblings ...)
  2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
@ 2015-11-09 19:59 ` Tom de Vries
  2015-11-11 11:03   ` Richard Biener
  2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 19:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.
>

This patch adds the pass_oacc_kernels pass group to the pass list in 
passes.def.

Note the repetition of pass_lim/pass_copy_prop. The first pair is for an 
inner loop in a loop nest, the second for an outer loop in a loop nest.

Thanks,
- Tom


[-- Attachment #2: 0010-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 3025 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* tree-ssa-loop.c (pass_scev_cprop::clone, pass_tree_loop_init::clone)
	(pass_tree_loop_done::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
---
 gcc/omp-low.c       |  1 +
 gcc/passes.def      | 21 +++++++++++++++++++++
 gcc/tree-ssa-loop.c |  3 +++
 3 files changed, 25 insertions(+)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 13fa456..1283cc7 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13360,6 +13360,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..b7a5424 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when there are oacc kernels in the
+	     function.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_ch_oacc_kernels);
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_tree_loop_init);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_scev_cprop);
+	      NEXT_PASS (pass_tree_loop_done);
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_tree_loop_init);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_tree_loop_done);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index b51cac2..0557f99 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -270,6 +270,7 @@ public:
 
   /* opt_pass methods: */
   virtual unsigned int execute (function *);
+  opt_pass * clone () { return new pass_tree_loop_init (m_ctxt); }
 
 }; // class pass_tree_loop_init
 
@@ -374,6 +375,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_scev_cprop; }
   virtual unsigned int execute (function *) { return scev_const_prop (); }
+  opt_pass * clone () { return new pass_scev_cprop (m_ctxt); }
 
 }; // class pass_scev_cprop
 
@@ -516,6 +518,7 @@ public:
 
   /* opt_pass methods: */
   virtual unsigned int execute (function *) { return tree_ssa_loop_done (); }
+  opt_pass * clone () { return new pass_tree_loop_done (m_ctxt); }
 
 }; // class pass_tree_loop_done
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (9 preceding siblings ...)
  2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
@ 2015-11-09 20:02 ` Tom de Vries
  2015-11-11 11:03   ` Richard Biener
  2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:02 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1658 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch updates existing testcases with new pass numbers, given the 
passes that were added in the pass list in patch 10.

Thanks,
- Tom


[-- Attachment #2: 0011-Update-testcases-after-adding-kernels-pass-group.patch --]
[-- Type: text/x-patch, Size: 35345 bytes --]

Update testcases after adding kernels pass group

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/restrict-2.c: Update after adding pass_oacc_kernels pass
	group.
	* c-c++-common/restrict-4.c: Same.
	* g++.dg/tree-ssa/copyprop-1.C: Same.
	* g++.dg/tree-ssa/pr33615.C: Same.
	* g++.dg/tree-ssa/restrict1.C: Same.
	* gcc.dg/gomp/notify-new-function-3.c: Same.
	* gcc.dg/pr23911.c: Same.
	* gcc.dg/pr41488.c: Same.
	* gcc.dg/tm/pub-safety-1.c: Same.
	* gcc.dg/tm/reg-promotion.c: Same.
	* gcc.dg/tree-ssa/20030709-2.c: Same.
	* gcc.dg/tree-ssa/20030731-2.c: Same.
	* gcc.dg/tree-ssa/20040729-1.c: Same.
	* gcc.dg/tree-ssa/20050314-1.c: Same.
	* gcc.dg/tree-ssa/cfgcleanup-1.c: Same.
	* gcc.dg/tree-ssa/loop-17.c: Same.
	* gcc.dg/tree-ssa/loop-32.c: Same.
	* gcc.dg/tree-ssa/loop-33.c: Same.
	* gcc.dg/tree-ssa/loop-34.c: Same.
	* gcc.dg/tree-ssa/loop-35.c: Same.
	* gcc.dg/tree-ssa/loop-36.c: Same.
	* gcc.dg/tree-ssa/loop-39.c: Same.
	* gcc.dg/tree-ssa/loop-7.c: Same.
	* gcc.dg/tree-ssa/pr21086.c: Same.
	* gcc.dg/tree-ssa/pr23109.c: Same.
	* gcc.dg/tree-ssa/restrict-3.c: Same.
	* gcc.dg/tree-ssa/restrict-5.c: Same.
	* gcc.dg/tree-ssa/scev-7.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-1.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-1.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-10.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-11.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-12.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-3.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-6.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-7.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-8.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-9.c: Same.
	* gcc.dg/tree-ssa/structopt-1.c: Same.
	* gcc.dg/vect/pr26359.c: Same.
	* gfortran.dg/pr32921.f: Same.
---
 gcc/testsuite/c-c++-common/restrict-2.c           | 4 ++--
 gcc/testsuite/c-c++-common/restrict-4.c           | 4 ++--
 gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C        | 4 ++--
 gcc/testsuite/g++.dg/tree-ssa/pr33615.C           | 4 ++--
 gcc/testsuite/g++.dg/tree-ssa/restrict1.C         | 4 ++--
 gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c | 2 +-
 gcc/testsuite/gcc.dg/pr23911.c                    | 6 +++---
 gcc/testsuite/gcc.dg/pr41488.c                    | 4 ++--
 gcc/testsuite/gcc.dg/tm/pub-safety-1.c            | 4 ++--
 gcc/testsuite/gcc.dg/tm/reg-promotion.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c        | 8 ++++----
 gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c      | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-17.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-32.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-33.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-34.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-35.c           | 6 +++---
 gcc/testsuite/gcc.dg/tree-ssa/loop-36.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-39.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-7.c            | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/pr21086.c           | 6 +++---
 gcc/testsuite/gcc.dg/tree-ssa/pr23109.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/scev-7.c            | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c         | 6 +++---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c       | 4 ++--
 gcc/testsuite/gcc.dg/vect/pr26359.c               | 4 ++--
 gcc/testsuite/gfortran.dg/pr32921.f               | 4 ++--
 43 files changed, 91 insertions(+), 91 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/restrict-2.c b/gcc/testsuite/c-c++-common/restrict-2.c
index 5e8bca7..183a0de 100644
--- a/gcc/testsuite/c-c++-common/restrict-2.c
+++ b/gcc/testsuite/c-c++-common/restrict-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim3-details" } */
 
 void foo (float * __restrict__ a, float * __restrict__ b, int n, int j)
 {
@@ -10,4 +10,4 @@ void foo (float * __restrict__ a, float * __restrict__ b, int n, int j)
 
 /* We should move the RHS of the store out of the loop.  */
 
-/* { dg-final { scan-tree-dump-times "Moving statement" 11 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 11 "lim3" } } */
diff --git a/gcc/testsuite/c-c++-common/restrict-4.c b/gcc/testsuite/c-c++-common/restrict-4.c
index cea6cd8..8dd597c 100644
--- a/gcc/testsuite/c-c++-common/restrict-4.c
+++ b/gcc/testsuite/c-c++-common/restrict-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile }  */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 struct Foo
 {
@@ -15,4 +15,4 @@ void bar(struct Foo f, int * __restrict__ q)
     }
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion" "lim3" } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C b/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C
index 5ff289c..34a9f7b 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-dce2" } */
+/* { dg-options "-O -fdump-tree-dce3" } */
 
 /* Verify that we can eliminate the useless conversions to/from
    const qualified pointer types
@@ -27,4 +27,4 @@ int foo(Object&o)
 
 /* Remaining should be two loads.  */
 
-/* { dg-final { scan-tree-dump-times " = \[^\n\]*;" 2 "dce2" } } */
+/* { dg-final { scan-tree-dump-times " = \[^\n\]*;" 2 "dce3" } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr33615.C b/gcc/testsuite/g++.dg/tree-ssa/pr33615.C
index f1b7a64..dd2bbb2 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr33615.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr33615.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fnon-call-exceptions -fdump-tree-lim1-details -w" } */
+/* { dg-options "-O -fnon-call-exceptions -fdump-tree-lim3-details -w" } */
 
 extern volatile int y;
 
@@ -16,4 +16,4 @@ foo (double a, int x)
 
 // The expression 1.0 / 0.0 should not be treated as a loop invariant
 // if it may throw an exception.
-// { dg-final { scan-tree-dump-times "invariant up to" 0 "lim1" } }
+// { dg-final { scan-tree-dump-times "invariant up to" 0 "lim3" } }
diff --git a/gcc/testsuite/g++.dg/tree-ssa/restrict1.C b/gcc/testsuite/g++.dg/tree-ssa/restrict1.C
index 5952fca..718d1ec 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/restrict1.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/restrict1.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 struct Foo
 {
@@ -16,4 +16,4 @@ void bar(Foo f, int * __restrict__ q)
     }
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c b/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c
index a8f24b1..033a407 100644
--- a/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c
+++ b/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c
@@ -11,4 +11,4 @@ foo (int *__restrict a, int *__restrict b, int *__restrict c)
 
 
 /* Check for new function notification in ompexpssa dump.  */
-/* { dg-final { scan-tree-dump-times "Added new ssa gimple function foo\\.\[\\\$_\]loopfn\\.0 to callgraph" 1 "ompexpssa" } } */
+/* { dg-final { scan-tree-dump-times "Added new ssa gimple function foo\\.\[\\\$_\]loopfn\\.0 to callgraph" 1 "ompexpssa2" } } */
diff --git a/gcc/testsuite/gcc.dg/pr23911.c b/gcc/testsuite/gcc.dg/pr23911.c
index 2c27397..3fa0412 100644
--- a/gcc/testsuite/gcc.dg/pr23911.c
+++ b/gcc/testsuite/gcc.dg/pr23911.c
@@ -1,7 +1,7 @@
 /* This was a missed optimization in tree constant propagation
    that CSE would catch later on.  */
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-dce2" } */
+/* { dg-options "-O -fdump-tree-dce3" } */
 
 double _Complex *a; 
 static const double _Complex b[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; 
@@ -16,5 +16,5 @@ test (void)
 
 /* After DCE2 which runs after FRE, the expressions should be fully
    constant folded.  There should be no loads from b left.  */
-/* { dg-final { scan-tree-dump-times "__complex__ \\\(1.0e\\\+0, 0.0\\\)" 2 "dce2" } } */
-/* { dg-final { scan-tree-dump-times "= b" 0 "dce2" } } */
+/* { dg-final { scan-tree-dump-times "__complex__ \\\(1.0e\\\+0, 0.0\\\)" 2 "dce3" } } */
+/* { dg-final { scan-tree-dump-times "= b" 0 "dce3" } } */
diff --git a/gcc/testsuite/gcc.dg/pr41488.c b/gcc/testsuite/gcc.dg/pr41488.c
index b9bc718..6c7686b 100644
--- a/gcc/testsuite/gcc.dg/pr41488.c
+++ b/gcc/testsuite/gcc.dg/pr41488.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-sccp-scev" } */
+/* { dg-options "-O2 -fdump-tree-sccp2-scev" } */
 
 struct struct_t
 {
@@ -14,4 +14,4 @@ void foo (struct struct_t* sp, int start, int end)
     sp->data[i+start] = 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp" } } */
+/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tm/pub-safety-1.c b/gcc/testsuite/gcc.dg/tm/pub-safety-1.c
index c95111c..3841c08 100644
--- a/gcc/testsuite/gcc.dg/tm/pub-safety-1.c
+++ b/gcc/testsuite/gcc.dg/tm/pub-safety-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-fgnu-tm -O1 -fdump-tree-lim1" } */
+/* { dg-options "-fgnu-tm -O1 -fdump-tree-lim3" } */
 
 /* Test that thread visible loads do not get hoisted out of loops if
    the load would not have occurred on each path out of the loop.  */
@@ -20,4 +20,4 @@ void reader()
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Cannot hoist.*DATA_DATA because it is in a transaction" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Cannot hoist.*DATA_DATA because it is in a transaction" 1 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tm/reg-promotion.c b/gcc/testsuite/gcc.dg/tm/reg-promotion.c
index 0200600..e0e5f62 100644
--- a/gcc/testsuite/gcc.dg/tm/reg-promotion.c
+++ b/gcc/testsuite/gcc.dg/tm/reg-promotion.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-fgnu-tm -O2 -fdump-tree-lim1" } */
+/* { dg-options "-fgnu-tm -O2 -fdump-tree-lim3" } */
 
 /* Test that `count' is not written to unless p->data>0.  */
 
@@ -20,4 +20,4 @@ void func()
   }
 }
 
-/* { dg-final { scan-tree-dump-times "Cannot hoist conditional load of count because it is in a transaction" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Cannot hoist conditional load of count because it is in a transaction" 1 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c
index d4f42f9..5009cd6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-dce2" } */
+/* { dg-options "-O -fdump-tree-dce3" } */
   
 struct rtx_def;
 typedef struct rtx_def *rtx;
@@ -42,13 +42,13 @@ get_alias_set (t)
 
 /* There should be precisely one load of ->decl.rtl.  If there is
    more than, then the dominator optimizations failed.  */
-/* { dg-final { scan-tree-dump-times "->decl\\.rtl" 1 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "->decl\\.rtl" 1 "dce3"} } */
   
 /* There should be no loads of .rtmem since the complex return statement
    is just "return 0".  */
-/* { dg-final { scan-tree-dump-times ".rtmem" 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times ".rtmem" 0 "dce3"} } */
   
 /* There should be one IF statement (the complex return statement should
    collapse down to a simple return 0 without any conditionals).  */
-/* { dg-final { scan-tree-dump-times "if " 1 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "if " 1 "dce3"} } */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c
index bdb22ff..069f953 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-dce1" } */
+/* { dg-options "-O2 -fdump-tree-dce2" } */
 
 void foo (void);
 
@@ -15,4 +15,4 @@ bar (int i, int partial, int args_addr)
 
 /* There should be only one IF conditional since the first does nothing
    useful.  */
-/* { dg-final { scan-tree-dump-times "if " 1 "dce1"} } */
+/* { dg-final { scan-tree-dump-times "if " 1 "dce2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c b/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c
index 6e7ffbb..812887a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-dce2" } */
+/* { dg-options "-O1 -fdump-tree-dce3" } */
 
 int
 foo ()
@@ -16,4 +16,4 @@ foo ()
    compiler was mistakenly thinking that the statement had volatile
    operands.  But 'p' itself is not volatile and taking the address of
    a volatile does not constitute a volatile operand.  */
-/* { dg-final { scan-tree-dump-times "&x" 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "&x" 0 "dce3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c b/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c
index fe220cd..1ad61f1 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-lim1-details --param allow-store-data-races=1" } */
+/* { dg-options "-O1 -fdump-tree-lim3-details --param allow-store-data-races=1" } */
 
 float a[100];
 
@@ -17,4 +17,4 @@ void xxx (void)
 /* Store motion may be applied to the assignment to a[k], since sinf
    cannot read nor write the memory.  */
 
-/* { dg-final { scan-tree-dump-times "Moving statement" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 1 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c b/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c
index 4d22a42..53ce973 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-dce1" } */
+/* { dg-options "-O2 -fdump-tree-dce2" } */
 void
 cleanup (int a, int b)
 {
@@ -15,4 +15,4 @@ cleanup (int a, int b)
   return;
 }
 /* Dce should get rid of the initializers and cfgcleanup should elliminate ifs  */
-/* { dg-final { scan-tree-dump-times "if " 0 "dce1"} } */
+/* { dg-final { scan-tree-dump-times "if " 0 "dce2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c
index 588cf4c..4cb1438 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-sccp-details" } */
+/* { dg-options "-O -fdump-tree-sccp2-details" } */
 
 /* To determine the number of iterations in this loop we need to fold
    p_4 + 4B > p_4 + 8B to false.  This transformation has caused
@@ -15,4 +15,4 @@ int foo (int *p)
   return i;
 }
 
-/* { dg-final { scan-tree-dump "# of iterations 1, bounded by 1" "sccp" } } */
+/* { dg-final { scan-tree-dump "# of iterations 1, bounded by 1" "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c
index 9953bb5..9b69c73 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int x;
 int a[100];
@@ -42,4 +42,4 @@ void test3(struct a *A)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 3 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 3 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c
index 2cf4c5a..98a16fb 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int x;
 int a[100];
@@ -36,4 +36,4 @@ void test5(struct a *A, unsigned b)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 4 "lim1" { xfail { lp64 || llp64 } } } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 4 "lim3" { xfail { lp64 || llp64 } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c
index 67493a5..26fb281 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int r[6];
 
@@ -17,4 +17,4 @@ void f (int n)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of r" 6 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of r" 6 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c
index 70557c5..87d105a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int x;
 int a[100];
@@ -67,5 +67,5 @@ void test4(struct a *A, unsigned LONG b)
     }
 }
 /* long index not hoisted for avr target PR 36561 */
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 8 "lim1" { xfail { "avr-*-*" } } } } */
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 6 "lim1" { target { "avr-*-*" } } } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 8 "lim3" { xfail { "avr-*-*" } } } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 6 "lim3" { target { "avr-*-*" } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c
index d922991..516cad9 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-dce3" } */
 
 struct X { float array[2]; };
 
@@ -18,4 +18,4 @@ float foobar () {
 
 /* The temporary structure should have been promoted to registers
    by FRE after the loops have been unrolled by the early unrolling pass.  */
-/* { dg-final { scan-tree-dump-not "c\.array" "dce2" } } */
+/* { dg-final { scan-tree-dump-not "c\.array" "dce3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c
index 53680dd..d1edbd5 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-sccp-details" } */
+/* { dg-options "-O2 -fdump-tree-sccp2-details" } */
 
 int
 foo (unsigned int n)
@@ -22,4 +22,4 @@ foo (unsigned int n)
   return r + n;
 }
 
-/* { dg-final { scan-tree-dump "# of iterations \[^\n\r]*, bounded by 8" "sccp" } } */
+/* { dg-final { scan-tree-dump "# of iterations \[^\n\r]*, bounded by 8" "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c
index 26fb4ec..e28e4c9 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c
@@ -1,6 +1,6 @@
 /* PR tree-optimization/19828 */
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-lim1-details" } */
+/* { dg-options "-O1 -fdump-tree-lim3-details" } */
 
 int cst_fun1 (int) __attribute__((__const__));
 int cst_fun2 (int) __attribute__((__const__));
@@ -31,4 +31,4 @@ int xxx (void)
    Calls to cst_fun2 and pure_fun2 should not be, since calling
    with k = 0 may be invalid.  */
 
-/* { dg-final { scan-tree-dump-times "Moving statement" 2 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 2 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c b/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c
index 26ea817..e8b62c2 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-dce1 -fdelete-null-pointer-checks" } */
+/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-dce2 -fdelete-null-pointer-checks" } */
 
 int
 foo (int *p)
@@ -18,5 +18,5 @@ foo (int *p)
 /* Target disabling -fdelete-null-pointer-checks should not fold checks */
 /* { dg-final { scan-tree-dump "Folding predicate " "vrp1" { target { ! keeps_null_pointer_checks } } } } */
 /* { dg-final { scan-tree-dump-times "Folding predicate " 0 "vrp1" { target {   keeps_null_pointer_checks } } } } */
-/* { dg-final { scan-tree-dump-not "b_. =" "dce1" { target { ! avr-*-* } } } } */
-/* { dg-final { scan-tree-dump "b_. =" "dce1" { target { avr-*-* } } } } */
+/* { dg-final { scan-tree-dump-not "b_. =" "dce2" { target { ! avr-*-* } } } } */
+/* { dg-final { scan-tree-dump "b_. =" "dce2" { target { avr-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c b/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
index 8281a98..040f3ae 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -funsafe-math-optimizations -ftrapping-math -fdump-tree-recip -fdump-tree-lim1" } */
+/* { dg-options "-O2 -funsafe-math-optimizations -ftrapping-math -fdump-tree-recip -fdump-tree-lim3" } */
 /* { dg-warning "-fassociative-math disabled" "" { target *-*-* } 1 } */
 
 double F[2] = { 0., 0. }, e = 0.;
@@ -29,6 +29,6 @@ int main()
 /* LIM only performs the transformation in the no-trapping-math case.  In
    the future we will do it for trapping-math as well in recip, check that
    this is not wrongly optimized.  */
-/* { dg-final { scan-tree-dump-not "reciptmp" "lim1" } } */
+/* { dg-final { scan-tree-dump-not "reciptmp" "lim3" } } */
 /* { dg-final { scan-tree-dump-not "reciptmp" "recip" } } */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c b/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c
index e9e1438..a352129 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim3-details" } */
 
 void f(int * __restrict__ r,
        int a[__restrict__ 16][16],
@@ -14,4 +14,4 @@ void f(int * __restrict__ r,
 
 /* We should apply store motion to the store to *r.  */
 
-/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c b/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c
index 6dd4c99..2e0edab 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim3-details" } */
 
 static inline __attribute__((always_inline))
 void f(int * __restrict__ r,
@@ -20,4 +20,4 @@ void g(int *r, int a[16][16], int b[16][16], int i, int j)
 
 /* We should apply store motion to the store to *r.  */
 
-/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c
index 5dfc7b1..ead68d0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-sccp-scev" } */
+/* { dg-options "-O2 -fdump-tree-sccp2-scev" } */
 
 struct struct_t
 {
@@ -14,4 +14,4 @@ void foo (struct struct_t* sp, int start, int end)
     sp->data[i+start] = 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp" } } */
+/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c
index 4a8c6b6..0c478d1 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-dce2" } */
+/* { dg-options "-O1 -fdump-tree-dce3" } */
 
 int t() __attribute__ ((const));
 void
@@ -10,4 +10,4 @@ q()
     i = t();
 }
 /* There should be no IF conditionals.  */
-/* { dg-final { scan-tree-dump-times "if " 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "if " 0 "dce3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c
index 6281a1e..b3f5073 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-dce3" } */
 
 /* We should notice constantness of this function. */
 static int __attribute__((noinline)) t(int a) 
@@ -13,4 +13,4 @@ void q(void)
     i = t(1);
 }
 /* There should be no IF conditionals.  */
-/* { dg-final { scan-tree-dump-times "if " 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "if " 0 "dce3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c
index 1b387cd..6a4b819 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1" } */
+/* { dg-options "-O -fdump-tree-lim3" } */
 
 /* This is a variant that does cause fold to place a cast to
    int before testing bit 1.  */
@@ -18,4 +18,4 @@ quantum_toffoli (int control1, int control2, int target,
     }
 }
 
-/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c
index 79ea042..afa547c 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int *l, *r;
 int test_func(void)
@@ -27,4 +27,4 @@ int test_func(void)
   return i;
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion of pos" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of pos" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
index eadf71c..d55f644 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fprofile-arcs -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fprofile-arcs -fdump-tree-lim3-details" } */
 /* { dg-require-profiling "-fprofile-generate" } */
 
 struct thread_param
@@ -22,4 +22,4 @@ void access_buf(struct thread_param* p)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of __gcov0.access_buf\\\[\[01\]\\\] from loop 1" 2 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of __gcov0.access_buf\\\[\[01\]\\\] from loop 1" 2 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c
index 35f17d5..18b055f 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1" } */
+/* { dg-options "-O -fdump-tree-lim3" } */
 
 int a[1024];
 
@@ -23,4 +23,4 @@ void bar (int x, int z)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "!= 0 ? " 2 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "!= 0 ? " 2 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c
index 8e72f78..9ef7bae 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1" } */
+/* { dg-options "-O -fdump-tree-lim3" } */
 
 /* This is a variant that doesn't cause fold to place a cast to
    int before testing bit 1.  */
@@ -18,4 +18,4 @@ int size)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c
index 2035215..dc7f41a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 struct { int x; int y; } global;
 void foo(int n)
@@ -9,5 +9,5 @@ void foo(int n)
     global.y += global.x*global.x;
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion of global.y" "lim1" } } */
-/* { dg-final { scan-tree-dump "Moving statement.*global.x.*out of loop 1" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of global.y" "lim3" } } */
+/* { dg-final { scan-tree-dump "Moving statement.*global.x.*out of loop 1" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c
index 283d206..535d627 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 double a[16][64], y[64], x[16];
 void foo(void)
@@ -10,4 +10,4 @@ void foo(void)
       y[j] = y[j] + a[i][j] * x[i];
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion of y" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of y" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
index f9d685e..bf4e8ec 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 extern const int srcshift;
 
@@ -11,4 +11,4 @@ void foo (int *srcdata, int *dstdata)
     dstdata[i] = srcdata[i] << srcshift;
 }
 
-/* { dg-final { scan-tree-dump "Moving statement" "lim1" } } */
+/* { dg-final { scan-tree-dump "Moving statement" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c
index aaad0f0..fb69af3 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 void bar (int);
 void foo (int n, int m)
@@ -16,4 +16,4 @@ void foo (int n, int m)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim3"  } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c
index 8abc2c7..9d2e817 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 void bar (int);
 void foo (int n, int m)
@@ -16,4 +16,4 @@ void foo (int n, int m)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim3"  } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c b/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c
index 0582e26..6abcb6c 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 int x; int y;
 struct { int x; int y; } global;
 int foo() {
@@ -10,5 +10,5 @@ int foo() {
 		global.y += global.x*global.x;
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of global.y" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of global.y" 1 "lim3" } } */
 /* XXX: We should also check for the load motion of global.x, but there is no easy way to do this.  */
diff --git a/gcc/testsuite/gcc.dg/vect/pr26359.c b/gcc/testsuite/gcc.dg/vect/pr26359.c
index 597ee7e..5b445a9 100644
--- a/gcc/testsuite/gcc.dg/vect/pr26359.c
+++ b/gcc/testsuite/gcc.dg/vect/pr26359.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target vect_int } */
-/* { dg-additional-options "-fdump-tree-dce5-details" } */
+/* { dg-additional-options "-fdump-tree-dce6-details" } */
 
 int a[256], b[256], c[256];
 
@@ -13,4 +13,4 @@ foo () {
   }
 }
 
-/* { dg-final { scan-tree-dump-times "Deleting : vect_" 0 "dce5" } } */
+/* { dg-final { scan-tree-dump-times "Deleting : vect_" 0 "dce6" } } */
diff --git a/gcc/testsuite/gfortran.dg/pr32921.f b/gcc/testsuite/gfortran.dg/pr32921.f
index 1c45d1e..e7264b7 100644
--- a/gcc/testsuite/gfortran.dg/pr32921.f
+++ b/gcc/testsuite/gfortran.dg/pr32921.f
@@ -1,5 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-O2 -fdump-tree-lim1" }
+! { dg-options "-O2 -fdump-tree-lim3" }
 ! gfortran -c -m32 -O2 -S junk.f
 !
       MODULE LES3D_DATA
@@ -45,4 +45,4 @@
 
       RETURN
       END
-! { dg-final { scan-tree-dump-times "stride" 4 "lim1" } }
+! { dg-final { scan-tree-dump-times "stride" 4 "lim3" } }
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 12/16] Handle acc loop directive
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (10 preceding siblings ...)
  2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
@ 2015-11-09 20:06 ` Tom de Vries
  2015-11-24 12:30   ` [PING][PATCH, " Tom de Vries
  2015-11-09 20:08 ` [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c Tom de Vries
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1733 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch deals with loops in an oacc kernels region which are 
annotated using "#pragma acc loop". It expands such a loop as a normal 
loop, which has the effect of ignoring the "#pragma acc loop".

Thanks,
- Tom


[-- Attachment #2: 0012-Handle-acc-loop-directive.patch --]
[-- Type: text/x-patch, Size: 8736 bytes --]

Handle acc loop directive

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (struct omp_region): Add inside_kernels_p field.
	(expand_omp_for_generic): Only set address taken for istart0
	and end0 unless necessary.  Adjust to generate a 'sequential' loop
	when GOMP builtin arguments are BUILT_IN_NONE.
	(expand_omp_for): Use expand_omp_for_generic() to generate a
	non-parallelized loop for OMP_FORs inside OpenACC kernels regions.
	(expand_omp): Mark inside_kernels_p field true for regions
	nested inside OpenACC kernels constructs.
---
 gcc/omp-low.c | 127 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 87 insertions(+), 40 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 1283cc7..859a2eb 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -136,6 +136,9 @@ struct omp_region
   /* The ordered stmt if type is GIMPLE_OMP_ORDERED and it has
      a depend clause.  */
   gomp_ordered *ord_stmt;
+
+  /* True if this is nested inside an OpenACC kernels construct.  */
+  bool inside_kernels_p;
 };
 
 /* Context structure.  Used to store information about each parallel
@@ -8238,6 +8241,7 @@ expand_omp_for_generic (struct omp_region *region,
   gassign *assign_stmt;
   bool in_combined_parallel = is_combined_parallel (region);
   bool broken_loop = region->cont == NULL;
+  bool seq_loop = (start_fn == BUILT_IN_NONE || next_fn == BUILT_IN_NONE);
   edge e, ne;
   tree *counts = NULL;
   int i;
@@ -8335,8 +8339,12 @@ expand_omp_for_generic (struct omp_region *region,
   type = TREE_TYPE (fd->loop.v);
   istart0 = create_tmp_var (fd->iter_type, ".istart0");
   iend0 = create_tmp_var (fd->iter_type, ".iend0");
-  TREE_ADDRESSABLE (istart0) = 1;
-  TREE_ADDRESSABLE (iend0) = 1;
+
+    if (!seq_loop)
+    {
+      TREE_ADDRESSABLE (istart0) = 1;
+      TREE_ADDRESSABLE (iend0) = 1;
+    }
 
   /* See if we need to bias by LLONG_MIN.  */
   if (fd->iter_type == long_long_unsigned_type_node
@@ -8366,7 +8374,20 @@ expand_omp_for_generic (struct omp_region *region,
   gsi_prev (&gsif);
 
   tree arr = NULL_TREE;
-  if (in_combined_parallel)
+  if (seq_loop)
+    {
+      tree n1 = fold_convert (fd->iter_type, fd->loop.n1);
+      tree n2 = fold_convert (fd->iter_type, fd->loop.n2);
+
+      assign_stmt = gimple_build_assign (istart0, n1);
+      gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
+
+      assign_stmt = gimple_build_assign (iend0, n2);
+      gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
+
+      t = fold_build2 (NE_EXPR, boolean_type_node, istart0, iend0);
+    }
+  else if (in_combined_parallel)
     {
       gcc_assert (fd->ordered == 0);
       /* In a combined parallel loop, emit a call to
@@ -8788,39 +8809,45 @@ expand_omp_for_generic (struct omp_region *region,
 	collapse_bb = extract_omp_for_update_vars (fd, cont_bb, l1_bb);
 
       /* Emit code to get the next parallel iteration in L2_BB.  */
-      gsi = gsi_start_bb (l2_bb);
+      if (!seq_loop)
+	{
+	  gsi = gsi_start_bb (l2_bb);
 
-      t = build_call_expr (builtin_decl_explicit (next_fn), 2,
-			   build_fold_addr_expr (istart0),
-			   build_fold_addr_expr (iend0));
-      t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
-				    false, GSI_CONTINUE_LINKING);
-      if (TREE_TYPE (t) != boolean_type_node)
-	t = fold_build2 (NE_EXPR, boolean_type_node,
-			 t, build_int_cst (TREE_TYPE (t), 0));
-      gcond *cond_stmt = gimple_build_cond_empty (t);
-      gsi_insert_after (&gsi, cond_stmt, GSI_CONTINUE_LINKING);
+	  t = build_call_expr (builtin_decl_explicit (next_fn), 2,
+			       build_fold_addr_expr (istart0),
+			       build_fold_addr_expr (iend0));
+	  t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
+					false, GSI_CONTINUE_LINKING);
+	  if (TREE_TYPE (t) != boolean_type_node)
+	    t = fold_build2 (NE_EXPR, boolean_type_node,
+			     t, build_int_cst (TREE_TYPE (t), 0));
+	  gcond *cond_stmt = gimple_build_cond_empty (t);
+	  gsi_insert_after (&gsi, cond_stmt, GSI_CONTINUE_LINKING);
+	}
     }
 
   /* Add the loop cleanup function.  */
   gsi = gsi_last_bb (exit_bb);
-  if (gimple_omp_return_nowait_p (gsi_stmt (gsi)))
-    t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_NOWAIT);
-  else if (gimple_omp_return_lhs (gsi_stmt (gsi)))
-    t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_CANCEL);
-  else
-    t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END);
-  gcall *call_stmt = gimple_build_call (t, 0);
-  if (gimple_omp_return_lhs (gsi_stmt (gsi)))
-    gimple_call_set_lhs (call_stmt, gimple_omp_return_lhs (gsi_stmt (gsi)));
-  gsi_insert_after (&gsi, call_stmt, GSI_SAME_STMT);
-  if (fd->ordered)
+  if (!seq_loop)
     {
-      tree arr = counts[fd->ordered];
-      tree clobber = build_constructor (TREE_TYPE (arr), NULL);
-      TREE_THIS_VOLATILE (clobber) = 1;
-      gsi_insert_after (&gsi, gimple_build_assign (arr, clobber),
-			GSI_SAME_STMT);
+      if (gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+	t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_NOWAIT);
+      else if (gimple_omp_return_lhs (gsi_stmt (gsi)))
+	t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_CANCEL);
+      else
+	t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END);
+      gcall *call_stmt = gimple_build_call (t, 0);
+      if (gimple_omp_return_lhs (gsi_stmt (gsi)))
+	gimple_call_set_lhs (call_stmt, gimple_omp_return_lhs (gsi_stmt (gsi)));
+      gsi_insert_after (&gsi, call_stmt, GSI_SAME_STMT);
+      if (fd->ordered)
+	{
+	  tree arr = counts[fd->ordered];
+	  tree clobber = build_constructor (TREE_TYPE (arr), NULL);
+	  TREE_THIS_VOLATILE (clobber) = 1;
+	  gsi_insert_after (&gsi, gimple_build_assign (arr, clobber),
+			    GSI_SAME_STMT);
+	}
     }
   gsi_remove (&gsi, true);
 
@@ -8833,7 +8860,9 @@ expand_omp_for_generic (struct omp_region *region,
       gimple_seq phis;
 
       e = find_edge (cont_bb, l3_bb);
-      ne = make_edge (l2_bb, l3_bb, EDGE_FALSE_VALUE);
+      ne = make_edge (l2_bb, l3_bb, (seq_loop
+				     ? EDGE_FALLTHRU
+				     : EDGE_FALSE_VALUE));
 
       phis = phi_nodes (l3_bb);
       for (gsi = gsi_start (phis); !gsi_end_p (gsi); gsi_next (&gsi))
@@ -8873,7 +8902,8 @@ expand_omp_for_generic (struct omp_region *region,
 	  e = find_edge (cont_bb, l2_bb);
 	  e->flags = EDGE_FALLTHRU;
 	}
-      make_edge (l2_bb, l0_bb, EDGE_TRUE_VALUE);
+      if (!seq_loop)
+	make_edge (l2_bb, l0_bb, EDGE_TRUE_VALUE);
 
       if (gimple_in_ssa_p (cfun))
 	{
@@ -8929,12 +8959,16 @@ expand_omp_for_generic (struct omp_region *region,
 
       add_bb_to_loop (l2_bb, outer_loop);
 
-      /* We've added a new loop around the original loop.  Allocate the
-	 corresponding loop struct.  */
-      struct loop *new_loop = alloc_loop ();
-      new_loop->header = l0_bb;
-      new_loop->latch = l2_bb;
-      add_loop (new_loop, outer_loop);
+      struct loop *new_loop = NULL;
+      if (!seq_loop)
+	{
+	  /* We've added a new loop around the original loop.  Allocate the
+	     corresponding loop struct.  */
+	  new_loop = alloc_loop ();
+	  new_loop->header = l0_bb;
+	  new_loop->latch = l2_bb;
+	  add_loop (new_loop, outer_loop);
+	}
 
       /* Allocate a loop structure for the original loop unless we already
 	 had one.  */
@@ -8944,7 +8978,9 @@ expand_omp_for_generic (struct omp_region *region,
 	  struct loop *orig_loop = alloc_loop ();
 	  orig_loop->header = l1_bb;
 	  /* The loop may have multiple latches.  */
-	  add_loop (orig_loop, new_loop);
+	  add_loop (orig_loop, (new_loop != NULL
+				? new_loop
+				: outer_loop));
 	}
     }
 }
@@ -11348,7 +11384,10 @@ expand_omp_for (struct omp_region *region, gimple *inner_stmt)
        original loops from being detected.  Fix that up.  */
     loops_state_set (LOOPS_NEED_FIXUP);
 
-  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
+  if (region->inside_kernels_p)
+    expand_omp_for_generic (region, &fd, BUILT_IN_NONE, BUILT_IN_NONE,
+			    inner_stmt);
+  else if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
     expand_omp_simd (region, &fd);
   else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
     expand_cilk_for (region, &fd);
@@ -13030,6 +13069,14 @@ expand_omp (struct omp_region *region)
       if (region->type == GIMPLE_OMP_PARALLEL)
 	determine_parallel_type (region);
 
+      if (region->type == GIMPLE_OMP_TARGET && region->inner)
+	{
+	  gomp_target *entry = as_a <gomp_target *> (last_stmt (region->entry));
+	  if (gimple_omp_target_kind (entry) == GF_OMP_TARGET_KIND_OACC_KERNELS
+	      || region->inside_kernels_p)
+	    region->inner->inside_kernels_p = true;
+	}
+
       if (region->type == GIMPLE_OMP_FOR
 	  && gimple_omp_for_combined_p (last_stmt (region->entry)))
 	inner_stmt = last_stmt (region->inner->entry);
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (11 preceding siblings ...)
  2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
@ 2015-11-09 20:08 ` Tom de Vries
  2016-01-18 13:33   ` [committed] Add oacc kernels tests in goacc Tom de Vries
  2015-11-09 20:09 ` [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95 Tom de Vries
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:08 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds C/C++ oacc kernels compilation tests.

Thanks,
- Tom


[-- Attachment #2: 0013-Add-c-c-common-goacc-kernels-.c.patch --]
[-- Type: text/x-patch, Size: 46798 bytes --]

Add c-c++-common/goacc/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/goacc/kernels-acc-loop-reduction.c: New test.
	* c-c++-common/goacc/kernels-acc-loop-smaller-equal.c: New test.
	* c-c++-common/goacc/kernels-counter-var-redundant-load.c: New test.
	* c-c++-common/goacc/kernels-counter-vars-function-scope.c: New test.
	* c-c++-common/goacc/kernels-double-reduction.c: New test.
	* c-c++-common/goacc/kernels-empty.c: New test.
	* c-c++-common/goacc/kernels-eternal.c: New test.
	* c-c++-common/goacc/kernels-loop-2-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-2.c: New test.
	* c-c++-common/goacc/kernels-loop-3-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-3.c: New test.
	* c-c++-common/goacc/kernels-loop-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-data-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-loop-data-update.c: New test.
	* c-c++-common/goacc/kernels-loop-data.c: New test.
	* c-c++-common/goacc/kernels-loop-g.c: New test.
	* c-c++-common/goacc/kernels-loop-mod-not-zero.c: New test.
	* c-c++-common/goacc/kernels-loop-n-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-n.c: New test.
	* c-c++-common/goacc/kernels-loop-nest.c: New test.
	* c-c++-common/goacc/kernels-loop.c: New test.
	* c-c++-common/goacc/kernels-noreturn.c: New test.
	* c-c++-common/goacc/kernels-one-counter-var.c: New test.
	* c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-reduction.c: New test.
---
 .../goacc/kernels-acc-loop-reduction.c             | 25 ++++++++
 .../goacc/kernels-acc-loop-smaller-equal.c         | 25 ++++++++
 .../goacc/kernels-counter-var-redundant-load.c     | 36 +++++++++++
 .../goacc/kernels-counter-vars-function-scope.c    | 54 +++++++++++++++++
 .../c-c++-common/goacc/kernels-double-reduction.c  | 37 ++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-empty.c   |  6 ++
 gcc/testsuite/c-c++-common/goacc/kernels-eternal.c | 11 ++++
 .../c-c++-common/goacc/kernels-loop-2-acc-loop.c   | 21 +++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c  | 70 ++++++++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-3-acc-loop.c   | 17 ++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c  | 49 +++++++++++++++
 .../c-c++-common/goacc/kernels-loop-acc-loop.c     | 17 ++++++
 .../c-c++-common/goacc/kernels-loop-data-2.c       | 70 ++++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit-2.c         | 68 +++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit.c           | 65 ++++++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-data-update.c  | 65 ++++++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-data.c         | 64 ++++++++++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c  | 17 ++++++
 .../c-c++-common/goacc/kernels-loop-mod-not-zero.c | 52 ++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-n-acc-loop.c   | 17 ++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c  | 56 +++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-nest.c         | 39 ++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop.c    | 56 +++++++++++++++++
 .../c-c++-common/goacc/kernels-noreturn.c          | 12 ++++
 .../c-c++-common/goacc/kernels-one-counter-var.c   | 54 +++++++++++++++++
 .../goacc/kernels-parallel-loop-data-enter-exit.c  | 66 ++++++++++++++++++++
 .../c-c++-common/goacc/kernels-reduction.c         | 36 +++++++++++
 27 files changed, 1105 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-empty.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-reduction.c

diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
new file mode 100644
index 0000000..dcc5891
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
@@ -0,0 +1,25 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+  unsigned int sum = 0;
+
+#pragma acc kernels loop gang reduction(+:sum)
+  for (int i = 0; i < n; i++)
+    sum += a[i];
+
+  return sum;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
new file mode 100644
index 0000000..c05c694
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
@@ -0,0 +1,25 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+unsigned int
+foo (int n)
+{
+  unsigned int sum = 1;
+
+  #pragma acc kernels loop
+  for (int i = 1; i <= n; i++)
+    sum += i;
+
+  return sum;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c b/gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c
new file mode 100644
index 0000000..ad101dd
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c
@@ -0,0 +1,36 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-dom_oacc_kernels3" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+COUNTERTYPE
+foo (unsigned int *c)
+{
+  COUNTERTYPE ii;
+
+#pragma acc kernels copyout (c[0:N])
+  {
+    for (ii = 0; ii < N; ii++)
+      c[ii] = 1;
+  }
+
+  return ii;
+}
+
+/* We're expecting:
+
+   .omp_data_i_10 = &.omp_data_arr.3;
+   _11 = .omp_data_i_10->ii;
+   *_11 = 0;
+   _15 = .omp_data_i_10->c;
+   c.1_16 = *_15;
+
+   Check that there is one load from anonymous ssa-name, which we assume to
+   be:
+   - the one to read c.  */
+
+/* { dg-final { scan-tree-dump-times "(?n)\\*_\[0-9\]\[0-9\]*;$" 1 "dom_oacc_kernels3" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c b/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
new file mode 100644
index 0000000..650fb8ca
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
@@ -0,0 +1,54 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+  COUNTERTYPE i;
+  COUNTERTYPE ii;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
new file mode 100644
index 0000000..da20f34
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
@@ -0,0 +1,37 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N 500
+
+unsigned int a[N][N];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i, j;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:N]) copy (sum)
+  {
+    for (i = 0; i < N; ++i)
+      for (j = 0; j < N; ++j)
+	sum += a[i][j];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-empty.c b/gcc/testsuite/c-c++-common/goacc/kernels-empty.c
new file mode 100644
index 0000000..e91b81c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-empty.c
@@ -0,0 +1,6 @@
+void
+foo (void)
+{
+#pragma acc kernels
+  ;
+}
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c b/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
new file mode 100644
index 0000000..edc17d2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
@@ -0,0 +1,11 @@
+int
+main (void)
+{
+#pragma acc kernels
+  {
+    while (1)
+      ;
+  }
+
+  return 0;
+}
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
new file mode 100644
index 0000000..6a4fb1f
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
@@ -0,0 +1,21 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-2.c"
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
new file mode 100644
index 0000000..514591e
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
@@ -0,0 +1,70 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels copyout (a[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels copyout (b[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
new file mode 100644
index 0000000..a9e81ee
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-3.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
new file mode 100644
index 0000000..790add9
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
@@ -0,0 +1,49 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int i;
+
+  unsigned int *__restrict c;
+
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    c[i] = i * 2;
+
+#pragma acc kernels copy (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = c[ii] + ii + 1;
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != i * 2 + i + 1)
+      abort ();
+
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
new file mode 100644
index 0000000..516598f
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c
new file mode 100644
index 0000000..095ed6c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c
@@ -0,0 +1,70 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+  }
+
+#pragma acc data copyout (b[0:N])
+  {
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+  }
+
+#pragma acc data copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c
new file mode 100644
index 0000000..9efffac
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c
@@ -0,0 +1,68 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N])
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+#pragma acc exit data copyout (a[0:N])
+
+#pragma acc enter data create (b[0:N])
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+#pragma acc exit data copyout (b[0:N])
+
+
+#pragma acc enter data copyin (a[0:N], b[0:N]) create (c[0:N])
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+#pragma acc exit data copyout (c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c
new file mode 100644
index 0000000..2da20b4
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c
@@ -0,0 +1,65 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c
new file mode 100644
index 0000000..09b63e5
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c
@@ -0,0 +1,65 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc update device (b[0:N])
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only two loops are analyzed, and that both can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c
new file mode 100644
index 0000000..437fd73
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c
@@ -0,0 +1,64 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N], b[0:N], c[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
new file mode 100644
index 0000000..27e23f8
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-g" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include "kernels-loop.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
new file mode 100644
index 0000000..940341d
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
@@ -0,0 +1,52 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
new file mode 100644
index 0000000..64e59a2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-n.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
new file mode 100644
index 0000000..73c6142
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
@@ -0,0 +1,56 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+foo (COUNTERTYPE n)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:n], b[0:n]) copyout (c[0:n])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < n; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
new file mode 100644
index 0000000..d2aeda6
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
@@ -0,0 +1,39 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Based on autopar/outer-1.c.  */
+
+#include <stdlib.h>
+
+#define N 1000
+
+int
+main (void)
+{
+  int x[N][N];
+
+#pragma acc kernels copyout (x)
+  {
+    for (int ii = 0; ii < N; ii++)
+      for (int jj = 0; jj < N; jj++)
+	x[ii][jj] = ii + jj + 3;
+  }
+
+  for (int i = 0; i < N; i++)
+    for (int j = 0; j < N; j++)
+      if (x[i][j] != i + j + 3)
+	abort ();
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop.c
new file mode 100644
index 0000000..925a84e
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop.c
@@ -0,0 +1,56 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c b/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
new file mode 100644
index 0000000..1a8cc67
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
@@ -0,0 +1,12 @@
+int
+main (void)
+{
+
+#pragma acc kernels
+  {
+    __builtin_abort ();
+  }
+
+  return 0;
+}
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c b/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
new file mode 100644
index 0000000..b000a8c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
@@ -0,0 +1,54 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+  COUNTERTYPE i;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (i = 0; i < N; i++)
+      c[i] = a[i] + b[i];
+  }
+
+  for (i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c b/gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c
new file mode 100644
index 0000000..31b06bd
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c
@@ -0,0 +1,66 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc parallel present (b[0:N])
+  {
+#pragma acc loop
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only two loops are analyzed, and that both can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
new file mode 100644
index 0000000..6a0b7a2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
@@ -0,0 +1,36 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define n 10000
+
+unsigned int a[n];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:n]) copy (sum)
+  {
+    for (i = 0; i < n; ++i)
+      sum += a[i];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (12 preceding siblings ...)
  2015-11-09 20:08 ` [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c Tom de Vries
@ 2015-11-09 20:09 ` Tom de Vries
  2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
  2015-11-09 20:12 ` [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95 Tom de Vries
  15 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1589 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds Fortran oacc kernels compilation tests.

Thanks,
- Tom


[-- Attachment #2: 0014-Add-gfortran.dg-goacc-kernels-.f95.patch --]
[-- Type: text/x-patch, Size: 16544 bytes --]

Add gfortran.dg/goacc/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* gfortran.dg/goacc/kernels-loop-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-update.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data.f95: New test.
	* gfortran.dg/goacc/kernels-loop.f95: New test.
	* gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95: New test.
---
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95 | 45 +++++++++++++++++++
 .../gfortran.dg/goacc/kernels-loop-data-2.f95      | 51 ++++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit-2.f95       | 51 ++++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit.f95         | 49 +++++++++++++++++++++
 .../gfortran.dg/goacc/kernels-loop-data-update.f95 | 48 ++++++++++++++++++++
 .../gfortran.dg/goacc/kernels-loop-data.f95        | 49 +++++++++++++++++++++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95   | 39 +++++++++++++++++
 .../kernels-parallel-loop-data-enter-exit.f95      | 50 +++++++++++++++++++++
 8 files changed, 382 insertions(+)
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95

diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
new file mode 100644
index 0000000..7fd6d4e
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
@@ -0,0 +1,45 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc kernels copyout (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyout (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
new file mode 100644
index 0000000..f788f67
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
@@ -0,0 +1,51 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyout (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
new file mode 100644
index 0000000..3599052
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
@@ -0,0 +1,51 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (a(0:n-1))
+
+  !$acc enter data create (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (b(0:n-1))
+
+  !$acc enter data copyin (a(0:n-1), b(0:n-1)) create (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
new file mode 100644
index 0000000..562422e
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
@@ -0,0 +1,49 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
new file mode 100644
index 0000000..ed18fe1
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
@@ -0,0 +1,48 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc update device (b(0:n-1))
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
new file mode 100644
index 0000000..177aa64
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
@@ -0,0 +1,49 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
new file mode 100644
index 0000000..c9364dd
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
@@ -0,0 +1,39 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only one loop is analyzed, and that it can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95
new file mode 100644
index 0000000..d805938
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95
@@ -0,0 +1,50 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc parallel present (b(0:n-1))
+  !$acc loop
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end parallel
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } }
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (13 preceding siblings ...)
  2015-11-09 20:09 ` [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95 Tom de Vries
@ 2015-11-09 20:11 ` Tom de Vries
  2016-01-18 13:39   ` [comitted] Add oacc kernels test in libgomp Tom de Vries
  2016-03-09  9:18   ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
  2015-11-09 20:12 ` [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95 Tom de Vries
  15 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1585 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds C/C++ oacc kernels execution tests.

Thanks,
- Tom


[-- Attachment #2: 0015-Add-libgomp.oacc-c-c-common-kernels-.c.patch --]
[-- Type: text/x-patch, Size: 27286 bytes --]

Add libgomp.oacc-c-c++-common/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c: Same.
---
 .../libgomp.oacc-c-c++-common/kernels-loop-2.c     | 47 ++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-3.c     | 34 +++++++++++++
 .../kernels-loop-and-seq-2.c                       | 36 ++++++++++++++
 .../kernels-loop-and-seq-3.c                       | 37 ++++++++++++++
 .../kernels-loop-and-seq-4.c                       | 36 ++++++++++++++
 .../kernels-loop-and-seq-5.c                       | 37 ++++++++++++++
 .../kernels-loop-and-seq-6.c                       | 36 ++++++++++++++
 .../kernels-loop-and-seq.c                         | 37 ++++++++++++++
 .../kernels-loop-collapse.c                        | 40 ++++++++++++++++
 .../kernels-loop-data-2.c                          | 56 ++++++++++++++++++++++
 .../kernels-loop-data-enter-exit-2.c               | 54 +++++++++++++++++++++
 .../kernels-loop-data-enter-exit.c                 | 51 ++++++++++++++++++++
 .../kernels-loop-data-update.c                     | 53 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-data.c  | 50 +++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-g.c     |  5 ++
 .../kernels-loop-mod-not-zero.c                    | 41 ++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-n.c     | 47 ++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-nest.c  | 26 ++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop.c       | 41 ++++++++++++++++
 .../kernels-parallel-loop-data-enter-exit.c        | 52 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-reduction.c  | 37 ++++++++++++++
 21 files changed, 853 insertions(+)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
new file mode 100644
index 0000000..13e57bd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels copyout (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels copyout (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
new file mode 100644
index 0000000..f61a74a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
@@ -0,0 +1,34 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int i;
+
+  unsigned int *__restrict c;
+
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    c[i] = i * 2;
+
+#pragma acc kernels copy (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = c[ii] + ii + 1;
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != i * 2 + i + 1)
+      abort ();
+
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
new file mode 100644
index 0000000..2e4100f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    a[0] = a[0] + 1;
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
new file mode 100644
index 0000000..b3e736b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+
+#pragma acc kernels copy (a[0:N])
+  {
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+
+    a[0] = 2;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
new file mode 100644
index 0000000..8b9affa
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    a[0] = 2;
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
new file mode 100644
index 0000000..83d4e7f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+  int r;
+#pragma acc kernels copyout(r) copy (a[0:N])
+  {
+    r = a[0];
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return r;
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 0)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
new file mode 100644
index 0000000..01d5e5e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    int r = a[0];
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1 + r;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
new file mode 100644
index 0000000..61d1283
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+
+#pragma acc kernels copy (a[0:N])
+  {
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+
+    a[0] = a[0] + 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
new file mode 100644
index 0000000..f7f04cb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 100
+
+int a[N][N];
+
+void __attribute__((noinline, noclone))
+foo (int m, int n)
+{
+  int i, j;
+  #pragma acc kernels
+  {
+#pragma acc loop collapse(2)
+    for (i = 0; i < m; i++)
+      for (j = 0; j < n; j++)
+	a[i][j] = 1;
+  }
+}
+
+int
+main (void)
+{
+  int i, j;
+
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      a[i][j] = 0;
+
+  foo (N, N);
+
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      if (a[i][j] != 1)
+	abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
new file mode 100644
index 0000000..b889ef9
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
@@ -0,0 +1,56 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+  }
+
+#pragma acc data copyout (b[0:N])
+  {
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+  }
+
+#pragma acc data copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
new file mode 100644
index 0000000..d508a44
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N])
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+#pragma acc exit data copyout (a[0:N])
+
+#pragma acc enter data create (b[0:N])
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+#pragma acc exit data copyout (b[0:N])
+
+
+#pragma acc enter data copyin (a[0:N], b[0:N]) create (c[0:N])
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+#pragma acc exit data copyout (c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
new file mode 100644
index 0000000..11d82f7
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
@@ -0,0 +1,51 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
new file mode 100644
index 0000000..a7d4e84
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
@@ -0,0 +1,53 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc update device (b[0:N])
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
new file mode 100644
index 0000000..607d7de
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N], b[0:N], c[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
new file mode 100644
index 0000000..96b6e4e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-g" } */
+
+#include "kernels-loop.c"
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
new file mode 100644
index 0000000..1433cb2
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
@@ -0,0 +1,41 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
new file mode 100644
index 0000000..fd0d5b1
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+static int __attribute__((noinline,noclone))
+foo (COUNTERTYPE n)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:n], b[0:n]) copyout (c[0:n])
+  {
+    for (COUNTERTYPE ii = 0; ii < n; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+int
+main (void)
+{
+  return foo (N);
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
new file mode 100644
index 0000000..21d2599
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 1000
+
+int
+main (void)
+{
+  int x[N][N];
+
+#pragma acc kernels copyout (x)
+  {
+    for (int ii = 0; ii < N; ii++)
+      for (int jj = 0; jj < N; jj++)
+	x[ii][jj] = ii + jj + 3;
+  }
+
+  for (int i = 0; i < N; i++)
+    for (int j = 0; j < N; j++)
+      if (x[i][j] != i + j + 3)
+	abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
new file mode 100644
index 0000000..3762e5a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
@@ -0,0 +1,41 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
new file mode 100644
index 0000000..767f6c8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
@@ -0,0 +1,52 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc parallel present (b[0:N])
+  {
+#pragma acc loop
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
new file mode 100644
index 0000000..511e25f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define n 10000
+
+unsigned int a[n];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:n]) copy (sum)
+  {
+    for (i = 0; i < n; ++i)
+      sum += a[i];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+int
+main ()
+{
+  int i;
+
+  for (i = 0; i < n; ++i)
+    a[i] = i % 2;
+
+  foo ();
+
+  return 0;
+}
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (14 preceding siblings ...)
  2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
@ 2015-11-09 20:12 ` Tom de Vries
  2016-03-09  9:19   ` Tom de Vries
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds Fortran oacc kernels execution tests.

Thanks,
- Tom


[-- Attachment #2: 0016-Add-libgomp.oacc-fortran-kernels-.f95.patch --]
[-- Type: text/x-patch, Size: 10459 bytes --]

Add libgomp.oacc-fortran/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: New test.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
	Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95:
	Same.
---
 .../libgomp.oacc-fortran/kernels-loop-2.f95        | 32 ++++++++++++++++++
 .../libgomp.oacc-fortran/kernels-loop-data-2.f95   | 38 ++++++++++++++++++++++
 .../kernels-loop-data-enter-exit-2.f95             | 38 ++++++++++++++++++++++
 .../kernels-loop-data-enter-exit.f95               | 36 ++++++++++++++++++++
 .../kernels-loop-data-update.f95                   | 36 ++++++++++++++++++++
 .../libgomp.oacc-fortran/kernels-loop-data.f95     | 36 ++++++++++++++++++++
 .../libgomp.oacc-fortran/kernels-loop.f95          | 28 ++++++++++++++++
 .../kernels-parallel-loop-data-enter-exit.f95      | 37 +++++++++++++++++++++
 8 files changed, 281 insertions(+)
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95

diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
new file mode 100644
index 0000000..1fb40ee
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
@@ -0,0 +1,32 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc kernels copyout (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyout (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
new file mode 100644
index 0000000..7b52253
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
@@ -0,0 +1,38 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyout (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
new file mode 100644
index 0000000..af98efa
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
@@ -0,0 +1,38 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (a(0:n-1))
+
+  !$acc enter data create (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (b(0:n-1))
+
+  !$acc enter data copyin (a(0:n-1), b(0:n-1)) create (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
new file mode 100644
index 0000000..bb6f8dc
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
@@ -0,0 +1,36 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
new file mode 100644
index 0000000..cab1f2c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
@@ -0,0 +1,36 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc update device (b(0:n-1))
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
new file mode 100644
index 0000000..f26671d
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
@@ -0,0 +1,36 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
new file mode 100644
index 0000000..b02dd57
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
@@ -0,0 +1,28 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
new file mode 100644
index 0000000..2322152
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
@@ -0,0 +1,37 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc parallel present (b(0:n-1))
+  !$acc loop
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end parallel
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 2/16] Make create_parallel_loop return void
  2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
@ 2015-11-11 10:50   ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:50 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch makes create_parallel_loop return void.  The result is currently
> unused.

Ok.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
@ 2015-11-11 10:50   ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:50 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> > 
> 
> In transform_to_exit_first_loop_alt we insert a new exit block  in between the
> new loop header and the old exit block. Currently, we also do this if this is
> not necessary.
> 
> This patch figures out when we need to insert a new exit block, and only then
> inserts it.

Ok.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
@ 2015-11-11 10:53   ` Richard Biener
  2015-11-11 11:01     ` Jakub Jelinek
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:53 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch addresses the problem that once the offloading region has been
> split off from the original function, alias analysis can no longer use
> information available in the original function that would allow it to do a
> more precise analysis for the offloading function. [ At some point we could
> use fipa-pta for that, as discussed in PR46032, but that's not feasible now. ]
> 
> The basic idea behind the patch is that for typical usage, the base pointers
> used in an offloaded region are non-aliasing. The patch works by adding
> restrict to the types of the fields used to pass data to an offloading region.
> 
> 
> The patch implements a new option
> -foffload-alias=<none|pointer|all>.
> 
> The option -foffload-alias=none instructs the compiler to assume that
> object references and pointer dereferences in an offload region do not
> alias.
> 
> The option -foffload-alias=pointer instructs the compiler to assume that
> objects references in an offload region do not alias.
> 
> The option -foffload-alias=all instructs the compiler to make no
> assumptions about aliasing in offload regions.
> 
> The default value is -foffload-alias=none.

I think global options for this is nonsense.  Please follow what
we do for #pragma GCC ivdep for example, thus allow the alias
behavior to be specified per "region" (whatever makes sense here
in the context of offloading).

Thanks,
Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
@ 2015-11-11 10:57   ` Richard Biener
  2015-11-16 11:39     ` Tom de Vries
  2015-11-16 11:39     ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:57 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch adds and initializes the field in_oacc_kernels_region field in
> struct loop.
> 
> The field is used to signal to subsequent passes that we're dealing with a
> loop in a kernels region that we're trying parallelize.
> 
> Note that we do not parallelize kernels regions with more than one loop nest.
> [ In general, kernels regions with more than one loop nest should be split up
> into seperate kernels regions, but that's not supported atm. ]

I think mark_loops_in_oacc_kernels_region can be greatly simplified.

Both region entry and exit should have the same ->loop_father (a SESE
region).  Then you can just walk that loops inner (and their sibling) 
loops checking their header domination relation with the region entry
exit (only necessary for direct inner loops).

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
@ 2015-11-11 10:59   ` Richard Biener
  2015-11-19 13:51     ` Tom de Vries
  2016-02-05 12:06   ` Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels) Thomas Schwinge
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:59 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patchs add a pass group pass_oacc_kernels (which will be added to the
> pass list as a whole in patch 10).

Just to understand (while also skimming the HSA patches).

You are basically relying on autopar for what the HSA patches call
"gridification"?  That is, OMP lowering produces loopy kernels
and autopar then will basically strip the outermost loop?

Richard.

> Atm, the parallelization behaviour for the kernels region is controlled by
> flag_tree_parallelize_loops, which is also used to control generic
> auto-parallelization by autopar using omp. That is not ideal, and we may want
> a separate flag (or param) to control the behaviour for oacc kernels, f.i.
> -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.
> 
> The purpose of the pass group as a whole is to massage the offloaded function
> into a shape that parloops can deal with it, and then run parloops on it.
> 
> Consider a testcase with a reduction, and a loop counter declared outside the
> offload region:
> ...
> unsigned int a[n];
> 
> unsigned int
> foo (void)
> {
>   int i;
>   unsigned int sum = 1;
> 
> #pragma acc kernels copyin (a[0:n]) copy (sum)
>   {
>     for (i = 0; i < n; ++i)
>       sum += a[i];
>   }
> 
>   return sum;
> }
> ...
> 
> After ealias, the loop body looks like this:
> ...
>   <bb 5>:
>   _8 = *.omp_data_i_3(D).a;
>   _9 = *.omp_data_i_3(D).i;
>   _10 = *_9;
>   _11 = *_8[_10];
>   _12 = *.omp_data_i_3(D).sum;
>   sum.0_13 = *_12;
>   sum.1_14 = _11 + sum.0_13;
>   _15 = *.omp_data_i_3(D).sum;
>   *_15 = sum.1_14;
>   _17 = *.omp_data_i_3(D).i;
>   _18 = *_17;
>   _19 = *.omp_data_i_3(D).i;
>   _20 = _18 + 1;
>   *_19 = _20;
>   goto <bb 6>;
> ...
> In other words, the iteration variable is in memory, as is the reduction
> variable, and the body contains lots of loop invariant loads.
> 
> At the end of the pass group, just before parloops, the body has been
> rewritten to have a local iteration variable and a local reduction variable,
> and all the loop invariant loads have been moved out of the loop:
> ...
>   <bb 4>:
>   # _27 = PHI <0(2), _20(5)>
>   # D__lsm.7_28 = PHI <D__lsm.7_29(2), sum.1_14(5)>
>   _11 = *_8[_27];
>   sum.1_14 = _11 + D__lsm.7_28;
>   _20 = _27 + 1;
>   if (_20 <= 9999)
>     goto <bb 5>;
>   else
>     goto <bb 3>;
> ...
> 
> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-11 10:53   ` Richard Biener
@ 2015-11-11 11:01     ` Jakub Jelinek
  2015-11-12 16:04       ` Tom de Vries
  2015-12-03 11:53       ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Jakub Jelinek @ 2015-11-11 11:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: Tom de Vries, gcc-patches

On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
> > The option -foffload-alias=pointer instructs the compiler to assume that
> > objects references in an offload region do not alias.
> > 
> > The option -foffload-alias=all instructs the compiler to make no
> > assumptions about aliasing in offload regions.
> > 
> > The default value is -foffload-alias=none.
> 
> I think global options for this is nonsense.  Please follow what
> we do for #pragma GCC ivdep for example, thus allow the alias
> behavior to be specified per "region" (whatever makes sense here
> in the context of offloading).

Yeah, completely agreed.  I don't see why the offloaded region would be in
any way special, they are C/C++/Fortran code as any other.
What we can and should improve is teach IPA aliasing/points to analysis
about the way we lower the host vs. offloading region boundary, so that
if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
determines something it can be used on the offloaded function side and vice
versa, but a switch like the above is just wrong.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
@ 2015-11-11 11:03   ` Richard Biener
  2015-11-12 14:32     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 11:03 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> This patch updates existing testcases with new pass numbers, given the passes
> that were added in the pass list in patch 10.

I think it would be nice to be able to specify the number in the .def
file instead so we can avoid this kind of churn everytime we do this.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
@ 2015-11-11 11:03   ` Richard Biener
  2015-11-16 11:55     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 11:03 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> > 
> 
> This patch adds the pass_oacc_kernels pass group to the pass list in
> passes.def.
> 
> Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner
> loop in a loop nest, the second for an outer loop in a loop nest.

@@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
          /* pass_build_ealias is a dummy pass that ensures that we
             execute TODO_rebuild_alias at this point.  */
          NEXT_PASS (pass_build_ealias);
+         /* Pass group that runs when there are oacc kernels in the
+            function.  */
+         NEXT_PASS (pass_oacc_kernels);
+         PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+             NEXT_PASS (pass_dominator_oacc_kernels);
+             NEXT_PASS (pass_ch_oacc_kernels);
+             NEXT_PASS (pass_dominator_oacc_kernels);
+             NEXT_PASS (pass_tree_loop_init);
+             NEXT_PASS (pass_lim);
+             NEXT_PASS (pass_copy_prop);
+             NEXT_PASS (pass_lim);
+             NEXT_PASS (pass_copy_prop);

iterate lim/copyprop twice?!  Why's that needed?

+             NEXT_PASS (pass_scev_cprop);

What's that for?  It's supposed to help removing loops - I don't
expect kernels to vanish.

+             NEXT_PASS (pass_tree_loop_done);
+             NEXT_PASS (pass_dominator_oacc_kernels);

Three times DOM?  No please.  I wonder why you don't run oacc_kernels
after FRE and drop the initial DOM(s).

+             NEXT_PASS (pass_dce);
+             NEXT_PASS (pass_tree_loop_init);
+             NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+             NEXT_PASS (pass_expand_omp_ssa);
+             NEXT_PASS (pass_tree_loop_done);

The switches into/outof tree_loop also look odd to me, but well
(they'll be controlled by -ftree-loop-optimize)).

+         POP_INSERT_PASSES ()

Please get some more sense into this pass pipeline.

Richard.


> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 7/16] Add pass_dominator_oacc_kernels
  2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
@ 2015-11-11 11:05   ` Richard Biener
  2015-11-16 12:04     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 11:05 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch adds pass_dominator_oacc_kernels (which we may as well call
> pass_dominator_no_peel_loop_headers. It doesn't do anything
> oacc-kernels-specific), to be used in the kernels pass group.
> 
> The reason I'm adding a new pass instead of using pass_dominator is that
> pass_dominator uses first_pass_instance. So adding a pass_dominator instance A
> before a pass_dominator instance B has the unexpected consequence that it may
> change the behaviour of instance B. I've filed PR68247 - "Remove
> pass_first_instance" to note this issue.

This looks ok (minus my comments to patch #10)

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 8/16] Add pass_ch_oacc_kernels
  2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
@ 2015-11-11 20:29   ` Tom de Vries
  2015-11-30 12:12     ` [gomp4] Use pass_ch instead of pass_ch_oacc_kernels (was: [PATCH, 8/16] Add pass_ch_oacc_kernels) Thomas Schwinge
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-11 20:29 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 09/11/15 19:33, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> this patch adds a pass pass_ch_oacc_kernels, which is like pass_ch, but
> only runs for loops with oacc_kernels_region set.
>
> [ But... thinking about it a bit more, I think that we could use a
> regular pass_ch instead. We only use the kernels pass group for a single
> loop nest in a kernels region, and we mark all the loops in the loop
> nest with oacc_kernels_region. So I think that the oacc_kernels_region
> test in pass_ch_oacc_kernels::process_loop_p evaluates to true. ]
>
> So, I'll try to confirm with retesting that we can drop this patch.
>

That's confirmed. I can use pass_ch instead of pass_ch_oacc_kernels, so 
I'm dropping this patch from the series.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-11 11:03   ` Richard Biener
@ 2015-11-12 14:32     ` Tom de Vries
  2015-11-12 14:43       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-12 14:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 11/11/15 12:03, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> This patch updates existing testcases with new pass numbers, given the passes
>> that were added in the pass list in patch 10.
>
> I think it would be nice to be able to specify the number in the .def
> file instead so we can avoid this kind of churn everytime we do this.

How about something along the lines of:
...
   /* pass_build_ealias is a dummy pass that ensures that we
      execute TODO_rebuild_alias at this point.  */
   NEXT_PASS (pass_build_ealias);
   /* Pass group that runs when there are oacc kernels in the
   function.  */
   NEXT_PASS (pass_oacc_kernels);
   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
   PUSH_ID ("oacc_kernels")
     ...
   POP_ID ()
   POP_INSERT_PASSES ()
   NEXT_PASS (pass_fre);
...

where the PUSH_ID/POP_ID pair has the functionality that all the 
contained passes:
- have the id prefixed to the dump file, so the dump file of pass_ch
   which normally is "ch" becomes "oacc_kernels_ch", and
- the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
   that it doesn't count as numbered instance of pass_ch
?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-12 14:32     ` Tom de Vries
@ 2015-11-12 14:43       ` Richard Biener
  2015-11-12 15:42         ` David Malcolm
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-12 14:43 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Thu, Nov 12, 2015 at 3:31 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 11/11/15 12:03, Richard Biener wrote:
>>
>> On Mon, 9 Nov 2015, Tom de Vries wrote:
>>
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>>
>>>> Hi,
>>>>
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>>>
>>>> The patch series contains these patches:
>>>>
>>>>        1    Insert new exit block only when needed in
>>>>           transform_to_exit_first_loop_alt
>>>>        2    Make create_parallel_loop return void
>>>>        3    Ignore reduction clause on kernels directive
>>>>        4    Implement -foffload-alias
>>>>        5    Add in_oacc_kernels_region in struct loop
>>>>        6    Add pass_oacc_kernels
>>>>        7    Add pass_dominator_oacc_kernels
>>>>        8    Add pass_ch_oacc_kernels
>>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>>       11    Update testcases after adding kernels pass group
>>>>       12    Handle acc loop directive
>>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>
>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>> intended to be committed at the same time.
>>>>
>>>> Bootstrapped and reg-tested on x86_64.
>>>>
>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>> patch that enables accelerator testing (which is submitted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>
>>>> I'll post the individual patches in reply to this message.
>>>
>>>
>>> This patch updates existing testcases with new pass numbers, given the
>>> passes
>>> that were added in the pass list in patch 10.
>>
>>
>> I think it would be nice to be able to specify the number in the .def
>> file instead so we can avoid this kind of churn everytime we do this.
>
>
> How about something along the lines of:
> ...
>   /* pass_build_ealias is a dummy pass that ensures that we
>      execute TODO_rebuild_alias at this point.  */
>   NEXT_PASS (pass_build_ealias);
>   /* Pass group that runs when there are oacc kernels in the
>   function.  */
>   NEXT_PASS (pass_oacc_kernels);
>   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>   PUSH_ID ("oacc_kernels")
>     ...
>   POP_ID ()
>   POP_INSERT_PASSES ()
>   NEXT_PASS (pass_fre);
> ...
>
> where the PUSH_ID/POP_ID pair has the functionality that all the contained
> passes:
> - have the id prefixed to the dump file, so the dump file of pass_ch
>   which normally is "ch" becomes "oacc_kernels_ch", and
> - the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
>   that it doesn't count as numbered instance of pass_ch
> ?

Hmm.  I'd like to have sth that allows me to add "slp" to both
pass_slp_vectorize
instances, having them share the suffix (as no two functions are in both dumps).

We similarly have "duplicates" across the -Og vs. the -O[0-3] pipeline.

Basically make all dump file name suffixes manually specified which means moving
them from the class definition to the actual instance.

Well, just an idea.  In a distant future I like our pass pipeline to become more
dynamic, getting away from a static passes.def towards, say, a pass "script"
(to be able to say "if inlining did nothing skip this group" or similar).

Richard.


> Thanks,
> - Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-12 14:43       ` Richard Biener
@ 2015-11-12 15:42         ` David Malcolm
  2015-11-13  9:44           ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: David Malcolm @ 2015-11-12 15:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: Tom de Vries, Richard Biener, gcc-patches, Jakub Jelinek

On Thu, 2015-11-12 at 15:43 +0100, Richard Biener wrote:
> On Thu, Nov 12, 2015 at 3:31 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > On 11/11/15 12:03, Richard Biener wrote:
> >>
> >> On Mon, 9 Nov 2015, Tom de Vries wrote:
> >>
> >>> On 09/11/15 16:35, Tom de Vries wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> this patch series for stage1 trunk adds support to:
> >>>> - parallelize oacc kernels regions using parloops, and
> >>>> - map the loops onto the oacc gang dimension.
> >>>>
> >>>> The patch series contains these patches:
> >>>>
> >>>>        1    Insert new exit block only when needed in
> >>>>           transform_to_exit_first_loop_alt
> >>>>        2    Make create_parallel_loop return void
> >>>>        3    Ignore reduction clause on kernels directive
> >>>>        4    Implement -foffload-alias
> >>>>        5    Add in_oacc_kernels_region in struct loop
> >>>>        6    Add pass_oacc_kernels
> >>>>        7    Add pass_dominator_oacc_kernels
> >>>>        8    Add pass_ch_oacc_kernels
> >>>>        9    Add pass_parallelize_loops_oacc_kernels
> >>>>       10    Add pass_oacc_kernels pass group in passes.def
> >>>>       11    Update testcases after adding kernels pass group
> >>>>       12    Handle acc loop directive
> >>>>       13    Add c-c++-common/goacc/kernels-*.c
> >>>>       14    Add gfortran.dg/goacc/kernels-*.f95
> >>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
> >>>>
> >>>> The first 9 patches are more or less independent, but patches 10-16 are
> >>>> intended to be committed at the same time.
> >>>>
> >>>> Bootstrapped and reg-tested on x86_64.
> >>>>
> >>>> Build and reg-tested with nvidia accelerator, in combination with a
> >>>> patch that enables accelerator testing (which is submitted at
> >>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> >>>>
> >>>> I'll post the individual patches in reply to this message.
> >>>
> >>>
> >>> This patch updates existing testcases with new pass numbers, given the
> >>> passes
> >>> that were added in the pass list in patch 10.
> >>
> >>
> >> I think it would be nice to be able to specify the number in the .def
> >> file instead so we can avoid this kind of churn everytime we do this.
> >
> >
> > How about something along the lines of:
> > ...
> >   /* pass_build_ealias is a dummy pass that ensures that we
> >      execute TODO_rebuild_alias at this point.  */
> >   NEXT_PASS (pass_build_ealias);
> >   /* Pass group that runs when there are oacc kernels in the
> >   function.  */
> >   NEXT_PASS (pass_oacc_kernels);
> >   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> >   PUSH_ID ("oacc_kernels")
> >     ...
> >   POP_ID ()
> >   POP_INSERT_PASSES ()
> >   NEXT_PASS (pass_fre);
> > ...
> >
> > where the PUSH_ID/POP_ID pair has the functionality that all the contained
> > passes:
> > - have the id prefixed to the dump file, so the dump file of pass_ch
> >   which normally is "ch" becomes "oacc_kernels_ch", and
> > - the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
> >   that it doesn't count as numbered instance of pass_ch
> > ?
> 
> Hmm.  I'd like to have sth that allows me to add "slp" to both
> pass_slp_vectorize
> instances, having them share the suffix (as no two functions are in both dumps).
> 
> We similarly have "duplicates" across the -Og vs. the -O[0-3] pipeline.
> 
> Basically make all dump file name suffixes manually specified which means moving
> them from the class definition to the actual instance.
> 
> Well, just an idea.  In a distant future I like our pass pipeline to become more
> dynamic, getting away from a static passes.def towards, say, a pass "script"
> (to be able to say "if inlining did nothing skip this group" or similar).

Can't that be done by having a parent pass to hold them, with a gate
function?

Or are you thinking of having another domain-specific language?

Thinking aloud, I've sometimes wondered if it would be helpful to be
able to subclass pass_manager, so that multiple passes.def files could
generate alternative pass_manager subclasses, with the precise choice of
pass_manager subclass being determined by options+target.  I don't know
if that latter idea is useful though.

Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-11 11:01     ` Jakub Jelinek
@ 2015-11-12 16:04       ` Tom de Vries
  2015-11-13  8:46         ` Richard Biener
  2015-12-03 11:53       ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-12 16:04 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

On 11/11/15 12:00, Jakub Jelinek wrote:
> On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
>>> The option -foffload-alias=pointer instructs the compiler to assume that
>>> objects references in an offload region do not alias.
>>>
>>> The option -foffload-alias=all instructs the compiler to make no
>>> assumptions about aliasing in offload regions.
>>>
>>> The default value is -foffload-alias=none.
>>
>> I think global options for this is nonsense.  Please follow what
>> we do for #pragma GCC ivdep for example, thus allow the alias
>> behavior to be specified per "region" (whatever makes sense here
>> in the context of offloading).

So, IIUC, instead of a global option foffload-alias, you're saying 
something like the following would be acceptable:
...
#pragma GCC offload-alias=<none|pointer|all>
#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
   {
     #pragma acc loop
     for (COUNTERTYPE ii = 0; ii < N; ii++)
       c[ii] = a[ii] + b[ii];
   }
...
?

I suppose that would work (though a global option would allow us to 
easily switch between none/pointer/all values in a large number of 
files, something that might be useful when f.i. running an openacc  test 
suite).

> Yeah, completely agreed.  I don't see why the offloaded region would be in
> any way special, they are C/C++/Fortran code as any other.
> What we can and should improve is teach IPA aliasing/points to analysis
> about the way we lower the host vs. offloading region boundary, so that
> if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> determines something it can be used on the offloaded function side and vice
> versa,

I agree this would be a nice way to solve the aliasing info problem, but 
considering the remark of Richard at 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
...
Not that I think IPA PTA is anywhere near production ready
...
I haven't considered proceeding in that direction.

Thanks,
- Tom

> but a switch like the above is just wrong.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-12 16:04       ` Tom de Vries
@ 2015-11-13  8:46         ` Richard Biener
  2015-11-13 11:03           ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-13  8:46 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Thu, 12 Nov 2015, Tom de Vries wrote:

> On 11/11/15 12:00, Jakub Jelinek wrote:
> > On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
> > > > The option -foffload-alias=pointer instructs the compiler to assume that
> > > > objects references in an offload region do not alias.
> > > > 
> > > > The option -foffload-alias=all instructs the compiler to make no
> > > > assumptions about aliasing in offload regions.
> > > > 
> > > > The default value is -foffload-alias=none.
> > > 
> > > I think global options for this is nonsense.  Please follow what
> > > we do for #pragma GCC ivdep for example, thus allow the alias
> > > behavior to be specified per "region" (whatever makes sense here
> > > in the context of offloading).
> 
> So, IIUC, instead of a global option foffload-alias, you're saying something
> like the following would be acceptable:
> ...
> #pragma GCC offload-alias=<none|pointer|all>
> #pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
>   {
>     #pragma acc loop
>     for (COUNTERTYPE ii = 0; ii < N; ii++)
>       c[ii] = a[ii] + b[ii];
>   }
> ...
> ?
> 
> I suppose that would work (though a global option would allow us to easily
> switch between none/pointer/all values in a large number of files, something
> that might be useful when f.i. running an openacc  test suite).
> 
> > Yeah, completely agreed.  I don't see why the offloaded region would be in
> > any way special, they are C/C++/Fortran code as any other.
> > What we can and should improve is teach IPA aliasing/points to analysis
> > about the way we lower the host vs. offloading region boundary, so that
> > if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> > determines something it can be used on the offloaded function side and vice
> > versa,
> 
> I agree this would be a nice way to solve the aliasing info problem, but
> considering the remark of Richard at
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
> ...
> Not that I think IPA PTA is anywhere near production ready

Just to clarify on that sentence:
 1) we lack good testing coverage for IPA PTA so wrong-code bugs might 
still exist
 2) IPA PTA can use a _lot_ of memory and compile-time
 3) for existing wrong-code issues I have merely dumbed down the
use of the analysis result resulting in weaker alias analysis compared to
the local PTA (for some cases)

Because of 2) and no good way to avoid this I decided to not make
fixing 3) a priority (and 1) still holds).

Richard.

> ...
> I haven't considered proceeding in that direction.
> 
> Thanks,
> - Tom
> 
> > but a switch like the above is just wrong.
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-12 15:42         ` David Malcolm
@ 2015-11-13  9:44           ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-13  9:44 UTC (permalink / raw)
  To: David Malcolm; +Cc: Richard Biener, Tom de Vries, gcc-patches, Jakub Jelinek

On Thu, 12 Nov 2015, David Malcolm wrote:

> On Thu, 2015-11-12 at 15:43 +0100, Richard Biener wrote:
> > On Thu, Nov 12, 2015 at 3:31 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > > On 11/11/15 12:03, Richard Biener wrote:
> > >>
> > >> On Mon, 9 Nov 2015, Tom de Vries wrote:
> > >>
> > >>> On 09/11/15 16:35, Tom de Vries wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> this patch series for stage1 trunk adds support to:
> > >>>> - parallelize oacc kernels regions using parloops, and
> > >>>> - map the loops onto the oacc gang dimension.
> > >>>>
> > >>>> The patch series contains these patches:
> > >>>>
> > >>>>        1    Insert new exit block only when needed in
> > >>>>           transform_to_exit_first_loop_alt
> > >>>>        2    Make create_parallel_loop return void
> > >>>>        3    Ignore reduction clause on kernels directive
> > >>>>        4    Implement -foffload-alias
> > >>>>        5    Add in_oacc_kernels_region in struct loop
> > >>>>        6    Add pass_oacc_kernels
> > >>>>        7    Add pass_dominator_oacc_kernels
> > >>>>        8    Add pass_ch_oacc_kernels
> > >>>>        9    Add pass_parallelize_loops_oacc_kernels
> > >>>>       10    Add pass_oacc_kernels pass group in passes.def
> > >>>>       11    Update testcases after adding kernels pass group
> > >>>>       12    Handle acc loop directive
> > >>>>       13    Add c-c++-common/goacc/kernels-*.c
> > >>>>       14    Add gfortran.dg/goacc/kernels-*.f95
> > >>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > >>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
> > >>>>
> > >>>> The first 9 patches are more or less independent, but patches 10-16 are
> > >>>> intended to be committed at the same time.
> > >>>>
> > >>>> Bootstrapped and reg-tested on x86_64.
> > >>>>
> > >>>> Build and reg-tested with nvidia accelerator, in combination with a
> > >>>> patch that enables accelerator testing (which is submitted at
> > >>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > >>>>
> > >>>> I'll post the individual patches in reply to this message.
> > >>>
> > >>>
> > >>> This patch updates existing testcases with new pass numbers, given the
> > >>> passes
> > >>> that were added in the pass list in patch 10.
> > >>
> > >>
> > >> I think it would be nice to be able to specify the number in the .def
> > >> file instead so we can avoid this kind of churn everytime we do this.
> > >
> > >
> > > How about something along the lines of:
> > > ...
> > >   /* pass_build_ealias is a dummy pass that ensures that we
> > >      execute TODO_rebuild_alias at this point.  */
> > >   NEXT_PASS (pass_build_ealias);
> > >   /* Pass group that runs when there are oacc kernels in the
> > >   function.  */
> > >   NEXT_PASS (pass_oacc_kernels);
> > >   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > >   PUSH_ID ("oacc_kernels")
> > >     ...
> > >   POP_ID ()
> > >   POP_INSERT_PASSES ()
> > >   NEXT_PASS (pass_fre);
> > > ...
> > >
> > > where the PUSH_ID/POP_ID pair has the functionality that all the contained
> > > passes:
> > > - have the id prefixed to the dump file, so the dump file of pass_ch
> > >   which normally is "ch" becomes "oacc_kernels_ch", and
> > > - the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
> > >   that it doesn't count as numbered instance of pass_ch
> > > ?
> > 
> > Hmm.  I'd like to have sth that allows me to add "slp" to both
> > pass_slp_vectorize
> > instances, having them share the suffix (as no two functions are in both dumps).
> > 
> > We similarly have "duplicates" across the -Og vs. the -O[0-3] pipeline.
> > 
> > Basically make all dump file name suffixes manually specified which means moving
> > them from the class definition to the actual instance.
> > 
> > Well, just an idea.  In a distant future I like our pass pipeline to become more
> > dynamic, getting away from a static passes.def towards, say, a pass "script"
> > (to be able to say "if inlining did nothing skip this group" or similar).
> 
> Can't that be done by having a parent pass to hold them, with a gate
> function?

Sure, that's how we do it for the loop sub-pipeline for example.

> Or are you thinking of having another domain-specific language?

Kind of.  I'm thinking of the pass pipeline being dynamic in the
sense of a program controlling execution of passes.  Basically
"scripting" the pass manager itself (yes, also with the idea to
give users and us more control).

Of course specific features can be implemented in the pass manager
itself (it's a "script" with static configuration).

> Thinking aloud, I've sometimes wondered if it would be helpful to be
> able to subclass pass_manager, so that multiple passes.def files could
> generate alternative pass_manager subclasses, with the precise choice of
> pass_manager subclass being determined by options+target.  I don't know
> if that latter idea is useful though.

I think the "use" of passes.def is simply too static.  We shouldn't bother
to create all the instances and dump file metadata until we need it.

The first thing to do is of course making the pass manager really
control the flow of compilation rather than various bits of
cgraph infrastructure executing specific (sub-)pass queues.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13  8:46         ` Richard Biener
@ 2015-11-13 11:03           ` Tom de Vries
  2015-11-13 11:30             ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-13 11:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

On 13/11/15 09:46, Richard Biener wrote:
> On Thu, 12 Nov 2015, Tom de Vries wrote:
>
>> On 11/11/15 12:00, Jakub Jelinek wrote:
>>> On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
>>>>> The option -foffload-alias=pointer instructs the compiler to assume that
>>>>> objects references in an offload region do not alias.
>>>>>
>>>>> The option -foffload-alias=all instructs the compiler to make no
>>>>> assumptions about aliasing in offload regions.
>>>>>
>>>>> The default value is -foffload-alias=none.
>>>>
>>>> I think global options for this is nonsense.  Please follow what
>>>> we do for #pragma GCC ivdep for example, thus allow the alias
>>>> behavior to be specified per "region" (whatever makes sense here
>>>> in the context of offloading).
>>
>> So, IIUC, instead of a global option foffload-alias, you're saying something
>> like the following would be acceptable:
>> ...
>> #pragma GCC offload-alias=<none|pointer|all>
>> #pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
>>    {
>>      #pragma acc loop
>>      for (COUNTERTYPE ii = 0; ii < N; ii++)
>>        c[ii] = a[ii] + b[ii];
>>    }
>> ...
>> ?
>>
>> I suppose that would work (though a global option would allow us to easily
>> switch between none/pointer/all values in a large number of files, something
>> that might be useful when f.i. running an openacc  test suite).
>>
>>> Yeah, completely agreed.  I don't see why the offloaded region would be in
>>> any way special, they are C/C++/Fortran code as any other.
>>> What we can and should improve is teach IPA aliasing/points to analysis
>>> about the way we lower the host vs. offloading region boundary, so that
>>> if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
>>> determines something it can be used on the offloaded function side and vice
>>> versa,
>>
>> I agree this would be a nice way to solve the aliasing info problem, but
>> considering the remark of Richard at
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
>> ...
>> Not that I think IPA PTA is anywhere near production ready
>
> Just to clarify on that sentence:
>   1) we lack good testing coverage for IPA PTA so wrong-code bugs might
> still exist
>   2) IPA PTA can use a _lot_ of memory and compile-time
>   3) for existing wrong-code issues I have merely dumbed down the
> use of the analysis result resulting in weaker alias analysis compared to
> the local PTA (for some cases)
>
> Because of 2) and no good way to avoid this I decided to not make
> fixing 3) a priority (and 1) still holds).
>

Hi,

thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.

Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit 
above? Is that sort of what you had in mind?

Thanks,
- Tom



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:03           ` Tom de Vries
@ 2015-11-13 11:30             ` Richard Biener
  2015-11-13 11:39               ` Jakub Jelinek
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-13 11:30 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Fri, 13 Nov 2015, Tom de Vries wrote:

> On 13/11/15 09:46, Richard Biener wrote:
> > On Thu, 12 Nov 2015, Tom de Vries wrote:
> > 
> > > On 11/11/15 12:00, Jakub Jelinek wrote:
> > > > On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
> > > > > > The option -foffload-alias=pointer instructs the compiler to assume
> > > > > > that
> > > > > > objects references in an offload region do not alias.
> > > > > > 
> > > > > > The option -foffload-alias=all instructs the compiler to make no
> > > > > > assumptions about aliasing in offload regions.
> > > > > > 
> > > > > > The default value is -foffload-alias=none.
> > > > > 
> > > > > I think global options for this is nonsense.  Please follow what
> > > > > we do for #pragma GCC ivdep for example, thus allow the alias
> > > > > behavior to be specified per "region" (whatever makes sense here
> > > > > in the context of offloading).
> > > 
> > > So, IIUC, instead of a global option foffload-alias, you're saying
> > > something
> > > like the following would be acceptable:
> > > ...
> > > #pragma GCC offload-alias=<none|pointer|all>
> > > #pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
> > >    {
> > >      #pragma acc loop
> > >      for (COUNTERTYPE ii = 0; ii < N; ii++)
> > >        c[ii] = a[ii] + b[ii];
> > >    }
> > > ...
> > > ?
> > > 
> > > I suppose that would work (though a global option would allow us to easily
> > > switch between none/pointer/all values in a large number of files,
> > > something
> > > that might be useful when f.i. running an openacc  test suite).
> > > 
> > > > Yeah, completely agreed.  I don't see why the offloaded region would be
> > > > in
> > > > any way special, they are C/C++/Fortran code as any other.
> > > > What we can and should improve is teach IPA aliasing/points to analysis
> > > > about the way we lower the host vs. offloading region boundary, so that
> > > > if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> > > > determines something it can be used on the offloaded function side and
> > > > vice
> > > > versa,
> > > 
> > > I agree this would be a nice way to solve the aliasing info problem, but
> > > considering the remark of Richard at
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
> > > ...
> > > Not that I think IPA PTA is anywhere near production ready
> > 
> > Just to clarify on that sentence:
> >   1) we lack good testing coverage for IPA PTA so wrong-code bugs might
> > still exist
> >   2) IPA PTA can use a _lot_ of memory and compile-time
> >   3) for existing wrong-code issues I have merely dumbed down the
> > use of the analysis result resulting in weaker alias analysis compared to
> > the local PTA (for some cases)
> > 
> > Because of 2) and no good way to avoid this I decided to not make
> > fixing 3) a priority (and 1) still holds).
> > 
> 
> Hi,
> 
> thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.
> 
> Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit above?
> Is that sort of what you had in mind?

Yes.  Whether that makes sense is another question of course.  You can
annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
as well if you know dependences without the users intervention.

Richard.

> Thanks,
> - Tom
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:30             ` Richard Biener
@ 2015-11-13 11:39               ` Jakub Jelinek
  2015-11-21 12:24                 ` Tom de Vries
  2015-12-11 12:45                 ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Jakub Jelinek @ 2015-11-13 11:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: Tom de Vries, gcc-patches

On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
> > thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.
> > 
> > Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit above?
> > Is that sort of what you had in mind?
> 
> Yes.  Whether that makes sense is another question of course.  You can
> annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
> as well if you know dependences without the users intervention.

I really don't like even the GCC offload-alias, I just don't see anything
special on the offload code.  Not to mention that the same issue is already
with other outlined functions, like OpenMP tasks or parallel regions, those
aren't offloaded, yet they can suffer from worse alias/points-to analysis
too.

We simply have some compiler internal interface between the caller and
callee of the outlined regions, each interface in between those has
its own structure type used to communicate the info;
we can attach attributes on the fields, or some flags to indicate some
properties interesting from aliasing POV.  We don't really need to perform
full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
the relationship in between such callers and callees (for offloading regions
we already have "omp target entrypoint" attribute on the callee and a
singler caller), tell LTO if possible not to split those into different
partitions if easily possible, and then just for these pairs perform
aliasing/points-to analysis in the caller and the result record using
cliques/special attributes/whatever to the callee side, so that the callee
(outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-11 10:57   ` Richard Biener
  2015-11-16 11:39     ` Tom de Vries
@ 2015-11-16 11:39     ` Tom de Vries
  1 sibling, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On 11/11/15 11:55, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patch adds and initializes the field in_oacc_kernels_region field in
>> struct loop.
>>
>> The field is used to signal to subsequent passes that we're dealing with a
>> loop in a kernels region that we're trying parallelize.
>>
>> Note that we do not parallelize kernels regions with more than one loop nest.
>> [ In general, kernels regions with more than one loop nest should be split up
>> into seperate kernels regions, but that's not supported atm. ]
>
> I think mark_loops_in_oacc_kernels_region can be greatly simplified.
>
> Both region entry and exit should have the same ->loop_father (a SESE
> region).  Then you can just walk that loops inner (and their sibling)
> loops checking their header domination relation with the region entry
> exit (only necessary for direct inner loops).

Updated patch to use the loops structure.  Atm I'm also skipping loops 
containing sibling loops, since I have no test-cases for that yet.

Thanks,
- Tom


[-- Attachment #2: 0003-Add-in_oacc_kernels_region-in-struct-loop.patch --]
[-- Type: text/x-patch, Size: 2785 bytes --]

Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.

---
 gcc/cfgloop.h |  3 +++
 gcc/omp-low.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 6af6893..ee73bf9 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if the loop is part of an oacc kernels region.  */
+  bool in_oacc_kernels_region;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 5f76434..fba7bbd 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12450,6 +12450,46 @@ get_oacc_ifn_dim_arg (const gimple *stmt)
   return (int) axis;
 }
 
+/* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
+   at REGION_EXIT.  */
+
+static void
+mark_loops_in_oacc_kernels_region (basic_block region_entry,
+				   basic_block region_exit)
+{
+  struct loop *outer = region_entry->loop_father;
+  gcc_assert (region_exit == NULL || outer == region_exit->loop_father);
+
+  /* Don't parallelize the kernels region if it contains more than one outer
+     loop.  */
+  unsigned int nr_outer_loops = 0;
+  struct loop *single_outer;
+  for (struct loop *loop = outer->inner; loop != NULL; loop = loop->next)
+    {
+      gcc_assert (loop_outer (loop) == outer);
+
+      if (!dominated_by_p (CDI_DOMINATORS, loop->header, region_entry))
+	continue;
+
+      if (region_exit != NULL
+	  && dominated_by_p (CDI_DOMINATORS, loop->header, region_exit))
+	continue;
+
+      nr_outer_loops++;
+      single_outer = loop;
+    }
+  if (nr_outer_loops != 1)
+    return;
+
+  for (struct loop *loop = single_outer->inner; loop != NULL; loop = loop->inner)
+    if (loop->next)
+      return;
+
+  /* Mark the loops in the region.  */
+  for (struct loop *loop = single_outer; loop != NULL; loop = loop->inner)
+    loop->in_oacc_kernels_region = true;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -12505,6 +12545,9 @@ expand_omp_target (struct omp_region *region)
   entry_bb = region->entry;
   exit_bb = region->exit;
 
+  if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+    mark_loops_in_oacc_kernels_region (region->entry, region->exit);
+
   if (offloaded)
     {
       unsigned srcidx, dstidx, num;

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-11 10:57   ` Richard Biener
@ 2015-11-16 11:39     ` Tom de Vries
  2015-11-16 12:41       ` Richard Biener
  2015-11-16 11:39     ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On 11/11/15 11:55, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patch adds and initializes the field in_oacc_kernels_region field in
>> struct loop.
>>
>> The field is used to signal to subsequent passes that we're dealing with a
>> loop in a kernels region that we're trying parallelize.
>>
>> Note that we do not parallelize kernels regions with more than one loop nest.
>> [ In general, kernels regions with more than one loop nest should be split up
>> into seperate kernels regions, but that's not supported atm. ]
>
> I think mark_loops_in_oacc_kernels_region can be greatly simplified.
>
> Both region entry and exit should have the same ->loop_father (a SESE
> region).  Then you can just walk that loops inner (and their sibling)
> loops checking their header domination relation with the region entry
> exit (only necessary for direct inner loops).

Updated patch to use the loops structure.  Atm I'm also skipping loops 
containing sibling loops, since I have no test-cases for that yet.

Thanks,
- Tom


[-- Attachment #2: 0003-Add-in_oacc_kernels_region-in-struct-loop.patch --]
[-- Type: text/x-patch, Size: 2785 bytes --]

Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.

---
 gcc/cfgloop.h |  3 +++
 gcc/omp-low.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 6af6893..ee73bf9 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if the loop is part of an oacc kernels region.  */
+  bool in_oacc_kernels_region;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 5f76434..fba7bbd 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12450,6 +12450,46 @@ get_oacc_ifn_dim_arg (const gimple *stmt)
   return (int) axis;
 }
 
+/* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
+   at REGION_EXIT.  */
+
+static void
+mark_loops_in_oacc_kernels_region (basic_block region_entry,
+				   basic_block region_exit)
+{
+  struct loop *outer = region_entry->loop_father;
+  gcc_assert (region_exit == NULL || outer == region_exit->loop_father);
+
+  /* Don't parallelize the kernels region if it contains more than one outer
+     loop.  */
+  unsigned int nr_outer_loops = 0;
+  struct loop *single_outer;
+  for (struct loop *loop = outer->inner; loop != NULL; loop = loop->next)
+    {
+      gcc_assert (loop_outer (loop) == outer);
+
+      if (!dominated_by_p (CDI_DOMINATORS, loop->header, region_entry))
+	continue;
+
+      if (region_exit != NULL
+	  && dominated_by_p (CDI_DOMINATORS, loop->header, region_exit))
+	continue;
+
+      nr_outer_loops++;
+      single_outer = loop;
+    }
+  if (nr_outer_loops != 1)
+    return;
+
+  for (struct loop *loop = single_outer->inner; loop != NULL; loop = loop->inner)
+    if (loop->next)
+      return;
+
+  /* Mark the loops in the region.  */
+  for (struct loop *loop = single_outer; loop != NULL; loop = loop->inner)
+    loop->in_oacc_kernels_region = true;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -12505,6 +12545,9 @@ expand_omp_target (struct omp_region *region)
   entry_bb = region->entry;
   exit_bb = region->exit;
 
+  if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+    mark_loops_in_oacc_kernels_region (region->entry, region->exit);
+
   if (offloaded)
     {
       unsigned srcidx, dstidx, num;

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-11 11:03   ` Richard Biener
@ 2015-11-16 11:55     ` Tom de Vries
  2015-11-16 12:45       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 4549 bytes --]

On 11/11/15 12:02, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>>
>>
>> This patch adds the pass_oacc_kernels pass group to the pass list in
>> passes.def.
>>
>> Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner
>> loop in a loop nest, the second for an outer loop in a loop nest.
>
> @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
>            /* pass_build_ealias is a dummy pass that ensures that we
>               execute TODO_rebuild_alias at this point.  */
>            NEXT_PASS (pass_build_ealias);
> +         /* Pass group that runs when there are oacc kernels in the
> +            function.  */
> +         NEXT_PASS (pass_oacc_kernels);
> +         PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> +             NEXT_PASS (pass_dominator_oacc_kernels);
> +             NEXT_PASS (pass_ch_oacc_kernels);
> +             NEXT_PASS (pass_dominator_oacc_kernels);
> +             NEXT_PASS (pass_tree_loop_init);
> +             NEXT_PASS (pass_lim);
> +             NEXT_PASS (pass_copy_prop);
> +             NEXT_PASS (pass_lim);
> +             NEXT_PASS (pass_copy_prop);
>
> iterate lim/copyprop twice?!  Why's that needed?
>

I've managed to eliminate the last pass_copy_prop, but not pass_lim. 
I've added a comment:
...
   /* We use pass_lim to rewrite in-memory iteration and reduction
      variable accesses in loops into local variables accesses.
      However, a single pass instantion manages to do this only for
      one loop level, so we use pass_lim twice to at least be able to
      handle a loop nest with a depth of two.  */
   NEXT_PASS (pass_lim);
   NEXT_PASS (pass_copy_prop);
   NEXT_PASS (pass_lim);
...

> +             NEXT_PASS (pass_scev_cprop);
>
> What's that for?  It's supposed to help removing loops - I don't
> expect kernels to vanish.

I'm using pass_scev_cprop for the "final value replacement" 
functionality. Added comment.

>
> +             NEXT_PASS (pass_tree_loop_done);
> +             NEXT_PASS (pass_dominator_oacc_kernels);
>
> Three times DOM?  No please.  I wonder why you don't run oacc_kernels
> after FRE and drop the initial DOM(s).
>

Done. There's just one pass_dominator_oacc_kernels left now.

> +             NEXT_PASS (pass_dce);
> +             NEXT_PASS (pass_tree_loop_init);
> +             NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> +             NEXT_PASS (pass_expand_omp_ssa);
> +             NEXT_PASS (pass_tree_loop_done);
>
> The switches into/outof tree_loop also look odd to me, but well
> (they'll be controlled by -ftree-loop-optimize)).
>

I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done 
in the pass group. Instead, I've added conditional loop optimizer setup in:
-  pass_lim and pass_scev_cprop (added in this patch), and
- pass_parallelize_loops_oacc_kernels (added in patch "Add
   pass_parallelize_loops_oacc_kernels").

Thanks,
- Tom


[-- Attachment #2: 0007-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 5177 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Allow to run outside
	pass_tree_loop.
	* tree-ssa-loop.c (pass_scev_cprop::clone): New function.
	(pass_scev_cprop::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 25 +++++++++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 14 ++++++++++++++
 gcc/tree-ssa-loop.c    | 22 +++++++++++++++++++++-
 5 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 9eae09a..8078afb 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13385,6 +13385,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index db822d3..d76cfd3 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -87,6 +87,31 @@ along with GCC; see the file COPYING3.  If not see
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      /* We need pass_ch here, because pass_lim has no effect on
+	         exit-first loops (PR65442).  Ideally we want to remove both
+		 this pass instantiation, and the reverse transformation
+		 transform_to_exit_first_loop_alt, which is done in
+		 pass_parallelize_loops_oacc_kernels. */
+	      NEXT_PASS (pass_ch);
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.
+		 However, a single pass instantion manages to do this only for
+		 one loop level, so we use pass_lim twice to at least be able to
+		 handle a loop nest with a depth of two.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_lim);
+	      /* We use pass_scev_cprop here for final value replacement.  */
+	      NEXT_PASS (pass_scev_cprop);
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..48810f3 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-propagate.h"
 #include "trans-mem.h"
 #include "gimple-fold.h"
+#include "tree-scalar-evolution.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -2501,6 +2502,19 @@ tree_ssa_lim (void)
 {
   unsigned int todo;
 
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+				| LOOPS_HAVE_RECORDED_EXITS
+				| LOOP_CLOSED_SSA))
+    {
+      loop_optimizer_init (LOOPS_NORMAL
+			   | LOOPS_HAVE_RECORDED_EXITS);
+      rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+      /* We might discover new loops, e.g. when turning irreducible
+	 regions into reducible.  */
+      scev_initialize ();
+    }
+
   tree_ssa_lim_initialize ();
 
   /* Gathers information about memory accesses in the loops.  */
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index b51cac2..570406f 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -373,10 +373,30 @@ public:
 
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_scev_cprop; }
-  virtual unsigned int execute (function *) { return scev_const_prop (); }
+  virtual unsigned int execute (function *);
+  opt_pass * clone () { return new pass_scev_cprop (m_ctxt); }
 
 }; // class pass_scev_cprop
 
+unsigned int
+pass_scev_cprop::execute (function *)
+{
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+				| LOOPS_HAVE_RECORDED_EXITS
+				| LOOP_CLOSED_SSA))
+    {
+      loop_optimizer_init (LOOPS_NORMAL
+			   | LOOPS_HAVE_RECORDED_EXITS);
+      rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+      /* We might discover new loops, e.g. when turning irreducible
+	 regions into reducible.  */
+      scev_initialize ();
+    }
+
+  return scev_const_prop (); 
+}
+
 } // anon namespace
 
 gimple_opt_pass *

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
@ 2015-11-16 11:59   ` Tom de Vries
  2015-11-24 12:27     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3350 bytes --]

On 09/11/15 20:52, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> This patch adds pass_parallelize_loops_oacc_kernels.
>
> There's a number of things we do differently in parloops for oacc kernels:
> - in normal parloops, we generate code to choose between a parallel
>    version of the loop, and a sequential (low iteration count) version.
>    Since the code in oacc kernels region is supposed to run on the
>    accelerator anyway, we skip this check, and don't add a low iteration
>    count loop.
> - in normal parloops, we generate an #pragma omp parallel /
>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>    into a thread function. Since the oacc kernels region is already
>    split off, we don't add this pair.
> - we indicate the parallelization factor by setting the oacc function
>    attributes
> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>    we add the gang clause
> - in normal parloops, we rewrite the variable accesses in the loop in
>    terms into accesses relative to a thread function parameter. For the
>    oacc kernels region, that rewrite has already been done at omp-lower,
>    so we skip this.
> - we need to ensure that the entire kernels region can be run in
>    parallel. The loop independence check is already present, so for oacc
>    kernels we add a check between blocks outside the loop and the entire
>    region.
> - we guard stores in the blocks outside the loop with gang_pos == 0.
>    There's no need for each gang to write to a single location, we can
>    do this in just one gang. (Typically this is the write of the final
>    value of the iteration variable if that one is copied back to the
>    host).
>

Reposting with loop optimizer init added in 
pass_parallelize_loops_oacc_kernels::execute.

Thanks,
- Tom

[-- Attachment #2: 0006-Add-pass_parallelize_loops_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 30773 bytes --]

Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
        (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.

---
 gcc/omp-low.c       |   8 +-
 gcc/omp-low.h       |   1 +
 gcc/tree-parloops.c | 693 +++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-pass.h     |   2 +
 4 files changed, 640 insertions(+), 64 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fba7bbd..9eae09a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11944,10 +11944,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
   gsi_remove (&gsi, true);
   gsi = gsi_last_bb (store_bb);
+  stmt = gsi_stmt (gsi);
   gsi_remove (&gsi, true);
 
   if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_no_phi);
+    {
+      release_defs (stmt);
+      update_ssa (TODO_update_ssa_no_phi);
+    }
 
   return true;
 }
@@ -12321,7 +12325,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index 194b3d1..1790f40 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree);
 extern void replace_oacc_fn_attrib (tree, tree);
 extern tree build_oacc_routine_dims (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..96b8415 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
 
-  addr = build_addr (t);
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
   basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   gomp_continue *omp_cont_stmt;
   tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
   edge exit, nexit, guard, end, e;
+  tree for_clauses = NULL_TREE;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
   bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (!oacc_kernels_p)
+    {
+      paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
+    }
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+  if (!oacc_kernels_p)
+    {
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+    }
+  else
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
 
   /* Initialize NEW_DATA.  */
   if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
     }
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+  /* Skip insertion of OMP_RETURN for oacc_kernels_p.  We've already generated
+     one when lowering the oacc kernels directive in
+     pass_lower_omp/lower_omp (). */
+  if (!oacc_kernels_p)
+    {
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
     OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
       = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  if (1)
+    {
+      /* In combination with the NUM_GANGS on the parallel.  */
+      for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+    }
+
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   for_clauses, 1, NULL);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
+      many_iterations_cond
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2527,12 +2606,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2588,6 +2676,7 @@ try_create_reduction_list (loop_p loop,
 			 "  FAILED: it is not a part of reduction.\n");
 	      return false;
 	    }
+	  red->keep_res = phi;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
 	      fprintf (dump_file, "reduction phi is  ");
@@ -2622,15 +2711,402 @@ try_create_reduction_list (loop_p loop,
     }
 
 
+  if (oacc_kernels_p)
+    {
+      edge e = loop_preheader_edge (loop);
+
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      struct reduction_info *red;
+	      red = reduction_phi (reduction_list, phi);
+
+	      /* Look for pattern:
+
+		 <bb preheader>
+		   .omp_data_i = &.omp_data_arr;
+		   addr = .omp_data_i->sum;
+		   sum_a = *addr;
+
+		 <bb header>:
+		   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+		 and assign addr to reduc->reduc_addr.  */
+
+	      tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      gimple *stmt = SSA_NAME_DEF_STMT (arg);
+	      if (!gimple_assign_single_p (stmt))
+		return false;
+	      tree memref = gimple_assign_rhs1 (stmt);
+	      if (TREE_CODE (memref) != MEM_REF)
+		return false;
+	      tree addr = TREE_OPERAND (memref, 0);
+
+	      gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+	      if (!gimple_assign_single_p (stmt2))
+		return false;
+	      tree compref = gimple_assign_rhs1 (stmt2);
+	      if (TREE_CODE (compref) != COMPONENT_REF)
+		return false;
+	      tree addr2 = TREE_OPERAND (compref, 0);
+	      if (TREE_CODE (addr2) != MEM_REF)
+		return false;
+	      addr2 = TREE_OPERAND (addr2, 0);
+	      if (TREE_CODE (addr2) != SSA_NAME
+		  || addr2 != get_omp_data_i_param ())
+		return false;
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
+
   return true;
 }
 
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      tree omp_data_i,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
+  return true;
+}
+
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  tree omp_data_i = get_omp_data_i_param ();
+  gcc_assert (omp_data_i != NULL_TREE);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+				   reduction_list, reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
+  return res;
+}
+
 /* Detect parallel loops and generate parallel code using libgomp
    primitives.  Returns true if some loop was parallelized, false
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2642,19 +3118,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2666,6 +3152,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2707,6 +3209,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2716,14 +3219,23 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (!flag_loop_parallelize_all
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
+      /* Skip inner loop(s) of parallelized loop.  */
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2736,8 +3248,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2787,7 +3300,7 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
@@ -2806,3 +3319,59 @@ make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "parloops_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+  loop_optimizer_init (LOOPS_NORMAL
+		       | LOOPS_HAVE_RECORDED_EXITS);
+  rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+
+      return TODO_update_ssa;
+    }
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 395a93a..f5803d0 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -384,6 +384,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+  make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 7/16] Add pass_dominator_oacc_kernels
  2015-11-11 11:05   ` Richard Biener
@ 2015-11-16 12:04     ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 12:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 11/11/15 12:05, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patch adds pass_dominator_oacc_kernels (which we may as well call
>> pass_dominator_no_peel_loop_headers. It doesn't do anything
>> oacc-kernels-specific), to be used in the kernels pass group.
>>
>> The reason I'm adding a new pass instead of using pass_dominator is that
>> pass_dominator uses first_pass_instance. So adding a pass_dominator instance A
>> before a pass_dominator instance B has the unexpected consequence that it may
>> change the behaviour of instance B. I've filed PR68247 - "Remove
>> pass_first_instance" to note this issue.
>
> This looks ok (minus my comments to patch #10)
>

AFAIU, if "Remove first_pass_instance from pass_dominator" get approved 
and committed, we can drop this patch, and use this pass instantiation 
instead in the oacc_kernels pass group:
...
   NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
...

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-16 11:39     ` Tom de Vries
@ 2015-11-16 12:41       ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-16 12:41 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 16 Nov 2015, Tom de Vries wrote:

> On 11/11/15 11:55, Richard Biener wrote:
> > On Mon, 9 Nov 2015, Tom de Vries wrote:
> > 
> > > On 09/11/15 16:35, Tom de Vries wrote:
> > > > Hi,
> > > > 
> > > > this patch series for stage1 trunk adds support to:
> > > > - parallelize oacc kernels regions using parloops, and
> > > > - map the loops onto the oacc gang dimension.
> > > > 
> > > > The patch series contains these patches:
> > > > 
> > > >        1    Insert new exit block only when needed in
> > > >           transform_to_exit_first_loop_alt
> > > >        2    Make create_parallel_loop return void
> > > >        3    Ignore reduction clause on kernels directive
> > > >        4    Implement -foffload-alias
> > > >        5    Add in_oacc_kernels_region in struct loop
> > > >        6    Add pass_oacc_kernels
> > > >        7    Add pass_dominator_oacc_kernels
> > > >        8    Add pass_ch_oacc_kernels
> > > >        9    Add pass_parallelize_loops_oacc_kernels
> > > >       10    Add pass_oacc_kernels pass group in passes.def
> > > >       11    Update testcases after adding kernels pass group
> > > >       12    Handle acc loop directive
> > > >       13    Add c-c++-common/goacc/kernels-*.c
> > > >       14    Add gfortran.dg/goacc/kernels-*.f95
> > > >       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > > >       16    Add libgomp.oacc-fortran/kernels-*.f95
> > > > 
> > > > The first 9 patches are more or less independent, but patches 10-16 are
> > > > intended to be committed at the same time.
> > > > 
> > > > Bootstrapped and reg-tested on x86_64.
> > > > 
> > > > Build and reg-tested with nvidia accelerator, in combination with a
> > > > patch that enables accelerator testing (which is submitted at
> > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > > > 
> > > > I'll post the individual patches in reply to this message.
> > > 
> > > this patch adds and initializes the field in_oacc_kernels_region field in
> > > struct loop.
> > > 
> > > The field is used to signal to subsequent passes that we're dealing with a
> > > loop in a kernels region that we're trying parallelize.
> > > 
> > > Note that we do not parallelize kernels regions with more than one loop
> > > nest.
> > > [ In general, kernels regions with more than one loop nest should be split
> > > up
> > > into seperate kernels regions, but that's not supported atm. ]
> > 
> > I think mark_loops_in_oacc_kernels_region can be greatly simplified.
> > 
> > Both region entry and exit should have the same ->loop_father (a SESE
> > region).  Then you can just walk that loops inner (and their sibling)
> > loops checking their header domination relation with the region entry
> > exit (only necessary for direct inner loops).
> 
> Updated patch to use the loops structure.  Atm I'm also skipping loops
> containing sibling loops, since I have no test-cases for that yet.

Looks ok to me now.  You want to update copy_loop_info btw.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 11:55     ` Tom de Vries
@ 2015-11-16 12:45       ` Richard Biener
  2015-11-16 23:21         ` Tom de Vries
  2015-11-19 10:31         ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-16 12:45 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 16 Nov 2015, Tom de Vries wrote:

> On 11/11/15 12:02, Richard Biener wrote:
> > On Mon, 9 Nov 2015, Tom de Vries wrote:
> > 
> > > On 09/11/15 16:35, Tom de Vries wrote:
> > > > Hi,
> > > > 
> > > > this patch series for stage1 trunk adds support to:
> > > > - parallelize oacc kernels regions using parloops, and
> > > > - map the loops onto the oacc gang dimension.
> > > > 
> > > > The patch series contains these patches:
> > > > 
> > > >        1    Insert new exit block only when needed in
> > > >           transform_to_exit_first_loop_alt
> > > >        2    Make create_parallel_loop return void
> > > >        3    Ignore reduction clause on kernels directive
> > > >        4    Implement -foffload-alias
> > > >        5    Add in_oacc_kernels_region in struct loop
> > > >        6    Add pass_oacc_kernels
> > > >        7    Add pass_dominator_oacc_kernels
> > > >        8    Add pass_ch_oacc_kernels
> > > >        9    Add pass_parallelize_loops_oacc_kernels
> > > >       10    Add pass_oacc_kernels pass group in passes.def
> > > >       11    Update testcases after adding kernels pass group
> > > >       12    Handle acc loop directive
> > > >       13    Add c-c++-common/goacc/kernels-*.c
> > > >       14    Add gfortran.dg/goacc/kernels-*.f95
> > > >       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > > >       16    Add libgomp.oacc-fortran/kernels-*.f95
> > > > 
> > > > The first 9 patches are more or less independent, but patches 10-16 are
> > > > intended to be committed at the same time.
> > > > 
> > > > Bootstrapped and reg-tested on x86_64.
> > > > 
> > > > Build and reg-tested with nvidia accelerator, in combination with a
> > > > patch that enables accelerator testing (which is submitted at
> > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > > > 
> > > > I'll post the individual patches in reply to this message.
> > > > 
> > > 
> > > This patch adds the pass_oacc_kernels pass group to the pass list in
> > > passes.def.
> > > 
> > > Note the repetition of pass_lim/pass_copy_prop. The first pair is for an
> > > inner
> > > loop in a loop nest, the second for an outer loop in a loop nest.
> > 
> > @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
> >            /* pass_build_ealias is a dummy pass that ensures that we
> >               execute TODO_rebuild_alias at this point.  */
> >            NEXT_PASS (pass_build_ealias);
> > +         /* Pass group that runs when there are oacc kernels in the
> > +            function.  */
> > +         NEXT_PASS (pass_oacc_kernels);
> > +         PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > +             NEXT_PASS (pass_dominator_oacc_kernels);
> > +             NEXT_PASS (pass_ch_oacc_kernels);
> > +             NEXT_PASS (pass_dominator_oacc_kernels);
> > +             NEXT_PASS (pass_tree_loop_init);
> > +             NEXT_PASS (pass_lim);
> > +             NEXT_PASS (pass_copy_prop);
> > +             NEXT_PASS (pass_lim);
> > +             NEXT_PASS (pass_copy_prop);
> > 
> > iterate lim/copyprop twice?!  Why's that needed?
> > 
> 
> I've managed to eliminate the last pass_copy_prop, but not pass_lim. I've
> added a comment:
> ...
>   /* We use pass_lim to rewrite in-memory iteration and reduction
>      variable accesses in loops into local variables accesses.
>      However, a single pass instantion manages to do this only for
>      one loop level, so we use pass_lim twice to at least be able to
>      handle a loop nest with a depth of two.  */
>   NEXT_PASS (pass_lim);
>   NEXT_PASS (pass_copy_prop);
>   NEXT_PASS (pass_lim);
> ...

Huh.  Testcase?  LIM is perfectly able to handle nests.

> > +             NEXT_PASS (pass_scev_cprop);
> > 
> > What's that for?  It's supposed to help removing loops - I don't
> > expect kernels to vanish.
> 
> I'm using pass_scev_cprop for the "final value replacement" functionality.
> Added comment.

That functionality is intented to enable loop removal.

> > 
> > +             NEXT_PASS (pass_tree_loop_done);
> > +             NEXT_PASS (pass_dominator_oacc_kernels);
> > 
> > Three times DOM?  No please.  I wonder why you don't run oacc_kernels
> > after FRE and drop the initial DOM(s).
> > 
> 
> Done. There's just one pass_dominator_oacc_kernels left now.
> 
> > +             NEXT_PASS (pass_dce);
> > +             NEXT_PASS (pass_tree_loop_init);
> > +             NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> > +             NEXT_PASS (pass_expand_omp_ssa);
> > +             NEXT_PASS (pass_tree_loop_done);
> > 
> > The switches into/outof tree_loop also look odd to me, but well
> > (they'll be controlled by -ftree-loop-optimize)).
> > 
> 
> I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in
> the pass group. Instead, I've added conditional loop optimizer setup in:
> -  pass_lim and pass_scev_cprop (added in this patch), and
> - pass_parallelize_loops_oacc_kernels (added in patch "Add
>   pass_parallelize_loops_oacc_kernels").

You miss calling scev_finalize ().

Much better otherwise.  I still wonder about scev_cprop and LIM two
times.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 12:45       ` Richard Biener
@ 2015-11-16 23:21         ` Tom de Vries
  2015-11-17 10:05           ` Richard Biener
  2015-11-19 10:31         ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 23:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 16/11/15 13:45, Richard Biener wrote:
>>> +             NEXT_PASS (pass_scev_cprop);
>>> > >
>>> > >What's that for?  It's supposed to help removing loops - I don't
>>> > >expect kernels to vanish.
>> >
>> >I'm using pass_scev_cprop for the "final value replacement" functionality.
>> >Added comment.

> That functionality is intented to enable loop removal.

Let me try to explain in a bit more detail.


I.

Consider a parloops testcase test.c, with a use of the final value of 
the iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
   int i;
   for (i = 0; i < n; ++i)
     a[i] = 1;

   return i;
}
...

Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

We can see here in the parloops dump-file that the loop was parallelized:
...
   SUCCESS: may be parallelized
...

Now say that we run with -fno-tree-scev-cprop in addition. Instead we 
find in the parloops dump-file:
...
phi is i_1 = PHI <i_10(4)>
arg of phi to exit:   value i_10 used outside loop
   checking if it a part of reduction pattern:
   FAILED: it is not a part of reduction.
...

Auto-parallelization fails in this case because there is a loop exit phi 
(the one in bb 6 defining i_1) which is not part of a reduction:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   # i_1 = PHI <i_10(4)>
   _20 = (unsigned int) i_1;
...

With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
   i_1 = PHI <i_10(4)>
   with
   i_1 = n_4(D);
...

And the resulting loop no longer has any loop exit phis, so 
auto-parallelization succeeds:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   _20 = (unsigned int) n_4(D);
...

[ I've filed PR68373 - "autopar fails on loop exit phi with argument 
defined outside loop", for a slightly different testcase where despite 
the final value replacement autopar still fails. ]


II.

Now, back to oacc kernels.

Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
   subroutine foo(n)
     implicit none
     integer :: n
     integer, dimension (0:n-1) :: a, b, c
     integer                    :: i, ii
     do i = 0, n - 1
        a(i) = i * 2
     end do

     do i = 0, n -1
        b(i) = i * 4
     end do

     !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
     do ii = 0, n - 1
        c(ii) = a(ii) + b(ii)
     end do
     !$acc end kernels

     do i = 0, n - 1
        if (c(i) .ne. a(i) + b(i)) call abort
     end do

   end subroutine foo
end module test
...

The loop at the start of the kernels pass group contains an in-memory 
iteration variable, with a store to '*_9 = _38'.
...
   <bb 4>:
   _13 = *.omp_data_i_4(D).c;
   c.21_14 = *_13;
   _16 = *_9;
   _17 = (integer(kind=8)) _16;
   _18 = *.omp_data_i_4(D).a;
   a.22_19 = *_18;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _24 = *.omp_data_i_4(D).b;
   b.23_25 = *_24;
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = _16 + 1;
   *_9 = _38;
   if (_8 == _16)
     goto <bb 3>;
   else
     goto <bb 4>;
...

After pass_lim/pass_copy_prop, we've rewritten that into using a local 
iteration variable, but we've generated a read of the final value of the 
iteration variable outside the loop, which means auto-parallelization 
will fail:
...
   <bb 5>:
   # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
   _17 = (integer(kind=8)) D__lsm.29_12;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = D__lsm.29_12 + 1;
   if (_8 == D__lsm.29_12)
     goto <bb 6>;
   else
     goto <bb 7>;

   <bb 6>:
   # D__lsm.29_27 = PHI <_38(5)>
   *_9 = D__lsm.29_27;
   goto <bb 3>;

   <bb 7>:
   goto <bb 5>;
...

This makes it similar to the parloops example above, and that's why I've 
added pass_scev_cprop in the kernels pass group.

[ And for some kernels test-cases with constant loop bound, it's not the 
final value replacement bit that does the substitution, but the first 
bit in scev_const_prop using resolve_mixers. So that's a related reason 
to use pass_scev_cprop. ]

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 23:21         ` Tom de Vries
@ 2015-11-17 10:05           ` Richard Biener
  2015-11-17 14:54             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-17 10:05 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 16/11/15 13:45, Richard Biener wrote:
>>>>
>>>> +             NEXT_PASS (pass_scev_cprop);
>>>> > >
>>>> > >What's that for?  It's supposed to help removing loops - I don't
>>>> > >expect kernels to vanish.
>>>
>>> >
>>> >I'm using pass_scev_cprop for the "final value replacement"
>>> > functionality.
>>> >Added comment.
>
>
>> That functionality is intented to enable loop removal.
>
>
> Let me try to explain in a bit more detail.
>
>
> I.
>
> Consider a parloops testcase test.c, with a use of the final value of the
> iteration variable (return i):
> ...
> unsigned int
> foo (int n, int *a)
> {
>   int i;
>   for (i = 0; i < n; ++i)
>     a[i] = 1;
>
>   return i;
> }
> ...
>
> Say we compile with:
> ...
> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> ...
>
> We can see here in the parloops dump-file that the loop was parallelized:
> ...
>   SUCCESS: may be parallelized
> ...
>
> Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> in the parloops dump-file:
> ...
> phi is i_1 = PHI <i_10(4)>
> arg of phi to exit:   value i_10 used outside loop
>   checking if it a part of reduction pattern:
>   FAILED: it is not a part of reduction.
> ...
>
> Auto-parallelization fails in this case because there is a loop exit phi
> (the one in bb 6 defining i_1) which is not part of a reduction:
> ...
>   <bb 4>:
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
>
>   <bb 5>:
>   goto <bb 4>;
>
>   <bb 6>:
>   # i_1 = PHI <i_10(4)>
>   _20 = (unsigned int) i_1;
> ...
>
> With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> ...
> final value replacement:
>   i_1 = PHI <i_10(4)>
>   with
>   i_1 = n_4(D);
> ...
>
> And the resulting loop no longer has any loop exit phis, so
> auto-parallelization succeeds:
> ...
>   <bb 4>:
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
>
>   <bb 5>:
>   goto <bb 4>;
>
>   <bb 6>:
>   _20 = (unsigned int) n_4(D);
> ...
>
> [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
> outside loop", for a slightly different testcase where despite the final
> value replacement autopar still fails. ]
>
>
> II.
>
> Now, back to oacc kernels.
>
> Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
> ...
> module test
> contains
>   subroutine foo(n)
>     implicit none
>     integer :: n
>     integer, dimension (0:n-1) :: a, b, c
>     integer                    :: i, ii
>     do i = 0, n - 1
>        a(i) = i * 2
>     end do
>
>     do i = 0, n -1
>        b(i) = i * 4
>     end do
>
>     !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
>     do ii = 0, n - 1
>        c(ii) = a(ii) + b(ii)
>     end do
>     !$acc end kernels
>
>     do i = 0, n - 1
>        if (c(i) .ne. a(i) + b(i)) call abort
>     end do
>
>   end subroutine foo
> end module test
> ...
>
> The loop at the start of the kernels pass group contains an in-memory
> iteration variable, with a store to '*_9 = _38'.
> ...
>   <bb 4>:
>   _13 = *.omp_data_i_4(D).c;
>   c.21_14 = *_13;
>   _16 = *_9;
>   _17 = (integer(kind=8)) _16;
>   _18 = *.omp_data_i_4(D).a;
>   a.22_19 = *_18;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _24 = *.omp_data_i_4(D).b;
>   b.23_25 = *_24;
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = _16 + 1;
>   *_9 = _38;
>   if (_8 == _16)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
> ...
>
> After pass_lim/pass_copy_prop, we've rewritten that into using a local
> iteration variable, but we've generated a read of the final value of the
> iteration variable outside the loop, which means auto-parallelization will
> fail:
> ...
>   <bb 5>:
>   # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
>   _17 = (integer(kind=8)) D__lsm.29_12;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = D__lsm.29_12 + 1;
>   if (_8 == D__lsm.29_12)
>     goto <bb 6>;
>   else
>     goto <bb 7>;
>
>   <bb 6>:
>   # D__lsm.29_27 = PHI <_38(5)>
>   *_9 = D__lsm.29_27;
>   goto <bb 3>;

So this store is not actually necessary?  Or just in an inconvenient place?

>
>   <bb 7>:
>   goto <bb 5>;
> ...
>
> This makes it similar to the parloops example above, and that's why I've
> added pass_scev_cprop in the kernels pass group.
>
> [ And for some kernels test-cases with constant loop bound, it's not the
> final value replacement bit that does the substitution, but the first bit in
> scev_const_prop using resolve_mixers. So that's a related reason to use
> pass_scev_cprop. ]

IMHO autopar needs to handle induction itself.  And the above LIM example
is none for why you need two LIM passes...

Richard.

> Thanks,
> - Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 10:05           ` Richard Biener
@ 2015-11-17 14:54             ` Tom de Vries
  2015-11-17 15:18               ` Richard Biener
  2015-11-19  0:35               ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-17 14:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 17/11/15 11:05, Richard Biener wrote:
> On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com> wrote:
>> On 16/11/15 13:45, Richard Biener wrote:
>>>>>
>>>>> +             NEXT_PASS (pass_scev_cprop);
>>>>>>>
>>>>>>> What's that for?  It's supposed to help removing loops - I don't
>>>>>>> expect kernels to vanish.
>>>>
>>>>>
>>>>> I'm using pass_scev_cprop for the "final value replacement"
>>>>> functionality.
>>>>> Added comment.
>>
>>
>>> That functionality is intented to enable loop removal.
>>
>>
>> Let me try to explain in a bit more detail.
>>
>>
>> I.
>>
>> Consider a parloops testcase test.c, with a use of the final value of the
>> iteration variable (return i):
>> ...
>> unsigned int
>> foo (int n, int *a)
>> {
>>    int i;
>>    for (i = 0; i < n; ++i)
>>      a[i] = 1;
>>
>>    return i;
>> }
>> ...
>>
>> Say we compile with:
>> ...
>> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
>> ...
>>
>> We can see here in the parloops dump-file that the loop was parallelized:
>> ...
>>    SUCCESS: may be parallelized
>> ...
>>
>> Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
>> in the parloops dump-file:
>> ...
>> phi is i_1 = PHI <i_10(4)>
>> arg of phi to exit:   value i_10 used outside loop
>>    checking if it a part of reduction pattern:
>>    FAILED: it is not a part of reduction.
>> ...
>>
>> Auto-parallelization fails in this case because there is a loop exit phi
>> (the one in bb 6 defining i_1) which is not part of a reduction:
>> ...
>>    <bb 4>:
>>    # i_13 = PHI <0(3), i_10(5)>
>>    _5 = (long unsigned int) i_13;
>>    _6 = _5 * 4;
>>    _8 = a_7(D) + _6;
>>    *_8 = 1;
>>    i_10 = i_13 + 1;
>>    if (n_4(D) > i_10)
>>      goto <bb 5>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 5>:
>>    goto <bb 4>;
>>
>>    <bb 6>:
>>    # i_1 = PHI <i_10(4)>
>>    _20 = (unsigned int) i_1;
>> ...
>>
>> With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
>> ...
>> final value replacement:
>>    i_1 = PHI <i_10(4)>
>>    with
>>    i_1 = n_4(D);
>> ...
>>
>> And the resulting loop no longer has any loop exit phis, so
>> auto-parallelization succeeds:
>> ...
>>    <bb 4>:
>>    # i_13 = PHI <0(3), i_10(5)>
>>    _5 = (long unsigned int) i_13;
>>    _6 = _5 * 4;
>>    _8 = a_7(D) + _6;
>>    *_8 = 1;
>>    i_10 = i_13 + 1;
>>    if (n_4(D) > i_10)
>>      goto <bb 5>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 5>:
>>    goto <bb 4>;
>>
>>    <bb 6>:
>>    _20 = (unsigned int) n_4(D);
>> ...
>>
>> [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
>> outside loop", for a slightly different testcase where despite the final
>> value replacement autopar still fails. ]
>>
>>
>> II.
>>
>> Now, back to oacc kernels.
>>
>> Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
>> ...
>> module test
>> contains
>>    subroutine foo(n)
>>      implicit none
>>      integer :: n
>>      integer, dimension (0:n-1) :: a, b, c
>>      integer                    :: i, ii
>>      do i = 0, n - 1
>>         a(i) = i * 2
>>      end do
>>
>>      do i = 0, n -1
>>         b(i) = i * 4
>>      end do
>>
>>      !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
>>      do ii = 0, n - 1
>>         c(ii) = a(ii) + b(ii)
>>      end do
>>      !$acc end kernels
>>
>>      do i = 0, n - 1
>>         if (c(i) .ne. a(i) + b(i)) call abort
>>      end do
>>
>>    end subroutine foo
>> end module test
>> ...
>>
>> The loop at the start of the kernels pass group contains an in-memory
>> iteration variable, with a store to '*_9 = _38'.
>> ...
>>    <bb 4>:
>>    _13 = *.omp_data_i_4(D).c;
>>    c.21_14 = *_13;
>>    _16 = *_9;
>>    _17 = (integer(kind=8)) _16;
>>    _18 = *.omp_data_i_4(D).a;
>>    a.22_19 = *_18;
>>    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>>    _24 = *.omp_data_i_4(D).b;
>>    b.23_25 = *_24;
>>    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>>    _30 = _23 + _29;
>>    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>>    _38 = _16 + 1;
>>    *_9 = _38;
>>    if (_8 == _16)
>>      goto <bb 3>;
>>    else
>>      goto <bb 4>;
>> ...
>>
>> After pass_lim/pass_copy_prop, we've rewritten that into using a local
>> iteration variable, but we've generated a read of the final value of the
>> iteration variable outside the loop, which means auto-parallelization will
>> fail:
>> ...
>>    <bb 5>:
>>    # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
>>    _17 = (integer(kind=8)) D__lsm.29_12;
>>    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>>    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>>    _30 = _23 + _29;
>>    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>>    _38 = D__lsm.29_12 + 1;
>>    if (_8 == D__lsm.29_12)
>>      goto <bb 6>;
>>    else
>>      goto <bb 7>;
>>
>>    <bb 6>:
>>    # D__lsm.29_27 = PHI <_38(5)>
>>    *_9 = D__lsm.29_27;
>>    goto <bb 3>;
>
> So this store is not actually necessary?

a.
In the case of this example, the store is dead.

There is a corresponding load at the point that we split off the region:
...
   <bb 9>:
   #pragma omp return

   <bb 10>:
   D.3635 = .omp_data_arr.25.ii;
   ii = *D.3635;
...

This load is later removed, given that ii is unused after the region. 
But once the region is split off,  there's nothing in the context of the 
store to suggest that it's dead.

And to get rid of the load of ii before the region is split off, we 
would have to implement some sort of liveness analysis on pre-ssa code.

b.
There's the case where there is an explicit use of ii after the region, 
in which case the store is not dead.

c.
And there's the case were we use a data clause on the region, f.i. 
'create (ii)' to indicate that the variable is neither copied in nor 
copied out of the region (the default for a scalar in a kernels region 
is 'copy', meaning copy-in-and-out).

[ This means the value of ii after the region is uninitialized. So even 
if there's a read from ii after the region, we cannot consider it 
connected to the store, given that the value written by the store on the 
accelerator will not be copied back to the host. ]

In this case, we already don't have any load of ii after the region:
...
   <bb 9>:
   #pragma omp return

   <bb 10>:
   .omp_data_sizes.28 = {CLOBBER};
   .omp_data_arr.27 = {CLOBBER};
...

We could insert clobbers for the bits of .omp_data_arr at the end of the 
region to indicate that those are not used. That might enable dse to get 
rid of the dead store.


But, I think we want a generic solution that handles cases a, b and c, 
which means we have to solve the most difficult case, which is b, where 
the store is not dead.

>  Or just in an inconvenient place?

I don't think the place of the store is inconvenient, it would be worse 
to have the store in the loop.

What is inconvenient about the store is the fact that it reads the final 
value of the iteration variable (which inhibits parloops).

>>    <bb 7>:
>>    goto <bb 5>;
>> ...
>>
>> This makes it similar to the parloops example above, and that's why I've
>> added pass_scev_cprop in the kernels pass group.
>>
>> [ And for some kernels test-cases with constant loop bound, it's not the
>> final value replacement bit that does the substitution, but the first bit in
>> scev_const_prop using resolve_mixers. So that's a related reason to use
>> pass_scev_cprop. ]
>
> IMHO autopar needs to handle induction itself.

I'm not sure what you mean. Could you elaborate?  Autopar handles 
induction variables, but it doesn't handle exit phis reading the final 
value of the induction variable. Is that what you want fixed? How?

> And the above LIM example
> is none for why you need two LIM passes...

Indeed. I'm planning a separate reply to explain in more detail the need 
for the two pass_lims.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 14:54             ` Tom de Vries
@ 2015-11-17 15:18               ` Richard Biener
  2015-11-17 15:39                 ` Tom de Vries
  2015-11-19  0:35               ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-17 15:18 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 11:05, Richard Biener wrote:
> > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com>
> > wrote:
> > > On 16/11/15 13:45, Richard Biener wrote:
> > > > > > 
> > > > > > +             NEXT_PASS (pass_scev_cprop);
> > > > > > > > 
> > > > > > > > What's that for?  It's supposed to help removing loops - I don't
> > > > > > > > expect kernels to vanish.
> > > > > 
> > > > > > 
> > > > > > I'm using pass_scev_cprop for the "final value replacement"
> > > > > > functionality.
> > > > > > Added comment.
> > > 
> > > 
> > > > That functionality is intented to enable loop removal.
> > > 
> > > 
> > > Let me try to explain in a bit more detail.
> > > 
> > > 
> > > I.
> > > 
> > > Consider a parloops testcase test.c, with a use of the final value of the
> > > iteration variable (return i):
> > > ...
> > > unsigned int
> > > foo (int n, int *a)
> > > {
> > >    int i;
> > >    for (i = 0; i < n; ++i)
> > >      a[i] = 1;
> > > 
> > >    return i;
> > > }
> > > ...
> > > 
> > > Say we compile with:
> > > ...
> > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > > ...
> > > 
> > > We can see here in the parloops dump-file that the loop was parallelized:
> > > ...
> > >    SUCCESS: may be parallelized
> > > ...
> > > 
> > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> > > in the parloops dump-file:
> > > ...
> > > phi is i_1 = PHI <i_10(4)>
> > > arg of phi to exit:   value i_10 used outside loop
> > >    checking if it a part of reduction pattern:
> > >    FAILED: it is not a part of reduction.
> > > ...
> > > 
> > > Auto-parallelization fails in this case because there is a loop exit phi
> > > (the one in bb 6 defining i_1) which is not part of a reduction:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    # i_1 = PHI <i_10(4)>
> > >    _20 = (unsigned int) i_1;
> > > ...
> > > 
> > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> > > ...
> > > final value replacement:
> > >    i_1 = PHI <i_10(4)>
> > >    with
> > >    i_1 = n_4(D);
> > > ...
> > > 
> > > And the resulting loop no longer has any loop exit phis, so
> > > auto-parallelization succeeds:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    _20 = (unsigned int) n_4(D);
> > > ...
> > > 
> > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument
> > > defined
> > > outside loop", for a slightly different testcase where despite the final
> > > value replacement autopar still fails. ]
> > > 
> > > 
> > > II.
> > > 
> > > Now, back to oacc kernels.
> > > 
> > > Consider test-case kernels-loop-n.f95 (will add this one to the
> > > test-cases):
> > > ...
> > > module test
> > > contains
> > >    subroutine foo(n)
> > >      implicit none
> > >      integer :: n
> > >      integer, dimension (0:n-1) :: a, b, c
> > >      integer                    :: i, ii
> > >      do i = 0, n - 1
> > >         a(i) = i * 2
> > >      end do
> > > 
> > >      do i = 0, n -1
> > >         b(i) = i * 4
> > >      end do
> > > 
> > >      !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> > >      do ii = 0, n - 1
> > >         c(ii) = a(ii) + b(ii)
> > >      end do
> > >      !$acc end kernels
> > > 
> > >      do i = 0, n - 1
> > >         if (c(i) .ne. a(i) + b(i)) call abort
> > >      end do
> > > 
> > >    end subroutine foo
> > > end module test
> > > ...
> > > 
> > > The loop at the start of the kernels pass group contains an in-memory
> > > iteration variable, with a store to '*_9 = _38'.
> > > ...
> > >    <bb 4>:
> > >    _13 = *.omp_data_i_4(D).c;
> > >    c.21_14 = *_13;
> > >    _16 = *_9;
> > >    _17 = (integer(kind=8)) _16;
> > >    _18 = *.omp_data_i_4(D).a;
> > >    a.22_19 = *_18;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _24 = *.omp_data_i_4(D).b;
> > >    b.23_25 = *_24;
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = _16 + 1;
> > >    *_9 = _38;
> > >    if (_8 == _16)
> > >      goto <bb 3>;
> > >    else
> > >      goto <bb 4>;
> > > ...
> > > 
> > > After pass_lim/pass_copy_prop, we've rewritten that into using a local
> > > iteration variable, but we've generated a read of the final value of the
> > > iteration variable outside the loop, which means auto-parallelization will
> > > fail:
> > > ...
> > >    <bb 5>:
> > >    # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
> > >    _17 = (integer(kind=8)) D__lsm.29_12;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = D__lsm.29_12 + 1;
> > >    if (_8 == D__lsm.29_12)
> > >      goto <bb 6>;
> > >    else
> > >      goto <bb 7>;
> > > 
> > >    <bb 6>:
> > >    # D__lsm.29_27 = PHI <_38(5)>
> > >    *_9 = D__lsm.29_27;
> > >    goto <bb 3>;
> > 
> > So this store is not actually necessary?
> 
> a.
> In the case of this example, the store is dead.
> 
> There is a corresponding load at the point that we split off the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   D.3635 = .omp_data_arr.25.ii;
>   ii = *D.3635;
> ...
> 
> This load is later removed, given that ii is unused after the region. But once
> the region is split off,  there's nothing in the context of the store to
> suggest that it's dead.
> 
> And to get rid of the load of ii before the region is split off, we would have
> to implement some sort of liveness analysis on pre-ssa code.
> 
> b.
> There's the case where there is an explicit use of ii after the region, in
> which case the store is not dead.
> 
> c.
> And there's the case were we use a data clause on the region, f.i. 'create
> (ii)' to indicate that the variable is neither copied in nor copied out of the
> region (the default for a scalar in a kernels region is 'copy', meaning
> copy-in-and-out).
> 
> [ This means the value of ii after the region is uninitialized. So even if
> there's a read from ii after the region, we cannot consider it connected to
> the store, given that the value written by the store on the accelerator will
> not be copied back to the host. ]
> 
> In this case, we already don't have any load of ii after the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   .omp_data_sizes.28 = {CLOBBER};
>   .omp_data_arr.27 = {CLOBBER};
> ...
> 
> We could insert clobbers for the bits of .omp_data_arr at the end of the
> region to indicate that those are not used. That might enable dse to get rid
> of the dead store.
> 
> 
> But, I think we want a generic solution that handles cases a, b and c, which
> means we have to solve the most difficult case, which is b, where the store is
> not dead.
> 
> >  Or just in an inconvenient place?
> 
> I don't think the place of the store is inconvenient, it would be worse to
> have the store in the loop.
> 
> What is inconvenient about the store is the fact that it reads the final value
> of the iteration variable (which inhibits parloops).
> 
> > >    <bb 7>:
> > >    goto <bb 5>;
> > > ...
> > > 
> > > This makes it similar to the parloops example above, and that's why I've
> > > added pass_scev_cprop in the kernels pass group.
> > > 
> > > [ And for some kernels test-cases with constant loop bound, it's not the
> > > final value replacement bit that does the substitution, but the first bit
> > > in
> > > scev_const_prop using resolve_mixers. So that's a related reason to use
> > > pass_scev_cprop. ]
> > 
> > IMHO autopar needs to handle induction itself.
> 
> I'm not sure what you mean. Could you elaborate?  Autopar handles induction
> variables, but it doesn't handle exit phis reading the final value of the
> induction variable. Is that what you want fixed? How?

Yes.  Perform final value replacement.

> > And the above LIM example
> > is none for why you need two LIM passes...
> 
> Indeed. I'm planning a separate reply to explain in more detail the need for
> the two pass_lims.

Thanks.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 15:18               ` Richard Biener
@ 2015-11-17 15:39                 ` Tom de Vries
  2015-11-17 22:21                   ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
  2015-11-18  8:30                   ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-17 15:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 17/11/15 16:18, Richard Biener wrote:
>>> IMHO autopar needs to handle induction itself.
>> >
>> >I'm not sure what you mean. Could you elaborate?  Autopar handles induction
>> >variables, but it doesn't handle exit phis reading the final value of the
>> >induction variable. Is that what you want fixed? How?
> Yes.  Perform final value replacement.
>

I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels 
seems to work fine.

Doing the same for pass_parallelize_loops like this:
...
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..d944395 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
    if (number_of_loops (fun) <= 1)
      return 0;

+  unsigned int sccp_todo = scev_const_prop ();
+  gcc_assert (sccp_todo == 0);
+
    if (parallelize_loops ())
      {
        fun->curr_properties &= ~(PROP_gimple_eomp);
...
seems to fix PR 68373 - "autopar fails on loop exit phi with argument 
defined outside loop".

The new scev_const_prop call in autopar rewrites this phi into an 
assignment, and that allows parloops to succeed:
...
final value replacement:
   n_2 = PHI <n_4(D)(4)>
   with
   n_2 = n_4(D);
...

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute
  2015-11-17 15:39                 ` Tom de Vries
@ 2015-11-17 22:21                   ` Tom de Vries
  2015-11-19  9:36                     ` Tom de Vries
  2015-11-18  8:30                   ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-17 22:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

[ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]

Hi,

Consider test-case test.c, with a use of the final value of the 
iteration variable (return i):
...
unsigned int
foo (int *a, unsigned int n)
{
   unsigned int i;
   for (i = 0; i < n; ++i)
     a[i] = 1;

   return i;
}
...

Compiled with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

Before parloops, we have:
...
  <bb 4>:
   # i_12 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_12;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_12 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   # i_14 = PHI <n_4(D)(4), 0(2)>
...

Parloops will fail because:
...
phi is n_2 = PHI <n_4(D)(4)>
arg of phi to exit:   value n_4(D) used outside loop
   checking if it a part of reduction pattern:
   FAILED: it is not a part of reduction....
...
[ note that the phi looks slightly different. In 
gather_scalar_reductions -> vect_analyze_loop_form -> 
vect_analyze_loop_form_1 -> split_loop_exit_edge we split the edge from 
bb4 to bb6. ]

This patch uses scev_const_prop at the start of parloops. 
scev_const_prop first also splits the exit edge, and then replaces the 
phi with a assignment:
...
  final value replacement:
   n_2 = PHI <n_4(D)(4)>
   with
   n_2 = n_4(D);
...

This allows parloops to succeed.

And there's a similar story when we compile with -fno-tree-scev-cprop in 
addition.

Bootstrapped and reg-tested on x86_64.

OK for stage3/stage1?

Thanks,
- Tom


[-- Attachment #2: 0005-Call-scev_const_prop-in-pass_parallelize_loops-execute.patch --]
[-- Type: text/x-patch, Size: 1372 bytes --]

Call scev_const_prop in pass_parallelize_loops::execute

2015-11-17  Tom de Vries  <tom@codesourcery.com>

	PR tree-optimization/68373
	* tree-parloops.c (pass_parallelize_loops::execute): Call
	scev_const_prop.

	* gcc.dg/autopar/pr68373.c: New test.

---
 gcc/testsuite/gcc.dg/autopar/pr68373.c | 14 ++++++++++++++
 gcc/tree-parloops.c                    |  3 +++
 2 files changed, 17 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/autopar/pr68373.c b/gcc/testsuite/gcc.dg/autopar/pr68373.c
new file mode 100644
index 0000000..8e0f8a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/autopar/pr68373.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+
+unsigned int
+foo (int *a, unsigned int n)
+{
+  unsigned int i;
+  for (i = 0; i < n; ++i)
+    a[i] = 1;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..d944395 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
+  unsigned int sccp_todo = scev_const_prop ();
+  gcc_assert (sccp_todo == 0);
+
   if (parallelize_loops ())
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 15:39                 ` Tom de Vries
  2015-11-17 22:21                   ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
@ 2015-11-18  8:30                   ` Richard Biener
  2015-11-18 16:22                     ` Bernhard Reutner-Fischer
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-18  8:30 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 16:18, Richard Biener wrote:
> > > > IMHO autopar needs to handle induction itself.
> > > >
> > > >I'm not sure what you mean. Could you elaborate?  Autopar handles
> > > induction
> > > >variables, but it doesn't handle exit phis reading the final value of the
> > > >induction variable. Is that what you want fixed? How?
> > Yes.  Perform final value replacement.
> > 
> 
> I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels seems to
> work fine.
> 
> Doing the same for pass_parallelize_loops like this:
> ...
> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
> index 17415a8..d944395 100644
> --- a/gcc/tree-parloops.c
> +++ b/gcc/tree-parloops.c
> @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
>    if (number_of_loops (fun) <= 1)
>      return 0;
> 
> +  unsigned int sccp_todo = scev_const_prop ();
> +  gcc_assert (sccp_todo == 0);
> +
>    if (parallelize_loops ())
>      {
>        fun->curr_properties &= ~(PROP_gimple_eomp);
> ...
> seems to fix PR 68373 - "autopar fails on loop exit phi with argument defined
> outside loop".
> 
> The new scev_const_prop call in autopar rewrites this phi into an assignment,
> and that allows parloops to succeed:
> ...
> final value replacement:
>   n_2 = PHI <n_4(D)(4)>
>   with
>   n_2 = n_4(D);
> ...

That works for me but please factor out the final value replacement
code from scev_const_prop.  I think best would be to have a
helper that does final value replacement for a single loop so you
can call it for loops to paralellize only.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-18  8:30                   ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
@ 2015-11-18 16:22                     ` Bernhard Reutner-Fischer
  2015-11-20 12:53                       ` [committed, trivial] Fix typo and trailing whitespace in dump-file strings in parloops Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Bernhard Reutner-Fischer @ 2015-11-18 16:22 UTC (permalink / raw)
  To: Richard Biener, Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On November 18, 2015 9:30:23 AM GMT+01:00, Richard Biener <rguenther@suse.de> wrote:
>On Tue, 17 Nov 2015, Tom de Vries wrote:
>
>> On 17/11/15 16:18, Richard Biener wrote:
>> > > > IMHO autopar needs to handle induction itself.
>> > > >
>> > > >I'm not sure what you mean. Could you elaborate?  Autopar
>handles
>> > > induction
>> > > >variables, but it doesn't handle exit phis reading the final
>value of the
>> > > >induction variable. Is that what you want fixed? How?
>> > Yes.  Perform final value replacement.
>> > 
>> 
>> I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels
>seems to
>> work fine.
>> 
>> Doing the same for pass_parallelize_loops like this:
>> ...
>> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
>> index 17415a8..d944395 100644
>> --- a/gcc/tree-parloops.c
>> +++ b/gcc/tree-parloops.c
>> @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
>>    if (number_of_loops (fun) <= 1)
>>      return 0;
>> 
>> +  unsigned int sccp_todo = scev_const_prop ();
>> +  gcc_assert (sccp_todo == 0);
>> +
>>    if (parallelize_loops ())
>>      {
>>        fun->curr_properties &= ~(PROP_gimple_eomp);
>> ...
>> seems to fix PR 68373 - "autopar fails on loop exit phi with argument
>defined
>> outside loop".
>> 
>> The new scev_const_prop call in autopar rewrites this phi into an
>assignment,
>> and that allows parloops to succeed:
>> ...
>> final value replacement:
>>   n_2 = PHI <n_4(D)(4)>
>>   with
>>   n_2 = n_4(D);
>> ...
>
>That works for me but please factor out the final value replacement
>code from scev_const_prop.  I think best would be to have a
>helper that does final value replacement for a single loop so you
>can call it for loops to paralellize only.

Bonus points for fixing the dump_file to parse in:

>Parloops will fail because:
>...
>phi is n_2 = PHI <n_4(D)(4)>
>arg of phi to exit: value n_4(D) used outside loop
>checking if it a part of reduction pattern:

s/it a/it is/

>FAILED: it is not a part of reduction....
>...

TIA,
>
>Richard.
>
>> Thanks,
>> - Tom
>> 
>> 


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 14:54             ` Tom de Vries
  2015-11-17 15:18               ` Richard Biener
@ 2015-11-19  0:35               ` Tom de Vries
  2015-11-20 10:28                 ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19  0:35 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 17/11/15 15:53, Tom de Vries wrote:
>> And the above LIM example
>> is none for why you need two LIM passes...
>
> Indeed. I'm planning a separate reply to explain in more detail the need
> for the two pass_lims.

I.

I managed to get rid of the two pass_lims for the motivating example 
that I used until now (goacc/kernels-double-reduction.c). I found that 
by adding a pass_dominator instance after pass_ch, I could get rid of 
the second pass_lim (and pass_copyprop as well).

But... then I wrote a counter example 
(goacc/kernels-double-reduction-n.c), and I'm back at two pass_lims (and 
two pass_dominators).
Also I've split the pass group into a bit before and after pass_fre.

So, the current pass group looks like:
...
NEXT_PASS (pass_build_ealias);

/* Pass group that runs when the function is an offloaded function
    containing oacc kernels loops.  Part 1.  */
NEXT_PASS (pass_oacc_kernels);
PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
     /* We need pass_ch here, because pass_lim has no effect on
        exit-first loops (PR65442).  Ideally we want to remove both
        this pass instantiation, and the reverse transformation
        transform_to_exit_first_loop_alt, which is done in
        pass_parallelize_loops_oacc_kernels. */
     NEXT_PASS (pass_ch);
POP_INSERT_PASSES ()

NEXT_PASS (pass_fre);

/* Pass group that runs when the function is an offloaded function
    containing oacc kernels loops.  Part 2.  */
NEXT_PASS (pass_oacc_kernels2);
PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
     /* We use pass_lim to rewrite in-memory iteration and reduction
        variable accesses in loops into local variables accesses.  */
     NEXT_PASS (pass_lim);
     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
     NEXT_PASS (pass_lim);
     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
     NEXT_PASS (pass_dce);
     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
     NEXT_PASS (pass_expand_omp_ssa);
POP_INSERT_PASSES ()
NEXT_PASS (pass_merge_phi);
...


II.

The motivating test-case kernels-double-reduction-n.c:
...
#include <stdlib.h>

#define N 500

unsigned int a[N][N];

void  __attribute__((noinline,noclone))
foo (unsigned int n)
{
   int i, j;
   unsigned int sum = 1;

#pragma acc kernels copyin (a[0:n]) copy (sum)
   {
     for (i = 0; i < n; ++i)
       for (j = 0; j < n; ++j)
         sum += a[i][j];
   }

   if (sum != 5001)
     abort ();
}
...


III.

Before first pass_lim. Note no phis on inner or outer loop header for 
iteration varables or reduction variable:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>: outer loop header
   _12 = *.omp_data_i_4(D).j;
   *_12 = 0;
   if (_45 != 0)
     goto <bb 6>;
   else
     goto <bb 5>;

   <bb 6>: inner loop header, latch
   _19 = *.omp_data_i_4(D).a;
   _21 = *_5;
   _23 = *_12;
   _24 = *_19[_21][_23];
   _25 = *.omp_data_i_4(D).sum;
   sum.0_26 = *_25;
   sum.1_27 = _24 + sum.0_26;
   *_25 = sum.1_27;
   _33 = _23 + 1;
   *_12 = _33;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 6>;
   else
     goto <bb 5>;

   <bb 5>: outer loop latch
   _36 = *_5;
   _38 = _36 + 1;
   *_5 = _38;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 3>:
   return;
...


IV.

After first pass_lim/pass_dom pair. Note there are phis on the inner 
loop header for the reduction and the iteration variable, but not on the 
outer loop header:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>:
   _12 = *.omp_data_i_4(D).j;
   _19 = *.omp_data_i_4(D).a;
   D__lsm.10_50 = *_12;
   D__lsm.11_51 = 0;
   _25 = *.omp_data_i_4(D).sum;

   <bb 5>: outer loop header
   D__lsm.10_20 = 0;
   D__lsm.11_22 = 1;
   _21 = *_5;
   D__lsm.12_28 = *_25;
   D__lsm.13_30 = 0;
   goto <bb 7>;

   <bb 7>: inner loop header, latch
   # D__lsm.10_47 = PHI <0(5), _33(7)>
   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
   _23 = D__lsm.10_47;
   _24 = *_19[_21][D__lsm.10_47];
   sum.0_26 = D__lsm.12_49;
   sum.1_27 = _24 + D__lsm.12_49;
   D__lsm.12_31 = sum.1_27;
   D__lsm.13_32 = 1;
   _33 = D__lsm.10_47 + 1;
   D__lsm.10_14 = _33;
   D__lsm.11_15 = 1;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 7>;
   else
     goto <bb 8>;

   <bb 8>: outer loop latch
   # D__lsm.10_35 = PHI <_33(7)>
   # D__lsm.11_37 = PHI <1(7)>
   # D__lsm.12_7 = PHI <sum.1_27(7)>
   # D__lsm.13_8 = PHI <1(7)>
   *_25 = sum.1_27;
   _36 = *_5;
   _38 = _36 + 1;
   *_5 = _38;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 6>:
   # D__lsm.10_10 = PHI <_33(8)>
   # D__lsm.11_11 = PHI <1(8)>
   *_12 = _33;
   goto <bb 3>;

   <bb 3>:
   return;
...


V.

After second pass_lim/pass_dom pair. Note there are phis on the inner 
and outer loop header for the reduction and the iteration variables:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>:
   _12 = *.omp_data_i_4(D).j;
   _19 = *.omp_data_i_4(D).a;
   D__lsm.10_50 = *_12;
   D__lsm.11_51 = 0;
   _25 = *.omp_data_i_4(D).sum;
   D__lsm.14_40 = 0;
   D__lsm.15_2 = 0;
   D__lsm.16_1 = *_25;
   D__lsm.17_46 = 0;

   <bb 5>: outer loop header
   # D__lsm.14_13 = PHI <0(4), _38(8)>
   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
   D__lsm.10_20 = 0;
   D__lsm.11_22 = 1;
   _21 = D__lsm.14_13;
   D__lsm.12_28 = D__lsm.16_34;
   D__lsm.13_30 = 0;
   goto <bb 7>;

   <bb 7>: inner loop header, latch
   # D__lsm.10_47 = PHI <0(5), _33(7)>
   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
   _23 = D__lsm.10_47;
   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
   sum.0_26 = D__lsm.12_49;
   sum.1_27 = _24 + D__lsm.12_49;
   D__lsm.12_31 = sum.1_27;
   D__lsm.13_32 = 1;
   _33 = D__lsm.10_47 + 1;
   D__lsm.10_14 = _33;
   D__lsm.11_15 = 1;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 7>;
   else
     goto <bb 8>;

   <bb 8>: outer loop latch
   # D__lsm.10_35 = PHI <_33(7)>
   # D__lsm.11_37 = PHI <1(7)>
   # D__lsm.12_7 = PHI <sum.1_27(7)>
   # D__lsm.13_8 = PHI <1(7)>
   # sum.1_48 = PHI <sum.1_27(7)>
   # _53 = PHI <_33(7)>
   D__lsm.16_56 = sum.1_27;
   D__lsm.17_57 = 1;
   _36 = D__lsm.14_13;
   _38 = D__lsm.14_13 + 1;
   D__lsm.14_58 = _38;
   D__lsm.15_59 = 1;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 6>:
   # D__lsm.10_10 = PHI <_33(8)>
   # D__lsm.11_11 = PHI <1(8)>
   # _43 = PHI <_33(8)>
   # D__lsm.16_62 = PHI <sum.1_27(8)>
   # D__lsm.17_63 = PHI <1(8)>
   # D__lsm.14_64 = PHI <_38(8)>
   # D__lsm.15_65 = PHI <1(8)>
   *_5 = _38;
   *_25 = sum.1_27;
   *_12 = _33;
   goto <bb 3>;

   <bb 3>:
   return;
...


VI.

After pass_dce, so before parloops-oacc-kernels:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>:
   _12 = *.omp_data_i_4(D).j;
   _19 = *.omp_data_i_4(D).a;
   _25 = *.omp_data_i_4(D).sum;
   D__lsm.16_1 = *_25;

   <bb 5>: outer loop header
   # D__lsm.14_13 = PHI <0(4), _38(8)>
   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
   goto <bb 7>;

   <bb 7>: inner loop header, latch
   # D__lsm.10_47 = PHI <0(5), _33(7)>
   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
   sum.1_27 = _24 + D__lsm.12_49;
   _33 = D__lsm.10_47 + 1;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 7>;
   else
     goto <bb 8>;

   <bb 8>: outer loop latch
   _38 = D__lsm.14_13 + 1;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 6>:
   *_5 = _38;
   *_25 = sum.1_27;
   *_12 = _33;
   goto <bb 3>;

   <bb 3>:
   return;
...

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute
  2015-11-17 22:21                   ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
@ 2015-11-19  9:36                     ` Tom de Vries
  2015-11-20 10:15                       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19  9:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2068 bytes --]

On 17/11/15 23:20, Tom de Vries wrote:
> [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
>
> Hi,
>
> Consider test-case test.c, with a use of the final value of the
> iteration variable (return i):
> ...
> unsigned int
> foo (int *a, unsigned int n)
> {
>    unsigned int i;
>    for (i = 0; i < n; ++i)
>      a[i] = 1;
>
>    return i;
> }
> ...
>
> Compiled with:
> ...
> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> ...
>
> Before parloops, we have:
> ...
>   <bb 4>:
>    # i_12 = PHI <0(3), i_10(5)>
>    _5 = (long unsigned int) i_12;
>    _6 = _5 * 4;
>    _8 = a_7(D) + _6;
>    *_8 = 1;
>    i_10 = i_12 + 1;
>    if (n_4(D) > i_10)
>      goto <bb 5>;
>    else
>      goto <bb 6>;
>
>    <bb 5>:
>    goto <bb 4>;
>
>    <bb 6>:
>    # i_14 = PHI <n_4(D)(4), 0(2)>
> ...
>
> Parloops will fail because:
> ...
> phi is n_2 = PHI <n_4(D)(4)>
> arg of phi to exit:   value n_4(D) used outside loop
>    checking if it a part of reduction pattern:
>    FAILED: it is not a part of reduction....
> ...
> [ note that the phi looks slightly different. In
> gather_scalar_reductions -> vect_analyze_loop_form ->
> vect_analyze_loop_form_1 -> split_loop_exit_edge we split the edge from
> bb4 to bb6. ]
>
> This patch uses scev_const_prop at the start of parloops.
> scev_const_prop first also splits the exit edge, and then replaces the
> phi with a assignment:
> ...
>   final value replacement:
>    n_2 = PHI <n_4(D)(4)>
>    with
>    n_2 = n_4(D);
> ...
>
> This allows parloops to succeed.
>
> And there's a similar story when we compile with -fno-tree-scev-cprop in
> addition.
>
> Bootstrapped and reg-tested on x86_64.
>
> OK for stage3/stage1?

The patch has been updated to do the final value replacement only for 
the loop that parloops is processing, as suggested in review comment at 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02166.html .

That means the patch is now also required for the kernels patch series.

Bootstrapped and reg-tested on x86_64.

OK for stage 3 trunk?

Thanks,
- Tom

[-- Attachment #2: 0001-Do-final-value-replacement-in-try_create_reduction_list.patch --]
[-- Type: text/x-patch, Size: 10690 bytes --]

Do final value replacement in try_create_reduction_list

2015-11-18  Tom de Vries  <tom@codesourcery.com>

	* tree-scalar-evolution.c (final_value_replacement_loop): Factor out of ...
	(scev_const_prop): ... here.
	* tree-scalar-evolution.h (final_value_replacement_loop): Declare.
	* tree-parloops.c (try_create_reduction_list): Call
	final_value_replacement_loop.

	* gcc.dg/autopar/pr68373.c: New test.

---
 gcc/testsuite/gcc.dg/autopar/pr68373.c |  14 ++
 gcc/tree-parloops.c                    |   3 +
 gcc/tree-scalar-evolution.c            | 248 +++++++++++++++++----------------
 gcc/tree-scalar-evolution.h            |   1 +
 4 files changed, 145 insertions(+), 121 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/autopar/pr68373.c b/gcc/testsuite/gcc.dg/autopar/pr68373.c
new file mode 100644
index 0000000..8e0f8a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/autopar/pr68373.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+
+unsigned int
+foo (int *a, unsigned int n)
+{
+  unsigned int i;
+  for (i = 0; i < n; ++i)
+    a[i] = 1;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..8d7912d 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2539,6 +2539,9 @@ try_create_reduction_list (loop_p loop,
 
   gcc_assert (exit);
 
+  /* Try to get rid of exit phis.  */
+  final_value_replacement_loop (loop);
+
   gather_scalar_reductions (loop, reduction_list);
 
 
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 27630f0..9b33693 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -3417,6 +3417,131 @@ expression_expensive_p (tree expr)
     }
 }
 
+/* Do final value replacement for LOOP.  */
+
+void
+final_value_replacement_loop (struct loop *loop)
+{
+  /* If we do not know exact number of iterations of the loop, we cannot
+     replace the final value.  */
+  edge exit = single_exit (loop);
+  if (!exit)
+    return;
+
+  tree niter = number_of_latch_executions (loop);
+  if (niter == chrec_dont_know)
+    return;
+
+  /* Ensure that it is possible to insert new statements somewhere.  */
+  if (!single_pred_p (exit->dest))
+    split_loop_exit_edge (exit);
+
+  /* Set stmt insertion pointer.  All stmts are inserted before this point.  */
+  gimple_stmt_iterator gsi = gsi_after_labels (exit->dest);
+
+  struct loop *ex_loop
+    = superloop_at_depth (loop,
+			  loop_depth (exit->dest->loop_father) + 1);
+
+  gphi_iterator psi;
+  for (psi = gsi_start_phis (exit->dest); !gsi_end_p (psi); )
+    {
+      gphi *phi = psi.phi ();
+      tree rslt = PHI_RESULT (phi);
+      tree def = PHI_ARG_DEF_FROM_EDGE (phi, exit);
+      if (virtual_operand_p (def))
+	{
+	  gsi_next (&psi);
+	  continue;
+	}
+
+      if (!POINTER_TYPE_P (TREE_TYPE (def))
+	  && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
+	{
+	  gsi_next (&psi);
+	  continue;
+	}
+
+      bool folded_casts;
+      def = analyze_scalar_evolution_in_loop (ex_loop, loop, def,
+					      &folded_casts);
+      def = compute_overall_effect_of_inner_loop (ex_loop, def);
+      if (!tree_does_not_contain_chrecs (def)
+	  || chrec_contains_symbols_defined_in_loop (def, ex_loop->num)
+	  /* Moving the computation from the loop may prolong life range
+	     of some ssa names, which may cause problems if they appear
+	     on abnormal edges.  */
+	  || contains_abnormal_ssa_name_p (def)
+	  /* Do not emit expensive expressions.  The rationale is that
+	     when someone writes a code like
+
+	     while (n > 45) n -= 45;
+
+	     he probably knows that n is not large, and does not want it
+	     to be turned into n %= 45.  */
+	  || expression_expensive_p (def))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "not replacing:\n  ");
+	      print_gimple_stmt (dump_file, phi, 0, 0);
+	      fprintf (dump_file, "\n");
+	    }
+	  gsi_next (&psi);
+	  continue;
+	}
+
+      /* Eliminate the PHI node and replace it by a computation outside
+	 the loop.  */
+      if (dump_file)
+	{
+	  fprintf (dump_file, "\nfinal value replacement:\n  ");
+	  print_gimple_stmt (dump_file, phi, 0, 0);
+	  fprintf (dump_file, "  with\n  ");
+	}
+      def = unshare_expr (def);
+      remove_phi_node (&psi, false);
+
+      /* If def's type has undefined overflow and there were folded
+	 casts, rewrite all stmts added for def into arithmetics
+	 with defined overflow behavior.  */
+      if (folded_casts && ANY_INTEGRAL_TYPE_P (TREE_TYPE (def))
+	  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (def)))
+	{
+	  gimple_seq stmts;
+	  gimple_stmt_iterator gsi2;
+	  def = force_gimple_operand (def, &stmts, true, NULL_TREE);
+	  gsi2 = gsi_start (stmts);
+	  while (!gsi_end_p (gsi2))
+	    {
+	      gimple *stmt = gsi_stmt (gsi2);
+	      gimple_stmt_iterator gsi3 = gsi2;
+	      gsi_next (&gsi2);
+	      gsi_remove (&gsi3, false);
+	      if (is_gimple_assign (stmt)
+		  && arith_code_with_undefined_signed_overflow
+		  (gimple_assign_rhs_code (stmt)))
+		gsi_insert_seq_before (&gsi,
+				       rewrite_to_defined_overflow (stmt),
+				       GSI_SAME_STMT);
+	      else
+		gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	    }
+	}
+      else
+	def = force_gimple_operand_gsi (&gsi, def, false, NULL_TREE,
+					true, GSI_SAME_STMT);
+
+      gassign *ass = gimple_build_assign (rslt, def);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      if (dump_file)
+	{
+	  print_gimple_stmt (dump_file, ass, 0, 0);
+	  fprintf (dump_file, "\n");
+	}
+    }
+}
+
 /* Replace ssa names for that scev can prove they are constant by the
    appropriate constants.  Also perform final value replacement in loops,
    in case the replacement expressions are cheap.
@@ -3430,8 +3555,7 @@ scev_const_prop (void)
   basic_block bb;
   tree name, type, ev;
   gphi *phi;
-  gassign *ass;
-  struct loop *loop, *ex_loop;
+  struct loop *loop;
   bitmap ssa_names_to_remove = NULL;
   unsigned i;
   gphi_iterator psi;
@@ -3507,126 +3631,8 @@ scev_const_prop (void)
 
   /* Now the regular final value replacement.  */
   FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
-    {
-      edge exit;
-      tree def, rslt, niter;
-      gimple_stmt_iterator gsi;
-
-      /* If we do not know exact number of iterations of the loop, we cannot
-	 replace the final value.  */
-      exit = single_exit (loop);
-      if (!exit)
-	continue;
-
-      niter = number_of_latch_executions (loop);
-      if (niter == chrec_dont_know)
-	continue;
-
-      /* Ensure that it is possible to insert new statements somewhere.  */
-      if (!single_pred_p (exit->dest))
-	split_loop_exit_edge (exit);
-      gsi = gsi_after_labels (exit->dest);
+    final_value_replacement_loop (loop);
 
-      ex_loop = superloop_at_depth (loop,
-				    loop_depth (exit->dest->loop_father) + 1);
-
-      for (psi = gsi_start_phis (exit->dest); !gsi_end_p (psi); )
-	{
-	  phi = psi.phi ();
-	  rslt = PHI_RESULT (phi);
-	  def = PHI_ARG_DEF_FROM_EDGE (phi, exit);
-	  if (virtual_operand_p (def))
-	    {
-	      gsi_next (&psi);
-	      continue;
-	    }
-
-	  if (!POINTER_TYPE_P (TREE_TYPE (def))
-	      && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
-	    {
-	      gsi_next (&psi);
-	      continue;
-	    }
-
-	  bool folded_casts;
-	  def = analyze_scalar_evolution_in_loop (ex_loop, loop, def,
-						  &folded_casts);
-	  def = compute_overall_effect_of_inner_loop (ex_loop, def);
-	  if (!tree_does_not_contain_chrecs (def)
-	      || chrec_contains_symbols_defined_in_loop (def, ex_loop->num)
-	      /* Moving the computation from the loop may prolong life range
-		 of some ssa names, which may cause problems if they appear
-		 on abnormal edges.  */
-	      || contains_abnormal_ssa_name_p (def)
-	      /* Do not emit expensive expressions.  The rationale is that
-		 when someone writes a code like
-
-		 while (n > 45) n -= 45;
-
-		 he probably knows that n is not large, and does not want it
-		 to be turned into n %= 45.  */
-	      || expression_expensive_p (def))
-	    {
-	      if (dump_file && (dump_flags & TDF_DETAILS))
-		{
-	          fprintf (dump_file, "not replacing:\n  ");
-	          print_gimple_stmt (dump_file, phi, 0, 0);
-	          fprintf (dump_file, "\n");
-		}
-	      gsi_next (&psi);
-	      continue;
-	    }
-
-	  /* Eliminate the PHI node and replace it by a computation outside
-	     the loop.  */
-	  if (dump_file)
-	    {
-	      fprintf (dump_file, "\nfinal value replacement:\n  ");
-	      print_gimple_stmt (dump_file, phi, 0, 0);
-	      fprintf (dump_file, "  with\n  ");
-	    }
-	  def = unshare_expr (def);
-	  remove_phi_node (&psi, false);
-
-	  /* If def's type has undefined overflow and there were folded
-	     casts, rewrite all stmts added for def into arithmetics
-	     with defined overflow behavior.  */
-	  if (folded_casts && ANY_INTEGRAL_TYPE_P (TREE_TYPE (def))
-	      && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (def)))
-	    {
-	      gimple_seq stmts;
-	      gimple_stmt_iterator gsi2;
-	      def = force_gimple_operand (def, &stmts, true, NULL_TREE);
-	      gsi2 = gsi_start (stmts);
-	      while (!gsi_end_p (gsi2))
-		{
-		  gimple *stmt = gsi_stmt (gsi2);
-		  gimple_stmt_iterator gsi3 = gsi2;
-		  gsi_next (&gsi2);
-		  gsi_remove (&gsi3, false);
-		  if (is_gimple_assign (stmt)
-		      && arith_code_with_undefined_signed_overflow
-					(gimple_assign_rhs_code (stmt)))
-		    gsi_insert_seq_before (&gsi,
-					   rewrite_to_defined_overflow (stmt),
-					   GSI_SAME_STMT);
-		  else
-		    gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
-		}
-	    }
-	  else
-	    def = force_gimple_operand_gsi (&gsi, def, false, NULL_TREE,
-					    true, GSI_SAME_STMT);
-
-	  ass = gimple_build_assign (rslt, def);
-	  gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
-	  if (dump_file)
-	    {
-	      print_gimple_stmt (dump_file, ass, 0, 0);
-	      fprintf (dump_file, "\n");
-	    }
-	}
-    }
   return 0;
 }
 
diff --git a/gcc/tree-scalar-evolution.h b/gcc/tree-scalar-evolution.h
index 6d31280..29c7cd4 100644
--- a/gcc/tree-scalar-evolution.h
+++ b/gcc/tree-scalar-evolution.h
@@ -33,6 +33,7 @@ extern tree analyze_scalar_evolution (struct loop *, tree);
 extern tree instantiate_scev (basic_block, struct loop *, tree);
 extern tree resolve_mixers (struct loop *, tree, bool *);
 extern void gather_stats_on_scev_database (void);
+extern void final_value_replacement_loop (struct loop *);
 extern unsigned int scev_const_prop (void);
 extern bool expression_expensive_p (tree);
 extern bool simple_iv (struct loop *, struct loop *, tree, struct affine_iv *,

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 12:45       ` Richard Biener
  2015-11-16 23:21         ` Tom de Vries
@ 2015-11-19 10:31         ` Tom de Vries
  2015-11-20 10:37           ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19 10:31 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1148 bytes --]

On 16/11/15 13:45, Richard Biener wrote:
>> I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in
>> >the pass group. Instead, I've added conditional loop optimizer setup in:
>> >-  pass_lim and pass_scev_cprop (added in this patch), and

Reposting the "Add pass_oacc_kernels pass group in passes.def" patch.

pass_scev_cprop is no longer part of the pass group.

And I've dropped the scev_initialize in pass_lim.

Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info 
is initialized at the start of the pass group and updated or reset by 
passes in the pass group if necessary, such that it's always available, 
or can be recalculated on the spot.

First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim 
doesn't use scev info. So there doesn't seem to be a need to do anything 
about scev info for using pass_lim outside pass_tree_loop.

>> >- pass_parallelize_loops_oacc_kernels (added in patch "Add
>> >   pass_parallelize_loops_oacc_kernels").
> You miss calling scev_finalize ().

I've added the scev_finalize () in patch "Add 
pass_parallelize_loops_oacc_kernels".

Thanks,
- Tom


[-- Attachment #2: 0005-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 4035 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 25 +++++++++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 10 +++++++++-
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 9c27396..d2f88b3 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13385,6 +13385,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..00446c3 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,32 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      /* We need pass_ch here, because pass_lim has no effect on
+	         exit-first loops (PR65442).  Ideally we want to remove both
+		 this pass instantiation, and the reverse transformation
+		 transform_to_exit_first_loop_alt, which is done in
+		 pass_parallelize_loops_oacc_kernels. */
+	      NEXT_PASS (pass_ch);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..96f05f2 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -2496,7 +2496,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
    i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,9 +2560,17 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+				| LOOPS_HAVE_RECORDED_EXITS))
+    loop_optimizer_init (LOOPS_NORMAL
+			 | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
     return 0;
 
+  if (!loops_state_satisfies_p (LOOP_CLOSED_SSA))
+    rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
   return tree_ssa_lim ();
 }
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-11 10:59   ` Richard Biener
@ 2015-11-19 13:51     ` Tom de Vries
  2015-11-24 12:17       ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19 13:51 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 11/11/15 11:58, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patchs add a pass group pass_oacc_kernels (which will be added to the
>> pass list as a whole in patch 10).
>
> Just to understand (while also skimming the HSA patches).
>
> You are basically relying on autopar for what the HSA patches call
> "gridification"?  That is, OMP lowering produces loopy kernels
> and autopar then will basically strip the outermost loop?

Short answer: no. In more detail...

Existing openmp support maps explictly independent loops (annotated with 
omp-for) in omp-parallel regions onto pthreads. It generates thread 
functions containing sequential loops that iterate on a subset of data 
of the original loop.

Parloops maps sequential loops onto pthreads by:
- proving the loop is independent
- identifiying reductions
- rewriting the loop into an omp-for annotated loop
- wrapping the loop in an an omp-parallel region
- rewriting the variable accesses in the loop such that they are
   relative to base pointers passed into the region
   (note: this bit is done by omplower for omp-for loops from source)
- rewriting the preloop-read and postloop-write pair of a reduction
   variable into an atomic update
- letting a subsequent ompexpand expand the omp-for and omp-parallel

The HSA support maps explicitly independent loops in openmp target 
regions onto an shared memory accelerator. By default, it generates 
kernel functions containing sequential loops that iterate on a subset of 
data of the original loop. The control flow has a performance penalty on 
the accelerator, so there's a concept called gridification (explained 
here: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00586.html ). [ I'm 
not sure if it is an additional transformation or a different style of 
generation ].  The gridification increases the launch dimensions of the 
kernels to a point that there's only one iteration left in the loop, 
which means that the control flow can be eliminated.

The openacc kernels support maps loops in an oacc kernels region onto a 
non-shared memory accelerator. These loops can be unannotated loops, or 
acc-loop annotated loops. If an acc-loop directive contains the 
independent clause, the loop is explicitly independent.

The current oacc kernels implementation mostly ignores the acc-loop 
directive, in order to unify handling of the annotated and unannotated 
loop. The patch "Handle acc loop directive" (at 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html ) expands the 
annotated loop as sequential loop.
At the point that we get to pass_parallelize_loops_oacc_kernels, we have 
sequential loops in an offloaded function (atm, there's no support for 
the independent clause yet).

So pass_parallelize_loops_oacc_kernels transforms sequential loops in an 
offloaded function originating from a kernels region into explicitly 
independent loops by:
- proving the loop is independent
- identifying reductions
- rewriting the loop into an acc-loop annotated loop
- annotating the offloaded function with kernel launch dimensions
- rewriting the preloop-load and postloop-store pair of a reduction
   variable into an atomic update
- letting a subsequent ompexpand expand the acc-loop

I'd say there's is no explicit gridification in there.

AFAIU, gridification is something that can result from determining the 
lauch dimensions of the offloaded function, and optimizing for those 
dimensions. Currently pass_parallelize_loops_oacc_kernels is a place 
where we set launch dimensions, but we're not optimizing for that, that 
happens later-on. (And I'm starting to wonder whether I can get rid of 
the setting of the gang dimension in pass_parallelize_loops_oacc_kernels).

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute
  2015-11-19  9:36                     ` Tom de Vries
@ 2015-11-20 10:15                       ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-20 10:15 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 17/11/15 23:20, Tom de Vries wrote:
> > [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
> > 
> > Hi,
> > 
> > Consider test-case test.c, with a use of the final value of the
> > iteration variable (return i):
> > ...
> > unsigned int
> > foo (int *a, unsigned int n)
> > {
> >    unsigned int i;
> >    for (i = 0; i < n; ++i)
> >      a[i] = 1;
> > 
> >    return i;
> > }
> > ...
> > 
> > Compiled with:
> > ...
> > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > ...
> > 
> > Before parloops, we have:
> > ...
> >   <bb 4>:
> >    # i_12 = PHI <0(3), i_10(5)>
> >    _5 = (long unsigned int) i_12;
> >    _6 = _5 * 4;
> >    _8 = a_7(D) + _6;
> >    *_8 = 1;
> >    i_10 = i_12 + 1;
> >    if (n_4(D) > i_10)
> >      goto <bb 5>;
> >    else
> >      goto <bb 6>;
> > 
> >    <bb 5>:
> >    goto <bb 4>;
> > 
> >    <bb 6>:
> >    # i_14 = PHI <n_4(D)(4), 0(2)>
> > ...
> > 
> > Parloops will fail because:
> > ...
> > phi is n_2 = PHI <n_4(D)(4)>
> > arg of phi to exit:   value n_4(D) used outside loop
> >    checking if it a part of reduction pattern:
> >    FAILED: it is not a part of reduction....
> > ...
> > [ note that the phi looks slightly different. In
> > gather_scalar_reductions -> vect_analyze_loop_form ->
> > vect_analyze_loop_form_1 -> split_loop_exit_edge we split the edge from
> > bb4 to bb6. ]
> > 
> > This patch uses scev_const_prop at the start of parloops.
> > scev_const_prop first also splits the exit edge, and then replaces the
> > phi with a assignment:
> > ...
> >   final value replacement:
> >    n_2 = PHI <n_4(D)(4)>
> >    with
> >    n_2 = n_4(D);
> > ...
> > 
> > This allows parloops to succeed.
> > 
> > And there's a similar story when we compile with -fno-tree-scev-cprop in
> > addition.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > OK for stage3/stage1?
> 
> The patch has been updated to do the final value replacement only for the loop
> that parloops is processing, as suggested in review comment at
> https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02166.html .
> 
> That means the patch is now also required for the kernels patch series.
> 
> Bootstrapped and reg-tested on x86_64.
> 
> OK for stage 3 trunk?

Ok.  Please mention tree-optimization/68373 in the changelog.

Thanks,
Richard.

> Thanks,
> - Tom
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-19  0:35               ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
@ 2015-11-20 10:28                 ` Richard Biener
  2015-11-21  8:42                   ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-20 10:28 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 17/11/15 15:53, Tom de Vries wrote:
> > > And the above LIM example
> > > is none for why you need two LIM passes...
> > 
> > Indeed. I'm planning a separate reply to explain in more detail the need
> > for the two pass_lims.
> 
> I.
> 
> I managed to get rid of the two pass_lims for the motivating example that I
> used until now (goacc/kernels-double-reduction.c). I found that by adding a
> pass_dominator instance after pass_ch, I could get rid of the second pass_lim
> (and pass_copyprop as well).
> 
> But... then I wrote a counter example (goacc/kernels-double-reduction-n.c),
> and I'm back at two pass_lims (and two pass_dominators).
> Also I've split the pass group into a bit before and after pass_fre.
> 
> So, the current pass group looks like:
> ...
> NEXT_PASS (pass_build_ealias);
> 
> /* Pass group that runs when the function is an offloaded function
>    containing oacc kernels loops.  Part 1.  */
> NEXT_PASS (pass_oacc_kernels);
> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>     /* We need pass_ch here, because pass_lim has no effect on
>        exit-first loops (PR65442).  Ideally we want to remove both
>        this pass instantiation, and the reverse transformation
>        transform_to_exit_first_loop_alt, which is done in
>        pass_parallelize_loops_oacc_kernels. */
>     NEXT_PASS (pass_ch);
> POP_INSERT_PASSES ()
> 
> NEXT_PASS (pass_fre);
> 
> /* Pass group that runs when the function is an offloaded function
>    containing oacc kernels loops.  Part 2.  */
> NEXT_PASS (pass_oacc_kernels2);
> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
>     /* We use pass_lim to rewrite in-memory iteration and reduction
>        variable accesses in loops into local variables accesses.  */
>     NEXT_PASS (pass_lim);
>     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>     NEXT_PASS (pass_lim);
>     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>     NEXT_PASS (pass_dce);
>     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
>     NEXT_PASS (pass_expand_omp_ssa);
> POP_INSERT_PASSES ()
> NEXT_PASS (pass_merge_phi);
> ...
> 
> 
> II.
> 
> The motivating test-case kernels-double-reduction-n.c:
> ...
> #include <stdlib.h>
> 
> #define N 500
> 
> unsigned int a[N][N];
> 
> void  __attribute__((noinline,noclone))
> foo (unsigned int n)
> {
>   int i, j;
>   unsigned int sum = 1;
> 
> #pragma acc kernels copyin (a[0:n]) copy (sum)
>   {
>     for (i = 0; i < n; ++i)
>       for (j = 0; j < n; ++j)
>         sum += a[i][j];
>   }
> 
>   if (sum != 5001)
>     abort ();
> }
> ...
> 
> 
> III.
> 
> Before first pass_lim. Note no phis on inner or outer loop header for
> iteration varables or reduction variable:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>: outer loop header
>   _12 = *.omp_data_i_4(D).j;
>   *_12 = 0;
>   if (_45 != 0)
>     goto <bb 6>;
>   else
>     goto <bb 5>;
> 
>   <bb 6>: inner loop header, latch
>   _19 = *.omp_data_i_4(D).a;
>   _21 = *_5;
>   _23 = *_12;
>   _24 = *_19[_21][_23];
>   _25 = *.omp_data_i_4(D).sum;
>   sum.0_26 = *_25;
>   sum.1_27 = _24 + sum.0_26;
>   *_25 = sum.1_27;
>   _33 = _23 + 1;
>   *_12 = _33;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 6>;
>   else
>     goto <bb 5>;
> 
>   <bb 5>: outer loop latch
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...
> 
> 
> IV.
> 
> After first pass_lim/pass_dom pair. Note there are phis on the inner loop
> header for the reduction and the iteration variable, but not on the outer loop
> header:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>:
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   D__lsm.10_50 = *_12;
>   D__lsm.11_51 = 0;
>   _25 = *.omp_data_i_4(D).sum;
> 
>   <bb 5>: outer loop header
>   D__lsm.10_20 = 0;
>   D__lsm.11_22 = 1;
>   _21 = *_5;
>   D__lsm.12_28 = *_25;
>   D__lsm.13_30 = 0;
>   goto <bb 7>;
> 
>   <bb 7>: inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
>   _23 = D__lsm.10_47;
>   _24 = *_19[_21][D__lsm.10_47];
>   sum.0_26 = D__lsm.12_49;
>   sum.1_27 = _24 + D__lsm.12_49;
>   D__lsm.12_31 = sum.1_27;
>   D__lsm.13_32 = 1;
>   _33 = D__lsm.10_47 + 1;
>   D__lsm.10_14 = _33;
>   D__lsm.11_15 = 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 7>;
>   else
>     goto <bb 8>;
> 
>   <bb 8>: outer loop latch
>   # D__lsm.10_35 = PHI <_33(7)>
>   # D__lsm.11_37 = PHI <1(7)>
>   # D__lsm.12_7 = PHI <sum.1_27(7)>
>   # D__lsm.13_8 = PHI <1(7)>
>   *_25 = sum.1_27;
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
>   <bb 6>:
>   # D__lsm.10_10 = PHI <_33(8)>
>   # D__lsm.11_11 = PHI <1(8)>
>   *_12 = _33;
>   goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...
> 
> 
> V.
> 
> After second pass_lim/pass_dom pair. Note there are phis on the inner and
> outer loop header for the reduction and the iteration variables:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>:
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   D__lsm.10_50 = *_12;
>   D__lsm.11_51 = 0;
>   _25 = *.omp_data_i_4(D).sum;
>   D__lsm.14_40 = 0;
>   D__lsm.15_2 = 0;
>   D__lsm.16_1 = *_25;
>   D__lsm.17_46 = 0;
> 
>   <bb 5>: outer loop header
>   # D__lsm.14_13 = PHI <0(4), _38(8)>
>   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
>   D__lsm.10_20 = 0;
>   D__lsm.11_22 = 1;
>   _21 = D__lsm.14_13;
>   D__lsm.12_28 = D__lsm.16_34;
>   D__lsm.13_30 = 0;
>   goto <bb 7>;
> 
>   <bb 7>: inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
>   _23 = D__lsm.10_47;
>   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
>   sum.0_26 = D__lsm.12_49;
>   sum.1_27 = _24 + D__lsm.12_49;
>   D__lsm.12_31 = sum.1_27;
>   D__lsm.13_32 = 1;
>   _33 = D__lsm.10_47 + 1;
>   D__lsm.10_14 = _33;
>   D__lsm.11_15 = 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 7>;
>   else
>     goto <bb 8>;
> 
>   <bb 8>: outer loop latch
>   # D__lsm.10_35 = PHI <_33(7)>
>   # D__lsm.11_37 = PHI <1(7)>
>   # D__lsm.12_7 = PHI <sum.1_27(7)>
>   # D__lsm.13_8 = PHI <1(7)>
>   # sum.1_48 = PHI <sum.1_27(7)>
>   # _53 = PHI <_33(7)>
>   D__lsm.16_56 = sum.1_27;
>   D__lsm.17_57 = 1;
>   _36 = D__lsm.14_13;
>   _38 = D__lsm.14_13 + 1;
>   D__lsm.14_58 = _38;
>   D__lsm.15_59 = 1;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
>   <bb 6>:
>   # D__lsm.10_10 = PHI <_33(8)>
>   # D__lsm.11_11 = PHI <1(8)>
>   # _43 = PHI <_33(8)>
>   # D__lsm.16_62 = PHI <sum.1_27(8)>
>   # D__lsm.17_63 = PHI <1(8)>
>   # D__lsm.14_64 = PHI <_38(8)>
>   # D__lsm.15_65 = PHI <1(8)>
>   *_5 = _38;
>   *_25 = sum.1_27;
>   *_12 = _33;
>   goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...

Sorry but staring at dumps doesn't make me understand the issue you
run into.  Where can I reproduce this if I have time to look at this?

From the dump below I understand you want no memory references in
the outer loop?  So the issue seems to be that store motion fails
to insert the preheader load / exit store to the outermost loop
possible and thus another LIM pass is needed to "store motion" those
again?  But a simple testcase

int a;
int *p = &a;
int foo (int n)
{
  for (int i = 0; i < n; ++i)
    for (int j = 0; j < 100; ++j)
      *p += j + i;
  return a;
}

shows that LIM can do this in one step.  Which means it should
be investigated why it doesn't do this properly for your testcase
(store motion of *_25).

Simply adding two LIM passes either papers over a wrong-code
bug (in LIM or in DOM) or over a missed-optimization in LIM.

Richard.
 
> 
> VI.
> 
> After pass_dce, so before parloops-oacc-kernels:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>:
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   _25 = *.omp_data_i_4(D).sum;
>   D__lsm.16_1 = *_25;
> 
>   <bb 5>: outer loop header
>   # D__lsm.14_13 = PHI <0(4), _38(8)>
>   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
>   goto <bb 7>;
> 
>   <bb 7>: inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
>   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
>   sum.1_27 = _24 + D__lsm.12_49;
>   _33 = D__lsm.10_47 + 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 7>;
>   else
>     goto <bb 8>;
> 
>   <bb 8>: outer loop latch
>   _38 = D__lsm.14_13 + 1;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
>   <bb 6>:
>   *_5 = _38;
>   *_25 = sum.1_27;
>   *_12 = _33;
>   goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...
> 
> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-19 10:31         ` Tom de Vries
@ 2015-11-20 10:37           ` Richard Biener
  2015-11-20 13:27             ` Tom de Vries
  2015-11-22 23:37             ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-20 10:37 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 16/11/15 13:45, Richard Biener wrote:
> > > I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done
> > > in
> > > >the pass group. Instead, I've added conditional loop optimizer setup in:
> > > >-  pass_lim and pass_scev_cprop (added in this patch), and
> 
> Reposting the "Add pass_oacc_kernels pass group in passes.def" patch.
> 
> pass_scev_cprop is no longer part of the pass group.
> 
> And I've dropped the scev_initialize in pass_lim.
> 
> Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info is
> initialized at the start of the pass group and updated or reset by passes in
> the pass group if necessary, such that it's always available, or can be
> recalculated on the spot.
> 
> First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim
> doesn't use scev info. So there doesn't seem to be a need to do anything about
> scev info for using pass_lim outside pass_tree_loop.
> 
> > > >- pass_parallelize_loops_oacc_kernels (added in patch "Add
> > > >   pass_parallelize_loops_oacc_kernels").
> > You miss calling scev_finalize ().
> 
> I've added the scev_finalize () in patch "Add
> pass_parallelize_loops_oacc_kernels".

 pass_lim::execute (function *fun)
 {
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+                               | LOOPS_HAVE_RECORDED_EXITS))
+    loop_optimizer_init (LOOPS_NORMAL
+                        | LOOPS_HAVE_RECORDED_EXITS);
+

note that this will, when not in the loop pipeline, not properly
fixup loops if LOOPS_NEED_FIXUP is set (that doesn't clear other
loop flags).  I'd rather make loop_optimizer_init do nothing
if requested flags are already set and no fixup is needed and
call the above unconditionally.  Thus sth like

Index: gcc/loop-init.c
===================================================================
--- gcc/loop-init.c     (revision 230649)
+++ gcc/loop-init.c     (working copy)
@@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
       calculate_dominance_info (CDI_DOMINATORS);
 
       if (!needs_fixup)
-       checking_verify_loop_structure ();
+       {
+         checking_verify_loop_structure ();
+         if (loops_state_satisfies_p (flags))
+           goto out;
+       }
 
       /* Clear all flags.  */
       if (recorded_exits)
@@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
   /* Apply flags to loops.  */
   apply_loop_flags (flags);
 
+  checking_verify_loop_structure ();
+
+out:
   /* Dump loops.  */
   flow_loops_dump (dump_file, NULL, 1);
 
-  checking_verify_loop_structure ();
-
   timevar_pop (TV_LOOP_INIT);
 }
 



   if (number_of_loops (fun) <= 1)
     return 0;
 
+  if (!loops_state_satisfies_p (LOOP_CLOSED_SSA))
+    rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
   return tree_ssa_lim ();
 }

that looks bogus.  The into-loop-closed SSA rewrite should be
only done if the state _satisfies_ it.  I understand LIM doesn't
require loop-closed SSA.  But it also doesn't destroy it obviously.
So just remove that.



> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [committed, trivial] Fix typo and trailing whitespace in dump-file strings in parloops
  2015-11-18 16:22                     ` Bernhard Reutner-Fischer
@ 2015-11-20 12:53                       ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-20 12:53 UTC (permalink / raw)
  To: Bernhard Reutner-Fischer, Richard Biener
  Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

[ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]

On 18/11/15 17:22, Bernhard Reutner-Fischer wrote:
> Bonus points for fixing the dump_file to parse in:
>
>> >Parloops will fail because:
>> >...
>> >phi is n_2 = PHI <n_4(D)(4)>
>> >arg of phi to exit: value n_4(D) used outside loop
>> >checking if it a part of reduction pattern:
> s/it a/it is/
>

This patch fixes a typo and trailing whitespace in dump-file strings in 
parloops.

Build for c and fortran, tested -fdump-tree-parloops testcases.

Committed to trunk as trivial.

Thanks,
- Tom

[-- Attachment #2: 0001-Fix-typo-and-trailing-whitespace-in-dump-file-strings-in-parloops.patch --]
[-- Type: text/x-patch, Size: 1244 bytes --]

Fix typo and trailing whitespace in dump-file strings in parloops

2015-11-19  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (build_new_reduction): Fix trailing whitespace in
	dump-file string.
	(try_create_reduction_list): Same.  Fix typo in dump-file string.

---
 gcc/tree-parloops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 8d7912d..aca2370 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2383,7 +2383,7 @@ build_new_reduction (reduction_info_table_type *reduction_list,
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file,
-	       "Detected reduction. reduction stmt is: \n");
+	       "Detected reduction. reduction stmt is:\n");
       print_gimple_stmt (dump_file, reduc_stmt, 0, 0);
       fprintf (dump_file, "\n");
     }
@@ -2564,7 +2564,7 @@ try_create_reduction_list (loop_p loop,
 	      print_generic_expr (dump_file, val, 0);
 	      fprintf (dump_file, " used outside loop\n");
 	      fprintf (dump_file,
-		       "  checking if it a part of reduction pattern:  \n");
+		       "  checking if it is part of reduction pattern:\n");
 	    }
 	  if (reduction_list->elements () == 0)
 	    {

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 10:37           ` Richard Biener
@ 2015-11-20 13:27             ` Tom de Vries
  2015-11-20 13:29               ` Richard Biener
  2015-11-22 23:37             ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-20 13:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 20/11/15 11:37, Richard Biener wrote:
>    I'd rather make loop_optimizer_init do nothing
> if requested flags are already set and no fixup is needed

> Thus sth like
>
> Index: gcc/loop-init.c
> ===================================================================
> --- gcc/loop-init.c     (revision 230649)
> +++ gcc/loop-init.c     (working copy)
> @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
>         calculate_dominance_info (CDI_DOMINATORS);
>
>         if (!needs_fixup)
> -       checking_verify_loop_structure ();
> +       {
> +         checking_verify_loop_structure ();
> +         if (loops_state_satisfies_p (flags))
> +           goto out;

What about flags that are present in the loops state, but not requested 
in flags? Should we try to clear those flags?

Thanks,
- Tom

> +       }
>
>         /* Clear all flags.  */
>         if (recorded_exits)
> @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
>     /* Apply flags to loops.  */
>     apply_loop_flags (flags);
>
> +  checking_verify_loop_structure ();
> +
> +out:
>     /* Dump loops.  */
>     flow_loops_dump (dump_file, NULL, 1);
>
> -  checking_verify_loop_structure ();
> -
>     timevar_pop (TV_LOOP_INIT);
>   }

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 13:27             ` Tom de Vries
@ 2015-11-20 13:29               ` Richard Biener
  2015-11-20 16:34                 ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-20 13:29 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Fri, 20 Nov 2015, Tom de Vries wrote:

> On 20/11/15 11:37, Richard Biener wrote:
> >    I'd rather make loop_optimizer_init do nothing
> > if requested flags are already set and no fixup is needed
> 
> > Thus sth like
> > 
> > Index: gcc/loop-init.c
> > ===================================================================
> > --- gcc/loop-init.c     (revision 230649)
> > +++ gcc/loop-init.c     (working copy)
> > @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
> >         calculate_dominance_info (CDI_DOMINATORS);
> > 
> >         if (!needs_fixup)
> > -       checking_verify_loop_structure ();
> > +       {
> > +         checking_verify_loop_structure ();
> > +         if (loops_state_satisfies_p (flags))
> > +           goto out;
> 
> What about flags that are present in the loops state, but not requested in
> flags? Should we try to clear those flags?

No, I don't think so, that would break in-loop-pipeline LIM, dropping
loop-closed SSA for example.

I agree it's somewhat of an odd behavior but all passes should
either be placed in a sub-pipeline with an outer 
loop_optimizer_init()/finalize () call or call both themselves.

Richard.

> Thanks,
> - Tom
> 
> > +       }
> > 
> >         /* Clear all flags.  */
> >         if (recorded_exits)
> > @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
> >     /* Apply flags to loops.  */
> >     apply_loop_flags (flags);
> > 
> > +  checking_verify_loop_structure ();
> > +
> > +out:
> >     /* Dump loops.  */
> >     flow_loops_dump (dump_file, NULL, 1);
> > 
> > -  checking_verify_loop_structure ();
> > -
> >     timevar_pop (TV_LOOP_INIT);
> >   }
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 13:29               ` Richard Biener
@ 2015-11-20 16:34                 ` Tom de Vries
  2015-11-23 10:11                   ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-20 16:34 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 20/11/15 14:29, Richard Biener wrote:
> I agree it's somewhat of an odd behavior but all passes should
> either be placed in a sub-pipeline with an outer
> loop_optimizer_init()/finalize () call or call both themselves.

Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks 
the loop pipeline.

We could use the style used in pass_slp_vectorize::execute:
...
pass_slp_vectorize::execute (function *fun)
{
   basic_block bb;

   bool in_loop_pipeline = scev_initialized_p ();
   if (!in_loop_pipeline)
     {
       loop_optimizer_init (LOOPS_NORMAL);
       scev_initialize ();
     }

   ...

   if (!in_loop_pipeline)
     {
       scev_finalize ();
       loop_optimizer_finalize ();
     }
...

Although that doesn't strike me as particularly clean.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 10:28                 ` Richard Biener
@ 2015-11-21  8:42                   ` Tom de Vries
  2015-11-23 11:31                     ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-21  8:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 10151 bytes --]

On 20/11/15 11:28, Richard Biener wrote:
> On Thu, 19 Nov 2015, Tom de Vries wrote:
>
>> >On 17/11/15 15:53, Tom de Vries wrote:
>>>> > > >And the above LIM example
>>>> > > >is none for why you need two LIM passes...
>>> > >
>>> > >Indeed. I'm planning a separate reply to explain in more detail the need
>>> > >for the two pass_lims.
>> >
>> >I.
>> >
>> >I managed to get rid of the two pass_lims for the motivating example that I
>> >used until now (goacc/kernels-double-reduction.c). I found that by adding a
>> >pass_dominator instance after pass_ch, I could get rid of the second pass_lim
>> >(and pass_copyprop as well).
>> >
>> >But... then I wrote a counter example (goacc/kernels-double-reduction-n.c),
>> >and I'm back at two pass_lims (and two pass_dominators).
>> >Also I've split the pass group into a bit before and after pass_fre.
>> >
>> >So, the current pass group looks like:
>> >...
>> >NEXT_PASS (pass_build_ealias);
>> >
>> >/* Pass group that runs when the function is an offloaded function
>> >    containing oacc kernels loops.  Part 1.  */
>> >NEXT_PASS (pass_oacc_kernels);
>> >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>> >     /* We need pass_ch here, because pass_lim has no effect on
>> >        exit-first loops (PR65442).  Ideally we want to remove both
>> >        this pass instantiation, and the reverse transformation
>> >        transform_to_exit_first_loop_alt, which is done in
>> >        pass_parallelize_loops_oacc_kernels. */
>> >     NEXT_PASS (pass_ch);
>> >POP_INSERT_PASSES ()
>> >
>> >NEXT_PASS (pass_fre);
>> >
>> >/* Pass group that runs when the function is an offloaded function
>> >    containing oacc kernels loops.  Part 2.  */
>> >NEXT_PASS (pass_oacc_kernels2);
>> >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
>> >     /* We use pass_lim to rewrite in-memory iteration and reduction
>> >        variable accesses in loops into local variables accesses.  */
>> >     NEXT_PASS (pass_lim);
>> >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>> >     NEXT_PASS (pass_lim);
>> >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>> >     NEXT_PASS (pass_dce);
>> >     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
>> >     NEXT_PASS (pass_expand_omp_ssa);
>> >POP_INSERT_PASSES ()
>> >NEXT_PASS (pass_merge_phi);
>> >...
>> >
>> >
>> >II.
>> >
>> >The motivating test-case kernels-double-reduction-n.c:
>> >...
>> >#include <stdlib.h>
>> >
>> >#define N 500
>> >
>> >unsigned int a[N][N];
>> >
>> >void  __attribute__((noinline,noclone))
>> >foo (unsigned int n)
>> >{
>> >   int i, j;
>> >   unsigned int sum = 1;
>> >
>> >#pragma acc kernels copyin (a[0:n]) copy (sum)
>> >   {
>> >     for (i = 0; i < n; ++i)
>> >       for (j = 0; j < n; ++j)
>> >         sum += a[i][j];
>> >   }
>> >
>> >   if (sum != 5001)
>> >     abort ();
>> >}
>> >...
>> >
>> >
>> >III.
>> >
>> >Before first pass_lim. Note no phis on inner or outer loop header for
>> >iteration varables or reduction variable:
>> >...
>> >   <bb 2>:
>> >   _5 = *.omp_data_i_4(D).i;
>> >   *_5 = 0;
>> >   _44 = *.omp_data_i_4(D).n;
>> >   _45 = *_44;
>> >   if (_45 != 0)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 4>: outer loop header
>> >   _12 = *.omp_data_i_4(D).j;
>> >   *_12 = 0;
>> >   if (_45 != 0)
>> >     goto <bb 6>;
>> >   else
>> >     goto <bb 5>;
>> >
>> >   <bb 6>: inner loop header, latch
>> >   _19 = *.omp_data_i_4(D).a;
>> >   _21 = *_5;
>> >   _23 = *_12;
>> >   _24 = *_19[_21][_23];
>> >   _25 = *.omp_data_i_4(D).sum;
>> >   sum.0_26 = *_25;
>> >   sum.1_27 = _24 + sum.0_26;
>> >   *_25 = sum.1_27;
>> >   _33 = _23 + 1;
>> >   *_12 = _33;
>> >   j.2_16 = (unsigned int) _33;
>> >   if (j.2_16 < _45)
>> >     goto <bb 6>;
>> >   else
>> >     goto <bb 5>;
>> >
>> >   <bb 5>: outer loop latch
>> >   _36 = *_5;
>> >   _38 = _36 + 1;
>> >   *_5 = _38;
>> >   i.3_9 = (unsigned int) _38;
>> >   if (i.3_9 < _45)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 3>:
>> >   return;
>> >...
>> >
>> >
>> >IV.
>> >
>> >After first pass_lim/pass_dom pair. Note there are phis on the inner loop
>> >header for the reduction and the iteration variable, but not on the outer loop
>> >header:
>> >...
>> >   <bb 2>:
>> >   _5 = *.omp_data_i_4(D).i;
>> >   *_5 = 0;
>> >   _44 = *.omp_data_i_4(D).n;
>> >   _45 = *_44;
>> >   if (_45 != 0)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 4>:
>> >   _12 = *.omp_data_i_4(D).j;
>> >   _19 = *.omp_data_i_4(D).a;
>> >   D__lsm.10_50 = *_12;
>> >   D__lsm.11_51 = 0;
>> >   _25 = *.omp_data_i_4(D).sum;
>> >
>> >   <bb 5>: outer loop header
>> >   D__lsm.10_20 = 0;
>> >   D__lsm.11_22 = 1;
>> >   _21 = *_5;
>> >   D__lsm.12_28 = *_25;
>> >   D__lsm.13_30 = 0;
>> >   goto <bb 7>;
>> >
>> >   <bb 7>: inner loop header, latch
>> >   # D__lsm.10_47 = PHI <0(5), _33(7)>
>> >   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
>> >   _23 = D__lsm.10_47;
>> >   _24 = *_19[_21][D__lsm.10_47];
>> >   sum.0_26 = D__lsm.12_49;
>> >   sum.1_27 = _24 + D__lsm.12_49;
>> >   D__lsm.12_31 = sum.1_27;
>> >   D__lsm.13_32 = 1;
>> >   _33 = D__lsm.10_47 + 1;
>> >   D__lsm.10_14 = _33;
>> >   D__lsm.11_15 = 1;
>> >   j.2_16 = (unsigned int) _33;
>> >   if (j.2_16 < _45)
>> >     goto <bb 7>;
>> >   else
>> >     goto <bb 8>;
>> >
>> >   <bb 8>: outer loop latch
>> >   # D__lsm.10_35 = PHI <_33(7)>
>> >   # D__lsm.11_37 = PHI <1(7)>
>> >   # D__lsm.12_7 = PHI <sum.1_27(7)>
>> >   # D__lsm.13_8 = PHI <1(7)>
>> >   *_25 = sum.1_27;
>> >   _36 = *_5;
>> >   _38 = _36 + 1;
>> >   *_5 = _38;
>> >   i.3_9 = (unsigned int) _38;
>> >   if (i.3_9 < _45)
>> >     goto <bb 5>;
>> >   else
>> >     goto <bb 6>;
>> >
>> >   <bb 6>:
>> >   # D__lsm.10_10 = PHI <_33(8)>
>> >   # D__lsm.11_11 = PHI <1(8)>
>> >   *_12 = _33;
>> >   goto <bb 3>;
>> >
>> >   <bb 3>:
>> >   return;
>> >...
>> >
>> >
>> >V.
>> >
>> >After second pass_lim/pass_dom pair. Note there are phis on the inner and
>> >outer loop header for the reduction and the iteration variables:
>> >...
>> >   <bb 2>:
>> >   _5 = *.omp_data_i_4(D).i;
>> >   *_5 = 0;
>> >   _44 = *.omp_data_i_4(D).n;
>> >   _45 = *_44;
>> >   if (_45 != 0)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 4>:
>> >   _12 = *.omp_data_i_4(D).j;
>> >   _19 = *.omp_data_i_4(D).a;
>> >   D__lsm.10_50 = *_12;
>> >   D__lsm.11_51 = 0;
>> >   _25 = *.omp_data_i_4(D).sum;
>> >   D__lsm.14_40 = 0;
>> >   D__lsm.15_2 = 0;
>> >   D__lsm.16_1 = *_25;
>> >   D__lsm.17_46 = 0;
>> >
>> >   <bb 5>: outer loop header
>> >   # D__lsm.14_13 = PHI <0(4), _38(8)>
>> >   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
>> >   D__lsm.10_20 = 0;
>> >   D__lsm.11_22 = 1;
>> >   _21 = D__lsm.14_13;
>> >   D__lsm.12_28 = D__lsm.16_34;
>> >   D__lsm.13_30 = 0;
>> >   goto <bb 7>;
>> >
>> >   <bb 7>: inner loop header, latch
>> >   # D__lsm.10_47 = PHI <0(5), _33(7)>
>> >   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
>> >   _23 = D__lsm.10_47;
>> >   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
>> >   sum.0_26 = D__lsm.12_49;
>> >   sum.1_27 = _24 + D__lsm.12_49;
>> >   D__lsm.12_31 = sum.1_27;
>> >   D__lsm.13_32 = 1;
>> >   _33 = D__lsm.10_47 + 1;
>> >   D__lsm.10_14 = _33;
>> >   D__lsm.11_15 = 1;
>> >   j.2_16 = (unsigned int) _33;
>> >   if (j.2_16 < _45)
>> >     goto <bb 7>;
>> >   else
>> >     goto <bb 8>;
>> >
>> >   <bb 8>: outer loop latch
>> >   # D__lsm.10_35 = PHI <_33(7)>
>> >   # D__lsm.11_37 = PHI <1(7)>
>> >   # D__lsm.12_7 = PHI <sum.1_27(7)>
>> >   # D__lsm.13_8 = PHI <1(7)>
>> >   # sum.1_48 = PHI <sum.1_27(7)>
>> >   # _53 = PHI <_33(7)>
>> >   D__lsm.16_56 = sum.1_27;
>> >   D__lsm.17_57 = 1;
>> >   _36 = D__lsm.14_13;
>> >   _38 = D__lsm.14_13 + 1;
>> >   D__lsm.14_58 = _38;
>> >   D__lsm.15_59 = 1;
>> >   i.3_9 = (unsigned int) _38;
>> >   if (i.3_9 < _45)
>> >     goto <bb 5>;
>> >   else
>> >     goto <bb 6>;
>> >
>> >   <bb 6>:
>> >   # D__lsm.10_10 = PHI <_33(8)>
>> >   # D__lsm.11_11 = PHI <1(8)>
>> >   # _43 = PHI <_33(8)>
>> >   # D__lsm.16_62 = PHI <sum.1_27(8)>
>> >   # D__lsm.17_63 = PHI <1(8)>
>> >   # D__lsm.14_64 = PHI <_38(8)>
>> >   # D__lsm.15_65 = PHI <1(8)>
>> >   *_5 = _38;
>> >   *_25 = sum.1_27;
>> >   *_12 = _33;
>> >   goto <bb 3>;
>> >
>> >   <bb 3>:
>> >   return;
>> >...
> Sorry but staring at dumps doesn't make me understand the issue you
> run into.  Where can I reproduce this if I have time to look at this?

I've posted the state of the patch series that reproduces this problem 
at 
https://github.com/vries/gcc/commits/vries/master-port-kernels-test-rb , 
run goacc.exp, testcase kernels-double-reduction-n.c.

> From the dump below I understand you want no memory references in
> the outer loop?
> So the issue seems to be that store motion fails
> to insert the preheader load / exit store to the outermost loop
> possible and thus another LIM pass is needed to "store motion" those
> again?

Yep.

>  But a simple testcase
>
> int a;
> int *p = &a;
> int foo (int n)
> {
>    for (int i = 0; i < n; ++i)
>      for (int j = 0; j < 100; ++j)
>        *p += j + i;
>    return a;
> }
>
> shows that LIM can do this in one step.

I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry 
conditions" for a test-case where that doesn't happen (when using 
-fno-tree-dominator-opts).

> Which means it should
> be investigated why it doesn't do this properly for your testcase
> (store motion of *_25).

There seems to be two related problems:
1. the store has tree_could_trap_p (ref->mem.ref) true, which should be
    false. I'll work on a fix for this.
2. Give that the store can trap, I  was running into PR68465. I managed
    to eliminate the 2nd pass_lim by moving the pass_dominator instance
    before the pass_lim instance.

Attached patch shows the pass group with only one pass_lim. I hope to be 
able to eliminate the first pass_dominator instance before pass_lim once 
I fix 1.

> Simply adding two LIM passes either papers over a wrong-code
> bug (in LIM or in DOM) or over a missed-optimization in LIM.

AFAIU now, it's PR68465, a missed optimization in LIM.

Thanks,
- Tom



[-- Attachment #2: 0005-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 4721 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* loop-init.c (loop_optimizer_init): If loops state doesn't need fixup,
	and requested flags are present in the loops state, don't reapply flags.
	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/loop-init.c        | 11 ++++++++---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 24 ++++++++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c |  4 +++-
 5 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index e32c94a..5bc0c54 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
       calculate_dominance_info (CDI_DOMINATORS);
 
       if (!needs_fixup)
-	checking_verify_loop_structure ();
+	{
+	  checking_verify_loop_structure ();
+	  if (loops_state_satisfies_p (flags))
+	    goto out;
+	}
 
       /* Clear all flags.  */
       if (recorded_exits)
@@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
   /* Apply flags to loops.  */
   apply_loop_flags (flags);
 
+  checking_verify_loop_structure ();
+
+ out:
   /* Dump loops.  */
   flow_loops_dump (dump_file, NULL, 1);
 
-  checking_verify_loop_structure ();
-
   timevar_pop (TV_LOOP_INIT);
 }
 
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 9c27396..d2f88b3 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13385,6 +13385,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..67f6829 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,31 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      /* We need pass_ch here, because pass_lim has no effect on
+	         exit-first loops (PR65442).  Ideally we want to remove both
+		 this pass instantiation, and the reverse transformation
+		 transform_to_exit_first_loop_alt, which is done in
+		 pass_parallelize_loops_oacc_kernels. */
+	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..2435da6 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -2496,7 +2496,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
    i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,6 +2560,8 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
     return 0;
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:39               ` Jakub Jelinek
@ 2015-11-21 12:24                 ` Tom de Vries
  2015-11-23 11:46                   ` Richard Biener
  2015-12-11 12:45                 ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-21 12:24 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3387 bytes --]

On 13/11/15 12:39, Jakub Jelinek wrote:
> On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
>>> thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.
>>>
>>> Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit above?
>>> Is that sort of what you had in mind?
>>
>> Yes.  Whether that makes sense is another question of course.  You can
>> annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
>> as well if you know dependences without the users intervention.
>
> I really don't like even the GCC offload-alias, I just don't see anything
> special on the offload code.  Not to mention that the same issue is already
> with other outlined functions, like OpenMP tasks or parallel regions, those
> aren't offloaded, yet they can suffer from worse alias/points-to analysis
> too.

AFAIU there is one aspect that is different for offloaded code: the 
setup of the data on the device.

Consider this example:
...
unsigned int a[N];
unsigned int b[N];
unsigned int c[N];

int
main (void)
{
   ...

#pragma acc kernels copyin (a) copyin (b) copyout (c)
   {
     for (COUNTERTYPE ii = 0; ii < N; ii++)
       c[ii] = a[ii] + b[ii];
   }

   ...
...

At gimple level, we have:
...
#pragma omp target oacc_kernels \
   map(force_from:c [len: 2097152]) \
   map(force_to:b [len: 2097152]) \
   map(force_to:a [len: 2097152])
...

[ The meaning of the force_from/force_to mappings is given in 
include/gomp-constants.h:
...
     /* Allocate.  */
     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
     /* ..., and copy to device.  */
     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
     /* ..., and copy from device.  */
     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
     /* ..., and copy to and from device.  */
     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
...  ]

So before calling the offloaded function, a separate alloc is done for 
a, b and c, and the base pointers of the newly allocated objects are 
passed to the offloaded function.

This means we can mark those base pointers as restrict in the offloaded 
function.

Attached proof-of-concept patch implements that.

> We simply have some compiler internal interface between the caller and
> callee of the outlined regions, each interface in between those has
> its own structure type used to communicate the info;
> we can attach attributes on the fields, or some flags to indicate some
> properties interesting from aliasing POV.
> We don't really need to perform
> full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> the relationship in between such callers and callees (for offloading regions
> we already have "omp target entrypoint" attribute on the callee and a
> singler caller), tell LTO if possible not to split those into different
> partitions if easily possible, and then just for these pairs perform
> aliasing/points-to analysis in the caller and the result record using
> cliques/special attributes/whatever to the callee side, so that the callee
> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.

As a start, is the approach of this patch OK?

It will allow us to commit the oacc kernels patch series with the 
ability to parallelize non-trivial testcases, and work on improving the 
alias bit after that.

Thanks,
- Tom




[-- Attachment #2: 0018-Mark-pointers-to-allocated-target-vars-as-restricted-if-possible.patch --]
[-- Type: text/x-patch, Size: 4201 bytes --]

Mark pointers to allocated target vars as restricted, if possible

---
 gcc/omp-low.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 268b67b..0ce822d 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1372,7 +1372,8 @@ build_sender_ref (tree var, omp_context *ctx)
 /* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
 
 static void
-install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
+		     bool base_pointers_restrict)
 {
   tree field, type, sfield = NULL_TREE;
   splay_tree_key key = (splay_tree_key) var;
@@ -1396,7 +1397,11 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
       type = build_pointer_type (build_pointer_type (type));
     }
   else if (by_ref)
-    type = build_pointer_type (type);
+    {
+      type = build_pointer_type (type);
+      if (base_pointers_restrict)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+    }
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
 
@@ -1460,6 +1465,12 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
     splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield);
 }
 
+static void
+install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+{
+  install_var_field_1 (var, by_ref, mask, ctx, false);
+}
+
 static tree
 install_var_local (tree var, omp_context *ctx)
 {
@@ -1816,7 +1827,8 @@ fixup_child_record_type (omp_context *ctx)
    specified by CLAUSES.  */
 
 static void
-scan_sharing_clauses (tree clauses, omp_context *ctx)
+scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
+			bool base_pointers_restrict)
 {
   tree c, decl;
   bool scan_array_reductions = false;
@@ -2073,7 +2085,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 		      && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE)
 		    install_var_field (decl, true, 7, ctx);
 		  else
-		    install_var_field (decl, true, 3, ctx);
+		    install_var_field_1 (decl, true, 3, ctx, base_pointers_restrict);
 		  if (is_gimple_omp_offloaded (ctx->stmt))
 		    install_var_local (decl, ctx);
 		}
@@ -2339,6 +2351,12 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 	scan_omp (&OMP_CLAUSE_LINEAR_GIMPLE_SEQ (c), ctx);
 }
 
+static void
+scan_sharing_clauses (tree clauses, omp_context *ctx)
+{
+  scan_sharing_clauses_1 (clauses, ctx, false);
+}
+
 /* Create a new name for omp child function.  Returns an identifier.  If
    IS_CILK_FOR is true then the suffix for the child function is
    "_cilk_for_fn."  */
@@ -3056,13 +3074,52 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   DECL_NAMELESS (name) = 1;
   TYPE_NAME (ctx->record_type) = name;
   TYPE_ARTIFICIAL (ctx->record_type) = 1;
+
+  bool base_pointers_restrict = false;
   if (offloaded)
     {
       create_omp_child_function (ctx, false);
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+
+      /* If all the clauses force allocation, we can be certain that the objects
+	 on the target are disjoint, and therefore mark the base pointers as
+	 restrict.  */
+      base_pointers_restrict = true;
+      tree c;
+      for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+	{
+	  switch (OMP_CLAUSE_CODE (c))
+	    {
+	    case OMP_CLAUSE_MAP:
+	      switch (OMP_CLAUSE_MAP_KIND (c))
+		{
+		case GOMP_MAP_ALLOC:
+		case GOMP_MAP_FORCE_TO:
+		case GOMP_MAP_FORCE_FROM:
+		case GOMP_MAP_FORCE_TOFROM:
+		  break;
+		default:
+		  base_pointers_restrict = false;
+		  break;
+		}
+	      break;
+
+	    default:
+	      base_pointers_restrict = false;
+	      break;
+	    }
+
+	  if (!base_pointers_restrict)
+	    break;
+	}
+      if (base_pointers_restrict)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Base pointers in offloaded function are restrict\n");
+	}
     }
 
-  scan_sharing_clauses (clauses, ctx);
+  scan_sharing_clauses_1 (clauses, ctx, base_pointers_restrict);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
   if (TYPE_FIELDS (ctx->record_type) == NULL)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init
  2015-11-20 10:37           ` Richard Biener
  2015-11-20 13:27             ` Tom de Vries
@ 2015-11-22 23:37             ` Tom de Vries
  2015-11-23 10:33               ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-22 23:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1459 bytes --]

[ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]

On 20/11/15 11:37, Richard Biener wrote:
> I'd rather make loop_optimizer_init do nothing
> if requested flags are already set and no fixup is needed and
> call the above unconditionally.  Thus sth like
>
> Index: gcc/loop-init.c
> ===================================================================
> --- gcc/loop-init.c     (revision 230649)
> +++ gcc/loop-init.c     (working copy)
> @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
>         calculate_dominance_info (CDI_DOMINATORS);
>
>         if (!needs_fixup)
> -       checking_verify_loop_structure ();
> +       {
> +         checking_verify_loop_structure ();
> +         if (loops_state_satisfies_p (flags))
> +           goto out;
> +       }
>
>         /* Clear all flags.  */
>         if (recorded_exits)
> @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
>     /* Apply flags to loops.  */
>     apply_loop_flags (flags);
>
> +  checking_verify_loop_structure ();
> +
> +out:
>     /* Dump loops.  */
>     flow_loops_dump (dump_file, NULL, 1);
>
> -  checking_verify_loop_structure ();
> -
>     timevar_pop (TV_LOOP_INIT);
>   }

This patch implements that approach, but the patch is slightly more 
complicated because of the need to handle 
LOOPS_MAY_HAVE_MULTIPLE_LATCHES differently than the rest of the flags.

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0002-Don-t-reapply-loops-flags-if-unnecessary-in-loop_optimizer_init.patch --]
[-- Type: text/x-patch, Size: 1546 bytes --]

Don't reapply loops flags if unnecessary in loop_optimizer_init

2015-11-22  Tom de Vries  <tom@codesourcery.com>

	* loop-init.c (loop_optimizer_init): Don't reapply loops flags if
	unnecessary.

---
 gcc/loop-init.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index e32c94a..4b72cab 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -85,6 +85,8 @@ loop_optimizer_init (unsigned flags)
 {
   timevar_push (TV_LOOP_INIT);
 
+  gcc_checking_assert ((flags & (LOOP_CLOSED_SSA | LOOPS_NEED_FIXUP)) == 0);
+
   if (!current_loops)
     {
       gcc_assert (!(cfun->curr_properties & PROP_loops));
@@ -103,7 +105,17 @@ loop_optimizer_init (unsigned flags)
       calculate_dominance_info (CDI_DOMINATORS);
 
       if (!needs_fixup)
-	checking_verify_loop_structure ();
+	{
+	  checking_verify_loop_structure ();
+
+	  bool need_reapply
+	    = (!loops_state_satisfies_p (flags
+					 & (~LOOPS_MAY_HAVE_MULTIPLE_LATCHES))
+	       || (loops_state_satisfies_p (LOOPS_MAY_HAVE_MULTIPLE_LATCHES)
+		   && ((flags & LOOPS_MAY_HAVE_MULTIPLE_LATCHES) == 0)));
+	  if (!need_reapply)
+	    goto out;
+	}
 
       /* Clear all flags.  */
       if (recorded_exits)
@@ -122,11 +134,12 @@ loop_optimizer_init (unsigned flags)
   /* Apply flags to loops.  */
   apply_loop_flags (flags);
 
+  checking_verify_loop_structure ();
+
+ out:
   /* Dump loops.  */
   flow_loops_dump (dump_file, NULL, 1);
 
-  checking_verify_loop_structure ();
-
   timevar_pop (TV_LOOP_INIT);
 }
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 16:34                 ` Tom de Vries
@ 2015-11-23 10:11                   ` Richard Biener
  2015-11-24 12:22                     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 10:11 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Fri, 20 Nov 2015, Tom de Vries wrote:

> On 20/11/15 14:29, Richard Biener wrote:
> > I agree it's somewhat of an odd behavior but all passes should
> > either be placed in a sub-pipeline with an outer
> > loop_optimizer_init()/finalize () call or call both themselves.
> 
> Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop
> pipeline.
> 
> We could use the style used in pass_slp_vectorize::execute:
> ...
> pass_slp_vectorize::execute (function *fun)
> {
>   basic_block bb;
> 
>   bool in_loop_pipeline = scev_initialized_p ();
>   if (!in_loop_pipeline)
>     {
>       loop_optimizer_init (LOOPS_NORMAL);
>       scev_initialize ();
>     }
> 
>   ...
> 
>   if (!in_loop_pipeline)
>     {
>       scev_finalize ();
>       loop_optimizer_finalize ();
>     }
> ...
> 
> Although that doesn't strike me as particularly clean.

At least it would be a consistent "unclean" style.  So yes, the
above would work for me.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init
  2015-11-22 23:37             ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
@ 2015-11-23 10:33               ` Richard Biener
  2015-11-23 11:27                 ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 10:33 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 23 Nov 2015, Tom de Vries wrote:

> [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
> 
> On 20/11/15 11:37, Richard Biener wrote:
> > I'd rather make loop_optimizer_init do nothing
> > if requested flags are already set and no fixup is needed and
> > call the above unconditionally.  Thus sth like
> > 
> > Index: gcc/loop-init.c
> > ===================================================================
> > --- gcc/loop-init.c     (revision 230649)
> > +++ gcc/loop-init.c     (working copy)
> > @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
> >         calculate_dominance_info (CDI_DOMINATORS);
> > 
> >         if (!needs_fixup)
> > -       checking_verify_loop_structure ();
> > +       {
> > +         checking_verify_loop_structure ();
> > +         if (loops_state_satisfies_p (flags))
> > +           goto out;
> > +       }
> > 
> >         /* Clear all flags.  */
> >         if (recorded_exits)
> > @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
> >     /* Apply flags to loops.  */
> >     apply_loop_flags (flags);
> > 
> > +  checking_verify_loop_structure ();
> > +
> > +out:
> >     /* Dump loops.  */
> >     flow_loops_dump (dump_file, NULL, 1);
> > 
> > -  checking_verify_loop_structure ();
> > -
> >     timevar_pop (TV_LOOP_INIT);
> >   }
> 
> This patch implements that approach, but the patch is slightly more
> complicated because of the need to handle LOOPS_MAY_HAVE_MULTIPLE_LATCHES
> differently than the rest of the flags.
> 
> Bootstrapped and reg-tested on x86_64.
> 
> OK for stage3 trunk?

Let's revisit this during stage1 if the scev_initialized () thing
SLP vectorization uses works, ok?

Thanks,
Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init
  2015-11-23 10:33               ` Richard Biener
@ 2015-11-23 11:27                 ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-23 11:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2818 bytes --]

On 23/11/15 11:29, Richard Biener wrote:
> On Mon, 23 Nov 2015, Tom de Vries wrote:
>
>> [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
>>
>> On 20/11/15 11:37, Richard Biener wrote:
>>> I'd rather make loop_optimizer_init do nothing
>>> if requested flags are already set and no fixup is needed and
>>> call the above unconditionally.  Thus sth like
>>>
>>> Index: gcc/loop-init.c
>>> ===================================================================
>>> --- gcc/loop-init.c     (revision 230649)
>>> +++ gcc/loop-init.c     (working copy)
>>> @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
>>>          calculate_dominance_info (CDI_DOMINATORS);
>>>
>>>          if (!needs_fixup)
>>> -       checking_verify_loop_structure ();
>>> +       {
>>> +         checking_verify_loop_structure ();
>>> +         if (loops_state_satisfies_p (flags))
>>> +           goto out;
>>> +       }
>>>
>>>          /* Clear all flags.  */
>>>          if (recorded_exits)
>>> @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
>>>      /* Apply flags to loops.  */
>>>      apply_loop_flags (flags);
>>>
>>> +  checking_verify_loop_structure ();
>>> +
>>> +out:
>>>      /* Dump loops.  */
>>>      flow_loops_dump (dump_file, NULL, 1);
>>>
>>> -  checking_verify_loop_structure ();
>>> -
>>>      timevar_pop (TV_LOOP_INIT);
>>>    }
>>
>> This patch implements that approach, but the patch is slightly more
>> complicated because of the need to handle LOOPS_MAY_HAVE_MULTIPLE_LATCHES
>> differently than the rest of the flags.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> OK for stage3 trunk?
>
> Let's revisit this during stage1 if the scev_initialized () thing
> SLP vectorization uses works, ok?
>

OK, I'll give that a try.

FTR, attached two patches are an attempt at a cleaner solution for 
pass_slp_vectorize::execute (in combination with patch "Don't reapply 
loops flags if unnecessary in loop_optimizer_init").

The first patch introduces a property PROP_scev, set for the duration of 
the loop pipeline. It allows us to call scev_initialize and 
scev_finalize unconditionally. Outside the loop pipeline calling the 
functions has the usual effect. Inside the loop pipeline, calling the 
functions has no effect.

The second patch introduces a property PROP_loops_normal_re_lcssa, set 
for the duration of the loop pipeline. It allows us (in combination with 
"Don't reapply loops flags if unnecessary in loop_optimizer_init") to 
call loop_optimizer_init and loop_optimizer_finalize unconditionally.
Outside the loop pipeline, calling the functions has the usual effect. 
Inside the loop pipeline, calling loop_optimizer_finalize has no effect, 
and calling loop_optimizer_initialize has no effect unless a fixup or a 
new loop property is needed.

Thanks,
- Tom


[-- Attachment #2: 0020-Add-PROP_scev.patch --]
[-- Type: text/x-patch, Size: 3142 bytes --]

Add PROP_scev

---
 gcc/tree-pass.h             |  1 +
 gcc/tree-scalar-evolution.c | 13 +++++++++++++
 gcc/tree-ssa-loop.c         |  3 ++-
 gcc/tree-vectorizer.c       |  4 ++--
 4 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 004db77..4e66b2c 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -227,6 +227,7 @@ protected:
 						   of math functions; the
 						   current choices have
 						   been optimized.  */
+#define PROP_scev		(1 << 16)	/* preserve scev info.  */
 
 #define PROP_trees \
   (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 9b33693..5d5e354 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -280,6 +280,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "tree-ssa-propagate.h"
 #include "gimple-fold.h"
+#include "tree-pass.h"
 
 static tree analyze_scalar_evolution_1 (struct loop *, tree, tree);
 static tree analyze_scalar_evolution_for_address_of (struct loop *loop,
@@ -3168,6 +3169,12 @@ scev_initialize (void)
 {
   struct loop *loop;
 
+  if (cfun->curr_properties & PROP_scev)
+    {
+      gcc_assert (scev_initialized_p ());
+      return;
+    }
+
   scalar_evolution_info = hash_table<scev_info_hasher>::create_ggc (100);
 
   initialize_scalar_evolutions_analyzer ();
@@ -3367,6 +3374,12 @@ simple_iv (struct loop *wrto_loop, struct loop *use_loop, tree op,
 void
 scev_finalize (void)
 {
+  if (cfun->curr_properties & PROP_scev)
+    {
+      gcc_assert (scev_initialized_p ());
+      return;
+    }
+
   if (!scalar_evolution_info)
     return;
   scalar_evolution_info->empty ();
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index d30e3c8..739fda7 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -290,7 +290,7 @@ const pass_data pass_data_tree_loop_init =
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_NONE, /* tv_id */
   PROP_cfg, /* properties_required */
-  0, /* properties_provided */
+  PROP_scev, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
   0, /* todo_flags_finish */
@@ -524,6 +524,7 @@ make_pass_iv_optimize (gcc::context *ctxt)
 static unsigned int
 tree_ssa_loop_done (void)
 {
+  cfun->curr_properties &= ~PROP_scev;
   free_numbers_of_iterations_estimates (cfun);
   scev_finalize ();
   loop_optimizer_finalize ();
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index b721c56..b06433d 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -731,8 +731,8 @@ pass_slp_vectorize::execute (function *fun)
   if (!in_loop_pipeline)
     {
       loop_optimizer_init (LOOPS_NORMAL);
-      scev_initialize ();
     }
+  scev_initialize ();
 
   /* Mark all stmts as not belonging to the current region and unvisited.  */
   FOR_EACH_BB_FN (bb, fun)
@@ -757,9 +757,9 @@ pass_slp_vectorize::execute (function *fun)
 
   free_stmt_vec_info_vec ();
 
+  scev_finalize ();
   if (!in_loop_pipeline)
     {
-      scev_finalize ();
       loop_optimizer_finalize ();
     }
 

[-- Attachment #3: 0021-Add-PROP_loops_normal_re_lcssa.patch --]
[-- Type: text/x-patch, Size: 3449 bytes --]

Add PROP_loops_normal_re_lcssa

---
 gcc/loop-init.c       | 13 +++++++++++++
 gcc/tree-pass.h       |  3 +++
 gcc/tree-ssa-loop.c   |  4 ++--
 gcc/tree-vectorizer.c | 11 ++---------
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index 4b72cab..9ce3e9e 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -100,6 +100,10 @@ loop_optimizer_init (unsigned flags)
       bool needs_fixup = loops_state_satisfies_p (LOOPS_NEED_FIXUP);
 
       gcc_assert (cfun->curr_properties & PROP_loops);
+      if (cfun->curr_properties & PROP_loops_normal_re_lcssa)
+	gcc_assert (loops_state_satisfies_p (LOOPS_NORMAL
+					     | LOOPS_HAVE_RECORDED_EXITS
+					     | LOOP_CLOSED_SSA));
 
       /* Ensure that the dominators are computed, like flow_loops_find does.  */
       calculate_dominance_info (CDI_DOMINATORS);
@@ -151,6 +155,15 @@ loop_optimizer_finalize (struct function *fn)
   struct loop *loop;
   basic_block bb;
 
+  if (fn->curr_properties & PROP_loops_normal_re_lcssa)
+    {
+      gcc_assert (loops_state_satisfies_p (fn, LOOPS_NORMAL
+					   | LOOPS_HAVE_RECORDED_EXITS
+					   | LOOP_CLOSED_SSA));
+      gcc_assert (!loops_state_satisfies_p (fn, LOOPS_NEED_FIXUP));
+      return;
+    }
+
   timevar_push (TV_LOOP_FINI);
 
   if (loops_state_satisfies_p (fn, LOOPS_HAVE_RECORDED_EXITS))
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 4e66b2c..c43a5f3 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -228,6 +228,9 @@ protected:
 						   current choices have
 						   been optimized.  */
 #define PROP_scev		(1 << 16)	/* preserve scev info.  */
+/* preserve loop structures in LOOPS_NORMAL with recorded exits, and in loop
+   closed ssa.  */
+#define PROP_loops_normal_re_lcssa	(1 << 17)
 
 #define PROP_trees \
   (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 739fda7..73fbb43 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -290,7 +290,7 @@ const pass_data pass_data_tree_loop_init =
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_NONE, /* tv_id */
   PROP_cfg, /* properties_required */
-  PROP_scev, /* properties_provided */
+  PROP_loops_normal_re_lcssa | PROP_scev, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
   0, /* todo_flags_finish */
@@ -524,7 +524,7 @@ make_pass_iv_optimize (gcc::context *ctxt)
 static unsigned int
 tree_ssa_loop_done (void)
 {
-  cfun->curr_properties &= ~PROP_scev;
+  cfun->curr_properties &= ~(PROP_loops_normal_re_lcssa | PROP_scev);
   free_numbers_of_iterations_estimates (cfun);
   scev_finalize ();
   loop_optimizer_finalize ();
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index b06433d..503f227 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -727,11 +727,7 @@ pass_slp_vectorize::execute (function *fun)
 {
   basic_block bb;
 
-  bool in_loop_pipeline = scev_initialized_p ();
-  if (!in_loop_pipeline)
-    {
-      loop_optimizer_init (LOOPS_NORMAL);
-    }
+  loop_optimizer_init (LOOPS_NORMAL);
   scev_initialize ();
 
   /* Mark all stmts as not belonging to the current region and unvisited.  */
@@ -758,10 +754,7 @@ pass_slp_vectorize::execute (function *fun)
   free_stmt_vec_info_vec ();
 
   scev_finalize ();
-  if (!in_loop_pipeline)
-    {
-      loop_optimizer_finalize ();
-    }
+  loop_optimizer_finalize ();
 
   return 0;
 }

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-21  8:42                   ` Tom de Vries
@ 2015-11-23 11:31                     ` Richard Biener
  2015-11-23 15:53                       ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 11:31 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Sat, 21 Nov 2015, Tom de Vries wrote:

> On 20/11/15 11:28, Richard Biener wrote:
> > On Thu, 19 Nov 2015, Tom de Vries wrote:
> > 
> > > >On 17/11/15 15:53, Tom de Vries wrote:
> > > > > > > >And the above LIM example
> > > > > > > >is none for why you need two LIM passes...
> > > > > >
> > > > > >Indeed. I'm planning a separate reply to explain in more detail the
> > > > need
> > > > > >for the two pass_lims.
> > > >
> > > >I.
> > > >
> > > >I managed to get rid of the two pass_lims for the motivating example that
> > > I
> > > >used until now (goacc/kernels-double-reduction.c). I found that by adding
> > > a
> > > >pass_dominator instance after pass_ch, I could get rid of the second
> > > pass_lim
> > > >(and pass_copyprop as well).
> > > >
> > > >But... then I wrote a counter example
> > > (goacc/kernels-double-reduction-n.c),
> > > >and I'm back at two pass_lims (and two pass_dominators).
> > > >Also I've split the pass group into a bit before and after pass_fre.
> > > >
> > > >So, the current pass group looks like:
> > > >...
> > > >NEXT_PASS (pass_build_ealias);
> > > >
> > > >/* Pass group that runs when the function is an offloaded function
> > > >    containing oacc kernels loops.  Part 1.  */
> > > >NEXT_PASS (pass_oacc_kernels);
> > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > > >     /* We need pass_ch here, because pass_lim has no effect on
> > > >        exit-first loops (PR65442).  Ideally we want to remove both
> > > >        this pass instantiation, and the reverse transformation
> > > >        transform_to_exit_first_loop_alt, which is done in
> > > >        pass_parallelize_loops_oacc_kernels. */
> > > >     NEXT_PASS (pass_ch);
> > > >POP_INSERT_PASSES ()
> > > >
> > > >NEXT_PASS (pass_fre);
> > > >
> > > >/* Pass group that runs when the function is an offloaded function
> > > >    containing oacc kernels loops.  Part 2.  */
> > > >NEXT_PASS (pass_oacc_kernels2);
> > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
> > > >     /* We use pass_lim to rewrite in-memory iteration and reduction
> > > >        variable accesses in loops into local variables accesses.  */
> > > >     NEXT_PASS (pass_lim);
> > > >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> > > >     NEXT_PASS (pass_lim);
> > > >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> > > >     NEXT_PASS (pass_dce);
> > > >     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> > > >     NEXT_PASS (pass_expand_omp_ssa);
> > > >POP_INSERT_PASSES ()
> > > >NEXT_PASS (pass_merge_phi);
> > > >...
> > > >
> > > >
> > > >II.
> > > >
> > > >The motivating test-case kernels-double-reduction-n.c:
> > > >...
> > > >#include <stdlib.h>
> > > >
> > > >#define N 500
> > > >
> > > >unsigned int a[N][N];
> > > >
> > > >void  __attribute__((noinline,noclone))
> > > >foo (unsigned int n)
> > > >{
> > > >   int i, j;
> > > >   unsigned int sum = 1;
> > > >
> > > >#pragma acc kernels copyin (a[0:n]) copy (sum)
> > > >   {
> > > >     for (i = 0; i < n; ++i)
> > > >       for (j = 0; j < n; ++j)
> > > >         sum += a[i][j];
> > > >   }
> > > >
> > > >   if (sum != 5001)
> > > >     abort ();
> > > >}
> > > >...
> > > >
> > > >
> > > >III.
> > > >
> > > >Before first pass_lim. Note no phis on inner or outer loop header for
> > > >iteration varables or reduction variable:
> > > >...
> > > >   <bb 2>:
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 4>: outer loop header
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   *_12 = 0;
> > > >   if (_45 != 0)
> > > >     goto <bb 6>;
> > > >   else
> > > >     goto <bb 5>;
> > > >
> > > >   <bb 6>: inner loop header, latch
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   _21 = *_5;
> > > >   _23 = *_12;
> > > >   _24 = *_19[_21][_23];
> > > >   _25 = *.omp_data_i_4(D).sum;
> > > >   sum.0_26 = *_25;
> > > >   sum.1_27 = _24 + sum.0_26;
> > > >   *_25 = sum.1_27;
> > > >   _33 = _23 + 1;
> > > >   *_12 = _33;
> > > >   j.2_16 = (unsigned int) _33;
> > > >   if (j.2_16 < _45)
> > > >     goto <bb 6>;
> > > >   else
> > > >     goto <bb 5>;
> > > >
> > > >   <bb 5>: outer loop latch
> > > >   _36 = *_5;
> > > >   _38 = _36 + 1;
> > > >   *_5 = _38;
> > > >   i.3_9 = (unsigned int) _38;
> > > >   if (i.3_9 < _45)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 3>:
> > > >   return;
> > > >...
> > > >
> > > >
> > > >IV.
> > > >
> > > >After first pass_lim/pass_dom pair. Note there are phis on the inner loop
> > > >header for the reduction and the iteration variable, but not on the outer
> > > loop
> > > >header:
> > > >...
> > > >   <bb 2>:
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 4>:
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   D__lsm.10_50 = *_12;
> > > >   D__lsm.11_51 = 0;
> > > >   _25 = *.omp_data_i_4(D).sum;
> > > >
> > > >   <bb 5>: outer loop header
> > > >   D__lsm.10_20 = 0;
> > > >   D__lsm.11_22 = 1;
> > > >   _21 = *_5;
> > > >   D__lsm.12_28 = *_25;
> > > >   D__lsm.13_30 = 0;
> > > >   goto <bb 7>;
> > > >
> > > >   <bb 7>: inner loop header, latch
> > > >   # D__lsm.10_47 = PHI <0(5), _33(7)>
> > > >   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
> > > >   _23 = D__lsm.10_47;
> > > >   _24 = *_19[_21][D__lsm.10_47];
> > > >   sum.0_26 = D__lsm.12_49;
> > > >   sum.1_27 = _24 + D__lsm.12_49;
> > > >   D__lsm.12_31 = sum.1_27;
> > > >   D__lsm.13_32 = 1;
> > > >   _33 = D__lsm.10_47 + 1;
> > > >   D__lsm.10_14 = _33;
> > > >   D__lsm.11_15 = 1;
> > > >   j.2_16 = (unsigned int) _33;
> > > >   if (j.2_16 < _45)
> > > >     goto <bb 7>;
> > > >   else
> > > >     goto <bb 8>;
> > > >
> > > >   <bb 8>: outer loop latch
> > > >   # D__lsm.10_35 = PHI <_33(7)>
> > > >   # D__lsm.11_37 = PHI <1(7)>
> > > >   # D__lsm.12_7 = PHI <sum.1_27(7)>
> > > >   # D__lsm.13_8 = PHI <1(7)>
> > > >   *_25 = sum.1_27;
> > > >   _36 = *_5;
> > > >   _38 = _36 + 1;
> > > >   *_5 = _38;
> > > >   i.3_9 = (unsigned int) _38;
> > > >   if (i.3_9 < _45)
> > > >     goto <bb 5>;
> > > >   else
> > > >     goto <bb 6>;
> > > >
> > > >   <bb 6>:
> > > >   # D__lsm.10_10 = PHI <_33(8)>
> > > >   # D__lsm.11_11 = PHI <1(8)>
> > > >   *_12 = _33;
> > > >   goto <bb 3>;
> > > >
> > > >   <bb 3>:
> > > >   return;
> > > >...
> > > >
> > > >
> > > >V.
> > > >
> > > >After second pass_lim/pass_dom pair. Note there are phis on the inner and
> > > >outer loop header for the reduction and the iteration variables:
> > > >...
> > > >   <bb 2>:
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 4>:
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   D__lsm.10_50 = *_12;
> > > >   D__lsm.11_51 = 0;
> > > >   _25 = *.omp_data_i_4(D).sum;
> > > >   D__lsm.14_40 = 0;
> > > >   D__lsm.15_2 = 0;
> > > >   D__lsm.16_1 = *_25;
> > > >   D__lsm.17_46 = 0;
> > > >
> > > >   <bb 5>: outer loop header
> > > >   # D__lsm.14_13 = PHI <0(4), _38(8)>
> > > >   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
> > > >   D__lsm.10_20 = 0;
> > > >   D__lsm.11_22 = 1;
> > > >   _21 = D__lsm.14_13;
> > > >   D__lsm.12_28 = D__lsm.16_34;
> > > >   D__lsm.13_30 = 0;
> > > >   goto <bb 7>;
> > > >
> > > >   <bb 7>: inner loop header, latch
> > > >   # D__lsm.10_47 = PHI <0(5), _33(7)>
> > > >   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
> > > >   _23 = D__lsm.10_47;
> > > >   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
> > > >   sum.0_26 = D__lsm.12_49;
> > > >   sum.1_27 = _24 + D__lsm.12_49;
> > > >   D__lsm.12_31 = sum.1_27;
> > > >   D__lsm.13_32 = 1;
> > > >   _33 = D__lsm.10_47 + 1;
> > > >   D__lsm.10_14 = _33;
> > > >   D__lsm.11_15 = 1;
> > > >   j.2_16 = (unsigned int) _33;
> > > >   if (j.2_16 < _45)
> > > >     goto <bb 7>;
> > > >   else
> > > >     goto <bb 8>;
> > > >
> > > >   <bb 8>: outer loop latch
> > > >   # D__lsm.10_35 = PHI <_33(7)>
> > > >   # D__lsm.11_37 = PHI <1(7)>
> > > >   # D__lsm.12_7 = PHI <sum.1_27(7)>
> > > >   # D__lsm.13_8 = PHI <1(7)>
> > > >   # sum.1_48 = PHI <sum.1_27(7)>
> > > >   # _53 = PHI <_33(7)>
> > > >   D__lsm.16_56 = sum.1_27;
> > > >   D__lsm.17_57 = 1;
> > > >   _36 = D__lsm.14_13;
> > > >   _38 = D__lsm.14_13 + 1;
> > > >   D__lsm.14_58 = _38;
> > > >   D__lsm.15_59 = 1;
> > > >   i.3_9 = (unsigned int) _38;
> > > >   if (i.3_9 < _45)
> > > >     goto <bb 5>;
> > > >   else
> > > >     goto <bb 6>;
> > > >
> > > >   <bb 6>:
> > > >   # D__lsm.10_10 = PHI <_33(8)>
> > > >   # D__lsm.11_11 = PHI <1(8)>
> > > >   # _43 = PHI <_33(8)>
> > > >   # D__lsm.16_62 = PHI <sum.1_27(8)>
> > > >   # D__lsm.17_63 = PHI <1(8)>
> > > >   # D__lsm.14_64 = PHI <_38(8)>
> > > >   # D__lsm.15_65 = PHI <1(8)>
> > > >   *_5 = _38;
> > > >   *_25 = sum.1_27;
> > > >   *_12 = _33;
> > > >   goto <bb 3>;
> > > >
> > > >   <bb 3>:
> > > >   return;
> > > >...
> > Sorry but staring at dumps doesn't make me understand the issue you
> > run into.  Where can I reproduce this if I have time to look at this?
> 
> I've posted the state of the patch series that reproduces this problem at
> https://github.com/vries/gcc/commits/vries/master-port-kernels-test-rb , run
> goacc.exp, testcase kernels-double-reduction-n.c.
> 
> > From the dump below I understand you want no memory references in
> > the outer loop?
> > So the issue seems to be that store motion fails
> > to insert the preheader load / exit store to the outermost loop
> > possible and thus another LIM pass is needed to "store motion" those
> > again?
> 
> Yep.
> 
> >  But a simple testcase
> > 
> > int a;
> > int *p = &a;
> > int foo (int n)
> > {
> >    for (int i = 0; i < n; ++i)
> >      for (int j = 0; j < 100; ++j)
> >        *p += j + i;
> >    return a;
> > }
> > 
> > shows that LIM can do this in one step.
> 
> I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry
> conditions" for a test-case where that doesn't happen (when using
> -fno-tree-dominator-opts).
> 
> > Which means it should
> > be investigated why it doesn't do this properly for your testcase
> > (store motion of *_25).
> 
> There seems to be two related problems:
> 1. the store has tree_could_trap_p (ref->mem.ref) true, which should be
>    false. I'll work on a fix for this.
> 2. Give that the store can trap, I  was running into PR68465. I managed
>    to eliminate the 2nd pass_lim by moving the pass_dominator instance
>    before the pass_lim instance.
> 
> Attached patch shows the pass group with only one pass_lim. I hope to be able
> to eliminate the first pass_dominator instance before pass_lim once I fix 1.
> 
> > Simply adding two LIM passes either papers over a wrong-code
> > bug (in LIM or in DOM) or over a missed-optimization in LIM.
> 
> AFAIU now, it's PR68465, a missed optimization in LIM.

Ok, it's not really LIMs job to cleanup loop header copying that way.

DOM performs jump-threading for this but FRE should also be able
to handle this just fine.  Ah, it doesn't because the outer loop
header directly contains the condition

Index: gcc/tree-ssa-sccvn.c
===================================================================
--- gcc/tree-ssa-sccvn.c        (revision 230737)
+++ gcc/tree-ssa-sccvn.c        (working copy)
@@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b
 
   /* If we have a single predecessor record the equivalence from a
      possible condition on the predecessor edge.  */
-  if (single_pred_p (bb))
+  edge pred_e = NULL;
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      if (e->flags & EDGE_DFS_BACK)
+       continue;
+      if (! pred_e)
+       pred_e = e;
+      else
+       {
+         pred_e = NULL;
+         break;
+       }
+    }
+  if (pred_e)
     {
-      edge e = single_pred_edge (bb);
       /* Check if there are multiple executable successor edges in
         the source block.  Otherwise there is no additional info
         to be recorded.  */
       edge e2;
-      FOR_EACH_EDGE (e2, ei, e->src->succs)
-       if (e2 != e
+      FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
+       if (e2 != pred_e
            && e2->flags & EDGE_EXECUTABLE)
          break;
       if (e2 && (e2->flags & EDGE_EXECUTABLE))
        {
-         gimple *stmt = last_stmt (e->src);
+         gimple *stmt = last_stmt (pred_e->src);
          if (stmt
              && gimple_code (stmt) == GIMPLE_COND)
            {
@@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
              tree lhs = gimple_cond_lhs (stmt);
              tree rhs = gimple_cond_rhs (stmt);
              record_conds (bb, code, lhs, rhs,
-                           (e->flags & EDGE_TRUE_VALUE) != 0);
+                           (pred_e->flags & EDGE_TRUE_VALUE) != 0);
              code = invert_tree_comparison (code, HONOR_NANS (lhs));
              if (code != ERROR_MARK)
                record_conds (bb, code, lhs, rhs,
-                             (e->flags & EDGE_TRUE_VALUE) == 0);
+                             (pred_e->flags & EDGE_TRUE_VALUE) == 0);
            }
        }
     }

fixes this for me (for a small testcase).  Does it help yours?

Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
it's supposed to match edges that have the src dominated by the dest).
Testing the above now.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-21 12:24                 ` Tom de Vries
@ 2015-11-23 11:46                   ` Richard Biener
  2015-11-27 11:44                     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 11:46 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Sat, 21 Nov 2015, Tom de Vries wrote:

> On 13/11/15 12:39, Jakub Jelinek wrote:
> > On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
> > > > thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta
> > > > issues'.
> > > > 
> > > > Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit
> > > > above?
> > > > Is that sort of what you had in mind?
> > > 
> > > Yes.  Whether that makes sense is another question of course.  You can
> > > annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
> > > as well if you know dependences without the users intervention.
> > 
> > I really don't like even the GCC offload-alias, I just don't see anything
> > special on the offload code.  Not to mention that the same issue is already
> > with other outlined functions, like OpenMP tasks or parallel regions, those
> > aren't offloaded, yet they can suffer from worse alias/points-to analysis
> > too.
> 
> AFAIU there is one aspect that is different for offloaded code: the setup of
> the data on the device.
> 
> Consider this example:
> ...
> unsigned int a[N];
> unsigned int b[N];
> unsigned int c[N];
> 
> int
> main (void)
> {
>   ...
> 
> #pragma acc kernels copyin (a) copyin (b) copyout (c)
>   {
>     for (COUNTERTYPE ii = 0; ii < N; ii++)
>       c[ii] = a[ii] + b[ii];
>   }
> 
>   ...
> ...
> 
> At gimple level, we have:
> ...
> #pragma omp target oacc_kernels \
>   map(force_from:c [len: 2097152]) \
>   map(force_to:b [len: 2097152]) \
>   map(force_to:a [len: 2097152])
> ...
> 
> [ The meaning of the force_from/force_to mappings is given in
> include/gomp-constants.h:
> ...
>     /* Allocate.  */
>     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
>     /* ..., and copy to device.  */
>     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
>     /* ..., and copy from device.  */
>     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
>     /* ..., and copy to and from device.  */
>     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
> ...  ]
> 
> So before calling the offloaded function, a separate alloc is done for a, b
> and c, and the base pointers of the newly allocated objects are passed to the
> offloaded function.
> 
> This means we can mark those base pointers as restrict in the offloaded
> function.
> 
> Attached proof-of-concept patch implements that.
> 
> > We simply have some compiler internal interface between the caller and
> > callee of the outlined regions, each interface in between those has
> > its own structure type used to communicate the info;
> > we can attach attributes on the fields, or some flags to indicate some
> > properties interesting from aliasing POV.
> > We don't really need to perform
> > full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> > the relationship in between such callers and callees (for offloading regions
> > we already have "omp target entrypoint" attribute on the callee and a
> > singler caller), tell LTO if possible not to split those into different
> > partitions if easily possible, and then just for these pairs perform
> > aliasing/points-to analysis in the caller and the result record using
> > cliques/special attributes/whatever to the callee side, so that the callee
> > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
> 
> As a start, is the approach of this patch OK?

Works for me but leaving to Jakub to review for correctness.

Richard.

> It will allow us to commit the oacc kernels patch series with the ability to
> parallelize non-trivial testcases, and work on improving the alias bit after
> that.
> 
> Thanks,
> - Tom
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-23 11:31                     ` Richard Biener
@ 2015-11-23 15:53                       ` Tom de Vries
  2015-11-23 16:38                         ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-23 15:53 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 23/11/15 12:31, Richard Biener wrote:
>>>  From the dump below I understand you want no memory references in
>>> > >the outer loop?
>>> > >So the issue seems to be that store motion fails
>>> > >to insert the preheader load / exit store to the outermost loop
>>> > >possible and thus another LIM pass is needed to "store motion" those
>>> > >again?
>> >
>> >Yep.
>> >
>>> > >  But a simple testcase
>>> > >
>>> > >int a;
>>> > >int *p = &a;
>>> > >int foo (int n)
>>> > >{
>>> > >    for (int i = 0; i < n; ++i)
>>> > >      for (int j = 0; j < 100; ++j)
>>> > >        *p += j + i;
>>> > >    return a;
>>> > >}
>>> > >
>>> > >shows that LIM can do this in one step.
>> >
>> >I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry
>> >conditions" for a test-case where that doesn't happen (when using
>> >-fno-tree-dominator-opts).
>> >
>>> > >Which means it should
>>> > >be investigated why it doesn't do this properly for your testcase
>>> > >(store motion of *_25).
>> >
>> >There seems to be two related problems:
>> >1. the store has tree_could_trap_p (ref->mem.ref) true, which should be
>> >    false. I'll work on a fix for this.
>> >2. Give that the store can trap, I  was running into PR68465. I managed
>> >    to eliminate the 2nd pass_lim by moving the pass_dominator instance
>> >    before the pass_lim instance.
>> >
>> >Attached patch shows the pass group with only one pass_lim. I hope to be able
>> >to eliminate the first pass_dominator instance before pass_lim once I fix 1.
>> >
>>> > >Simply adding two LIM passes either papers over a wrong-code
>>> > >bug (in LIM or in DOM) or over a missed-optimization in LIM.
>> >
>> >AFAIU now, it's PR68465, a missed optimization in LIM.
> Ok, it's not really LIMs job to cleanup loop header copying that way.
>
> DOM performs jump-threading for this but FRE should also be able
> to handle this just fine.  Ah, it doesn't because the outer loop
> header directly contains the condition
>
> Index: gcc/tree-ssa-sccvn.c
> ===================================================================
> --- gcc/tree-ssa-sccvn.c        (revision 230737)
> +++ gcc/tree-ssa-sccvn.c        (working copy)
> @@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b
>
>     /* If we have a single predecessor record the equivalence from a
>        possible condition on the predecessor edge.  */
> -  if (single_pred_p (bb))
> +  edge pred_e = NULL;
> +  FOR_EACH_EDGE (e, ei, bb->preds)
> +    {
> +      if (e->flags & EDGE_DFS_BACK)
> +       continue;
> +      if (! pred_e)
> +       pred_e = e;
> +      else
> +       {
> +         pred_e = NULL;
> +         break;
> +       }
> +    }
> +  if (pred_e)
>       {
> -      edge e = single_pred_edge (bb);
>         /* Check if there are multiple executable successor edges in
>           the source block.  Otherwise there is no additional info
>           to be recorded.  */
>         edge e2;
> -      FOR_EACH_EDGE (e2, ei, e->src->succs)
> -       if (e2 != e
> +      FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
> +       if (e2 != pred_e
>              && e2->flags & EDGE_EXECUTABLE)
>            break;
>         if (e2 && (e2->flags & EDGE_EXECUTABLE))
>          {
> -         gimple *stmt = last_stmt (e->src);
> +         gimple *stmt = last_stmt (pred_e->src);
>            if (stmt
>                && gimple_code (stmt) == GIMPLE_COND)
>              {
> @@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
>                tree lhs = gimple_cond_lhs (stmt);
>                tree rhs = gimple_cond_rhs (stmt);
>                record_conds (bb, code, lhs, rhs,
> -                           (e->flags & EDGE_TRUE_VALUE) != 0);
> +                           (pred_e->flags & EDGE_TRUE_VALUE) != 0);
>                code = invert_tree_comparison (code, HONOR_NANS (lhs));
>                if (code != ERROR_MARK)
>                  record_conds (bb, code, lhs, rhs,
> -                             (e->flags & EDGE_TRUE_VALUE) == 0);
> +                             (pred_e->flags & EDGE_TRUE_VALUE) == 0);
>              }
>          }
>       }
>
> fixes this for me (for a small testcase).  Does it help yours?
>

Yes, it has the desired effect (of not needing pass_dominator before 
pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as 
non-trapping" committed as r230738, also has that effect, so AFAIU I 
don't require this tree-ssa-sccvn.c fix.

Thanks,
- Tom

> Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
> it's supposed to match edges that have the src dominated by the dest).
> Testing the above now.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-23 15:53                       ` Tom de Vries
@ 2015-11-23 16:38                         ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-23 16:38 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On November 23, 2015 4:37:18 PM GMT+01:00, Tom de Vries <Tom_deVries@mentor.com> wrote:
>On 23/11/15 12:31, Richard Biener wrote:
>>>>  From the dump below I understand you want no memory references in
>>>> > >the outer loop?
>>>> > >So the issue seems to be that store motion fails
>>>> > >to insert the preheader load / exit store to the outermost loop
>>>> > >possible and thus another LIM pass is needed to "store motion"
>those
>>>> > >again?
>>> >
>>> >Yep.
>>> >
>>>> > >  But a simple testcase
>>>> > >
>>>> > >int a;
>>>> > >int *p = &a;
>>>> > >int foo (int n)
>>>> > >{
>>>> > >    for (int i = 0; i < n; ++i)
>>>> > >      for (int j = 0; j < 100; ++j)
>>>> > >        *p += j + i;
>>>> > >    return a;
>>>> > >}
>>>> > >
>>>> > >shows that LIM can do this in one step.
>>> >
>>> >I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop
>entry
>>> >conditions" for a test-case where that doesn't happen (when using
>>> >-fno-tree-dominator-opts).
>>> >
>>>> > >Which means it should
>>>> > >be investigated why it doesn't do this properly for your
>testcase
>>>> > >(store motion of *_25).
>>> >
>>> >There seems to be two related problems:
>>> >1. the store has tree_could_trap_p (ref->mem.ref) true, which
>should be
>>> >    false. I'll work on a fix for this.
>>> >2. Give that the store can trap, I  was running into PR68465. I
>managed
>>> >    to eliminate the 2nd pass_lim by moving the pass_dominator
>instance
>>> >    before the pass_lim instance.
>>> >
>>> >Attached patch shows the pass group with only one pass_lim. I hope
>to be able
>>> >to eliminate the first pass_dominator instance before pass_lim once
>I fix 1.
>>> >
>>>> > >Simply adding two LIM passes either papers over a wrong-code
>>>> > >bug (in LIM or in DOM) or over a missed-optimization in LIM.
>>> >
>>> >AFAIU now, it's PR68465, a missed optimization in LIM.
>> Ok, it's not really LIMs job to cleanup loop header copying that way.
>>
>> DOM performs jump-threading for this but FRE should also be able
>> to handle this just fine.  Ah, it doesn't because the outer loop
>> header directly contains the condition
>>
>> Index: gcc/tree-ssa-sccvn.c
>> ===================================================================
>> --- gcc/tree-ssa-sccvn.c        (revision 230737)
>> +++ gcc/tree-ssa-sccvn.c        (working copy)
>> @@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b
>>
>>     /* If we have a single predecessor record the equivalence from a
>>        possible condition on the predecessor edge.  */
>> -  if (single_pred_p (bb))
>> +  edge pred_e = NULL;
>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>> +    {
>> +      if (e->flags & EDGE_DFS_BACK)
>> +       continue;
>> +      if (! pred_e)
>> +       pred_e = e;
>> +      else
>> +       {
>> +         pred_e = NULL;
>> +         break;
>> +       }
>> +    }
>> +  if (pred_e)
>>       {
>> -      edge e = single_pred_edge (bb);
>>         /* Check if there are multiple executable successor edges in
>>           the source block.  Otherwise there is no additional info
>>           to be recorded.  */
>>         edge e2;
>> -      FOR_EACH_EDGE (e2, ei, e->src->succs)
>> -       if (e2 != e
>> +      FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
>> +       if (e2 != pred_e
>>              && e2->flags & EDGE_EXECUTABLE)
>>            break;
>>         if (e2 && (e2->flags & EDGE_EXECUTABLE))
>>          {
>> -         gimple *stmt = last_stmt (e->src);
>> +         gimple *stmt = last_stmt (pred_e->src);
>>            if (stmt
>>                && gimple_code (stmt) == GIMPLE_COND)
>>              {
>> @@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
>>                tree lhs = gimple_cond_lhs (stmt);
>>                tree rhs = gimple_cond_rhs (stmt);
>>                record_conds (bb, code, lhs, rhs,
>> -                           (e->flags & EDGE_TRUE_VALUE) != 0);
>> +                           (pred_e->flags & EDGE_TRUE_VALUE) != 0);
>>                code = invert_tree_comparison (code, HONOR_NANS
>(lhs));
>>                if (code != ERROR_MARK)
>>                  record_conds (bb, code, lhs, rhs,
>> -                             (e->flags & EDGE_TRUE_VALUE) == 0);
>> +                             (pred_e->flags & EDGE_TRUE_VALUE) ==
>0);
>>              }
>>          }
>>       }
>>
>> fixes this for me (for a small testcase).  Does it help yours?
>>
>
>Yes, it has the desired effect (of not needing pass_dominator before 
>pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as 
>non-trapping" committed as r230738, also has that effect, so AFAIU I 
>don't require this tree-ssa-sccvn.c fix.

OK, I committed it anyway already.

Richard.

>Thanks,
>- Tom
>
>> Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
>> it's supposed to match edges that have the src dominated by the
>dest).
>> Testing the above now.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-19 13:51     ` Tom de Vries
@ 2015-11-24 12:17       ` Tom de Vries
  2015-11-25 10:42         ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

On 19/11/15 14:50, Tom de Vries wrote:
> On 11/11/15 11:58, Richard Biener wrote:
>> On Mon, 9 Nov 2015, Tom de Vries wrote:
>>
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>> Hi,
>>>>
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>>>
>>>> The patch series contains these patches:
>>>>
>>>>        1    Insert new exit block only when needed in
>>>>           transform_to_exit_first_loop_alt
>>>>        2    Make create_parallel_loop return void
>>>>        3    Ignore reduction clause on kernels directive
>>>>        4    Implement -foffload-alias
>>>>        5    Add in_oacc_kernels_region in struct loop
>>>>        6    Add pass_oacc_kernels
>>>>        7    Add pass_dominator_oacc_kernels
>>>>        8    Add pass_ch_oacc_kernels
>>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>>       11    Update testcases after adding kernels pass group
>>>>       12    Handle acc loop directive
>>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>
>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>> intended to be committed at the same time.
>>>>
>>>> Bootstrapped and reg-tested on x86_64.
>>>>
>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>> patch that enables accelerator testing (which is submitted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>
>>>> I'll post the individual patches in reply to this message.
>>>
>>> this patchs add a pass group pass_oacc_kernels (which will be added
>>> to the
>>> pass list as a whole in patch 10).
>>
>> Just to understand (while also skimming the HSA patches).
>>
>> You are basically relying on autopar for what the HSA patches call
>> "gridification"?  That is, OMP lowering produces loopy kernels
>> and autopar then will basically strip the outermost loop?
>
> Short answer: no. In more detail...
<SNIP>

Reposting patch, after splitting the pass group into two.

Thanks,
- TOm


[-- Attachment #2: 0002-Add-pass_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 4336 bytes --]

Add pass_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_oacc_kernels, make_pass_oacc_kernels2):
	Declare.
	* tree-ssa-loop.c (gate_oacc_kernels): New static function.
	(pass_data_oacc_kernels, pass_data_oacc_kernels2): New pass_data.
	(class pass_oacc_kernels, class pass_oacc_kernels2): New pass.
	(make_pass_oacc_kernels, make_pass_oacc_kernels2): New function.

---
 gcc/tree-pass.h     |   2 +
 gcc/tree-ssa-loop.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 112 insertions(+)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index dcd2d5e..9704918 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -465,6 +465,8 @@ extern gimple_opt_pass *make_pass_strength_reduction (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index afdef12..cf7d94e 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
+#include "omp-low.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -141,6 +142,115 @@ make_pass_tree_loop (gcc::context *ctxt)
   return new pass_tree_loop (ctxt);
 }
 
+/* Gate for oacc kernels pass group.  */
+
+static bool
+gate_oacc_kernels (function *fn)
+{
+  if (flag_tree_parallelize_loops <= 1)
+    return false;
+
+  tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
+  if (oacc_function_attr == NULL_TREE)
+    return false;
+
+  tree val = TREE_VALUE (oacc_function_attr);
+  while (val != NULL_TREE && TREE_VALUE (val) == NULL_TREE)
+    val = TREE_CHAIN (val);
+
+  if (val != NULL_TREE)
+    return false;
+
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 0)
+    if (loop->in_oacc_kernels_region)
+      return true;
+
+  return false;
+}
+
+/* The oacc kernels superpass.  */
+
+namespace {
+
+const pass_data pass_data_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
+
+}; // class pass_oacc_kernels
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_oacc_kernels (ctxt);
+}
+
+namespace {
+
+const pass_data pass_data_oacc_kernels2 =
+{
+  GIMPLE_PASS, /* type */
+  "oacc_kernels2", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_oacc_kernels2 : public gimple_opt_pass
+{
+public:
+  pass_oacc_kernels2 (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
+  virtual unsigned int execute (function *fn)
+    {
+      /* Rather than having a copy of the previous dump, get some use out of
+	 this dump, and try to minimize differences with the following pass
+	 (pass_lim), which will initizalize the loop optimizer with
+	 LOOPS_NORMAL.  */
+      loop_optimizer_init (LOOPS_NORMAL);
+      loop_optimizer_finalize (fn);
+      return 0;
+    }
+
+}; // class pass_oacc_kernels2
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_oacc_kernels2 (gcc::context *ctxt)
+{
+  return new pass_oacc_kernels2 (ctxt);
+}
+
 /* The no-loop superpass.  */
 
 namespace {

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-23 10:11                   ` Richard Biener
@ 2015-11-24 12:22                     ` Tom de Vries
  2015-11-24 13:19                       ` Richard Biener
  2015-11-25 10:44                       ` Richard Biener
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:22 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]

On 23/11/15 11:02, Richard Biener wrote:
> On Fri, 20 Nov 2015, Tom de Vries wrote:
>
>> On 20/11/15 14:29, Richard Biener wrote:
>>> I agree it's somewhat of an odd behavior but all passes should
>>> either be placed in a sub-pipeline with an outer
>>> loop_optimizer_init()/finalize () call or call both themselves.
>>
>> Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop
>> pipeline.
>>
>> We could use the style used in pass_slp_vectorize::execute:
>> ...
>> pass_slp_vectorize::execute (function *fun)
>> {
>>    basic_block bb;
>>
>>    bool in_loop_pipeline = scev_initialized_p ();
>>    if (!in_loop_pipeline)
>>      {
>>        loop_optimizer_init (LOOPS_NORMAL);
>>        scev_initialize ();
>>      }
>>
>>    ...
>>
>>    if (!in_loop_pipeline)
>>      {
>>        scev_finalize ();
>>        loop_optimizer_finalize ();
>>      }
>> ...
>>
>> Although that doesn't strike me as particularly clean.
>
> At least it would be a consistent "unclean" style.  So yes, the
> above would work for me.
>

Reposting using the in_loop_pipeline style in pass_lim.

Thanks,
- Tom


[-- Attachment #2: 0004-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 3891 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 18 ++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 12 ++++++++++--
 4 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index efe5d3a..7318b0e 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13366,6 +13366,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..f1969c0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,25 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_ch);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..0d82d36 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-propagate.h"
 #include "trans-mem.h"
 #include "gimple-fold.h"
+#include "tree-scalar-evolution.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -2496,7 +2497,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
    i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,10 +2561,17 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  bool in_loop_pipeline = scev_initialized_p ();
+  if (!in_loop_pipeline)
+    loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
     return 0;
+  unsigned int todo = tree_ssa_lim ();
 
-  return tree_ssa_lim ();
+  if (!in_loop_pipeline)
+    loop_optimizer_finalize ();
+  return todo;
 }
 
 } // anon namespace

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING][PATCH, 3/16] Ignore reduction clause on kernels directive
  2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
@ 2015-11-24 12:25   ` Tom de Vries
  2016-01-18 14:24     ` [PING^2][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:25 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener, Thomas Schwinge

On 09/11/15 16:50, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> As discussed here (
> https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00785.html ), the kernels
> directive does not allow the reduction clause.  This patch fixes that.
>

Ping.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-16 11:59   ` Tom de Vries
@ 2015-11-24 12:27     ` Tom de Vries
  2015-12-13 16:58       ` [PIING][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3601 bytes --]

On 16/11/15 12:59, Tom de Vries wrote:
> On 09/11/15 20:52, Tom de Vries wrote:
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>       1    Insert new exit block only when needed in
>>>          transform_to_exit_first_loop_alt
>>>       2    Make create_parallel_loop return void
>>>       3    Ignore reduction clause on kernels directive
>>>       4    Implement -foffload-alias
>>>       5    Add in_oacc_kernels_region in struct loop
>>>       6    Add pass_oacc_kernels
>>>       7    Add pass_dominator_oacc_kernels
>>>       8    Add pass_ch_oacc_kernels
>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>      11    Update testcases after adding kernels pass group
>>>      12    Handle acc loop directive
>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> This patch adds pass_parallelize_loops_oacc_kernels.
>>
>> There's a number of things we do differently in parloops for oacc
>> kernels:
>> - in normal parloops, we generate code to choose between a parallel
>>    version of the loop, and a sequential (low iteration count) version.
>>    Since the code in oacc kernels region is supposed to run on the
>>    accelerator anyway, we skip this check, and don't add a low iteration
>>    count loop.
>> - in normal parloops, we generate an #pragma omp parallel /
>>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>>    into a thread function. Since the oacc kernels region is already
>>    split off, we don't add this pair.
>> - we indicate the parallelization factor by setting the oacc function
>>    attributes
>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>    we add the gang clause
>> - in normal parloops, we rewrite the variable accesses in the loop in
>>    terms into accesses relative to a thread function parameter. For the
>>    oacc kernels region, that rewrite has already been done at omp-lower,
>>    so we skip this.
>> - we need to ensure that the entire kernels region can be run in
>>    parallel. The loop independence check is already present, so for oacc
>>    kernels we add a check between blocks outside the loop and the entire
>>    region.
>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>    There's no need for each gang to write to a single location, we can
>>    do this in just one gang. (Typically this is the write of the final
>>    value of the iteration variable if that one is copied back to the
>>    host).
>>
>
> Reposting with loop optimizer init added in
> pass_parallelize_loops_oacc_kernels::execute.
>

Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize 
  added in pass_parallelize_loops_oacc_kernels::execute.

Thanks,
- Tom


[-- Attachment #2: 0003-Add-pass_parallelize_loops_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 30877 bytes --]

Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
	(create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.

---
 gcc/omp-low.c       |   8 +-
 gcc/omp-low.h       |   1 +
 gcc/tree-parloops.c | 700 +++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-pass.h     |   2 +
 4 files changed, 647 insertions(+), 64 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 0d4c6e5..efe5d3a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11925,10 +11925,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
   gsi_remove (&gsi, true);
   gsi = gsi_last_bb (store_bb);
+  stmt = gsi_stmt (gsi);
   gsi_remove (&gsi, true);
 
   if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_no_phi);
+    {
+      release_defs (stmt);
+      update_ssa (TODO_update_ssa_no_phi);
+    }
 
   return true;
 }
@@ -12302,7 +12306,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index 194b3d1..1790f40 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree);
 extern void replace_oacc_fn_attrib (tree, tree);
 extern tree build_oacc_routine_dims (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 9b564ca..0403d3b 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
 
-  addr = build_addr (t);
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
   basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   gomp_continue *omp_cont_stmt;
   tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
   edge exit, nexit, guard, end, e;
+  tree for_clauses = NULL_TREE;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
   bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (!oacc_kernels_p)
+    {
+      paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
+    }
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+  if (!oacc_kernels_p)
+    {
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+    }
+  else
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
 
   /* Initialize NEW_DATA.  */
   if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
     }
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+  /* Skip insertion of OMP_RETURN for oacc_kernels_p.  We've already generated
+     one when lowering the oacc kernels directive in
+     pass_lower_omp/lower_omp (). */
+  if (!oacc_kernels_p)
+    {
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
     OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
       = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  if (1)
+    {
+      /* In combination with the NUM_GANGS on the parallel.  */
+      for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+    }
+
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   for_clauses, 1, NULL);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
+      many_iterations_cond
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2531,12 +2610,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2595,6 +2683,7 @@ try_create_reduction_list (loop_p loop,
 			 "  FAILED: it is not a part of reduction.\n");
 	      return false;
 	    }
+	  red->keep_res = phi;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
 	      fprintf (dump_file, "reduction phi is  ");
@@ -2629,15 +2718,402 @@ try_create_reduction_list (loop_p loop,
     }
 
 
+  if (oacc_kernels_p)
+    {
+      edge e = loop_preheader_edge (loop);
+
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      struct reduction_info *red;
+	      red = reduction_phi (reduction_list, phi);
+
+	      /* Look for pattern:
+
+		 <bb preheader>
+		   .omp_data_i = &.omp_data_arr;
+		   addr = .omp_data_i->sum;
+		   sum_a = *addr;
+
+		 <bb header>:
+		   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+		 and assign addr to reduc->reduc_addr.  */
+
+	      tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      gimple *stmt = SSA_NAME_DEF_STMT (arg);
+	      if (!gimple_assign_single_p (stmt))
+		return false;
+	      tree memref = gimple_assign_rhs1 (stmt);
+	      if (TREE_CODE (memref) != MEM_REF)
+		return false;
+	      tree addr = TREE_OPERAND (memref, 0);
+
+	      gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+	      if (!gimple_assign_single_p (stmt2))
+		return false;
+	      tree compref = gimple_assign_rhs1 (stmt2);
+	      if (TREE_CODE (compref) != COMPONENT_REF)
+		return false;
+	      tree addr2 = TREE_OPERAND (compref, 0);
+	      if (TREE_CODE (addr2) != MEM_REF)
+		return false;
+	      addr2 = TREE_OPERAND (addr2, 0);
+	      if (TREE_CODE (addr2) != SSA_NAME
+		  || addr2 != get_omp_data_i_param ())
+		return false;
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
+
+  return true;
+}
+
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      tree omp_data_i,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
   return true;
 }
 
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  tree omp_data_i = get_omp_data_i_param ();
+  gcc_assert (omp_data_i != NULL_TREE);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+				   reduction_list, reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
+  return res;
+}
+
 /* Detect parallel loops and generate parallel code using libgomp
    primitives.  Returns true if some loop was parallelized, false
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2649,19 +3125,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2673,6 +3159,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2714,6 +3216,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2723,14 +3226,23 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (!flag_loop_parallelize_all
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
+      /* Skip inner loop(s) of parallelized loop.  */
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2743,8 +3255,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2794,7 +3307,7 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
@@ -2813,3 +3326,66 @@ make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "parloops_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+  unsigned int todo = 0;
+
+  loop_optimizer_init (LOOPS_NORMAL
+		       | LOOPS_HAVE_RECORDED_EXITS);
+
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+  scev_initialize ();
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+      todo |= TODO_update_ssa;
+    }
+
+  scev_finalize ();
+  loop_optimizer_finalize ();
+
+  return todo;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 9704918..004db77 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -385,6 +385,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+  make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING][PATCH, 12/16] Handle acc loop directive
  2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
@ 2015-11-24 12:30   ` Tom de Vries
  2016-01-18 14:27     ` [PING^2][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:30 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 09/11/15 21:06, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> this patch deals with loops in an oacc kernels region which are
> annotated using "#pragma acc loop". It expands such a loop as a normal
> loop, which has the effect of ignoring the "#pragma acc loop".
>

Ping.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 12:22                     ` Tom de Vries
@ 2015-11-24 13:19                       ` Richard Biener
  2015-11-24 14:33                         ` Tom de Vries
  2015-11-25 10:44                       ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-24 13:19 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 23/11/15 11:02, Richard Biener wrote:
> > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > 
> > > On 20/11/15 14:29, Richard Biener wrote:
> > > > I agree it's somewhat of an odd behavior but all passes should
> > > > either be placed in a sub-pipeline with an outer
> > > > loop_optimizer_init()/finalize () call or call both themselves.
> > > 
> > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
> > > loop
> > > pipeline.
> > > 
> > > We could use the style used in pass_slp_vectorize::execute:
> > > ...
> > > pass_slp_vectorize::execute (function *fun)
> > > {
> > >    basic_block bb;
> > > 
> > >    bool in_loop_pipeline = scev_initialized_p ();
> > >    if (!in_loop_pipeline)
> > >      {
> > >        loop_optimizer_init (LOOPS_NORMAL);
> > >        scev_initialize ();
> > >      }
> > > 
> > >    ...
> > > 
> > >    if (!in_loop_pipeline)
> > >      {
> > >        scev_finalize ();
> > >        loop_optimizer_finalize ();
> > >      }
> > > ...
> > > 
> > > Although that doesn't strike me as particularly clean.
> > 
> > At least it would be a consistent "unclean" style.  So yes, the
> > above would work for me.
> > 
> 
> Reposting using the in_loop_pipeline style in pass_lim.

The tree-ssa-loop-im.c changes are ok (I suppose the other changes
are in the other patch you posted as well).

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 13:19                       ` Richard Biener
@ 2015-11-24 14:33                         ` Tom de Vries
  2015-11-24 14:36                           ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 14:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 24/11/15 14:13, Richard Biener wrote:
> On Tue, 24 Nov 2015, Tom de Vries wrote:
>
>> >On 23/11/15 11:02, Richard Biener wrote:
>>> > >On Fri, 20 Nov 2015, Tom de Vries wrote:
>>> > >
>>>> > > >On 20/11/15 14:29, Richard Biener wrote:
>>>>> > > > >I agree it's somewhat of an odd behavior but all passes should
>>>>> > > > >either be placed in a sub-pipeline with an outer
>>>>> > > > >loop_optimizer_init()/finalize () call or call both themselves.
>>>> > > >
>>>> > > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
>>>> > > >loop
>>>> > > >pipeline.
>>>> > > >
>>>> > > >We could use the style used in pass_slp_vectorize::execute:
>>>> > > >...
>>>> > > >pass_slp_vectorize::execute (function *fun)
>>>> > > >{
>>>> > > >    basic_block bb;
>>>> > > >
>>>> > > >    bool in_loop_pipeline = scev_initialized_p ();
>>>> > > >    if (!in_loop_pipeline)
>>>> > > >      {
>>>> > > >        loop_optimizer_init (LOOPS_NORMAL);
>>>> > > >        scev_initialize ();
>>>> > > >      }
>>>> > > >
>>>> > > >    ...
>>>> > > >
>>>> > > >    if (!in_loop_pipeline)
>>>> > > >      {
>>>> > > >        scev_finalize ();
>>>> > > >        loop_optimizer_finalize ();
>>>> > > >      }
>>>> > > >...
>>>> > > >
>>>> > > >Although that doesn't strike me as particularly clean.
>>> > >
>>> > >At least it would be a consistent "unclean" style.  So yes, the
>>> > >above would work for me.
>>> > >
>> >
>> >Reposting using the in_loop_pipeline style in pass_lim.
> The tree-ssa-loop-im.c changes are ok

OK, I'll commit those.

> (I suppose the other changes
> are in the other patch you posted as well).

This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch 
contains changes related to adding pass_oacc_kernels2. Are those the 
"other changes" you're referring to?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 14:33                         ` Tom de Vries
@ 2015-11-24 14:36                           ` Richard Biener
  2015-11-24 15:05                             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-24 14:36 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 24/11/15 14:13, Richard Biener wrote:
> > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > 
> > > >On 23/11/15 11:02, Richard Biener wrote:
> > > > > >On Fri, 20 Nov 2015, Tom de Vries wrote:
> > > > > >
> > > > > > > >On 20/11/15 14:29, Richard Biener wrote:
> > > > > > > > > >I agree it's somewhat of an odd behavior but all passes
> > > > > > should
> > > > > > > > > >either be placed in a sub-pipeline with an outer
> > > > > > > > > >loop_optimizer_init()/finalize () call or call both
> > > > > > themselves.
> > > > > > > >
> > > > > > > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim
> > > > > breaks the
> > > > > > > >loop
> > > > > > > >pipeline.
> > > > > > > >
> > > > > > > >We could use the style used in pass_slp_vectorize::execute:
> > > > > > > >...
> > > > > > > >pass_slp_vectorize::execute (function *fun)
> > > > > > > >{
> > > > > > > >    basic_block bb;
> > > > > > > >
> > > > > > > >    bool in_loop_pipeline = scev_initialized_p ();
> > > > > > > >    if (!in_loop_pipeline)
> > > > > > > >      {
> > > > > > > >        loop_optimizer_init (LOOPS_NORMAL);
> > > > > > > >        scev_initialize ();
> > > > > > > >      }
> > > > > > > >
> > > > > > > >    ...
> > > > > > > >
> > > > > > > >    if (!in_loop_pipeline)
> > > > > > > >      {
> > > > > > > >        scev_finalize ();
> > > > > > > >        loop_optimizer_finalize ();
> > > > > > > >      }
> > > > > > > >...
> > > > > > > >
> > > > > > > >Although that doesn't strike me as particularly clean.
> > > > > >
> > > > > >At least it would be a consistent "unclean" style.  So yes, the
> > > > > >above would work for me.
> > > > > >
> > > >
> > > >Reposting using the in_loop_pipeline style in pass_lim.
> > The tree-ssa-loop-im.c changes are ok
> 
> OK, I'll commit those.
> 
> > (I suppose the other changes
> > are in the other patch you posted as well).
> 
> This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
> contains changes related to adding pass_oacc_kernels2. Are those the "other
> changes" you're referring to?

No, the other pathc adding oacc_kernels pass group to passes.def.

Btw, at some point splitting patches too much becomes very much
confusing instead of helping.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 14:36                           ` Richard Biener
@ 2015-11-24 15:05                             ` Tom de Vries
  2015-11-25 10:43                               ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 15:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 24/11/15 15:33, Richard Biener wrote:
> On Tue, 24 Nov 2015, Tom de Vries wrote:
>
>> On 24/11/15 14:13, Richard Biener wrote:
>>> On Tue, 24 Nov 2015, Tom de Vries wrote:
>>>
>>>>> On 23/11/15 11:02, Richard Biener wrote:
>>>>>>> On Fri, 20 Nov 2015, Tom de Vries wrote:
>>>>>>>
>>>>>>>>> On 20/11/15 14:29, Richard Biener wrote:
>>>>>>>>>>> I agree it's somewhat of an odd behavior but all passes
>>>>>>> should
>>>>>>>>>>> either be placed in a sub-pipeline with an outer
>>>>>>>>>>> loop_optimizer_init()/finalize () call or call both
>>>>>>> themselves.
>>>>>>>>>
>>>>>>>>> Hmm, but adding loop_optimizer_finalize at the end of pass_lim
>>>>>> breaks the
>>>>>>>>> loop
>>>>>>>>> pipeline.
>>>>>>>>>
>>>>>>>>> We could use the style used in pass_slp_vectorize::execute:
>>>>>>>>> ...
>>>>>>>>> pass_slp_vectorize::execute (function *fun)
>>>>>>>>> {
>>>>>>>>>     basic_block bb;
>>>>>>>>>
>>>>>>>>>     bool in_loop_pipeline = scev_initialized_p ();
>>>>>>>>>     if (!in_loop_pipeline)
>>>>>>>>>       {
>>>>>>>>>         loop_optimizer_init (LOOPS_NORMAL);
>>>>>>>>>         scev_initialize ();
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>>     ...
>>>>>>>>>
>>>>>>>>>     if (!in_loop_pipeline)
>>>>>>>>>       {
>>>>>>>>>         scev_finalize ();
>>>>>>>>>         loop_optimizer_finalize ();
>>>>>>>>>       }
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> Although that doesn't strike me as particularly clean.
>>>>>>>
>>>>>>> At least it would be a consistent "unclean" style.  So yes, the
>>>>>>> above would work for me.
>>>>>>>
>>>>>
>>>>> Reposting using the in_loop_pipeline style in pass_lim.
>>> The tree-ssa-loop-im.c changes are ok
>>
>> OK, I'll commit those.
>>
>>> (I suppose the other changes
>>> are in the other patch you posted as well).
>>
>> This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
>> contains changes related to adding pass_oacc_kernels2. Are those the "other
>> changes" you're referring to?
>
> No, the other pathc adding oacc_kernels pass group to passes.def.
>

I don't understand. There 's only one patch adding oacc_kernels pass 
group to passes.def (which is the one in this thread).

> Btw, at some point splitting patches too much becomes very much
> confusing instead of helping.

Would it help if I merge "Add pass_oacc_kernels" with this patch?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-24 12:17       ` Tom de Vries
@ 2015-11-25 10:42         ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-25 10:42 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 19/11/15 14:50, Tom de Vries wrote:
> > On 11/11/15 11:58, Richard Biener wrote:
> > > On Mon, 9 Nov 2015, Tom de Vries wrote:
> > > 
> > > > On 09/11/15 16:35, Tom de Vries wrote:
> > > > > Hi,
> > > > > 
> > > > > this patch series for stage1 trunk adds support to:
> > > > > - parallelize oacc kernels regions using parloops, and
> > > > > - map the loops onto the oacc gang dimension.
> > > > > 
> > > > > The patch series contains these patches:
> > > > > 
> > > > >        1    Insert new exit block only when needed in
> > > > >           transform_to_exit_first_loop_alt
> > > > >        2    Make create_parallel_loop return void
> > > > >        3    Ignore reduction clause on kernels directive
> > > > >        4    Implement -foffload-alias
> > > > >        5    Add in_oacc_kernels_region in struct loop
> > > > >        6    Add pass_oacc_kernels
> > > > >        7    Add pass_dominator_oacc_kernels
> > > > >        8    Add pass_ch_oacc_kernels
> > > > >        9    Add pass_parallelize_loops_oacc_kernels
> > > > >       10    Add pass_oacc_kernels pass group in passes.def
> > > > >       11    Update testcases after adding kernels pass group
> > > > >       12    Handle acc loop directive
> > > > >       13    Add c-c++-common/goacc/kernels-*.c
> > > > >       14    Add gfortran.dg/goacc/kernels-*.f95
> > > > >       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > > > >       16    Add libgomp.oacc-fortran/kernels-*.f95
> > > > > 
> > > > > The first 9 patches are more or less independent, but patches 10-16
> > > > > are
> > > > > intended to be committed at the same time.
> > > > > 
> > > > > Bootstrapped and reg-tested on x86_64.
> > > > > 
> > > > > Build and reg-tested with nvidia accelerator, in combination with a
> > > > > patch that enables accelerator testing (which is submitted at
> > > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > > > > 
> > > > > I'll post the individual patches in reply to this message.
> > > > 
> > > > this patchs add a pass group pass_oacc_kernels (which will be added
> > > > to the
> > > > pass list as a whole in patch 10).
> > > 
> > > Just to understand (while also skimming the HSA patches).
> > > 
> > > You are basically relying on autopar for what the HSA patches call
> > > "gridification"?  That is, OMP lowering produces loopy kernels
> > > and autopar then will basically strip the outermost loop?
> > 
> > Short answer: no. In more detail...
> <SNIP>
> 
> Reposting patch, after splitting the pass group into two.

Ok.

Richard.

> Thanks,
> - TOm
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 15:05                             ` Tom de Vries
@ 2015-11-25 10:43                               ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-25 10:43 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 24/11/15 15:33, Richard Biener wrote:
> > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > 
> > > On 24/11/15 14:13, Richard Biener wrote:
> > > > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > > > 
> > > > > > On 23/11/15 11:02, Richard Biener wrote:
> > > > > > > > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > > > > > > > 
> > > > > > > > > > On 20/11/15 14:29, Richard Biener wrote:
> > > > > > > > > > > > I agree it's somewhat of an odd behavior but all passes
> > > > > > > > should
> > > > > > > > > > > > either be placed in a sub-pipeline with an outer
> > > > > > > > > > > > loop_optimizer_init()/finalize () call or call both
> > > > > > > > themselves.
> > > > > > > > > > 
> > > > > > > > > > Hmm, but adding loop_optimizer_finalize at the end of
> > > > > > > > > > pass_lim
> > > > > > > breaks the
> > > > > > > > > > loop
> > > > > > > > > > pipeline.
> > > > > > > > > > 
> > > > > > > > > > We could use the style used in pass_slp_vectorize::execute:
> > > > > > > > > > ...
> > > > > > > > > > pass_slp_vectorize::execute (function *fun)
> > > > > > > > > > {
> > > > > > > > > >     basic_block bb;
> > > > > > > > > > 
> > > > > > > > > >     bool in_loop_pipeline = scev_initialized_p ();
> > > > > > > > > >     if (!in_loop_pipeline)
> > > > > > > > > >       {
> > > > > > > > > >         loop_optimizer_init (LOOPS_NORMAL);
> > > > > > > > > >         scev_initialize ();
> > > > > > > > > >       }
> > > > > > > > > > 
> > > > > > > > > >     ...
> > > > > > > > > > 
> > > > > > > > > >     if (!in_loop_pipeline)
> > > > > > > > > >       {
> > > > > > > > > >         scev_finalize ();
> > > > > > > > > >         loop_optimizer_finalize ();
> > > > > > > > > >       }
> > > > > > > > > > ...
> > > > > > > > > > 
> > > > > > > > > > Although that doesn't strike me as particularly clean.
> > > > > > > > 
> > > > > > > > At least it would be a consistent "unclean" style.  So yes, the
> > > > > > > > above would work for me.
> > > > > > > > 
> > > > > > 
> > > > > > Reposting using the in_loop_pipeline style in pass_lim.
> > > > The tree-ssa-loop-im.c changes are ok
> > > 
> > > OK, I'll commit those.
> > > 
> > > > (I suppose the other changes
> > > > are in the other patch you posted as well).
> > > 
> > > This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
> > > contains changes related to adding pass_oacc_kernels2. Are those the
> > > "other
> > > changes" you're referring to?
> > 
> > No, the other pathc adding oacc_kernels pass group to passes.def.
> > 
> 
> I don't understand. There 's only one patch adding oacc_kernels pass group to
> passes.def (which is the one in this thread).
> 
> > Btw, at some point splitting patches too much becomes very much
> > confusing instead of helping.
> 
> Would it help if I merge "Add pass_oacc_kernels" with this patch?

It would have, yes.  As said, the excessive splitting just confuses
the review process.  Will review in the present state anyway.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 12:22                     ` Tom de Vries
  2015-11-24 13:19                       ` Richard Biener
@ 2015-11-25 10:44                       ` Richard Biener
  2015-11-30 17:48                         ` [gomp4] " Thomas Schwinge
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-25 10:44 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 23/11/15 11:02, Richard Biener wrote:
> > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > 
> > > On 20/11/15 14:29, Richard Biener wrote:
> > > > I agree it's somewhat of an odd behavior but all passes should
> > > > either be placed in a sub-pipeline with an outer
> > > > loop_optimizer_init()/finalize () call or call both themselves.
> > > 
> > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
> > > loop
> > > pipeline.
> > > 
> > > We could use the style used in pass_slp_vectorize::execute:
> > > ...
> > > pass_slp_vectorize::execute (function *fun)
> > > {
> > >    basic_block bb;
> > > 
> > >    bool in_loop_pipeline = scev_initialized_p ();
> > >    if (!in_loop_pipeline)
> > >      {
> > >        loop_optimizer_init (LOOPS_NORMAL);
> > >        scev_initialize ();
> > >      }
> > > 
> > >    ...
> > > 
> > >    if (!in_loop_pipeline)
> > >      {
> > >        scev_finalize ();
> > >        loop_optimizer_finalize ();
> > >      }
> > > ...
> > > 
> > > Although that doesn't strike me as particularly clean.
> > 
> > At least it would be a consistent "unclean" style.  So yes, the
> > above would work for me.
> > 
> 
> Reposting using the in_loop_pipeline style in pass_lim.

Ok.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-23 11:46                   ` Richard Biener
@ 2015-11-27 11:44                     ` Tom de Vries
  2015-11-27 12:14                       ` Tom de Vries
  2015-12-02  9:46                       ` Jakub Jelinek
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-27 11:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4227 bytes --]

On 23/11/15 12:41, Richard Biener wrote:
> On Sat, 21 Nov 2015, Tom de Vries wrote:
>
>> >On 13/11/15 12:39, Jakub Jelinek wrote:
>>> > >On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
>>>>> > > > >thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta
>>>>> > > > >issues'.
>>>>> > > > >
>>>>> > > > >Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit
>>>>> > > > >above?
>>>>> > > > >Is that sort of what you had in mind?
>>>> > > >
>>>> > > >Yes.  Whether that makes sense is another question of course.  You can
>>>> > > >annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
>>>> > > >as well if you know dependences without the users intervention.
>>> > >
>>> > >I really don't like even the GCC offload-alias, I just don't see anything
>>> > >special on the offload code.  Not to mention that the same issue is already
>>> > >with other outlined functions, like OpenMP tasks or parallel regions, those
>>> > >aren't offloaded, yet they can suffer from worse alias/points-to analysis
>>> > >too.
>> >
>> >AFAIU there is one aspect that is different for offloaded code: the setup of
>> >the data on the device.
>> >
>> >Consider this example:
>> >...
>> >unsigned int a[N];
>> >unsigned int b[N];
>> >unsigned int c[N];
>> >
>> >int
>> >main (void)
>> >{
>> >   ...
>> >
>> >#pragma acc kernels copyin (a) copyin (b) copyout (c)
>> >   {
>> >     for (COUNTERTYPE ii = 0; ii < N; ii++)
>> >       c[ii] = a[ii] + b[ii];
>> >   }
>> >
>> >   ...
>> >...
>> >
>> >At gimple level, we have:
>> >...
>> >#pragma omp target oacc_kernels \
>> >   map(force_from:c [len: 2097152]) \
>> >   map(force_to:b [len: 2097152]) \
>> >   map(force_to:a [len: 2097152])
>> >...
>> >
>> >[ The meaning of the force_from/force_to mappings is given in
>> >include/gomp-constants.h:
>> >...
>> >     /* Allocate.  */
>> >     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
>> >     /* ..., and copy to device.  */
>> >     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
>> >     /* ..., and copy from device.  */
>> >     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
>> >     /* ..., and copy to and from device.  */
>> >     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
>> >...  ]
>> >
>> >So before calling the offloaded function, a separate alloc is done for a, b
>> >and c, and the base pointers of the newly allocated objects are passed to the
>> >offloaded function.
>> >
>> >This means we can mark those base pointers as restrict in the offloaded
>> >function.
>> >
>> >Attached proof-of-concept patch implements that.
>> >
>>> > >We simply have some compiler internal interface between the caller and
>>> > >callee of the outlined regions, each interface in between those has
>>> > >its own structure type used to communicate the info;
>>> > >we can attach attributes on the fields, or some flags to indicate some
>>> > >properties interesting from aliasing POV.
>>> > >We don't really need to perform
>>> > >full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
>>> > >the relationship in between such callers and callees (for offloading regions
>>> > >we already have "omp target entrypoint" attribute on the callee and a
>>> > >singler caller), tell LTO if possible not to split those into different
>>> > >partitions if easily possible, and then just for these pairs perform
>>> > >aliasing/points-to analysis in the caller and the result record using
>>> > >cliques/special attributes/whatever to the callee side, so that the callee
>>> > >(outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
>> >
>> >As a start, is the approach of this patch OK?
> Works for me but leaving to Jakub to review for correctness.

Attached patch is a complete version:
- added ChangeLog
- added missing function header comments
- moved analysis to separate function
   omp_target_base_pointers_restrict_p
- added example in comment before analysis
- fixed error in omp_target_base_pointers_restrict_p where I was using
   GOMP_MAP_ALLOC but should have been using GOMP_MAP_FORCE_ALLOC
- added testcases

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0001-Mark-pointers-to-allocated-target-vars-as-restricted-if-possible.patch --]
[-- Type: text/x-patch, Size: 13735 bytes --]

Mark pointers to allocated target vars as restricted, if possible

2015-11-26  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (install_var_field_1): New function, factored out of ...
	(install_var_field): ... here.
	(scan_sharing_clauses_1): New function, factored out of ...
	(scan_sharing_clauses): ... here.
	(omp_target_base_pointers_restrict_p): New function.
	(scan_omp_target): Call scan_sharing_clauses_1 instead of
	scan_sharing_clauses, with base_pointers_restrict arg.

	* c-c++-common/goacc/kernels-alias-2.c: New test.
	* c-c++-common/goacc/kernels-alias-3.c: New test.
	* c-c++-common/goacc/kernels-alias-4.c: New test.
	* c-c++-common/goacc/kernels-alias-5.c: New test.
	* c-c++-common/goacc/kernels-alias-6.c: New test.
	* c-c++-common/goacc/kernels-alias-7.c: New test.
	* c-c++-common/goacc/kernels-alias-8.c: New test.
	* c-c++-common/goacc/kernels-alias.c: New test.

---
 gcc/omp-low.c                                      | 109 +++++++++++++++++++--
 gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c |  27 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c |  20 ++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c |  22 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c |  19 ++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c |  23 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c |  25 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c |  22 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias.c   |  29 ++++++
 9 files changed, 289 insertions(+), 7 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 0d4c6e5..6843c49 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1366,10 +1366,12 @@ build_sender_ref (tree var, omp_context *ctx)
   return build_sender_ref ((splay_tree_key) var, ctx);
 }
 
-/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
+/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
+   BASE_POINTERS_RESTRICT, declare the field with restrict.  */
 
 static void
-install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
+		     bool base_pointers_restrict)
 {
   tree field, type, sfield = NULL_TREE;
   splay_tree_key key = (splay_tree_key) var;
@@ -1393,7 +1395,11 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
       type = build_pointer_type (build_pointer_type (type));
     }
   else if (by_ref)
-    type = build_pointer_type (type);
+    {
+      type = build_pointer_type (type);
+      if (base_pointers_restrict)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+    }
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
 
@@ -1457,6 +1463,14 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
     splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield);
 }
 
+/* As install_var_field_1, but with base_pointers_restrict == false.  */
+
+static void
+install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+{
+  install_var_field_1 (var, by_ref, mask, ctx, false);
+}
+
 static tree
 install_var_local (tree var, omp_context *ctx)
 {
@@ -1810,10 +1824,12 @@ fixup_child_record_type (omp_context *ctx)
 }
 
 /* Instantiate decls as necessary in CTX to satisfy the data sharing
-   specified by CLAUSES.  */
+   specified by CLAUSES.  If BASE_POINTERS_RESTRICT, install var field with
+   restrict.  */
 
 static void
-scan_sharing_clauses (tree clauses, omp_context *ctx)
+scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
+			bool base_pointers_restrict)
 {
   tree c, decl;
   bool scan_array_reductions = false;
@@ -2070,7 +2086,8 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 		      && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE)
 		    install_var_field (decl, true, 7, ctx);
 		  else
-		    install_var_field (decl, true, 3, ctx);
+		    install_var_field_1 (decl, true, 3, ctx,
+					 base_pointers_restrict);
 		  if (is_gimple_omp_offloaded (ctx->stmt))
 		    install_var_local (decl, ctx);
 		}
@@ -2336,6 +2353,14 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 	scan_omp (&OMP_CLAUSE_LINEAR_GIMPLE_SEQ (c), ctx);
 }
 
+/* As scan_sharing_clauses_1, but with base_pointers_restrict == false.  */
+
+static void
+scan_sharing_clauses (tree clauses, omp_context *ctx)
+{
+  scan_sharing_clauses_1 (clauses, ctx, false);
+}
+
 /* Create a new name for omp child function.  Returns an identifier.  If
    IS_CILK_FOR is true then the suffix for the child function is
    "_cilk_for_fn."  */
@@ -3032,6 +3057,68 @@ scan_omp_single (gomp_single *stmt, omp_context *outer_ctx)
     layout_type (ctx->record_type);
 }
 
+/* Return true if the CLAUSES of an omp target guarantee that the base pointers
+   used in the corresponding offloaded function are restrict.  */
+
+static bool
+omp_target_base_pointers_restrict_p (tree clauses)
+{
+  /* The analysis relies on the GOMP_MAP_FORCE_* mapping kinds, which are only
+     used by OpenACC.  */
+  if (flag_openacc == 0)
+    return false;
+
+  /* I.  Basic example:
+
+       void foo (void)
+       {
+	 unsigned int a[2], b[2];
+
+	 #pragma acc kernels \
+	   copyout (a) \
+	   copyout (b)
+	 {
+	   a[0] = 0;
+	   b[0] = 1;
+	 }
+       }
+
+     After gimplification, we have:
+
+       #pragma omp target oacc_kernels \
+	 map(force_from:a [len: 8]) \
+	 map(force_from:b [len: 8])
+       {
+	 a[0] = 0;
+	 b[0] = 1;
+       }
+
+     Because both mappings have the force prefix, we know that they will be
+     allocated when calling the corresponding offloaded function, which means we
+     can mark the base pointers for a and b in the offloaded function as
+     restrict.  */
+
+  tree c;
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP)
+	return false;
+
+      switch (OMP_CLAUSE_MAP_KIND (c))
+	{
+	case GOMP_MAP_FORCE_ALLOC:
+	case GOMP_MAP_FORCE_TO:
+	case GOMP_MAP_FORCE_FROM:
+	case GOMP_MAP_FORCE_TOFROM:
+	  break;
+	default:
+	  return false;
+	}
+    }
+
+  return true;
+}
+
 /* Scan a GIMPLE_OMP_TARGET.  */
 
 static void
@@ -3053,13 +3140,21 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   DECL_NAMELESS (name) = 1;
   TYPE_NAME (ctx->record_type) = name;
   TYPE_ARTIFICIAL (ctx->record_type) = 1;
+
+  bool base_pointers_restrict = false;
   if (offloaded)
     {
       create_omp_child_function (ctx, false);
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+
+      base_pointers_restrict = omp_target_base_pointers_restrict_p (clauses);
+      if (base_pointers_restrict
+	  && dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Base pointers in offloaded function are restrict\n");
     }
 
-  scan_sharing_clauses (clauses, ctx);
+  scan_sharing_clauses_1 (clauses, ctx, base_pointers_restrict);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
   if (TYPE_FIELDS (ctx->record_type) == NULL)
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c
new file mode 100644
index 0000000..d437c47
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c
@@ -0,0 +1,27 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+void
+foo (void)
+{
+  unsigned int a;
+  unsigned int b;
+  unsigned int c;
+  unsigned int d;
+
+#pragma acc kernels copyin (a) create (b) copyout (c) copy (d)
+  {
+    a = 0;
+    b = 0;
+    c = 0;
+    d = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 8 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c
new file mode 100644
index 0000000..0eda7e1
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c
@@ -0,0 +1,20 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+void
+foo (void)
+{
+  unsigned int a;
+  unsigned int *p = &a;
+
+#pragma acc kernels pcopyin (a, p[0:1])
+  {
+    a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c
new file mode 100644
index 0000000..037901f
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c
@@ -0,0 +1,22 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int *p = &a[0];
+
+#pragma acc kernels pcopyin (a, p[0:2])
+  {
+    a[0] = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c
new file mode 100644
index 0000000..69cd3fb
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c
@@ -0,0 +1,19 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+void
+foo (int *a)
+{
+  int *p = a;
+
+#pragma acc kernels pcopyin (a[0:1], p[0:1])
+  {
+    *a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c
new file mode 100644
index 0000000..6ebce15
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c
@@ -0,0 +1,23 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+typedef __SIZE_TYPE__ size_t;
+extern void *acc_copyin (void *, size_t);
+
+void
+foo (void)
+{
+  int a = 0;
+  int *p = (int *)acc_copyin (&a, sizeof (a));
+
+#pragma acc kernels deviceptr (p) pcopy(a)
+  {
+    a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c
new file mode 100644
index 0000000..40eb235
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c
@@ -0,0 +1,25 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+typedef __SIZE_TYPE__ size_t;
+extern void *acc_copyin (void *, size_t);
+
+#define N 2
+
+void
+foo (void)
+{
+  int a[N];
+  int *p = (int *)acc_copyin (&a[0], sizeof (a));
+
+#pragma acc kernels deviceptr (p) pcopy(a)
+  {
+    a[0] = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c
new file mode 100644
index 0000000..0b93e35
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c
@@ -0,0 +1,22 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+typedef __SIZE_TYPE__ size_t;
+extern void *acc_copyin (void *, size_t);
+
+void
+foo (int *a, size_t n)
+{
+  int *p = (int *)acc_copyin (&a, n);
+
+#pragma acc kernels deviceptr (p) pcopy(a[0:n])
+  {
+    a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias.c
new file mode 100644
index 0000000..25821ab2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int b[N];
+  unsigned int c[N];
+  unsigned int d[N];
+
+#pragma acc kernels copyin (a) create (b) copyout (c) copy (d)
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 8 "ealias" } } */
+

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-27 11:44                     ` Tom de Vries
@ 2015-11-27 12:14                       ` Tom de Vries
  2015-12-02  9:59                         ` Jakub Jelinek
  2015-12-02  9:46                       ` Jakub Jelinek
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-27 12:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5720 bytes --]

On 27/11/15 12:42, Tom de Vries wrote:
> On 23/11/15 12:41, Richard Biener wrote:
>> On Sat, 21 Nov 2015, Tom de Vries wrote:
>>
>>> >On 13/11/15 12:39, Jakub Jelinek wrote:
>>>> > >On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
>>>>>> > > > >thanks for the explanation. Filed as PR68331 - '[meta-bug]
>>>>>> fipa-pta
>>>>>> > > > >issues'.
>>>>>> > > > >
>>>>>> > > > >Any feedback on the '#pragma GCC
>>>>>> offload-alias=<none|pointer|all>' bit
>>>>>> > > > >above?
>>>>>> > > > >Is that sort of what you had in mind?
>>>>> > > >
>>>>> > > >Yes.  Whether that makes sense is another question of course.
>>>>> You can
>>>>> > > >annotate memory references with MR_DEPENDENCE_BASE/CLIQUE
>>>>> yourself
>>>>> > > >as well if you know dependences without the users intervention.
>>>> > >
>>>> > >I really don't like even the GCC offload-alias, I just don't see
>>>> anything
>>>> > >special on the offload code.  Not to mention that the same issue
>>>> is already
>>>> > >with other outlined functions, like OpenMP tasks or parallel
>>>> regions, those
>>>> > >aren't offloaded, yet they can suffer from worse alias/points-to
>>>> analysis
>>>> > >too.
>>> >
>>> >AFAIU there is one aspect that is different for offloaded code: the
>>> setup of
>>> >the data on the device.
>>> >
>>> >Consider this example:
>>> >...
>>> >unsigned int a[N];
>>> >unsigned int b[N];
>>> >unsigned int c[N];
>>> >
>>> >int
>>> >main (void)
>>> >{
>>> >   ...
>>> >
>>> >#pragma acc kernels copyin (a) copyin (b) copyout (c)
>>> >   {
>>> >     for (COUNTERTYPE ii = 0; ii < N; ii++)
>>> >       c[ii] = a[ii] + b[ii];
>>> >   }
>>> >
>>> >   ...
>>> >...
>>> >
>>> >At gimple level, we have:
>>> >...
>>> >#pragma omp target oacc_kernels \
>>> >   map(force_from:c [len: 2097152]) \
>>> >   map(force_to:b [len: 2097152]) \
>>> >   map(force_to:a [len: 2097152])
>>> >...
>>> >
>>> >[ The meaning of the force_from/force_to mappings is given in
>>> >include/gomp-constants.h:
>>> >...
>>> >     /* Allocate.  */
>>> >     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
>>> >     /* ..., and copy to device.  */
>>> >     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
>>> >     /* ..., and copy from device.  */
>>> >     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
>>> >     /* ..., and copy to and from device.  */
>>> >     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
>>> >...  ]
>>> >
>>> >So before calling the offloaded function, a separate alloc is done
>>> for a, b
>>> >and c, and the base pointers of the newly allocated objects are
>>> passed to the
>>> >offloaded function.
>>> >
>>> >This means we can mark those base pointers as restrict in the offloaded
>>> >function.
>>> >
>>> >Attached proof-of-concept patch implements that.
>>> >
>>>> > >We simply have some compiler internal interface between the
>>>> caller and
>>>> > >callee of the outlined regions, each interface in between those has
>>>> > >its own structure type used to communicate the info;
>>>> > >we can attach attributes on the fields, or some flags to indicate
>>>> some
>>>> > >properties interesting from aliasing POV.
>>>> > >We don't really need to perform
>>>> > >full IPA-PTA, perhaps it would be enough to a) record somewhere
>>>> in cgraph
>>>> > >the relationship in between such callers and callees (for
>>>> offloading regions
>>>> > >we already have "omp target entrypoint" attribute on the callee
>>>> and a
>>>> > >singler caller), tell LTO if possible not to split those into
>>>> different
>>>> > >partitions if easily possible, and then just for these pairs perform
>>>> > >aliasing/points-to analysis in the caller and the result record
>>>> using
>>>> > >cliques/special attributes/whatever to the callee side, so that
>>>> the callee
>>>> > >(outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
>>>> analysis.
>>> >
>>> >As a start, is the approach of this patch OK?
>> Works for me but leaving to Jakub to review for correctness.
>
> Attached patch is a complete version:
> - added ChangeLog
> - added missing function header comments
> - moved analysis to separate function
>    omp_target_base_pointers_restrict_p
> - added example in comment before analysis
> - fixed error in omp_target_base_pointers_restrict_p where I was using
>    GOMP_MAP_ALLOC but should have been using GOMP_MAP_FORCE_ALLOC
> - added testcases
>

This follow-up patch handles the case that we copy from/to pointers 
rather than declared variables:
...
        void foo (unsigned int *a, unsigned int *b)
        {
	 #pragma acc kernels copyout (a[0:2]) copyout (b[0:2])
	 {
	   a[0] = 0;
	   b[0] = 1;
	 }
        }
...

After gimplification, we have:
...
      foo (unsigned int * a, unsigned int * b)
      {
        unsigned int * b.0;
        unsigned int * a.1;

        b.0 = b;
        a.1 = a;
        #pragma omp target oacc_kernels \
	 map(force_from:*a.1 (*a) [len: 8]) \
	 map(alloc:a [pointer assign, bias: 0]) \
	 map(force_from:*b.0 (*b) [len: 8]) \
	 map(alloc:b [pointer assign, bias: 0])
        {
	 unsigned int * a.2;
	 unsigned int * b.3;

	 a.2 = a;
	 *a.2 = 0;
	 b.3 = b;
	 *b.3 = 1;
       }
      }
...

We don't bail out of omp_target_base_pointers_restrict_p when 
encountering 'map(alloc:a [pointer assign, bias: 0])', given that we can 
find the matching 'map(force_from:*a.1 (*a) [len: 8])'.

Using this and the previous patch, I'm able to do auto-parallelization 
on all the oacc kernels c test-cases, with the obvious exception of the 
testcases where some of used variables are mapped using the 'present' 
tag (in other words, missing the force tag).

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0002-Handle-non-declared-variables-in-kernels-alias-analysis.patch --]
[-- Type: text/x-patch, Size: 10410 bytes --]

Handle non-declared variables in kernels alias analysis

2015-11-27  Tom de Vries  <tom@codesourcery.com>

	* gimplify.c (gimplify_scan_omp_clauses): Initialize
	OMP_CLAUSE_ORIG_DECL.
	* omp-low.c (install_var_field_1): Handle base_pointers_restrict for
	pointers.
	(map_ptr_clause_points_to_clause_p)
	(nr_map_ptr_clauses_pointing_to_clause): New function.
	(omp_target_base_pointers_restrict_p): Handle GOMP_MAP_POINTER.
	* tree-pretty-print.c (dump_omp_clause): Print OMP_CLAUSE_ORIG_DECL.
	* tree.c (omp_clause_num_ops): Set num_ops for OMP_CLAUSE_MAP to 3.
	* tree.h (OMP_CLAUSE_ORIG_DECL): New macro.

	* c-c++-common/goacc/kernels-alias-10.c: New test.
	* c-c++-common/goacc/kernels-alias-9.c: New test.

---
 gcc/gimplify.c                                     |   1 +
 gcc/omp-low.c                                      | 134 ++++++++++++++++++++-
 .../c-c++-common/goacc/kernels-alias-10.c          |  29 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c |  29 +++++
 gcc/tree-pretty-print.c                            |   8 ++
 gcc/tree.c                                         |   2 +-
 gcc/tree.h                                         |   5 +
 7 files changed, 205 insertions(+), 3 deletions(-)

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index a3ed378..fcac745 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -6713,6 +6713,7 @@ gimplify_scan_omp_clauses (tree *list_p, gimple_seq *pre_p,
 	  if (!DECL_P (decl))
 	    {
 	      tree d = decl, *pd;
+	      OMP_CLAUSE_ORIG_DECL (c) = copy_node (decl);
 	      if (TREE_CODE (d) == ARRAY_REF)
 		{
 		  while (TREE_CODE (d) == ARRAY_REF)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 6843c49..8ae08c52 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1396,6 +1396,9 @@ install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
     }
   else if (by_ref)
     {
+      if (base_pointers_restrict
+	  && POINTER_TYPE_P (type))
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
       type = build_pointer_type (type);
       if (base_pointers_restrict)
 	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
@@ -3057,6 +3060,64 @@ scan_omp_single (gomp_single *stmt, omp_context *outer_ctx)
     layout_type (ctx->record_type);
 }
 
+/* Return true if OMP_CLAUSE_DECL (MAP_POINTER_CLAUSE) points to
+   OMP_CLAUSE_DECL (CLAUSE).  */
+
+static bool
+map_ptr_clause_points_to_clause_p (tree map_pointer_clause, tree clause)
+{
+  gcc_assert (OMP_CLAUSE_CODE (map_pointer_clause) == OMP_CLAUSE_MAP);
+  gcc_assert (OMP_CLAUSE_MAP_KIND (map_pointer_clause) == GOMP_MAP_POINTER);
+
+  if (OMP_CLAUSE_CODE (clause) != OMP_CLAUSE_MAP)
+    return false;
+
+  tree orig_decl = OMP_CLAUSE_ORIG_DECL (clause);
+  if (orig_decl == NULL_TREE)
+    return false;
+
+  tree ptr_decl = OMP_CLAUSE_DECL (map_pointer_clause);
+  switch (TREE_CODE (orig_decl))
+    {
+    case ARRAY_REF:
+      if (!integer_zerop (TREE_OPERAND (orig_decl, 1)))
+	return false;
+
+      /* Fall through.  */
+    case INDIRECT_REF:
+      if (!operand_equal_p (ptr_decl, TREE_OPERAND (orig_decl, 0), 0))
+	return false;
+      break;
+    default:
+      return false;
+    }
+
+  return true;
+}
+
+/* Return the number of map_pointer clauses in CLAUSES pointing to CLAUSE.  */
+
+static unsigned int
+nr_map_ptr_clauses_pointing_to_clause (tree clauses, tree clause)
+{
+  unsigned int nr = 0;
+
+  tree c;
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP)
+	continue;
+
+      if (OMP_CLAUSE_MAP_KIND (c) != GOMP_MAP_POINTER)
+	continue;
+
+      if (map_ptr_clause_points_to_clause_p (c, clause))
+	nr++;
+    }
+
+  return nr;
+}
+
 /* Return true if the CLAUSES of an omp target guarantee that the base pointers
    used in the corresponding offloaded function are restrict.  */
 
@@ -3096,8 +3157,59 @@ omp_target_base_pointers_restrict_p (tree clauses)
      Because both mappings have the force prefix, we know that they will be
      allocated when calling the corresponding offloaded function, which means we
      can mark the base pointers for a and b in the offloaded function as
-     restrict.  */
+     restrict.
+
+     II.  GOMP_MAP_POINTER example:
 
+       void foo (unsigned int *a, unsigned int *b)
+       {
+	 #pragma acc kernels copyout (a[0:2]) copyout (b[0:2])
+	 {
+	   a[0] = 0;
+	   b[0] = 1;
+	 }
+       }
+
+     After gimplification, we have:
+
+     foo (unsigned int * a, unsigned int * b)
+     {
+       unsigned int * b.0;
+       unsigned int * a.1;
+
+       b.0 = b;
+       a.1 = a;
+       #pragma omp target oacc_kernels \
+	 map(force_from:*a.1 (*a) [len: 8]) \
+	 map(alloc:a [pointer assign, bias: 0]) \
+	 map(force_from:*b.0 (*b) [len: 8]) \
+	 map(alloc:b [pointer assign, bias: 0])
+       {
+	 unsigned int * a.2;
+	 unsigned int * b.3;
+
+	 a.2 = a;
+	 *a.2 = 0;
+	 b.3 = b;
+	 *b.3 = 1;
+       }
+     }
+
+     Because:
+     - we can prove for both pointer assign mappings that they point to a
+       force-prefixed mapping, and
+     - the force-prefixed mappings themselves do not have their OMP_CLAUSE_DECL
+       used in the body,
+     we can mark the base pointers for a and b in the offloaded function as
+     restrict.
+
+     KLUDGE: In order to connect the pointer mapping clause to the force_*
+     clause, we need to save the pre-gimplification OMP_CLAUSE_DECL as
+     OMP_CLAUSE_ORIG_DECL.  Note that OMP_CLAUSE_ORIG_DECL is printed as '(*a)'
+     in 'map(force_from:*a.1 (*a) [len: 8])'.  */
+
+  unsigned int ptr_found = 0;
+  unsigned int ptr_matched = 0;
   tree c;
   for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
     {
@@ -3110,13 +3222,31 @@ omp_target_base_pointers_restrict_p (tree clauses)
 	case GOMP_MAP_FORCE_TO:
 	case GOMP_MAP_FORCE_FROM:
 	case GOMP_MAP_FORCE_TOFROM:
+	  {
+	    unsigned int nr
+	      = nr_map_ptr_clauses_pointing_to_clause (clauses, c);
+	    if (DECL_P (OMP_CLAUSE_DECL (c)))
+	      {
+		if (nr != 0)
+		  return false;
+	      }
+	    else
+	      {
+		if (nr != 1)
+		  return false;
+		ptr_matched++;
+	      }
+	  }
+	  break;
+	case GOMP_MAP_POINTER:
+	  ptr_found++;
 	  break;
 	default:
 	  return false;
 	}
     }
 
-  return true;
+  return ptr_found == ptr_matched;
 }
 
 /* Scan a GIMPLE_OMP_TARGET.  */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c
new file mode 100644
index 0000000..ce5bbe8
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int b[N];
+  unsigned int c[N];
+  unsigned int d[N];
+
+#pragma acc kernels copyin (a[0:N]) create (b[0:N]) copyout (c[0:N]) copy (d[0:N])
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 8 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c
new file mode 100644
index 0000000..7229fd4
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (unsigned int *a, unsigned int *b, unsigned int *c, unsigned int *d)
+{
+
+#pragma acc kernels copyin (a[0:N]) create (b[0:N]) copyout (c[0:N]) copy (d[0:N])
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 6" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 7" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 8" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 9" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 12 "ealias" } } */
+
diff --git a/gcc/tree-pretty-print.c b/gcc/tree-pretty-print.c
index caec760..4b94f18 100644
--- a/gcc/tree-pretty-print.c
+++ b/gcc/tree-pretty-print.c
@@ -666,6 +666,14 @@ dump_omp_clause (pretty_printer *pp, tree clause, int spc, int flags)
       pp_colon (pp);
       dump_generic_node (pp, OMP_CLAUSE_DECL (clause),
 			 spc, flags, false);
+      if (OMP_CLAUSE_ORIG_DECL (clause) != NULL_TREE)
+	{
+	  pp_space (pp);
+	  pp_left_paren (pp);
+	  dump_generic_node (pp, OMP_CLAUSE_ORIG_DECL (clause),
+			     spc, flags, false);
+	  pp_right_paren (pp);
+	}
      print_clause_size:
       if (OMP_CLAUSE_SIZE (clause))
 	{
diff --git a/gcc/tree.c b/gcc/tree.c
index 779fe93..45f9a17 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -277,7 +277,7 @@ unsigned const char omp_clause_num_ops[] =
   1, /* OMP_CLAUSE_LINK  */
   2, /* OMP_CLAUSE_FROM  */
   2, /* OMP_CLAUSE_TO  */
-  2, /* OMP_CLAUSE_MAP  */
+  3, /* OMP_CLAUSE_MAP  */
   1, /* OMP_CLAUSE_USE_DEVICE_PTR  */
   1, /* OMP_CLAUSE_IS_DEVICE_PTR  */
   2, /* OMP_CLAUSE__CACHE_  */
diff --git a/gcc/tree.h b/gcc/tree.h
index cb52deb..27221ee 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1382,6 +1382,11 @@ extern void protected_set_expr_location (tree, location_t);
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_RANGE_CHECK (OMP_CLAUSE_CHECK (NODE),	\
 					      OMP_CLAUSE_PRIVATE,	\
 					      OMP_CLAUSE__LOOPTEMP_), 0)
+#define OMP_CLAUSE_ORIG_DECL(NODE)					\
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_RANGE_CHECK (OMP_CLAUSE_CHECK (NODE),	\
+					      OMP_CLAUSE_PRIVATE,	\
+					      OMP_CLAUSE__LOOPTEMP_), 2)
+
 #define OMP_CLAUSE_HAS_LOCATION(NODE) \
   (LOCATION_LOCUS ((OMP_CLAUSE_CHECK (NODE))->omp_clause.locus)		\
   != UNKNOWN_LOCATION)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [gomp4] Use pass_ch instead of pass_ch_oacc_kernels (was: [PATCH, 8/16] Add pass_ch_oacc_kernels)
  2015-11-11 20:29   ` Tom de Vries
@ 2015-11-30 12:12     ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2015-11-30 12:12 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 9343 bytes --]

Hi!

On Wed, 11 Nov 2015 21:29:10 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 09/11/15 19:33, Tom de Vries wrote:
> > On 09/11/15 16:35, Tom de Vries wrote:
> > this patch adds a pass pass_ch_oacc_kernels, which is like pass_ch, but
> > only runs for loops with oacc_kernels_region set.
> >
> > [ But... thinking about it a bit more, I think that we could use a
> > regular pass_ch instead. We only use the kernels pass group for a single
> > loop nest in a kernels region, and we mark all the loops in the loop
> > nest with oacc_kernels_region. So I think that the oacc_kernels_region
> > test in pass_ch_oacc_kernels::process_loop_p evaluates to true. ]
> >
> > So, I'll try to confirm with retesting that we can drop this patch.
> >
> 
> That's confirmed. I can use pass_ch instead of pass_ch_oacc_kernels, so 
> I'm dropping this patch from the series.

Committed to gomp-4_0-branch in r231067:

commit 8249e606d83025092e3b0b227360f7e38fe591d4
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Mon Nov 30 12:05:50 2015 +0000

    Use pass_ch instead of pass_ch_oacc_kernels
    
    	gcc/
    	* passes.def: Use pass_ch instead of pass_ch_oacc_kernels.
    	* tree-pass.h (make_pass_ch_oacc_kernels): Remove.
    	* tree-ssa-loop-ch.c: Revert to trunk r230907 version.
    	gcc/testsuite/
    	* gcc.dg/tree-ssa/copy-headers.c: Update for new pass_ch.
    	* gcc.dg/tree-ssa/foldconst-2.c: Likewise.
    	* gcc.dg/tree-ssa/loop-40.c: Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231067 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp                           |    6 +++
 gcc/passes.def                               |    2 +-
 gcc/testsuite/ChangeLog.gomp                 |    6 +++
 gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c |    4 +-
 gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c  |    4 +-
 gcc/testsuite/gcc.dg/tree-ssa/loop-40.c      |    4 +-
 gcc/tree-pass.h                              |    1 -
 gcc/tree-ssa-loop-ch.c                       |   60 +++-----------------------
 8 files changed, 24 insertions(+), 63 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index 54712ab..2c8f0c2 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,9 @@
+2015-11-30  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* passes.def: Use pass_ch instead of pass_ch_oacc_kernels.
+	* tree-pass.h (make_pass_ch_oacc_kernels): Remove.
+	* tree-ssa-loop-ch.c: Revert to trunk r230907 version.
+
 2015-11-18  Nathan Sidwell  <nathan@codesourcery.com>
 
 	* config/nvptx/nvptx.c: Remove unneeded #includes. Backport
diff --git gcc/passes.def gcc/passes.def
index e44bfac..f4eb235 100644
--- gcc/passes.def
+++ gcc/passes.def
@@ -93,7 +93,7 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_oacc_kernels);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_ch_oacc_kernels);
+	      NEXT_PASS (pass_ch);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_tree_loop_init);
 	      NEXT_PASS (pass_lim);
diff --git gcc/testsuite/ChangeLog.gomp gcc/testsuite/ChangeLog.gomp
index dd3b1f5..59733bd 100644
--- gcc/testsuite/ChangeLog.gomp
+++ gcc/testsuite/ChangeLog.gomp
@@ -1,3 +1,9 @@
+2015-11-30  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* gcc.dg/tree-ssa/copy-headers.c: Update for new pass_ch.
+	* gcc.dg/tree-ssa/foldconst-2.c: Likewise.
+	* gcc.dg/tree-ssa/loop-40.c: Likewise.
+
 2015-11-19  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* gfortran.dg/goacc/routine-6.f90: Ensure that the device clause is
diff --git gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c
index 4241b40..a5a8212 100644
--- gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c
+++ gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-ch-details" } */
+/* { dg-options "-O2 -fdump-tree-ch2-details" } */
 
 extern int foo (int);
 
@@ -12,4 +12,4 @@ void bla (void)
 }
 
 /* There should be a header duplicated.  */
-/* { dg-final { scan-tree-dump-times "Duplicating header" 1 "ch"} } */
+/* { dg-final { scan-tree-dump-times "Duplicating header" 1 "ch2"} } */
diff --git gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c
index eb1e6de..e9a6f87 100644
--- gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c
+++ gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-ch" } */
+/* { dg-options "-O2 -fdump-tree-ch2" } */
 typedef union tree_node *tree;
 enum tree_code
 {
@@ -56,4 +56,4 @@ emit_support_tinfos (void)
 }
 /* We should copy loop header to fundamentals[0] and then fold it way into
    known value.  */
-/* { dg-final { scan-tree-dump-not "fundamentals.0" "ch"} } */
+/* { dg-final { scan-tree-dump-not "fundamentals.0" "ch2"} } */
diff --git gcc/testsuite/gcc.dg/tree-ssa/loop-40.c gcc/testsuite/gcc.dg/tree-ssa/loop-40.c
index 8397396..36db565 100644
--- gcc/testsuite/gcc.dg/tree-ssa/loop-40.c
+++ gcc/testsuite/gcc.dg/tree-ssa/loop-40.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-ch-details" } */
+/* { dg-options "-O2 -fdump-tree-ch2-details" } */
 
 int mymax2(int *it, int *end)
 {
@@ -10,4 +10,4 @@ int mymax2(int *it, int *end)
   return max;
 }
 
-/* { dg-final { scan-tree-dump "Duplicating header" "ch" } } */
+/* { dg-final { scan-tree-dump "Duplicating header" "ch2" } } */
diff --git gcc/tree-pass.h gcc/tree-pass.h
index 8ac8e72..004db77 100644
--- gcc/tree-pass.h
+++ gcc/tree-pass.h
@@ -392,7 +392,6 @@ extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_ch_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_split_paths (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
diff --git gcc/tree-ssa-loop-ch.c gcc/tree-ssa-loop-ch.c
index 3773e94..6493fcc 100644
--- gcc/tree-ssa-loop-ch.c
+++ gcc/tree-ssa-loop-ch.c
@@ -33,7 +33,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-ssa-scopedtables.h"
 #include "tree-ssa-threadedge.h"
-#include "omp-low.h"
 
 /* Duplicates headers of loops if they are small enough, so that the statements
    in the loop body are always executed when the loop is entered.  This
@@ -125,7 +124,7 @@ do_while_loop_p (struct loop *loop)
 
 namespace {
 
-/* Common superclass for header-copying phases.  */
+/* Common superclass for both header-copying phases.  */
 class ch_base : public gimple_opt_pass
 {
   protected:
@@ -160,16 +159,14 @@ public:
     : ch_base (pass_data_ch, ctxt)
   {}
 
-  pass_ch (pass_data data, gcc::context *ctxt)
-    : ch_base (data, ctxt)
-  {}
-
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_ch != 0; }
   
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
@@ -341,8 +338,6 @@ ch_base::copy_headers (function *fun)
   return changed ? TODO_cleanup_cfg : 0;
 }
 
-} // anon namespace
-
 /* Initialize the loop structures we need, and finalize after.  */
 
 unsigned int
@@ -408,6 +403,8 @@ pass_ch_vect::process_loop_p (struct loop *loop)
   return false;
 }
 
+} // anon namespace
+
 gimple_opt_pass *
 make_pass_ch_vect (gcc::context *ctxt)
 {
@@ -419,50 +416,3 @@ make_pass_ch (gcc::context *ctxt)
 {
   return new pass_ch (ctxt);
 }
-
-namespace {
-
-const pass_data pass_data_ch_oacc_kernels =
-{
-  GIMPLE_PASS, /* type */
-  "ch_oacc_kernels", /* name */
-  OPTGROUP_LOOP, /* optinfo_flags */
-  TV_TREE_CH, /* tv_id */
-  ( PROP_cfg | PROP_ssa ), /* properties_required */
-  0, /* properties_provided */
-  0, /* properties_destroyed */
-  0, /* todo_flags_start */
-  TODO_cleanup_cfg, /* todo_flags_finish */
-};
-
-class pass_ch_oacc_kernels : public pass_ch
-{
-public:
-  pass_ch_oacc_kernels (gcc::context *ctxt)
-    : pass_ch (pass_data_ch_oacc_kernels, ctxt)
-  {}
-
-  /* opt_pass methods: */
-  virtual bool gate (function *) { return true; }
-
-protected:
-  /* ch_base method: */
-  virtual bool process_loop_p (struct loop *loop);
-}; // class pass_ch_oacc_kernels
-
-} // anon namespace
-
-bool
-pass_ch_oacc_kernels::process_loop_p (struct loop *loop)
-{
-  if (!loop->in_oacc_kernels_region)
-    return false;
-
-  return pass_ch::process_loop_p (loop);
-}
-
-gimple_opt_pass *
-make_pass_ch_oacc_kernels (gcc::context *ctxt)
-{
-  return new pass_ch_oacc_kernels (ctxt);
-}


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [gomp4] Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-25 10:44                       ` Richard Biener
@ 2015-11-30 17:48                         ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2015-11-30 17:48 UTC (permalink / raw)
  To: gcc-patches, Tom de Vries; +Cc: Richard Biener, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 5606 bytes --]

Hi!

On Wed, 25 Nov 2015 11:43:14 +0100 (CET), Richard Biener <rguenther@suse.de> wrote:
> On Tue, 24 Nov 2015, Tom de Vries wrote:
> > > [...]
> > 
> > Reposting using the in_loop_pipeline style in pass_lim.
> 
> Ok.

I merged trunk r230907 into gomp-4_0-branch in a very simplistic way,
basically just moving pass_fre in between pass_oacc_kernels and the (new)
pass_oacc_kernels2 pass groups.  We'll want to clean this up later (on
gomp-4_0-branch), once we're more clear on what difference will remain
between the trunk and gomp-4_0-branch pass structures (if any); for now
this makes sure we don't regress OpenACC kernels functionality on
gomp-4_0-branch.  In gomp-4_0-branch r231078, I effectively applied the
following:

commit ffae8a36e195172327a233bd397a4230a7939681
Merge: 8249e60 e1e1688
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Mon Nov 30 17:28:07 2015 +0000

    svn merge -r 230906:230907 svn+ssh://gcc.gnu.org/svn/gcc/trunk
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231078 138bc75d-0d04-0410-961f-82ee72b054a4

 gcc/ChangeLog           |  6 ++++
 gcc/passes.def          | 13 +++++++--
 gcc/testsuite/ChangeLog | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 3 deletions(-)

[diff --git gcc/ChangeLog gcc/ChangeLog]
diff --git gcc/passes.def gcc/passes.def
index f4eb235..9fe4fec 100644
--- gcc/passes.def
+++ gcc/passes.def
@@ -84,36 +84,43 @@ along with GCC; see the file COPYING3.  If not see
 	  /* After CCP we rewrite no longer addressed locals into SSA
 	     form if possible.  */
 	  NEXT_PASS (pass_forwprop);
 	  NEXT_PASS (pass_sra_early);
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when there are oacc kernels in the
-	     function.  */
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
 	  NEXT_PASS (pass_oacc_kernels);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_ch);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	  POP_INSERT_PASSES ()
+	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+		 variable accesses in loops into local variables accesses.  */
 	      NEXT_PASS (pass_tree_loop_init);
 	      NEXT_PASS (pass_lim);
 	      NEXT_PASS (pass_copy_prop);
 	      NEXT_PASS (pass_lim);
 	      NEXT_PASS (pass_copy_prop);
 	      NEXT_PASS (pass_scev_cprop);
 	      NEXT_PASS (pass_tree_loop_done);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_dce);
 	      NEXT_PASS (pass_tree_loop_init);
       	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
 	      NEXT_PASS (pass_expand_omp_ssa);
 	      NEXT_PASS (pass_tree_loop_done);
 	  POP_INSERT_PASSES ()
-	  NEXT_PASS (pass_fre);
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);
 	  NEXT_PASS (pass_convert_switch);
 	  NEXT_PASS (pass_cleanup_eh);
[diff --git gcc/testsuite/ChangeLog gcc/testsuite/ChangeLog]

..., so the following difference from trunk to gomp-4_0-branch remains to
be resolved/reduced (plus the corresponding testsuite tree dump scanning
changes):

--- gcc/passes.def
+++ gcc/passes.def
@@ -89,25 +89,36 @@ along with GCC; see the file COPYING3.  If not see
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
 	  /* Pass group that runs when the function is an offloaded function
 	     containing oacc kernels loops.  Part 1.  */
 	  NEXT_PASS (pass_oacc_kernels);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
 	  /* Pass group that runs when the function is an offloaded function
 	     containing oacc kernels loops.  Part 2.  */
 	  NEXT_PASS (pass_oacc_kernels2);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
 	      /* We use pass_lim to rewrite in-memory iteration and reduction
 		 variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_tree_loop_init);
 	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_scev_cprop);
+	      NEXT_PASS (pass_tree_loop_done);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_tree_loop_init);
+      	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
 	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_tree_loop_done);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-27 11:44                     ` Tom de Vries
  2015-11-27 12:14                       ` Tom de Vries
@ 2015-12-02  9:46                       ` Jakub Jelinek
  2015-12-02 13:11                         ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Jakub Jelinek @ 2015-12-02  9:46 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches

On Fri, Nov 27, 2015 at 12:42:09PM +0100, Tom de Vries wrote:
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -1366,10 +1366,12 @@ build_sender_ref (tree var, omp_context *ctx)
>    return build_sender_ref ((splay_tree_key) var, ctx);
>  }
>  
> -/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
> +/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
> +   BASE_POINTERS_RESTRICT, declare the field with restrict.  */
>  
>  static void
> -install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
> +install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
> +		     bool base_pointers_restrict)

Ugh, why the renaming?  Just use default argument:
		bool base_pointers_restrict = false

> +/* As install_var_field_1, but with base_pointers_restrict == false.  */
> +
> +static void
> +install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
> +{
> +  install_var_field_1 (var, by_ref, mask, ctx, false);
> +}

And avoid the wrapper.

>  /* Instantiate decls as necessary in CTX to satisfy the data sharing
> -   specified by CLAUSES.  */
> +   specified by CLAUSES.  If BASE_POINTERS_RESTRICT, install var field with
> +   restrict.  */
>  
>  static void
> -scan_sharing_clauses (tree clauses, omp_context *ctx)
> +scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
> +			bool base_pointers_restrict)

Likewise.

Otherwise LGTM, but I'm worried if this isn't related in any way to
PR68640 and might not make things worse.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-27 12:14                       ` Tom de Vries
@ 2015-12-02  9:59                         ` Jakub Jelinek
  2016-03-14 13:16                           ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Jakub Jelinek @ 2015-12-02  9:59 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches

On Fri, Nov 27, 2015 at 01:03:52PM +0100, Tom de Vries wrote:
> Handle non-declared variables in kernels alias analysis
> 
> 2015-11-27  Tom de Vries  <tom@codesourcery.com>
> 
> 	* gimplify.c (gimplify_scan_omp_clauses): Initialize
> 	OMP_CLAUSE_ORIG_DECL.
> 	* omp-low.c (install_var_field_1): Handle base_pointers_restrict for
> 	pointers.
> 	(map_ptr_clause_points_to_clause_p)
> 	(nr_map_ptr_clauses_pointing_to_clause): New function.
> 	(omp_target_base_pointers_restrict_p): Handle GOMP_MAP_POINTER.
> 	* tree-pretty-print.c (dump_omp_clause): Print OMP_CLAUSE_ORIG_DECL.
> 	* tree.c (omp_clause_num_ops): Set num_ops for OMP_CLAUSE_MAP to 3.
> 	* tree.h (OMP_CLAUSE_ORIG_DECL): New macro.
> 
> 	* c-c++-common/goacc/kernels-alias-10.c: New test.
> 	* c-c++-common/goacc/kernels-alias-9.c: New test.

I don't like this (mainly the addition of OMP_CLAUSE_ORIG_DECL),
but it also sounds wrong to me.
The primary question is how do you handle GOMP_MAP_POINTER
(which is something we don't use for C/C++ OpenMP anymore,
and Fortran OpenMP will stop using it in GCC 7 or 6.2?) on the OpenACC
libgomp side, does it work like GOMP_MAP_ALLOC or GOMP_MAP_FORCE_ALLOC?
Similarly GOMP_MAP_TO_PSET.  If it works like GOMP_MAP_ALLOC (it does
on the OpenMP side in target.c, so if something is already mapped, no
further pointer assignment happens), then your change looks wrong.
If it works like GOMP_MAP_FORCE_ALLOC, then you just should treat
GOMP_MAP_POINTER on all OpenACC constructs as opcode that allows the
restrict operation.  If it should behave differently depending on
if the corresponding array section has been mapped with GOMP_MAP_FORCE_*
or without it, then supposedly you should use a different code for
those two.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-02  9:46                       ` Jakub Jelinek
@ 2015-12-02 13:11                         ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-12-02 13:11 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches

On 02/12/15 10:45, Jakub Jelinek wrote:
> On Fri, Nov 27, 2015 at 12:42:09PM +0100, Tom de Vries wrote:
>> --- a/gcc/omp-low.c
>> +++ b/gcc/omp-low.c
>> @@ -1366,10 +1366,12 @@ build_sender_ref (tree var, omp_context *ctx)
>>     return build_sender_ref ((splay_tree_key) var, ctx);
>>   }
>>
>> -/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
>> +/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
>> +   BASE_POINTERS_RESTRICT, declare the field with restrict.  */
>>
>>   static void
>> -install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
>> +install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
>> +		     bool base_pointers_restrict)
>
> Ugh, why the renaming?  Just use default argument:
> 		bool base_pointers_restrict = false
>
>> +/* As install_var_field_1, but with base_pointers_restrict == false.  */
>> +
>> +static void
>> +install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
>> +{
>> +  install_var_field_1 (var, by_ref, mask, ctx, false);
>> +}
>
> And avoid the wrapper.
>
>>   /* Instantiate decls as necessary in CTX to satisfy the data sharing
>> -   specified by CLAUSES.  */
>> +   specified by CLAUSES.  If BASE_POINTERS_RESTRICT, install var field with
>> +   restrict.  */
>>
>>   static void
>> -scan_sharing_clauses (tree clauses, omp_context *ctx)
>> +scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
>> +			bool base_pointers_restrict)
>
> Likewise.
>
> Otherwise LGTM,

Hi Jakub,

thanks for the review.

> but I'm worried if this isn't related in any way to
> PR68640 and might not make things worse.
>

AFAIU, they're sort of opposite cases:
- in the case of the PR, we add restrict in a function argument
   by accident
- in the case of this patch, we add restrict in a function argument
   by analysis

[ Btw, now that this patch (which exploits GOMP_MAP_FORCE_* mappings)
   is OK-ed, the patch "Fix oacc kernels default mapping for scalars" at
   https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03334.html becomes more
   relevant, since that one ensures that scalars by default
   get the GOMP_MAP_FORCE_COPY mapping (rather than the incorrect
   GOMP_MAP_COPY) ]

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-11 11:01     ` Jakub Jelinek
  2015-11-12 16:04       ` Tom de Vries
@ 2015-12-03 11:53       ` Tom de Vries
  1 sibling, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-12-03 11:53 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

On 11/11/15 12:00, Jakub Jelinek wrote:
> On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
>>> The option -foffload-alias=pointer instructs the compiler to assume that
>>> objects references in an offload region do not alias.
>>>
>>> The option -foffload-alias=all instructs the compiler to make no
>>> assumptions about aliasing in offload regions.
>>>
>>> The default value is -foffload-alias=none.
>>
>> I think global options for this is nonsense.  Please follow what
>> we do for #pragma GCC ivdep for example, thus allow the alias
>> behavior to be specified per "region" (whatever makes sense here
>> in the context of offloading).
>
> Yeah, completely agreed.  I don't see why the offloaded region would be in
> any way special, they are C/C++/Fortran code as any other.
> What we can and should improve is teach IPA aliasing/points to analysis
> about the way we lower the host vs. offloading region boundary, so that
> if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> determines something it can be used on the offloaded function side and vice
> versa, but a switch like the above is just wrong.

Filed the GOMP_target_ext bit as PR 68675 - Handle GOMP_target_ext 
optimally in ipa-pta.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:39               ` Jakub Jelinek
  2015-11-21 12:24                 ` Tom de Vries
@ 2015-12-11 12:45                 ` Tom de Vries
  2015-12-11 13:00                   ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-11 12:45 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

On 13/11/15 12:39, Jakub Jelinek wrote:
> We simply have some compiler internal interface between the caller and
> callee of the outlined regions, each interface in between those has
> its own structure type used to communicate the info;
> we can attach attributes on the fields, or some flags to indicate some
> properties interesting from aliasing POV.  We don't really need to perform
> full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> the relationship in between such callers and callees (for offloading regions
> we already have "omp target entrypoint" attribute on the callee and a
> singler caller), tell LTO if possible not to split those into different
> partitions if easily possible, and then just for these pairs perform
> aliasing/points-to analysis in the caller and the result record using
> cliques/special attributes/whatever to the callee side, so that the callee
> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.

Hi,

This work-in-progress patch allows me to use IPA PTA information in the 
kernels pass group.

Since:
-  I'm running IPA PTA before ealias, and IPA PTA does not interpret
    restrict, and
- compute_may_alias doesn't run if IPA PTA information is present
I needed to convince ealias to do the restrict clique/base annotation.

It would be more logical to fit IPA PTA after ealias, but one is an IPA 
pass, the other a regular one-function pass, so I would have to split 
the containing pass groups pass_all_early_optimizations and 
pass_local_optimization_passes. I'll give that a try now.

Any comments?

Thanks,
- Tom

[-- Attachment #2: 0008-Run-pass_ipa_pta-before-pass_local_optimization_passes.patch --]
[-- Type: text/x-patch, Size: 5025 bytes --]

Run pass_ipa_pta before pass_local_optimization_passes

---
 gcc/gimple-ssa.h           |  2 ++
 gcc/passes.def             |  1 +
 gcc/tree-pass.h            |  1 +
 gcc/tree-ssa-structalias.c | 60 +++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/gcc/gimple-ssa.h b/gcc/gimple-ssa.h
index 39551da..aff2fb7 100644
--- a/gcc/gimple-ssa.h
+++ b/gcc/gimple-ssa.h
@@ -83,6 +83,8 @@ struct GTY(()) gimple_df {
   /* The PTA solution for the ESCAPED artificial variable.  */
   struct pt_solution escaped;
 
+  bool clique_base_annotation_done;
+
   /* A map of decls to artificial ssa-names that point to the partition
      of the decl.  */
   hash_map<tree, tree> * GTY((skip(""))) decls_to_pointers;
diff --git a/gcc/passes.def b/gcc/passes.def
index 678a900..5293be0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -68,6 +68,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
   POP_INSERT_PASSES ()
 
+  NEXT_PASS (pass_ipa_pta_oacc_kernels);
   NEXT_PASS (pass_local_optimization_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_local_optimization_passes)
       NEXT_PASS (pass_fixup_cfg);
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 4566d33..980922e 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -497,6 +497,7 @@ extern ipa_opt_pass_d *make_pass_ipa_devirt (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_pure_const (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_pta (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_tm (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_target_clone (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_dispatcher_calls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index 7420ce1..dfc0422 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -6939,7 +6939,7 @@ solve_constraints (void)
    at the start of the file for an algorithmic overview.  */
 
 static void
-compute_points_to_sets (void)
+compute_points_to_sets (bool set_points_to_info)
 {
   basic_block bb;
   unsigned i;
@@ -6981,6 +6981,9 @@ compute_points_to_sets (void)
   /* From the constraints compute the points-to sets.  */
   solve_constraints ();
 
+  if (!set_points_to_info)
+    goto done;
+
   /* Compute the points-to set for ESCAPED used for call-clobber analysis.  */
   cfun->gimple_df->escaped = find_what_var_points_to (cfun->decl,
 						      get_varinfo (escaped_id));
@@ -7057,6 +7060,7 @@ compute_points_to_sets (void)
 	}
     }
 
+ done:
   timevar_pop (TV_TREE_PTA);
 }
 
@@ -7289,6 +7293,8 @@ compute_dependence_clique (void)
 unsigned int
 compute_may_aliases (void)
 {
+  bool set_points_to_info = true;
+
   if (cfun->gimple_df->ipa_pta)
     {
       if (dump_file)
@@ -7300,13 +7306,16 @@ compute_may_aliases (void)
 	  dump_alias_info (dump_file);
 	}
 
-      return 0;
+      if (cfun->gimple_df->clique_base_annotation_done)
+	return 0;
+
+      set_points_to_info = false;
     }
 
   /* For each pointer P_i, determine the sets of variables that P_i may
      point-to.  Compute the reachability set of escaped and call-used
      variables.  */
-  compute_points_to_sets ();
+  compute_points_to_sets (set_points_to_info);
 
   /* Debugging dumps.  */
   if (dump_file)
@@ -7314,6 +7323,7 @@ compute_may_aliases (void)
 
   /* Compute restrict-based memory disambiguations.  */
   compute_dependence_clique ();
+  cfun->gimple_df->clique_base_annotation_done = true;
 
   /* Deallocate memory used by aliasing data structures and the internal
      points-to solution.  */
@@ -7816,3 +7826,47 @@ make_pass_ipa_pta (gcc::context *ctxt)
 {
   return new pass_ipa_pta (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_ipa_pta_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "pta_oacc_kernels", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_IPA_PTA, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_pta_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_pta_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (optimize
+	      && flag_openacc
+	      && flag_tree_parallelize_loops > 1
+	      /* Don't bother doing anything if the program has errors.  */
+	      && !seen_error ());
+    }
+
+  virtual unsigned int execute (function *) { return ipa_pta_execute (); }
+
+}; // class pass_ipa_pta_oacc_kernels
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_ipa_pta_oacc_kernels (ctxt);
+}

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-11 12:45                 ` Tom de Vries
@ 2015-12-11 13:00                   ` Richard Biener
  2015-12-13 16:38                     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-12-11 13:00 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Fri, 11 Dec 2015, Tom de Vries wrote:

> On 13/11/15 12:39, Jakub Jelinek wrote:
> > We simply have some compiler internal interface between the caller and
> > callee of the outlined regions, each interface in between those has
> > its own structure type used to communicate the info;
> > we can attach attributes on the fields, or some flags to indicate some
> > properties interesting from aliasing POV.  We don't really need to perform
> > full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> > the relationship in between such callers and callees (for offloading regions
> > we already have "omp target entrypoint" attribute on the callee and a
> > singler caller), tell LTO if possible not to split those into different
> > partitions if easily possible, and then just for these pairs perform
> > aliasing/points-to analysis in the caller and the result record using
> > cliques/special attributes/whatever to the callee side, so that the callee
> > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
> 
> Hi,
> 
> This work-in-progress patch allows me to use IPA PTA information in the
> kernels pass group.
> 
> Since:
> -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
>    restrict, and
> - compute_may_alias doesn't run if IPA PTA information is present
> I needed to convince ealias to do the restrict clique/base annotation.
> 
> It would be more logical to fit IPA PTA after ealias, but one is an IPA pass,
> the other a regular one-function pass, so I would have to split the containing
> pass groups pass_all_early_optimizations and pass_local_optimization_passes.
> I'll give that a try now.
> 
> Any comments?

I don't think you want to run IPA PTA before early
optimizations, it (and ealias) rely on some initial cleanup to
do anything meaningful with well-spent ressources.

The local PTA "hack" also looks more like a waste of resources, but well 
... teaching IPA PTA to honor restrict might be an impossible task
though I didn't think much about it other than handling it only for
nonlocal_p functions (for others we should see all incoming args
if IPA PTA works optimally).  The restrict tags will leak all over
the place of course and in the end no meaningful cliques may remain.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-11 13:00                   ` Richard Biener
@ 2015-12-13 16:38                     ` Tom de Vries
  2015-12-14 13:26                       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-13 16:38 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2860 bytes --]

On 11/12/15 14:00, Richard Biener wrote:
> On Fri, 11 Dec 2015, Tom de Vries wrote:
>
>> On 13/11/15 12:39, Jakub Jelinek wrote:
>>> We simply have some compiler internal interface between the caller and
>>> callee of the outlined regions, each interface in between those has
>>> its own structure type used to communicate the info;
>>> we can attach attributes on the fields, or some flags to indicate some
>>> properties interesting from aliasing POV.  We don't really need to perform
>>> full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
>>> the relationship in between such callers and callees (for offloading regions
>>> we already have "omp target entrypoint" attribute on the callee and a
>>> singler caller), tell LTO if possible not to split those into different
>>> partitions if easily possible, and then just for these pairs perform
>>> aliasing/points-to analysis in the caller and the result record using
>>> cliques/special attributes/whatever to the callee side, so that the callee
>>> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
>>
>> Hi,
>>
>> This work-in-progress patch allows me to use IPA PTA information in the
>> kernels pass group.
>>
>> Since:
>> -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
>>     restrict, and
>> - compute_may_alias doesn't run if IPA PTA information is present
>> I needed to convince ealias to do the restrict clique/base annotation.
>>
>> It would be more logical to fit IPA PTA after ealias, but one is an IPA pass,
>> the other a regular one-function pass, so I would have to split the containing
>> pass groups pass_all_early_optimizations and pass_local_optimization_passes.
>> I'll give that a try now.
>>

I've tried this approach, but realized that this changes the order in 
which non-openacc functions are processed in the compiler, so I've 
abandoned this idea.

>> Any comments?
>
> I don't think you want to run IPA PTA before early
> optimizations, it (and ealias) rely on some initial cleanup to
> do anything meaningful with well-spent ressources.
>
> The local PTA "hack" also looks more like a waste of resources, but well
> ... teaching IPA PTA to honor restrict might be an impossible task
> though I didn't think much about it other than handling it only for
> nonlocal_p functions (for others we should see all incoming args
> if IPA PTA works optimally).  The restrict tags will leak all over
> the place of course and in the end no meaningful cliques may remain.
>

This patch:
- moves the kernels pass group to the first position in the pass list
   after ealias where we're back in ipa mode
- inserts an new ipa pass to contain the gimple pass group called
   pass_oacc_ipa
- inserts a version of ipa-pta before the pass group.

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0003-Add-pass_oacc_ipa.patch --]
[-- Type: text/x-patch, Size: 13777 bytes --]

Add pass_oacc_ipa

---
 gcc/passes.def                          | 37 ++++++++++++++-------------
 gcc/testsuite/g++.dg/ipa/devirt-37.C    | 10 ++++----
 gcc/testsuite/g++.dg/ipa/devirt-40.C    |  4 +--
 gcc/testsuite/g++.dg/tree-ssa/pr61034.C | 10 ++++----
 gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c   |  4 +--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c    |  4 +--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c    |  4 +--
 gcc/tree-pass.h                         |  3 ++-
 gcc/tree-ssa-loop.c                     | 40 ++++++++++++++----------------
 gcc/tree-ssa-structalias.c              | 44 +++++++++++++++++++++++++++++++++
 10 files changed, 102 insertions(+), 58 deletions(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index 43ce3d5..579dd63 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,24 +88,7 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 1.  */
-	  NEXT_PASS (pass_oacc_kernels);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
-	      NEXT_PASS (pass_ch);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 2.  */
-	  NEXT_PASS (pass_oacc_kernels2);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
-	      /* We use pass_lim to rewrite in-memory iteration and reduction
-		 variable accesses in loops into local variables accesses.  */
-	      NEXT_PASS (pass_lim);
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_dce);
-	      NEXT_PASS (pass_expand_omp_ssa);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
@@ -124,6 +107,26 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
   POP_INSERT_PASSES ()
+
+  NEXT_PASS (pass_ipa_pta_oacc_kernels);
+  NEXT_PASS (pass_oacc_ipa);
+  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
+      /* Pass group that runs when the function is an offloaded function
+         containing oacc kernels loops.  */
+      NEXT_PASS (pass_oacc_kernels);
+      PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+          NEXT_PASS (pass_ch);
+          NEXT_PASS (pass_fre);
+          /* We use pass_lim to rewrite in-memory iteration and reduction
+	     variable accesses in loops into local variables accesses.  */
+          NEXT_PASS (pass_lim);
+          NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+          NEXT_PASS (pass_dce);
+          NEXT_PASS (pass_expand_omp_ssa);
+          NEXT_PASS (pass_rebuild_cgraph_edges);
+      POP_INSERT_PASSES ()
+  POP_INSERT_PASSES ()
+
   NEXT_PASS (pass_ipa_chkp_produce_thunks);
   NEXT_PASS (pass_ipa_auto_profile);
   NEXT_PASS (pass_ipa_free_inline_summary);
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-37.C b/gcc/testsuite/g++.dg/ipa/devirt-37.C
index 9c5287e..b7f52a0 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-37.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-37.C
@@ -1,4 +1,4 @@
-/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre2-details -fno-early-inlining"  } */
+/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre3-details -fno-early-inlining"  } */
 #include <stdlib.h>
 struct A {virtual void test() {abort ();}};
 struct B:A
@@ -30,7 +30,7 @@ t()
 /* After inlining the call within constructor needs to be checked to not go into a basetype.
    We should see the vtbl store and we should notice extcall as possibly clobbering the
    type but ignore it because b is in static storage.  */
-/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre2"  } } */
+/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-40.C b/gcc/testsuite/g++.dg/ipa/devirt-40.C
index 279a228..5107c29 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-40.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-40.C
@@ -1,4 +1,4 @@
-/* { dg-options "-O2 -fdump-tree-fre2-details"  } */
+/* { dg-options "-O2 -fdump-tree-fre3-details"  } */
 typedef enum
 {
 } UErrorCode;
@@ -19,4 +19,4 @@ A::m_fn1 (UnicodeString &, int &p2, UErrorCode &) const
   UnicodeString a[2];
 }
 
-/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre2"  } } */
+/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
index cd4ee05..c06c580 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O2 -fdump-tree-fre2 -fdump-tree-optimized" }
+// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized" }
 
 #define assume(x) if(!(x))__builtin_unreachable()
 
@@ -42,13 +42,13 @@ bool f(I a, I b, I c, I d) {
 // a bunch of conditional free()s and unreachable()s.
 // This works only if everything is inlined into 'f'.
 
-// { dg-final { scan-tree-dump-times ";; Function" 1 "fre2" } }
-// { dg-final { scan-tree-dump-times "unreachable" 11 "fre2" } }
+// { dg-final { scan-tree-dump-times ";; Function" 1 "fre3" } }
+// { dg-final { scan-tree-dump-times "unreachable" 11 "fre3" } }
 
 // Note that depending on PUSH_ARGS_REVERSED we are presented with
 // a different initial CFG and thus the final outcome is different
 
-// { dg-final { scan-tree-dump-times "free" 10 "fre2" { target x86_64-*-* i?86-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 10 "fre3" { target x86_64-*-* i?86-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 3 "optimized" { target x86_64-*-* i?86-*-* } } }
-// { dg-final { scan-tree-dump-times "free" 14 "fre2" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 14 "fre3" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 4 "optimized" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
index f558df3..71b31c4 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
@@ -1,5 +1,5 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2 -fno-ipa-icf" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3 -fno-ipa-icf" } */
 
 static int x, y;
 
@@ -54,7 +54,7 @@ int main()
   local_address_taken (&y);
   /* As we are computing flow- and context-insensitive we may not
      CSE the load of x here.  */
-  /* { dg-final { scan-tree-dump " = x;" "fre2" } } */
+  /* { dg-final { scan-tree-dump " = x;" "fre3" } } */
   return x;
 }
 
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
index ff6fa57..8655794 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 static int __attribute__((noinline,noclone))
 foo (int *p, int *q)
@@ -23,4 +23,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
index 106e325..c42762a 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 int a, b;
 
@@ -28,4 +28,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index e1cbce9..1a1da12 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -468,7 +468,7 @@ extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_oacc_ipa (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
@@ -495,6 +495,7 @@ extern ipa_opt_pass_d *make_pass_ipa_devirt (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_pure_const (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_pta (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_tm (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_target_clone (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_dispatcher_calls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index cf7d94e..0e1dad8 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -206,12 +206,14 @@ make_pass_oacc_kernels (gcc::context *ctxt)
   return new pass_oacc_kernels (ctxt);
 }
 
+/* The oacc ipa superpass.  */
+
 namespace {
 
-const pass_data pass_data_oacc_kernels2 =
+const pass_data pass_data_oacc_ipa =
 {
-  GIMPLE_PASS, /* type */
-  "oacc_kernels2", /* name */
+  SIMPLE_IPA_PASS, /* type */
+  "oacc_ipa", /* name */
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_TREE_LOOP, /* tv_id */
   PROP_cfg, /* properties_required */
@@ -221,34 +223,28 @@ const pass_data pass_data_oacc_kernels2 =
   0, /* todo_flags_finish */
 };
 
-class pass_oacc_kernels2 : public gimple_opt_pass
+class pass_oacc_ipa : public simple_ipa_opt_pass
 {
 public:
-  pass_oacc_kernels2 (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  pass_oacc_ipa (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_oacc_ipa, ctxt)
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
-  virtual unsigned int execute (function *fn)
-    {
-      /* Rather than having a copy of the previous dump, get some use out of
-	 this dump, and try to minimize differences with the following pass
-	 (pass_lim), which will initizalize the loop optimizer with
-	 LOOPS_NORMAL.  */
-      loop_optimizer_init (LOOPS_NORMAL);
-      loop_optimizer_finalize (fn);
-      return 0;
-    }
-
-}; // class pass_oacc_kernels2
+  virtual bool gate (function *)
+  {
+    return (flag_openacc
+	    && flag_tree_parallelize_loops > 1);
+  }
+					     
+}; // class pass_oacc_ipa
 
 } // anon namespace
 
-gimple_opt_pass *
-make_pass_oacc_kernels2 (gcc::context *ctxt)
+simple_ipa_opt_pass *
+make_pass_oacc_ipa (gcc::context *ctxt)
 {
-  return new pass_oacc_kernels2 (ctxt);
+  return new pass_oacc_ipa (ctxt);
 }
 
 /* The no-loop superpass.  */
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index 7420ce1..b105edc 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -7816,3 +7816,47 @@ make_pass_ipa_pta (gcc::context *ctxt)
 {
   return new pass_ipa_pta (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_ipa_pta_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "pta_oacc_kernels", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_IPA_PTA, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_pta_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_pta_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (optimize
+	      && flag_openacc
+	      && flag_tree_parallelize_loops > 1
+	      /* Don't bother doing anything if the program has errors.  */
+	      && !seen_error ());
+    }
+
+  virtual unsigned int execute (function *) { return ipa_pta_execute (); }
+
+}; // class pass_ipa_pta_oacc_kernels
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_ipa_pta_oacc_kernels (ctxt);
+}

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-24 12:27     ` Tom de Vries
@ 2015-12-13 16:58       ` Tom de Vries
  2015-12-14 15:23         ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-13 16:58 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 24/11/15 13:24, Tom de Vries wrote:
> On 16/11/15 12:59, Tom de Vries wrote:
>> On 09/11/15 20:52, Tom de Vries wrote:
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>> Hi,
>>>>
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>>>
>>>> The patch series contains these patches:
>>>>
>>>>       1    Insert new exit block only when needed in
>>>>          transform_to_exit_first_loop_alt
>>>>       2    Make create_parallel_loop return void
>>>>       3    Ignore reduction clause on kernels directive
>>>>       4    Implement -foffload-alias
>>>>       5    Add in_oacc_kernels_region in struct loop
>>>>       6    Add pass_oacc_kernels
>>>>       7    Add pass_dominator_oacc_kernels
>>>>       8    Add pass_ch_oacc_kernels
>>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>>      11    Update testcases after adding kernels pass group
>>>>      12    Handle acc loop directive
>>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>
>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>> intended to be committed at the same time.
>>>>
>>>> Bootstrapped and reg-tested on x86_64.
>>>>
>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>> patch that enables accelerator testing (which is submitted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>
>>>> I'll post the individual patches in reply to this message.
>>>
>>> This patch adds pass_parallelize_loops_oacc_kernels.
>>>
>>> There's a number of things we do differently in parloops for oacc
>>> kernels:
>>> - in normal parloops, we generate code to choose between a parallel
>>>    version of the loop, and a sequential (low iteration count) version.
>>>    Since the code in oacc kernels region is supposed to run on the
>>>    accelerator anyway, we skip this check, and don't add a low iteration
>>>    count loop.
>>> - in normal parloops, we generate an #pragma omp parallel /
>>>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>>>    into a thread function. Since the oacc kernels region is already
>>>    split off, we don't add this pair.
>>> - we indicate the parallelization factor by setting the oacc function
>>>    attributes
>>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>>    we add the gang clause
>>> - in normal parloops, we rewrite the variable accesses in the loop in
>>>    terms into accesses relative to a thread function parameter. For the
>>>    oacc kernels region, that rewrite has already been done at omp-lower,
>>>    so we skip this.
>>> - we need to ensure that the entire kernels region can be run in
>>>    parallel. The loop independence check is already present, so for oacc
>>>    kernels we add a check between blocks outside the loop and the entire
>>>    region.
>>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>>    There's no need for each gang to write to a single location, we can
>>>    do this in just one gang. (Typically this is the write of the final
>>>    value of the iteration variable if that one is copied back to the
>>>    host).
>>>
>>
>> Reposting with loop optimizer init added in
>> pass_parallelize_loops_oacc_kernels::execute.
>>
>
> Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
>   added in pass_parallelize_loops_oacc_kernels::execute.
>

Ping.

Anything I can do to facilitate the review?

Thanks,
  Tom
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-13 16:38                     ` Tom de Vries
@ 2015-12-14 13:26                       ` Richard Biener
  2015-12-14 15:44                         ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-12-14 13:26 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Sun, 13 Dec 2015, Tom de Vries wrote:

> On 11/12/15 14:00, Richard Biener wrote:
> > On Fri, 11 Dec 2015, Tom de Vries wrote:
> > 
> > > On 13/11/15 12:39, Jakub Jelinek wrote:
> > > > We simply have some compiler internal interface between the caller and
> > > > callee of the outlined regions, each interface in between those has
> > > > its own structure type used to communicate the info;
> > > > we can attach attributes on the fields, or some flags to indicate some
> > > > properties interesting from aliasing POV.  We don't really need to
> > > > perform
> > > > full IPA-PTA, perhaps it would be enough to a) record somewhere in
> > > > cgraph
> > > > the relationship in between such callers and callees (for offloading
> > > > regions
> > > > we already have "omp target entrypoint" attribute on the callee and a
> > > > singler caller), tell LTO if possible not to split those into different
> > > > partitions if easily possible, and then just for these pairs perform
> > > > aliasing/points-to analysis in the caller and the result record using
> > > > cliques/special attributes/whatever to the callee side, so that the
> > > > callee
> > > > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
> > > > analysis.
> > > 
> > > Hi,
> > > 
> > > This work-in-progress patch allows me to use IPA PTA information in the
> > > kernels pass group.
> > > 
> > > Since:
> > > -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
> > >     restrict, and
> > > - compute_may_alias doesn't run if IPA PTA information is present
> > > I needed to convince ealias to do the restrict clique/base annotation.
> > > 
> > > It would be more logical to fit IPA PTA after ealias, but one is an IPA
> > > pass,
> > > the other a regular one-function pass, so I would have to split the
> > > containing
> > > pass groups pass_all_early_optimizations and
> > > pass_local_optimization_passes.
> > > I'll give that a try now.
> > > 
> 
> I've tried this approach, but realized that this changes the order in which
> non-openacc functions are processed in the compiler, so I've abandoned this
> idea.
> 
> > > Any comments?
> > 
> > I don't think you want to run IPA PTA before early
> > optimizations, it (and ealias) rely on some initial cleanup to
> > do anything meaningful with well-spent ressources.
> > 
> > The local PTA "hack" also looks more like a waste of resources, but well
> > ... teaching IPA PTA to honor restrict might be an impossible task
> > though I didn't think much about it other than handling it only for
> > nonlocal_p functions (for others we should see all incoming args
> > if IPA PTA works optimally).  The restrict tags will leak all over
> > the place of course and in the end no meaningful cliques may remain.
> > 
> 
> This patch:
> - moves the kernels pass group to the first position in the pass list
>   after ealias where we're back in ipa mode
> - inserts an new ipa pass to contain the gimple pass group called
>   pass_oacc_ipa
> - inserts a version of ipa-pta before the pass group.

In principle I like this a lot, but

+  NEXT_PASS (pass_ipa_pta_oacc_kernels);
+  NEXT_PASS (pass_oacc_ipa);
+  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)

I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
group and thus just "clone" ipa_pta?  sub-passes of IPA passes can
be both ipa passes and non-ipa passes.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-12-13 16:58       ` [PIING][PATCH, " Tom de Vries
@ 2015-12-14 15:23         ` Richard Biener
  2016-01-16 22:41           ` [Committed] Move pass_expand_omp_ssa out of pass_parallelize_loops Tom de Vries
                             ` (2 more replies)
  0 siblings, 3 replies; 133+ messages in thread
From: Richard Biener @ 2015-12-14 15:23 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek, Richard Biener

On Sun, Dec 13, 2015 at 5:58 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 24/11/15 13:24, Tom de Vries wrote:
>>
>> On 16/11/15 12:59, Tom de Vries wrote:
>>>
>>> On 09/11/15 20:52, Tom de Vries wrote:
>>>>
>>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> this patch series for stage1 trunk adds support to:
>>>>> - parallelize oacc kernels regions using parloops, and
>>>>> - map the loops onto the oacc gang dimension.
>>>>>
>>>>> The patch series contains these patches:
>>>>>
>>>>>       1    Insert new exit block only when needed in
>>>>>          transform_to_exit_first_loop_alt
>>>>>       2    Make create_parallel_loop return void
>>>>>       3    Ignore reduction clause on kernels directive
>>>>>       4    Implement -foffload-alias
>>>>>       5    Add in_oacc_kernels_region in struct loop
>>>>>       6    Add pass_oacc_kernels
>>>>>       7    Add pass_dominator_oacc_kernels
>>>>>       8    Add pass_ch_oacc_kernels
>>>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>>>      11    Update testcases after adding kernels pass group
>>>>>      12    Handle acc loop directive
>>>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>>
>>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>>> intended to be committed at the same time.
>>>>>
>>>>> Bootstrapped and reg-tested on x86_64.
>>>>>
>>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>>> patch that enables accelerator testing (which is submitted at
>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>>
>>>>> I'll post the individual patches in reply to this message.
>>>>
>>>>
>>>> This patch adds pass_parallelize_loops_oacc_kernels.
>>>>
>>>> There's a number of things we do differently in parloops for oacc
>>>> kernels:
>>>> - in normal parloops, we generate code to choose between a parallel
>>>>    version of the loop, and a sequential (low iteration count) version.
>>>>    Since the code in oacc kernels region is supposed to run on the
>>>>    accelerator anyway, we skip this check, and don't add a low iteration
>>>>    count loop.
>>>> - in normal parloops, we generate an #pragma omp parallel /
>>>>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>>>>    into a thread function. Since the oacc kernels region is already
>>>>    split off, we don't add this pair.
>>>> - we indicate the parallelization factor by setting the oacc function
>>>>    attributes
>>>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>>>    we add the gang clause
>>>> - in normal parloops, we rewrite the variable accesses in the loop in
>>>>    terms into accesses relative to a thread function parameter. For the
>>>>    oacc kernels region, that rewrite has already been done at omp-lower,
>>>>    so we skip this.
>>>> - we need to ensure that the entire kernels region can be run in
>>>>    parallel. The loop independence check is already present, so for oacc
>>>>    kernels we add a check between blocks outside the loop and the entire
>>>>    region.
>>>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>>>    There's no need for each gang to write to a single location, we can
>>>>    do this in just one gang. (Typically this is the write of the final
>>>>    value of the iteration variable if that one is copied back to the
>>>>    host).
>>>>
>>>
>>> Reposting with loop optimizer init added in
>>> pass_parallelize_loops_oacc_kernels::execute.
>>>
>>
>> Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
>>   added in pass_parallelize_loops_oacc_kernels::execute.
>>
>
> Ping.
>
> Anything I can do to facilitate the review?

Document new functions, avoid if (1).

Ideally some refactoring would avoid some of the if (!oacc_kernels_p) spaghetti
but I'm considering tree-parloops.c (and its bugs) yours.

Can the pass not just use a pass parameter to switch between oacc/non-oacc?

Richard.

> Thanks,
>  Tom
>>
>>
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-14 13:26                       ` Richard Biener
@ 2015-12-14 15:44                         ` Tom de Vries
  2015-12-16 13:16                           ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-14 15:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4194 bytes --]

On 14/12/15 14:26, Richard Biener wrote:
> On Sun, 13 Dec 2015, Tom de Vries wrote:
>
>> On 11/12/15 14:00, Richard Biener wrote:
>>> On Fri, 11 Dec 2015, Tom de Vries wrote:
>>>
>>>> On 13/11/15 12:39, Jakub Jelinek wrote:
>>>>> We simply have some compiler internal interface between the caller and
>>>>> callee of the outlined regions, each interface in between those has
>>>>> its own structure type used to communicate the info;
>>>>> we can attach attributes on the fields, or some flags to indicate some
>>>>> properties interesting from aliasing POV.  We don't really need to
>>>>> perform
>>>>> full IPA-PTA, perhaps it would be enough to a) record somewhere in
>>>>> cgraph
>>>>> the relationship in between such callers and callees (for offloading
>>>>> regions
>>>>> we already have "omp target entrypoint" attribute on the callee and a
>>>>> singler caller), tell LTO if possible not to split those into different
>>>>> partitions if easily possible, and then just for these pairs perform
>>>>> aliasing/points-to analysis in the caller and the result record using
>>>>> cliques/special attributes/whatever to the callee side, so that the
>>>>> callee
>>>>> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
>>>>> analysis.
>>>>
>>>> Hi,
>>>>
>>>> This work-in-progress patch allows me to use IPA PTA information in the
>>>> kernels pass group.
>>>>
>>>> Since:
>>>> -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
>>>>      restrict, and
>>>> - compute_may_alias doesn't run if IPA PTA information is present
>>>> I needed to convince ealias to do the restrict clique/base annotation.
>>>>
>>>> It would be more logical to fit IPA PTA after ealias, but one is an IPA
>>>> pass,
>>>> the other a regular one-function pass, so I would have to split the
>>>> containing
>>>> pass groups pass_all_early_optimizations and
>>>> pass_local_optimization_passes.
>>>> I'll give that a try now.
>>>>
>>
>> I've tried this approach, but realized that this changes the order in which
>> non-openacc functions are processed in the compiler, so I've abandoned this
>> idea.
>>
>>>> Any comments?
>>>
>>> I don't think you want to run IPA PTA before early
>>> optimizations, it (and ealias) rely on some initial cleanup to
>>> do anything meaningful with well-spent ressources.
>>>
>>> The local PTA "hack" also looks more like a waste of resources, but well
>>> ... teaching IPA PTA to honor restrict might be an impossible task
>>> though I didn't think much about it other than handling it only for
>>> nonlocal_p functions (for others we should see all incoming args
>>> if IPA PTA works optimally).  The restrict tags will leak all over
>>> the place of course and in the end no meaningful cliques may remain.
>>>
>>
>> This patch:
>> - moves the kernels pass group to the first position in the pass list
>>    after ealias where we're back in ipa mode
>> - inserts an new ipa pass to contain the gimple pass group called
>>    pass_oacc_ipa
>> - inserts a version of ipa-pta before the pass group.
>
> In principle I like this a lot, but
>
> +  NEXT_PASS (pass_ipa_pta_oacc_kernels);
> +  NEXT_PASS (pass_oacc_ipa);
> +  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
>
> I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
> group and thus just "clone" ipa_pta?

Done. But using a clone means using the same gate function, and that 
means that this pass_ipa_pta instance no longer runs by default for 
openacc by default.

I've added enabling-by-default of fipa-pta for fopenacc in 
default_options_optimization to fix that.

> sub-passes of IPA passes can
> be both ipa passes and non-ipa passes.

Right. It does mean that I need yet another pass (pass_ipa_oacc_kernels) 
to do the IPA/non-IPA transition at pass/sub-pass boundary:
...
   NEXT_PASS (pass_ipa_oacc);
   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
       NEXT_PASS (pass_ipa_pta);
       NEXT_PASS (pass_ipa_oacc_kernels);
       PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
          /* out-of-ipa */
          NEXT_PASS (pass_oacc_kernels);
          PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
...

OK for stage3 if bootstrap and reg-test succeeds?

Thanks,
- Tom


[-- Attachment #2: 0003-Add-pass_oacc_ipa.patch --]
[-- Type: text/x-patch, Size: 15289 bytes --]

Add pass_oacc_ipa

2015-12-14  Tom de Vries  <tom@codesourcery.com>

	* opts.c (default_options_optimization): Set fipa-pta on by default for
	fopenacc.
	* passes.def: Move kernels pass group to pass_ipa_oacc.
	* tree-pass.h (make_pass_oacc_kernels2): Remove.
	(make_pass_ipa_oacc, make_pass_ipa_oacc_kernels): Declare.
	* tree-ssa-loop.c (pass_oacc_kernels2, make_pass_oacc_kernels2): Remove.
	(pass_ipa_oacc, pass_ipa_oacc_kernels): New pass.
	(make_pass_ipa_oacc, make_pass_ipa_oacc_kernels): New function.
	* tree-ssa-structalias.c (pass_ipa_pta::clone): New function.

	* g++.dg/ipa/devirt-37.C: Update for new fre2 pass.
	* g++.dg/ipa/devirt-40.C: Same.
	* g++.dg/tree-ssa/pr61034.C: Same.
	* gcc.dg/ipa/ipa-pta-13.c: Same.
	* gcc.dg/ipa/ipa-pta-3.c: Same.
	* gcc.dg/ipa/ipa-pta-4.c: Same.

---
 gcc/opts.c                              |  9 ++++
 gcc/passes.def                          | 41 ++++++++++--------
 gcc/testsuite/g++.dg/ipa/devirt-37.C    | 10 ++---
 gcc/testsuite/g++.dg/ipa/devirt-40.C    |  4 +-
 gcc/testsuite/g++.dg/tree-ssa/pr61034.C | 10 ++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c   |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c    |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c    |  4 +-
 gcc/tree-pass.h                         |  3 +-
 gcc/tree-ssa-loop.c                     | 76 ++++++++++++++++++++++++---------
 gcc/tree-ssa-structalias.c              |  2 +
 11 files changed, 110 insertions(+), 57 deletions(-)

diff --git a/gcc/opts.c b/gcc/opts.c
index 3d25f98..42d5566 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -560,6 +560,7 @@ default_options_optimization (struct gcc_options *opts,
 {
   unsigned int i;
   int opt2;
+  bool openacc_mode = false;
 
   /* Scan to see what optimization level has been specified.  That will
      determine the default value of many flags.  */
@@ -619,6 +620,10 @@ default_options_optimization (struct gcc_options *opts,
 	  opts->x_optimize_debug = 1;
 	  break;
 
+	case OPT_fopenacc:
+	  openacc_mode = true;
+	  break;
+
 	default:
 	  /* Ignore other options in this prescan.  */
 	  break;
@@ -633,6 +638,10 @@ default_options_optimization (struct gcc_options *opts,
   /* -O2 param settings.  */
   opt2 = (opts->x_optimize >= 2);
 
+  if (openacc_mode
+      && !opts_set->x_flag_ipa_pta)
+    opts->x_flag_ipa_pta = true;
+
   /* Track fields in field-sensitive alias analysis.  */
   maybe_set_param_value
     (PARAM_MAX_FIELDS_FOR_FIELD_SENSITIVE,
diff --git a/gcc/passes.def b/gcc/passes.def
index 43ce3d5..96e18f1 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,24 +88,7 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 1.  */
-	  NEXT_PASS (pass_oacc_kernels);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
-	      NEXT_PASS (pass_ch);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 2.  */
-	  NEXT_PASS (pass_oacc_kernels2);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
-	      /* We use pass_lim to rewrite in-memory iteration and reduction
-		 variable accesses in loops into local variables accesses.  */
-	      NEXT_PASS (pass_lim);
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_dce);
-	      NEXT_PASS (pass_expand_omp_ssa);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
@@ -124,6 +107,30 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
   POP_INSERT_PASSES ()
+
+  NEXT_PASS (pass_ipa_oacc);
+  PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
+      NEXT_PASS (pass_ipa_pta);
+      /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.	 */
+      NEXT_PASS (pass_ipa_oacc_kernels);
+      PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_fre);
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+		 variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      /* pass_parallelize_loops_oacc_kernels */
+	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_rebuild_cgraph_edges);
+	  POP_INSERT_PASSES ()
+      POP_INSERT_PASSES ()
+  POP_INSERT_PASSES ()
+
   NEXT_PASS (pass_ipa_chkp_produce_thunks);
   NEXT_PASS (pass_ipa_auto_profile);
   NEXT_PASS (pass_ipa_free_inline_summary);
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-37.C b/gcc/testsuite/g++.dg/ipa/devirt-37.C
index 9c5287e..b7f52a0 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-37.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-37.C
@@ -1,4 +1,4 @@
-/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre2-details -fno-early-inlining"  } */
+/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre3-details -fno-early-inlining"  } */
 #include <stdlib.h>
 struct A {virtual void test() {abort ();}};
 struct B:A
@@ -30,7 +30,7 @@ t()
 /* After inlining the call within constructor needs to be checked to not go into a basetype.
    We should see the vtbl store and we should notice extcall as possibly clobbering the
    type but ignore it because b is in static storage.  */
-/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre2"  } } */
+/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-40.C b/gcc/testsuite/g++.dg/ipa/devirt-40.C
index 279a228..5107c29 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-40.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-40.C
@@ -1,4 +1,4 @@
-/* { dg-options "-O2 -fdump-tree-fre2-details"  } */
+/* { dg-options "-O2 -fdump-tree-fre3-details"  } */
 typedef enum
 {
 } UErrorCode;
@@ -19,4 +19,4 @@ A::m_fn1 (UnicodeString &, int &p2, UErrorCode &) const
   UnicodeString a[2];
 }
 
-/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre2"  } } */
+/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
index cd4ee05..c06c580 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O2 -fdump-tree-fre2 -fdump-tree-optimized" }
+// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized" }
 
 #define assume(x) if(!(x))__builtin_unreachable()
 
@@ -42,13 +42,13 @@ bool f(I a, I b, I c, I d) {
 // a bunch of conditional free()s and unreachable()s.
 // This works only if everything is inlined into 'f'.
 
-// { dg-final { scan-tree-dump-times ";; Function" 1 "fre2" } }
-// { dg-final { scan-tree-dump-times "unreachable" 11 "fre2" } }
+// { dg-final { scan-tree-dump-times ";; Function" 1 "fre3" } }
+// { dg-final { scan-tree-dump-times "unreachable" 11 "fre3" } }
 
 // Note that depending on PUSH_ARGS_REVERSED we are presented with
 // a different initial CFG and thus the final outcome is different
 
-// { dg-final { scan-tree-dump-times "free" 10 "fre2" { target x86_64-*-* i?86-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 10 "fre3" { target x86_64-*-* i?86-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 3 "optimized" { target x86_64-*-* i?86-*-* } } }
-// { dg-final { scan-tree-dump-times "free" 14 "fre2" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 14 "fre3" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 4 "optimized" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
index f558df3..71b31c4 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
@@ -1,5 +1,5 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2 -fno-ipa-icf" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3 -fno-ipa-icf" } */
 
 static int x, y;
 
@@ -54,7 +54,7 @@ int main()
   local_address_taken (&y);
   /* As we are computing flow- and context-insensitive we may not
      CSE the load of x here.  */
-  /* { dg-final { scan-tree-dump " = x;" "fre2" } } */
+  /* { dg-final { scan-tree-dump " = x;" "fre3" } } */
   return x;
 }
 
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
index ff6fa57..8655794 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 static int __attribute__((noinline,noclone))
 foo (int *p, int *q)
@@ -23,4 +23,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
index 106e325..c42762a 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 int a, b;
 
@@ -28,4 +28,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index e1cbce9..dcdbdfd 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -468,7 +468,8 @@ extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc_kernels (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index cf7d94e..1fe2716 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -36,6 +36,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
 #include "omp-low.h"
+#include "diagnostic-core.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -206,12 +207,14 @@ make_pass_oacc_kernels (gcc::context *ctxt)
   return new pass_oacc_kernels (ctxt);
 }
 
+/* The ipa oacc superpass.  */
+
 namespace {
 
-const pass_data pass_data_oacc_kernels2 =
+const pass_data pass_data_ipa_oacc =
 {
-  GIMPLE_PASS, /* type */
-  "oacc_kernels2", /* name */
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc", /* name */
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_TREE_LOOP, /* tv_id */
   PROP_cfg, /* properties_required */
@@ -221,34 +224,65 @@ const pass_data pass_data_oacc_kernels2 =
   0, /* todo_flags_finish */
 };
 
-class pass_oacc_kernels2 : public gimple_opt_pass
+class pass_ipa_oacc : public simple_ipa_opt_pass
 {
 public:
-  pass_oacc_kernels2 (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  pass_ipa_oacc (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc, ctxt)
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
-  virtual unsigned int execute (function *fn)
-    {
-      /* Rather than having a copy of the previous dump, get some use out of
-	 this dump, and try to minimize differences with the following pass
-	 (pass_lim), which will initizalize the loop optimizer with
-	 LOOPS_NORMAL.  */
-      loop_optimizer_init (LOOPS_NORMAL);
-      loop_optimizer_finalize (fn);
-      return 0;
-    }
+  virtual bool gate (function *)
+  {
+    return (optimize
+	    /* Don't bother doing anything if the program has errors.  */
+	    && !seen_error ()
+	    && flag_openacc
+	    && flag_tree_parallelize_loops > 1);
+  }
 
-}; // class pass_oacc_kernels2
+}; // class pass_ipa_oacc
 
 } // anon namespace
 
-gimple_opt_pass *
-make_pass_oacc_kernels2 (gcc::context *ctxt)
+simple_ipa_opt_pass *
+make_pass_ipa_oacc (gcc::context *ctxt)
+{
+  return new pass_ipa_oacc (ctxt);
+}
+
+/* The ipa oacc kernels pass.  */
+
+namespace {
+
+const pass_data pass_data_ipa_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc_kernels, ctxt)
+  {}
+
+}; // class pass_ipa_oacc_kernels
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_oacc_kernels (gcc::context *ctxt)
 {
-  return new pass_oacc_kernels2 (ctxt);
+  return new pass_ipa_oacc_kernels (ctxt);
 }
 
 /* The no-loop superpass.  */
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index b34c955..5f8c0b6 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -7821,6 +7821,8 @@ public:
 	      && !seen_error ());
     }
 
+  opt_pass * clone () { return new pass_ipa_pta (m_ctxt); }
+
   virtual unsigned int execute (function *) { return ipa_pta_execute (); }
 
 }; // class pass_ipa_pta

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-14 15:44                         ` Tom de Vries
@ 2015-12-16 13:16                           ` Richard Biener
  2015-12-16 14:43                             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-12-16 13:16 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Mon, 14 Dec 2015, Tom de Vries wrote:

> On 14/12/15 14:26, Richard Biener wrote:
> > On Sun, 13 Dec 2015, Tom de Vries wrote:
> > 
> > > On 11/12/15 14:00, Richard Biener wrote:
> > > > On Fri, 11 Dec 2015, Tom de Vries wrote:
> > > > 
> > > > > On 13/11/15 12:39, Jakub Jelinek wrote:
> > > > > > We simply have some compiler internal interface between the caller
> > > > > > and
> > > > > > callee of the outlined regions, each interface in between those has
> > > > > > its own structure type used to communicate the info;
> > > > > > we can attach attributes on the fields, or some flags to indicate
> > > > > > some
> > > > > > properties interesting from aliasing POV.  We don't really need to
> > > > > > perform
> > > > > > full IPA-PTA, perhaps it would be enough to a) record somewhere in
> > > > > > cgraph
> > > > > > the relationship in between such callers and callees (for offloading
> > > > > > regions
> > > > > > we already have "omp target entrypoint" attribute on the callee and
> > > > > > a
> > > > > > singler caller), tell LTO if possible not to split those into
> > > > > > different
> > > > > > partitions if easily possible, and then just for these pairs perform
> > > > > > aliasing/points-to analysis in the caller and the result record
> > > > > > using
> > > > > > cliques/special attributes/whatever to the callee side, so that the
> > > > > > callee
> > > > > > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
> > > > > > analysis.
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > This work-in-progress patch allows me to use IPA PTA information in
> > > > > the
> > > > > kernels pass group.
> > > > > 
> > > > > Since:
> > > > > -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
> > > > >      restrict, and
> > > > > - compute_may_alias doesn't run if IPA PTA information is present
> > > > > I needed to convince ealias to do the restrict clique/base annotation.
> > > > > 
> > > > > It would be more logical to fit IPA PTA after ealias, but one is an
> > > > > IPA
> > > > > pass,
> > > > > the other a regular one-function pass, so I would have to split the
> > > > > containing
> > > > > pass groups pass_all_early_optimizations and
> > > > > pass_local_optimization_passes.
> > > > > I'll give that a try now.
> > > > > 
> > > 
> > > I've tried this approach, but realized that this changes the order in
> > > which
> > > non-openacc functions are processed in the compiler, so I've abandoned
> > > this
> > > idea.
> > > 
> > > > > Any comments?
> > > > 
> > > > I don't think you want to run IPA PTA before early
> > > > optimizations, it (and ealias) rely on some initial cleanup to
> > > > do anything meaningful with well-spent ressources.
> > > > 
> > > > The local PTA "hack" also looks more like a waste of resources, but well
> > > > ... teaching IPA PTA to honor restrict might be an impossible task
> > > > though I didn't think much about it other than handling it only for
> > > > nonlocal_p functions (for others we should see all incoming args
> > > > if IPA PTA works optimally).  The restrict tags will leak all over
> > > > the place of course and in the end no meaningful cliques may remain.
> > > > 
> > > 
> > > This patch:
> > > - moves the kernels pass group to the first position in the pass list
> > >    after ealias where we're back in ipa mode
> > > - inserts an new ipa pass to contain the gimple pass group called
> > >    pass_oacc_ipa
> > > - inserts a version of ipa-pta before the pass group.
> > 
> > In principle I like this a lot, but
> > 
> > +  NEXT_PASS (pass_ipa_pta_oacc_kernels);
> > +  NEXT_PASS (pass_oacc_ipa);
> > +  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
> > 
> > I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
> > group and thus just "clone" ipa_pta?
> 
> Done. But using a clone means using the same gate function, and that means
> that this pass_ipa_pta instance no longer runs by default for openacc by
> default.
> 
> I've added enabling-by-default of fipa-pta for fopenacc in
> default_options_optimization to fix that.

Hmm, but that enables both IPA PTA passes then?  I suppose that's ok,
and if not enabling the "late" IPA PTA you'd want to re-set 
gimple_df->ipa_pta.

> > sub-passes of IPA passes can
> > be both ipa passes and non-ipa passes.
> 
> Right. It does mean that I need yet another pass (pass_ipa_oacc_kernels) to do
> the IPA/non-IPA transition at pass/sub-pass boundary:
> ...
>   NEXT_PASS (pass_ipa_oacc);
>   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
>       NEXT_PASS (pass_ipa_pta);
>       NEXT_PASS (pass_ipa_oacc_kernels);
>       PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
>          /* out-of-ipa */
>          NEXT_PASS (pass_oacc_kernels);
>          PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> ...
> 
> OK for stage3 if bootstrap and reg-test succeeds?

Ok.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-16 13:16                           ` Richard Biener
@ 2015-12-16 14:43                             ` Tom de Vries
  2015-12-17 12:03                               ` [gomp4] " Thomas Schwinge
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-16 14:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5197 bytes --]

On 16/12/15 14:16, Richard Biener wrote:
> On Mon, 14 Dec 2015, Tom de Vries wrote:
>
>> On 14/12/15 14:26, Richard Biener wrote:
>>> On Sun, 13 Dec 2015, Tom de Vries wrote:
>>>
>>>> On 11/12/15 14:00, Richard Biener wrote:
>>>>> On Fri, 11 Dec 2015, Tom de Vries wrote:
>>>>>
>>>>>> On 13/11/15 12:39, Jakub Jelinek wrote:
>>>>>>> We simply have some compiler internal interface between the caller
>>>>>>> and
>>>>>>> callee of the outlined regions, each interface in between those has
>>>>>>> its own structure type used to communicate the info;
>>>>>>> we can attach attributes on the fields, or some flags to indicate
>>>>>>> some
>>>>>>> properties interesting from aliasing POV.  We don't really need to
>>>>>>> perform
>>>>>>> full IPA-PTA, perhaps it would be enough to a) record somewhere in
>>>>>>> cgraph
>>>>>>> the relationship in between such callers and callees (for offloading
>>>>>>> regions
>>>>>>> we already have "omp target entrypoint" attribute on the callee and
>>>>>>> a
>>>>>>> singler caller), tell LTO if possible not to split those into
>>>>>>> different
>>>>>>> partitions if easily possible, and then just for these pairs perform
>>>>>>> aliasing/points-to analysis in the caller and the result record
>>>>>>> using
>>>>>>> cliques/special attributes/whatever to the callee side, so that the
>>>>>>> callee
>>>>>>> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
>>>>>>> analysis.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This work-in-progress patch allows me to use IPA PTA information in
>>>>>> the
>>>>>> kernels pass group.
>>>>>>
>>>>>> Since:
>>>>>> -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
>>>>>>       restrict, and
>>>>>> - compute_may_alias doesn't run if IPA PTA information is present
>>>>>> I needed to convince ealias to do the restrict clique/base annotation.
>>>>>>
>>>>>> It would be more logical to fit IPA PTA after ealias, but one is an
>>>>>> IPA
>>>>>> pass,
>>>>>> the other a regular one-function pass, so I would have to split the
>>>>>> containing
>>>>>> pass groups pass_all_early_optimizations and
>>>>>> pass_local_optimization_passes.
>>>>>> I'll give that a try now.
>>>>>>
>>>>
>>>> I've tried this approach, but realized that this changes the order in
>>>> which
>>>> non-openacc functions are processed in the compiler, so I've abandoned
>>>> this
>>>> idea.
>>>>
>>>>>> Any comments?
>>>>>
>>>>> I don't think you want to run IPA PTA before early
>>>>> optimizations, it (and ealias) rely on some initial cleanup to
>>>>> do anything meaningful with well-spent ressources.
>>>>>
>>>>> The local PTA "hack" also looks more like a waste of resources, but well
>>>>> ... teaching IPA PTA to honor restrict might be an impossible task
>>>>> though I didn't think much about it other than handling it only for
>>>>> nonlocal_p functions (for others we should see all incoming args
>>>>> if IPA PTA works optimally).  The restrict tags will leak all over
>>>>> the place of course and in the end no meaningful cliques may remain.
>>>>>
>>>>
>>>> This patch:
>>>> - moves the kernels pass group to the first position in the pass list
>>>>     after ealias where we're back in ipa mode
>>>> - inserts an new ipa pass to contain the gimple pass group called
>>>>     pass_oacc_ipa
>>>> - inserts a version of ipa-pta before the pass group.
>>>
>>> In principle I like this a lot, but
>>>
>>> +  NEXT_PASS (pass_ipa_pta_oacc_kernels);
>>> +  NEXT_PASS (pass_oacc_ipa);
>>> +  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
>>>
>>> I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
>>> group and thus just "clone" ipa_pta?
>>
>> Done. But using a clone means using the same gate function, and that means
>> that this pass_ipa_pta instance no longer runs by default for openacc by
>> default.
>>
>> I've added enabling-by-default of fipa-pta for fopenacc in
>> default_options_optimization to fix that.
>
> Hmm, but that enables both IPA PTA passes then?

Yes. An alternative could be to:
- have 'NEXT_PASS (pass_ipa_pta, true/false /* oacc_p */)' in the pass
   list,
- declare a new flag fipa-pta-oacc, and
- use fipa-pta or fipa-pta-oacc in the gate function depending on
   oacc_p.

> I suppose that's ok,
> and if not enabling the "late" IPA PTA you'd want to re-set
> gimple_df->ipa_pta.
>
>>> sub-passes of IPA passes can
>>> be both ipa passes and non-ipa passes.
>>
>> Right. It does mean that I need yet another pass (pass_ipa_oacc_kernels) to do
>> the IPA/non-IPA transition at pass/sub-pass boundary:
>> ...
>>    NEXT_PASS (pass_ipa_oacc);
>>    PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
>>        NEXT_PASS (pass_ipa_pta);
>>        NEXT_PASS (pass_ipa_oacc_kernels);
>>        PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
>>           /* out-of-ipa */
>>           NEXT_PASS (pass_oacc_kernels);
>>           PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>> ...
>>
>> OK for stage3 if bootstrap and reg-test succeeds?
>
> Ok.

Committed as attached, with the following changes:
- test for opt->value of OPT_fopenacc in default_options_optimization,
   to prevent fipa-pta to be switched on by default for -fno-openacc.
- fixed pta -> pta2 scan failures.

Thanks,
- Tom



[-- Attachment #2: 0001-Add-pass_oacc_ipa.patch --]
[-- Type: text/x-patch, Size: 26042 bytes --]

Add pass_oacc_ipa

2015-12-14  Tom de Vries  <tom@codesourcery.com>

	* opts.c (default_options_optimization): Set fipa-pta on by default for
	fopenacc.
	* passes.def: Move kernels pass group to pass_ipa_oacc.
	* tree-pass.h (make_pass_oacc_kernels2): Remove.
	(make_pass_ipa_oacc, make_pass_ipa_oacc_kernels): Declare.
	* tree-ssa-loop.c (pass_oacc_kernels2, make_pass_oacc_kernels2): Remove.
	(pass_ipa_oacc, pass_ipa_oacc_kernels): New pass.
	(make_pass_ipa_oacc, make_pass_ipa_oacc_kernels): New function.
	* tree-ssa-structalias.c (pass_ipa_pta::clone): New function.

	* g++.dg/ipa/devirt-37.C: Update for new fre2 pass.
	* g++.dg/ipa/devirt-40.C: Same.
	* g++.dg/tree-ssa/pr61034.C: Same.
	* gcc.dg/ipa/ipa-pta-1.c: Update for new pta1 pass.
	* gcc.dg/ipa/ipa-pta-10.c: Same.
	* gcc.dg/ipa/ipa-pta-11.c: Same.
	* gcc.dg/ipa/ipa-pta-14.c: Same.
	* gcc.dg/ipa/ipa-pta-16.c: Same.
	* gcc.dg/ipa/ipa-pta-2.c: Same.
	* gcc.dg/ipa/ipa-pta-5.c: Same.
	* gcc.dg/ipa/ipa-pta-6.c: Same.
	* gcc.dg/torture/ipa-pta-1.c: Same.
	* gcc.dg/ipa/ipa-pta-13.c: Update for new fre2 and pta1 pass.
	* gcc.dg/ipa/ipa-pta-3.c: Same.
	* gcc.dg/ipa/ipa-pta-4.c: Same.

---
 gcc/opts.c                               | 10 +++++
 gcc/passes.def                           | 41 ++++++++++-------
 gcc/testsuite/g++.dg/ipa/devirt-37.C     | 10 ++---
 gcc/testsuite/g++.dg/ipa/devirt-40.C     |  4 +-
 gcc/testsuite/g++.dg/tree-ssa/pr61034.C  | 10 ++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c     | 12 ++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c    |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c    | 12 ++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c    | 14 +++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c    |  6 +--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c    |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c     |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c     |  8 ++--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c     |  8 ++--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c     |  2 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c     |  4 +-
 gcc/testsuite/gcc.dg/torture/ipa-pta-1.c |  4 +-
 gcc/tree-pass.h                          |  3 +-
 gcc/tree-ssa-loop.c                      | 76 +++++++++++++++++++++++---------
 gcc/tree-ssa-structalias.c               |  2 +
 20 files changed, 146 insertions(+), 92 deletions(-)

diff --git a/gcc/opts.c b/gcc/opts.c
index 3d25f98..d46f304 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -560,6 +560,7 @@ default_options_optimization (struct gcc_options *opts,
 {
   unsigned int i;
   int opt2;
+  bool openacc_mode = false;
 
   /* Scan to see what optimization level has been specified.  That will
      determine the default value of many flags.  */
@@ -619,6 +620,11 @@ default_options_optimization (struct gcc_options *opts,
 	  opts->x_optimize_debug = 1;
 	  break;
 
+	case OPT_fopenacc:
+	  if (opt->value)
+	    openacc_mode = true;
+	  break;
+
 	default:
 	  /* Ignore other options in this prescan.  */
 	  break;
@@ -633,6 +639,10 @@ default_options_optimization (struct gcc_options *opts,
   /* -O2 param settings.  */
   opt2 = (opts->x_optimize >= 2);
 
+  if (openacc_mode
+      && !opts_set->x_flag_ipa_pta)
+    opts->x_flag_ipa_pta = true;
+
   /* Track fields in field-sensitive alias analysis.  */
   maybe_set_param_value
     (PARAM_MAX_FIELDS_FOR_FIELD_SENSITIVE,
diff --git a/gcc/passes.def b/gcc/passes.def
index 43ce3d5..96e18f1 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,24 +88,7 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 1.  */
-	  NEXT_PASS (pass_oacc_kernels);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
-	      NEXT_PASS (pass_ch);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 2.  */
-	  NEXT_PASS (pass_oacc_kernels2);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
-	      /* We use pass_lim to rewrite in-memory iteration and reduction
-		 variable accesses in loops into local variables accesses.  */
-	      NEXT_PASS (pass_lim);
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_dce);
-	      NEXT_PASS (pass_expand_omp_ssa);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
@@ -124,6 +107,30 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
   POP_INSERT_PASSES ()
+
+  NEXT_PASS (pass_ipa_oacc);
+  PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
+      NEXT_PASS (pass_ipa_pta);
+      /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.	 */
+      NEXT_PASS (pass_ipa_oacc_kernels);
+      PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_fre);
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+		 variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      /* pass_parallelize_loops_oacc_kernels */
+	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_rebuild_cgraph_edges);
+	  POP_INSERT_PASSES ()
+      POP_INSERT_PASSES ()
+  POP_INSERT_PASSES ()
+
   NEXT_PASS (pass_ipa_chkp_produce_thunks);
   NEXT_PASS (pass_ipa_auto_profile);
   NEXT_PASS (pass_ipa_free_inline_summary);
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-37.C b/gcc/testsuite/g++.dg/ipa/devirt-37.C
index 9c5287e..b7f52a0 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-37.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-37.C
@@ -1,4 +1,4 @@
-/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre2-details -fno-early-inlining"  } */
+/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre3-details -fno-early-inlining"  } */
 #include <stdlib.h>
 struct A {virtual void test() {abort ();}};
 struct B:A
@@ -30,7 +30,7 @@ t()
 /* After inlining the call within constructor needs to be checked to not go into a basetype.
    We should see the vtbl store and we should notice extcall as possibly clobbering the
    type but ignore it because b is in static storage.  */
-/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre2"  } } */
+/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-40.C b/gcc/testsuite/g++.dg/ipa/devirt-40.C
index 279a228..5107c29 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-40.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-40.C
@@ -1,4 +1,4 @@
-/* { dg-options "-O2 -fdump-tree-fre2-details"  } */
+/* { dg-options "-O2 -fdump-tree-fre3-details"  } */
 typedef enum
 {
 } UErrorCode;
@@ -19,4 +19,4 @@ A::m_fn1 (UnicodeString &, int &p2, UErrorCode &) const
   UnicodeString a[2];
 }
 
-/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre2"  } } */
+/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
index cd4ee05..c06c580 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O2 -fdump-tree-fre2 -fdump-tree-optimized" }
+// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized" }
 
 #define assume(x) if(!(x))__builtin_unreachable()
 
@@ -42,13 +42,13 @@ bool f(I a, I b, I c, I d) {
 // a bunch of conditional free()s and unreachable()s.
 // This works only if everything is inlined into 'f'.
 
-// { dg-final { scan-tree-dump-times ";; Function" 1 "fre2" } }
-// { dg-final { scan-tree-dump-times "unreachable" 11 "fre2" } }
+// { dg-final { scan-tree-dump-times ";; Function" 1 "fre3" } }
+// { dg-final { scan-tree-dump-times "unreachable" 11 "fre3" } }
 
 // Note that depending on PUSH_ARGS_REVERSED we are presented with
 // a different initial CFG and thus the final outcome is different
 
-// { dg-final { scan-tree-dump-times "free" 10 "fre2" { target x86_64-*-* i?86-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 10 "fre3" { target x86_64-*-* i?86-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 3 "optimized" { target x86_64-*-* i?86-*-* } } }
-// { dg-final { scan-tree-dump-times "free" 14 "fre2" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 14 "fre3" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 4 "optimized" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c
index c183fcb..bc631f8 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O -fipa-pta -fdump-ipa-pta2-details" } */
 
 static int __attribute__((noinline))
 foo (int *p, int *q)
@@ -45,8 +45,8 @@ int main()
    not seen by IPA PTA (if the address escapes the unit which we only compute
    during IPA PTA...).  Thus the solution also includes NONLOCAL.  */
 
-/* { dg-final { scan-ipa-dump "fn_1 = { bar foo }" "pta" } } */
-/* { dg-final { scan-ipa-dump "bar.arg0 = { NONLOCAL a }" "pta" } } */
-/* { dg-final { scan-ipa-dump "bar.arg1 = { NONLOCAL a }" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg0 = { NONLOCAL a }" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg1 = { NONLOCAL a }" "pta" } } */
+/* { dg-final { scan-ipa-dump "fn_1 = { bar foo }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "bar.arg0 = { NONLOCAL a }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "bar.arg1 = { NONLOCAL a }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = { NONLOCAL a }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg1 = { NONLOCAL a }" "pta2" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c
index 0a6c166..90b7bf8 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details" } */
 
 #include <stdarg.h>
 
@@ -26,4 +26,4 @@ int main()
 /* Verify we properly handle variadic arguments and do not let escape
    stuff through it.  */
 
-/* { dg-final { scan-ipa-dump "ESCAPED = { (ESCAPED )?(NONLOCAL )?}" "pta" } } */
+/* { dg-final { scan-ipa-dump "ESCAPED = { (ESCAPED )?(NONLOCAL )?}" "pta2" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c
index 84dd254..9857d7b 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c
@@ -1,25 +1,25 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details" } */
 
 static int i;
 /* i should not escape here, p should point to i only.  */
-/* { dg-final { scan-ipa-dump "p = { i }" "pta" } } */
+/* { dg-final { scan-ipa-dump "p = { i }" "pta2" } } */
 static int *p = &i;
 
 int j;
 /* q should point to j only.  */
-/* { dg-final { scan-ipa-dump "q = { j }" "pta" } } */
+/* { dg-final { scan-ipa-dump "q = { j }" "pta2" } } */
 static int *q = &j;
 
 static int k;
 /* k should escape here, r should point to NONLOCAL, ESCAPED, k.  */
 int *r = &k;
-/* { dg-final { scan-ipa-dump "r = { ESCAPED NONLOCAL k }" "pta" } } */
+/* { dg-final { scan-ipa-dump "r = { ESCAPED NONLOCAL k }" "pta2" } } */
 
 int l;
 /* s should point to NONLOCAL, ESCAPED, l.  */
 int *s = &l;
-/* { dg-final { scan-ipa-dump "s = { ESCAPED NONLOCAL l }" "pta" } } */
+/* { dg-final { scan-ipa-dump "s = { ESCAPED NONLOCAL l }" "pta2" } } */
 
 /* Make p and q referenced so they do not get optimized out.  */
 int foo() { return &p < &q; }
@@ -32,4 +32,4 @@ int main()
 /* It isn't clear if the escape if l is strictly necessary, if it were
    we should have i, r and s in ESCAPED as well.  */
 
-/* { dg-final { scan-ipa-dump "ESCAPED = { ESCAPED NONLOCAL l k }" "pta" } } */
+/* { dg-final { scan-ipa-dump "ESCAPED = { ESCAPED NONLOCAL l k }" "pta2" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
index f558df3..93dd871 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
@@ -1,5 +1,5 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2 -fno-ipa-icf" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details -fdump-tree-fre3 -fno-ipa-icf" } */
 
 static int x, y;
 
@@ -19,7 +19,7 @@ void *anyfn_global;
 
 /* Even though not referenced in this TU we should have added constraints
    for the initializer.  */
-/* { dg-final { scan-ipa-dump "ex = &local_address_taken" "pta" } } */
+/* { dg-final { scan-ipa-dump "ex = &local_address_taken" "pta2" } } */
 void (*ex)(int *) = local_address_taken;
 
 extern void link_error (void);
@@ -38,11 +38,11 @@ int main()
      uses to be messed up even further.  */
   /* ???  As we don't expand the ESCAPED solution we either get x printed here
      or not based on the phase of the moon.  */
-  /* { dg-final { scan-ipa-dump "local_address_taken.arg0 = { ESCAPED NONLOCAL y x }" "pta" { xfail *-*-* } } } */
-  /* { dg-final { scan-ipa-dump "local_address_taken.clobber = { ESCAPED NONLOCAL y x }" "pta" { xfail *-*-* } } } */
-  /* { dg-final { scan-ipa-dump "local_address_taken.use = { }" "pta" { xfail *-*-* } } } */
+  /* { dg-final { scan-ipa-dump "local_address_taken.arg0 = { ESCAPED NONLOCAL y x }" "pta2" { xfail *-*-* } } } */
+  /* { dg-final { scan-ipa-dump "local_address_taken.clobber = { ESCAPED NONLOCAL y x }" "pta2" { xfail *-*-* } } } */
+  /* { dg-final { scan-ipa-dump "local_address_taken.use = { }" "pta2" { xfail *-*-* } } } */
   /* ??? But make sure x really escaped.  */
-  /* { dg-final { scan-ipa-dump "ESCAPED = {\[^\n\}\]* x \[^\n\}\]*}" "pta" } } */
+  /* { dg-final { scan-ipa-dump "ESCAPED = {\[^\n\}\]* x \[^\n\}\]*}" "pta2" } } */
   (*anyfn) (&x);
   x = 0;
   local (&y);
@@ -54,7 +54,7 @@ int main()
   local_address_taken (&y);
   /* As we are computing flow- and context-insensitive we may not
      CSE the load of x here.  */
-  /* { dg-final { scan-tree-dump " = x;" "fre2" } } */
+  /* { dg-final { scan-tree-dump " = x;" "fre3" } } */
   return x;
 }
 
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
index e3333fa..cc2b940 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fno-tree-fre -fno-tree-sra -fdump-ipa-pta-details -fdelete-null-pointer-checks" } */
+/* { dg-options "-O2 -fipa-pta -fno-tree-fre -fno-tree-sra -fdump-ipa-pta2-details -fdelete-null-pointer-checks" } */
 
 struct X {
     int i;
@@ -21,8 +21,8 @@ int main()
   void *p;
   a.p = (void *)&c;
   p = foo(&a, &a);
-  /* { dg-final { scan-ipa-dump "foo.result = { NULL a\[^ \]* c\[^ \]* }" "pta" { target { ! keeps_null_pointer_checks } } } } */
-  /* { dg-final { scan-ipa-dump "foo.result = { NONLOCAL a\[^ \]* c\[^ \]* }" "pta" { target { keeps_null_pointer_checks } } } } */
+  /* { dg-final { scan-ipa-dump "foo.result = { NULL a\[^ \]* c\[^ \]* }" "pta2" { target { ! keeps_null_pointer_checks } } } } */
+  /* { dg-final { scan-ipa-dump "foo.result = { NONLOCAL a\[^ \]* c\[^ \]* }" "pta2" { target { keeps_null_pointer_checks } } } } */
   ((struct X *)p)->p = (void *)0;
   if (a.p != (void *)0)
     abort ();
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c
index 5bd6596..83b9cd8 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fno-tree-sra -fipa-pta -fdump-ipa-pta" } */
+/* { dg-options "-O2 -fno-tree-sra -fipa-pta -fdump-ipa-pta2" } */
 
 struct X
 {
@@ -29,4 +29,4 @@ int main()
   return 0;
 }
 
-/* { dg-final { scan-ipa-dump "y.\[0-9\]*\\\+\[0-9\]* = { i }" "pta" } } */
+/* { dg-final { scan-ipa-dump "y.\[0-9\]*\\\+\[0-9\]* = { i }" "pta2" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c
index b77864d..0cf2adf 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O -fipa-pta -fdump-ipa-pta2-details" } */
 
 int (*fn)(int *);
 
@@ -21,4 +21,4 @@ int main()
 /* Make sure that when a local function escapes its argument points-to sets
    are properly adjusted.  */
 
-/* { dg-final { scan-ipa-dump "foo.arg0 = { ESCAPED NONLOCAL }" "pta" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = { ESCAPED NONLOCAL }" "pta2" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
index ff6fa57..68c2144 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details -fdump-tree-fre3-details" } */
 
 static int __attribute__((noinline,noclone))
 foo (int *p, int *q)
@@ -21,6 +21,6 @@ int main()
 
 /* Verify we can disambiguate *p and *q in foo.  */
 
-/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
index 106e325..2fc8ada 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details -fdump-tree-fre3-details" } */
 
 int a, b;
 
@@ -26,6 +26,6 @@ int main()
 
 /* Verify we can disambiguate *p and *q in foo.  */
 
-/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c
index 625291b..ec12979 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details" } */
 
 int **x;
 
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c
index c1c8245..8fd5a43 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O -fipa-pta -fdump-ipa-pta2-details" } */
 
 static void __attribute__((noinline,noclone))
 foo (int *p)
@@ -21,4 +21,4 @@ int main()
 /* Verify we correctly compute the units ESCAPED set as empty but
    still properly account for the store via *p in foo.  */
 
-/* { dg-final { scan-ipa-dump "ESCAPED = { }" "pta" } } */
+/* { dg-final { scan-ipa-dump "ESCAPED = { }" "pta2" } } */
diff --git a/gcc/testsuite/gcc.dg/torture/ipa-pta-1.c b/gcc/testsuite/gcc.dg/torture/ipa-pta-1.c
index c31d408..1bf4997 100644
--- a/gcc/testsuite/gcc.dg/torture/ipa-pta-1.c
+++ b/gcc/testsuite/gcc.dg/torture/ipa-pta-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { nonpic } } } */
-/* { dg-options "-fipa-pta -fdump-ipa-pta -fno-ipa-icf" } */
+/* { dg-options "-fipa-pta -fdump-ipa-pta2 -fno-ipa-icf" } */
 /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
 
 struct X { char x; char y; };
@@ -42,4 +42,4 @@ void test4 (int a4, char b, char c, char d, char e, char f, char g, char h)
   bar (p);
 }
 
-/* { dg-final { scan-ipa-dump "bar.arg0 = { test4.arg0 test3.arg0 test2.arg0 test1.arg0 }" "pta" } } */
+/* { dg-final { scan-ipa-dump "bar.arg0 = { test4.arg0 test3.arg0 test2.arg0 test1.arg0 }" "pta2" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index e1cbce9..dcdbdfd 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -468,7 +468,8 @@ extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc_kernels (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index cf7d94e..1fe2716 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -36,6 +36,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
 #include "omp-low.h"
+#include "diagnostic-core.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -206,12 +207,14 @@ make_pass_oacc_kernels (gcc::context *ctxt)
   return new pass_oacc_kernels (ctxt);
 }
 
+/* The ipa oacc superpass.  */
+
 namespace {
 
-const pass_data pass_data_oacc_kernels2 =
+const pass_data pass_data_ipa_oacc =
 {
-  GIMPLE_PASS, /* type */
-  "oacc_kernels2", /* name */
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc", /* name */
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_TREE_LOOP, /* tv_id */
   PROP_cfg, /* properties_required */
@@ -221,34 +224,65 @@ const pass_data pass_data_oacc_kernels2 =
   0, /* todo_flags_finish */
 };
 
-class pass_oacc_kernels2 : public gimple_opt_pass
+class pass_ipa_oacc : public simple_ipa_opt_pass
 {
 public:
-  pass_oacc_kernels2 (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  pass_ipa_oacc (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc, ctxt)
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
-  virtual unsigned int execute (function *fn)
-    {
-      /* Rather than having a copy of the previous dump, get some use out of
-	 this dump, and try to minimize differences with the following pass
-	 (pass_lim), which will initizalize the loop optimizer with
-	 LOOPS_NORMAL.  */
-      loop_optimizer_init (LOOPS_NORMAL);
-      loop_optimizer_finalize (fn);
-      return 0;
-    }
+  virtual bool gate (function *)
+  {
+    return (optimize
+	    /* Don't bother doing anything if the program has errors.  */
+	    && !seen_error ()
+	    && flag_openacc
+	    && flag_tree_parallelize_loops > 1);
+  }
 
-}; // class pass_oacc_kernels2
+}; // class pass_ipa_oacc
 
 } // anon namespace
 
-gimple_opt_pass *
-make_pass_oacc_kernels2 (gcc::context *ctxt)
+simple_ipa_opt_pass *
+make_pass_ipa_oacc (gcc::context *ctxt)
+{
+  return new pass_ipa_oacc (ctxt);
+}
+
+/* The ipa oacc kernels pass.  */
+
+namespace {
+
+const pass_data pass_data_ipa_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc_kernels, ctxt)
+  {}
+
+}; // class pass_ipa_oacc_kernels
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_oacc_kernels (gcc::context *ctxt)
 {
-  return new pass_oacc_kernels2 (ctxt);
+  return new pass_ipa_oacc_kernels (ctxt);
 }
 
 /* The no-loop superpass.  */
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index 7420ce1..53264c3 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -7805,6 +7805,8 @@ public:
 	      && !seen_error ());
     }
 
+  opt_pass * clone () { return new pass_ipa_pta (m_ctxt); }
+
   virtual unsigned int execute (function *) { return ipa_pta_execute (); }
 
 }; // class pass_ipa_pta

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [gomp4] Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-16 14:43                             ` Tom de Vries
@ 2015-12-17 12:03                               ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2015-12-17 12:03 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 31995 bytes --]

Hi!

On Wed, 16 Dec 2015 15:42:55 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 16/12/15 14:16, Richard Biener wrote:
> > On Mon, 14 Dec 2015, Tom de Vries wrote:
> >
> >> On 14/12/15 14:26, Richard Biener wrote:
> >>> On Sun, 13 Dec 2015, Tom de Vries wrote:
> >>>> This patch:
> >>>> - moves the kernels pass group to the first position in the pass list
> >>>>     after ealias where we're back in ipa mode
> >>>> - inserts an new ipa pass to contain the gimple pass group called
> >>>>     pass_oacc_ipa
> >>>> - inserts a version of ipa-pta before the pass group.

> Committed as attached [...]

> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -88,24 +88,7 @@ along with GCC; see the file COPYING3.  If not see
>  	  /* pass_build_ealias is a dummy pass that ensures that we
>  	     execute TODO_rebuild_alias at this point.  */
>  	  NEXT_PASS (pass_build_ealias);
> -	  /* Pass group that runs when the function is an offloaded function
> -	     containing oacc kernels loops.  Part 1.  */
> -	  NEXT_PASS (pass_oacc_kernels);
> -	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> -	      NEXT_PASS (pass_ch);
> -	  POP_INSERT_PASSES ()
>  	  NEXT_PASS (pass_fre);
> -	  /* Pass group that runs when the function is an offloaded function
> -	     containing oacc kernels loops.  Part 2.  */
> -	  NEXT_PASS (pass_oacc_kernels2);
> -	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
> -	      /* We use pass_lim to rewrite in-memory iteration and reduction
> -		 variable accesses in loops into local variables accesses.  */
> -	      NEXT_PASS (pass_lim);
> -	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> -	      NEXT_PASS (pass_dce);
> -	      NEXT_PASS (pass_expand_omp_ssa);
> -	  POP_INSERT_PASSES ()
>  	  NEXT_PASS (pass_merge_phi);
>            NEXT_PASS (pass_dse);
>  	  NEXT_PASS (pass_cd_dce);
> @@ -124,6 +107,30 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_rebuild_cgraph_edges);
>        NEXT_PASS (pass_inline_parameters);
>    POP_INSERT_PASSES ()
> +
> +  NEXT_PASS (pass_ipa_oacc);
> +  PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
> +      NEXT_PASS (pass_ipa_pta);
> +      /* Pass group that runs when the function is an offloaded function
> +	 containing oacc kernels loops.	 */
> +      NEXT_PASS (pass_ipa_oacc_kernels);
> +      PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
> +	  NEXT_PASS (pass_oacc_kernels);
> +	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> +	      NEXT_PASS (pass_ch);
> +	      NEXT_PASS (pass_fre);
> +	      /* We use pass_lim to rewrite in-memory iteration and reduction
> +		 variable accesses in loops into local variables accesses.  */
> +	      NEXT_PASS (pass_lim);
> +	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> +	      NEXT_PASS (pass_dce);
> +	      /* pass_parallelize_loops_oacc_kernels */
> +	      NEXT_PASS (pass_expand_omp_ssa);
> +	      NEXT_PASS (pass_rebuild_cgraph_edges);
> +	  POP_INSERT_PASSES ()
> +      POP_INSERT_PASSES ()
> +  POP_INSERT_PASSES ()
> +
>    NEXT_PASS (pass_ipa_chkp_produce_thunks);
>    NEXT_PASS (pass_ipa_auto_profile);
>    NEXT_PASS (pass_ipa_free_inline_summary);

Merging this patch into gomp-4_0-branch, as before --
<http://news.gmane.org/find-root.php?message_id=%3C87io4jo0ei.fsf%40kepler.schwinge.homeip.net%3E>
-- I again did not clean up but instead just copied the existing OpenACC
kernels pass structure verbatim, so on gomp-4_0-branch we currently got
these additional passes wired up:

           PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
              NEXT_PASS (pass_oacc_kernels);
              PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
    +             NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
                  NEXT_PASS (pass_ch);
    +             NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
                  NEXT_PASS (pass_fre);
                  /* We use pass_lim to rewrite in-memory iteration and reduction
                     variable accesses in loops into local variables accesses.  */
    +             NEXT_PASS (pass_tree_loop_init);
                  NEXT_PASS (pass_lim);
    +             NEXT_PASS (pass_copy_prop);
    +             NEXT_PASS (pass_lim);
    +             NEXT_PASS (pass_copy_prop);
    +             NEXT_PASS (pass_scev_cprop);
    +             NEXT_PASS (pass_tree_loop_done);
                  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
                  NEXT_PASS (pass_dce);
    -             /* pass_parallelize_loops_oacc_kernels */
    +             NEXT_PASS (pass_tree_loop_init);
    +             NEXT_PASS (pass_parallelize_loops_oacc_kernels);
                  NEXT_PASS (pass_expand_omp_ssa);
    +             NEXT_PASS (pass_tree_loop_done);
                  NEXT_PASS (pass_rebuild_cgraph_edges);
              POP_INSERT_PASSES ()
           POP_INSERT_PASSES ()

That is, in r231753 I effectively merged trunk r231690 into
gomp-4_0-branch as follows:

commit 6a16087c1ab71a9539f538b90a2d98cecab90e82
Merge: 49a5b61 f71e6ee
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Dec 17 11:43:00 2015 +0000

    svn merge -r 231689:231690 svn+ssh://gcc.gnu.org/svn/gcc/trunk
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231753 138bc75d-0d04-0410-961f-82ee72b054a4

 gcc/ChangeLog                            | 12 +++++
 gcc/opts.c                               | 10 ++++
 gcc/passes.def                           | 62 +++++++++++++------------
 gcc/testsuite/ChangeLog                  | 18 ++++++++
 gcc/testsuite/g++.dg/ipa/devirt-37.C     | 10 ++--
 gcc/testsuite/g++.dg/ipa/devirt-40.C     |  4 +-
 gcc/testsuite/g++.dg/tree-ssa/pr61034.C  | 10 ++--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c     | 12 ++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c    |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c    | 12 ++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c    | 14 +++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c    |  6 +--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c    |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c     |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c     |  8 ++--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c     |  8 ++--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c     |  2 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c     |  4 +-
 gcc/testsuite/gcc.dg/torture/ipa-pta-1.c |  4 +-
 gcc/tree-pass.h                          |  3 +-
 gcc/tree-ssa-loop.c                      | 78 +++++++++++++++++++++++---------
 gcc/tree-ssa-structalias.c               |  2 +
 22 files changed, 187 insertions(+), 104 deletions(-)

[diff --git gcc/ChangeLog gcc/ChangeLog]
diff --git gcc/opts.c gcc/opts.c
index 3d25f98..d46f304 100644
--- gcc/opts.c
+++ gcc/opts.c
@@ -560,6 +560,7 @@ default_options_optimization (struct gcc_options *opts,
 {
   unsigned int i;
   int opt2;
+  bool openacc_mode = false;
 
   /* Scan to see what optimization level has been specified.  That will
      determine the default value of many flags.  */
@@ -619,6 +620,11 @@ default_options_optimization (struct gcc_options *opts,
 	  opts->x_optimize_debug = 1;
 	  break;
 
+	case OPT_fopenacc:
+	  if (opt->value)
+	    openacc_mode = true;
+	  break;
+
 	default:
 	  /* Ignore other options in this prescan.  */
 	  break;
@@ -633,6 +639,10 @@ default_options_optimization (struct gcc_options *opts,
   /* -O2 param settings.  */
   opt2 = (opts->x_optimize >= 2);
 
+  if (openacc_mode
+      && !opts_set->x_flag_ipa_pta)
+    opts->x_flag_ipa_pta = true;
+
   /* Track fields in field-sensitive alias analysis.  */
   maybe_set_param_value
     (PARAM_MAX_FIELDS_FOR_FIELD_SENSITIVE,
diff --git gcc/passes.def gcc/passes.def
index 9b8fe57..bf7b15e 100644
--- gcc/passes.def
+++ gcc/passes.def
@@ -88,35 +88,7 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 1.  */
-	  NEXT_PASS (pass_oacc_kernels);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_ch);
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 2.  */
-	  NEXT_PASS (pass_oacc_kernels2);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
-	      /* We use pass_lim to rewrite in-memory iteration and reduction
-		 variable accesses in loops into local variables accesses.  */
-	      NEXT_PASS (pass_tree_loop_init);
-	      NEXT_PASS (pass_lim);
-	      NEXT_PASS (pass_copy_prop);
-	      NEXT_PASS (pass_lim);
-	      NEXT_PASS (pass_copy_prop);
-	      NEXT_PASS (pass_scev_cprop);
-	      NEXT_PASS (pass_tree_loop_done);
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_dce);
-	      NEXT_PASS (pass_tree_loop_init);
-      	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
-	      NEXT_PASS (pass_expand_omp_ssa);
-	      NEXT_PASS (pass_tree_loop_done);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
@@ -135,6 +107,40 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
   POP_INSERT_PASSES ()
+
+  NEXT_PASS (pass_ipa_oacc);
+  PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
+      NEXT_PASS (pass_ipa_pta);
+      /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.	 */
+      NEXT_PASS (pass_ipa_oacc_kernels);
+      PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_fre);
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+		 variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_tree_loop_init);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_scev_cprop);
+	      NEXT_PASS (pass_tree_loop_done);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_tree_loop_init);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_tree_loop_done);
+	      NEXT_PASS (pass_rebuild_cgraph_edges);
+	  POP_INSERT_PASSES ()
+      POP_INSERT_PASSES ()
+  POP_INSERT_PASSES ()
+
   NEXT_PASS (pass_ipa_chkp_produce_thunks);
   NEXT_PASS (pass_ipa_auto_profile);
   NEXT_PASS (pass_ipa_free_inline_summary);
[diff --git gcc/testsuite/ChangeLog gcc/testsuite/ChangeLog]
index 9c5287e..b7f52a0 100644
--- gcc/testsuite/g++.dg/ipa/devirt-37.C
+++ gcc/testsuite/g++.dg/ipa/devirt-37.C
@@ -1,4 +1,4 @@
-/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre2-details -fno-early-inlining"  } */
+/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre3-details -fno-early-inlining"  } */
 #include <stdlib.h>
 struct A {virtual void test() {abort ();}};
 struct B:A
@@ -30,7 +30,7 @@ t()
 /* After inlining the call within constructor needs to be checked to not go into a basetype.
    We should see the vtbl store and we should notice extcall as possibly clobbering the
    type but ignore it because b is in static storage.  */
-/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre2"  } } */
+/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre3"  } } */
diff --git gcc/testsuite/g++.dg/ipa/devirt-40.C gcc/testsuite/g++.dg/ipa/devirt-40.C
index 279a228..5107c29 100644
--- gcc/testsuite/g++.dg/ipa/devirt-40.C
+++ gcc/testsuite/g++.dg/ipa/devirt-40.C
@@ -1,4 +1,4 @@
-/* { dg-options "-O2 -fdump-tree-fre2-details"  } */
+/* { dg-options "-O2 -fdump-tree-fre3-details"  } */
 typedef enum
 {
 } UErrorCode;
@@ -19,4 +19,4 @@ A::m_fn1 (UnicodeString &, int &p2, UErrorCode &) const
   UnicodeString a[2];
 }
 
-/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre2"  } } */
+/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre3"  } } */
diff --git gcc/testsuite/g++.dg/tree-ssa/pr61034.C gcc/testsuite/g++.dg/tree-ssa/pr61034.C
index cd4ee05..c06c580 100644
--- gcc/testsuite/g++.dg/tree-ssa/pr61034.C
+++ gcc/testsuite/g++.dg/tree-ssa/pr61034.C
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O2 -fdump-tree-fre2 -fdump-tree-optimized" }
+// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized" }
 
 #define assume(x) if(!(x))__builtin_unreachable()
 
@@ -42,13 +42,13 @@ bool f(I a, I b, I c, I d) {
 // a bunch of conditional free()s and unreachable()s.
 // This works only if everything is inlined into 'f'.
 
-// { dg-final { scan-tree-dump-times ";; Function" 1 "fre2" } }
-// { dg-final { scan-tree-dump-times "unreachable" 11 "fre2" } }
+// { dg-final { scan-tree-dump-times ";; Function" 1 "fre3" } }
+// { dg-final { scan-tree-dump-times "unreachable" 11 "fre3" } }
 
 // Note that depending on PUSH_ARGS_REVERSED we are presented with
 // a different initial CFG and thus the final outcome is different
 
-// { dg-final { scan-tree-dump-times "free" 10 "fre2" { target x86_64-*-* i?86-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 10 "fre3" { target x86_64-*-* i?86-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 3 "optimized" { target x86_64-*-* i?86-*-* } } }
-// { dg-final { scan-tree-dump-times "free" 14 "fre2" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 14 "fre3" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 4 "optimized" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c
index c183fcb..bc631f8 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-1.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O -fipa-pta -fdump-ipa-pta2-details" } */
 
 static int __attribute__((noinline))
 foo (int *p, int *q)
@@ -45,8 +45,8 @@ int main()
    not seen by IPA PTA (if the address escapes the unit which we only compute
    during IPA PTA...).  Thus the solution also includes NONLOCAL.  */
 
-/* { dg-final { scan-ipa-dump "fn_1 = { bar foo }" "pta" } } */
-/* { dg-final { scan-ipa-dump "bar.arg0 = { NONLOCAL a }" "pta" } } */
-/* { dg-final { scan-ipa-dump "bar.arg1 = { NONLOCAL a }" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg0 = { NONLOCAL a }" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg1 = { NONLOCAL a }" "pta" } } */
+/* { dg-final { scan-ipa-dump "fn_1 = { bar foo }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "bar.arg0 = { NONLOCAL a }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "bar.arg1 = { NONLOCAL a }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = { NONLOCAL a }" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg1 = { NONLOCAL a }" "pta2" } } */
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c
index 0a6c166..90b7bf8 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-10.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details" } */
 
 #include <stdarg.h>
 
@@ -26,4 +26,4 @@ int main()
 /* Verify we properly handle variadic arguments and do not let escape
    stuff through it.  */
 
-/* { dg-final { scan-ipa-dump "ESCAPED = { (ESCAPED )?(NONLOCAL )?}" "pta" } } */
+/* { dg-final { scan-ipa-dump "ESCAPED = { (ESCAPED )?(NONLOCAL )?}" "pta2" } } */
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c
index 84dd254..9857d7b 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-11.c
@@ -1,25 +1,25 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details" } */
 
 static int i;
 /* i should not escape here, p should point to i only.  */
-/* { dg-final { scan-ipa-dump "p = { i }" "pta" } } */
+/* { dg-final { scan-ipa-dump "p = { i }" "pta2" } } */
 static int *p = &i;
 
 int j;
 /* q should point to j only.  */
-/* { dg-final { scan-ipa-dump "q = { j }" "pta" } } */
+/* { dg-final { scan-ipa-dump "q = { j }" "pta2" } } */
 static int *q = &j;
 
 static int k;
 /* k should escape here, r should point to NONLOCAL, ESCAPED, k.  */
 int *r = &k;
-/* { dg-final { scan-ipa-dump "r = { ESCAPED NONLOCAL k }" "pta" } } */
+/* { dg-final { scan-ipa-dump "r = { ESCAPED NONLOCAL k }" "pta2" } } */
 
 int l;
 /* s should point to NONLOCAL, ESCAPED, l.  */
 int *s = &l;
-/* { dg-final { scan-ipa-dump "s = { ESCAPED NONLOCAL l }" "pta" } } */
+/* { dg-final { scan-ipa-dump "s = { ESCAPED NONLOCAL l }" "pta2" } } */
 
 /* Make p and q referenced so they do not get optimized out.  */
 int foo() { return &p < &q; }
@@ -32,4 +32,4 @@ int main()
 /* It isn't clear if the escape if l is strictly necessary, if it were
    we should have i, r and s in ESCAPED as well.  */
 
-/* { dg-final { scan-ipa-dump "ESCAPED = { ESCAPED NONLOCAL l k }" "pta" } } */
+/* { dg-final { scan-ipa-dump "ESCAPED = { ESCAPED NONLOCAL l k }" "pta2" } } */
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
index f558df3..93dd871 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
@@ -1,5 +1,5 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2 -fno-ipa-icf" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details -fdump-tree-fre3 -fno-ipa-icf" } */
 
 static int x, y;
 
@@ -19,7 +19,7 @@ void *anyfn_global;
 
 /* Even though not referenced in this TU we should have added constraints
    for the initializer.  */
-/* { dg-final { scan-ipa-dump "ex = &local_address_taken" "pta" } } */
+/* { dg-final { scan-ipa-dump "ex = &local_address_taken" "pta2" } } */
 void (*ex)(int *) = local_address_taken;
 
 extern void link_error (void);
@@ -38,11 +38,11 @@ int main()
      uses to be messed up even further.  */
   /* ???  As we don't expand the ESCAPED solution we either get x printed here
      or not based on the phase of the moon.  */
-  /* { dg-final { scan-ipa-dump "local_address_taken.arg0 = { ESCAPED NONLOCAL y x }" "pta" { xfail *-*-* } } } */
-  /* { dg-final { scan-ipa-dump "local_address_taken.clobber = { ESCAPED NONLOCAL y x }" "pta" { xfail *-*-* } } } */
-  /* { dg-final { scan-ipa-dump "local_address_taken.use = { }" "pta" { xfail *-*-* } } } */
+  /* { dg-final { scan-ipa-dump "local_address_taken.arg0 = { ESCAPED NONLOCAL y x }" "pta2" { xfail *-*-* } } } */
+  /* { dg-final { scan-ipa-dump "local_address_taken.clobber = { ESCAPED NONLOCAL y x }" "pta2" { xfail *-*-* } } } */
+  /* { dg-final { scan-ipa-dump "local_address_taken.use = { }" "pta2" { xfail *-*-* } } } */
   /* ??? But make sure x really escaped.  */
-  /* { dg-final { scan-ipa-dump "ESCAPED = {\[^\n\}\]* x \[^\n\}\]*}" "pta" } } */
+  /* { dg-final { scan-ipa-dump "ESCAPED = {\[^\n\}\]* x \[^\n\}\]*}" "pta2" } } */
   (*anyfn) (&x);
   x = 0;
   local (&y);
@@ -54,7 +54,7 @@ int main()
   local_address_taken (&y);
   /* As we are computing flow- and context-insensitive we may not
      CSE the load of x here.  */
-  /* { dg-final { scan-tree-dump " = x;" "fre2" } } */
+  /* { dg-final { scan-tree-dump " = x;" "fre3" } } */
   return x;
 }
 
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
index e3333fa..cc2b940 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fno-tree-fre -fno-tree-sra -fdump-ipa-pta-details -fdelete-null-pointer-checks" } */
+/* { dg-options "-O2 -fipa-pta -fno-tree-fre -fno-tree-sra -fdump-ipa-pta2-details -fdelete-null-pointer-checks" } */
 
 struct X {
     int i;
@@ -21,8 +21,8 @@ int main()
   void *p;
   a.p = (void *)&c;
   p = foo(&a, &a);
-  /* { dg-final { scan-ipa-dump "foo.result = { NULL a\[^ \]* c\[^ \]* }" "pta" { target { ! keeps_null_pointer_checks } } } } */
-  /* { dg-final { scan-ipa-dump "foo.result = { NONLOCAL a\[^ \]* c\[^ \]* }" "pta" { target { keeps_null_pointer_checks } } } } */
+  /* { dg-final { scan-ipa-dump "foo.result = { NULL a\[^ \]* c\[^ \]* }" "pta2" { target { ! keeps_null_pointer_checks } } } } */
+  /* { dg-final { scan-ipa-dump "foo.result = { NONLOCAL a\[^ \]* c\[^ \]* }" "pta2" { target { keeps_null_pointer_checks } } } } */
   ((struct X *)p)->p = (void *)0;
   if (a.p != (void *)0)
     abort ();
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c
index 5bd6596..83b9cd8 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-16.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fno-tree-sra -fipa-pta -fdump-ipa-pta" } */
+/* { dg-options "-O2 -fno-tree-sra -fipa-pta -fdump-ipa-pta2" } */
 
 struct X
 {
@@ -29,4 +29,4 @@ int main()
   return 0;
 }
 
-/* { dg-final { scan-ipa-dump "y.\[0-9\]*\\\+\[0-9\]* = { i }" "pta" } } */
+/* { dg-final { scan-ipa-dump "y.\[0-9\]*\\\+\[0-9\]* = { i }" "pta2" } } */
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c
index b77864d..0cf2adf 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O -fipa-pta -fdump-ipa-pta2-details" } */
 
 int (*fn)(int *);
 
@@ -21,4 +21,4 @@ int main()
 /* Make sure that when a local function escapes its argument points-to sets
    are properly adjusted.  */
 
-/* { dg-final { scan-ipa-dump "foo.arg0 = { ESCAPED NONLOCAL }" "pta" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = { ESCAPED NONLOCAL }" "pta2" } } */
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
index ff6fa57..68c2144 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details -fdump-tree-fre3-details" } */
 
 static int __attribute__((noinline,noclone))
 foo (int *p, int *q)
@@ -21,6 +21,6 @@ int main()
 
 /* Verify we can disambiguate *p and *q in foo.  */
 
-/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
index 106e325..2fc8ada 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details -fdump-tree-fre3-details" } */
 
 int a, b;
 
@@ -26,6 +26,6 @@ int main()
 
 /* Verify we can disambiguate *p and *q in foo.  */
 
-/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
-/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta2" } } */
+/* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c
index 625291b..ec12979 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-5.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta2-details" } */
 
 int **x;
 
diff --git gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c
index c1c8245..8fd5a43 100644
--- gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c
+++ gcc/testsuite/gcc.dg/ipa/ipa-pta-6.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O -fipa-pta -fdump-ipa-pta-details" } */
+/* { dg-options "-O -fipa-pta -fdump-ipa-pta2-details" } */
 
 static void __attribute__((noinline,noclone))
 foo (int *p)
@@ -21,4 +21,4 @@ int main()
 /* Verify we correctly compute the units ESCAPED set as empty but
    still properly account for the store via *p in foo.  */
 
-/* { dg-final { scan-ipa-dump "ESCAPED = { }" "pta" } } */
+/* { dg-final { scan-ipa-dump "ESCAPED = { }" "pta2" } } */
diff --git gcc/testsuite/gcc.dg/torture/ipa-pta-1.c gcc/testsuite/gcc.dg/torture/ipa-pta-1.c
index c31d408..1bf4997 100644
--- gcc/testsuite/gcc.dg/torture/ipa-pta-1.c
+++ gcc/testsuite/gcc.dg/torture/ipa-pta-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { nonpic } } } */
-/* { dg-options "-fipa-pta -fdump-ipa-pta -fno-ipa-icf" } */
+/* { dg-options "-fipa-pta -fdump-ipa-pta2 -fno-ipa-icf" } */
 /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
 
 struct X { char x; char y; };
@@ -42,4 +42,4 @@ void test4 (int a4, char b, char c, char d, char e, char f, char g, char h)
   bar (p);
 }
 
-/* { dg-final { scan-ipa-dump "bar.arg0 = { test4.arg0 test3.arg0 test2.arg0 test1.arg0 }" "pta" } } */
+/* { dg-final { scan-ipa-dump "bar.arg0 = { test4.arg0 test3.arg0 test2.arg0 test1.arg0 }" "pta2" } } */
diff --git gcc/tree-pass.h gcc/tree-pass.h
index acb606a..b622f56 100644
--- gcc/tree-pass.h
+++ gcc/tree-pass.h
@@ -471,7 +471,8 @@ extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc_kernels (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git gcc/tree-ssa-loop.c gcc/tree-ssa-loop.c
index 8a7ef1b..ce2068a 100644
--- gcc/tree-ssa-loop.c
+++ gcc/tree-ssa-loop.c
@@ -36,6 +36,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
 #include "omp-low.h"
+#include "diagnostic-core.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -206,12 +207,14 @@ make_pass_oacc_kernels (gcc::context *ctxt)
   return new pass_oacc_kernels (ctxt);
 }
 
+/* The ipa oacc superpass.  */
+
 namespace {
 
-const pass_data pass_data_oacc_kernels2 =
+const pass_data pass_data_ipa_oacc =
 {
-  GIMPLE_PASS, /* type */
-  "oacc_kernels2", /* name */
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc", /* name */
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_TREE_LOOP, /* tv_id */
   PROP_cfg, /* properties_required */
@@ -221,34 +224,65 @@ const pass_data pass_data_oacc_kernels2 =
   0, /* todo_flags_finish */
 };
 
-class pass_oacc_kernels2 : public gimple_opt_pass
+class pass_ipa_oacc : public simple_ipa_opt_pass
 {
 public:
-  pass_oacc_kernels2 (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  pass_ipa_oacc (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc, ctxt)
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
-  virtual unsigned int execute (function *fn)
-    {
-      /* Rather than having a copy of the previous dump, get some use out of
-	 this dump, and try to minimize differences with the following pass
-	 (pass_lim), which will initizalize the loop optimizer with
-	 LOOPS_NORMAL.  */
-      loop_optimizer_init (LOOPS_NORMAL);
-      loop_optimizer_finalize (fn);
-      return 0;
-    }
-
-}; // class pass_oacc_kernels2
+  virtual bool gate (function *)
+  {
+    return (optimize
+	    /* Don't bother doing anything if the program has errors.  */
+	    && !seen_error ()
+	    && flag_openacc
+	    && flag_tree_parallelize_loops > 1);
+  }
+
+}; // class pass_ipa_oacc
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_oacc (gcc::context *ctxt)
+{
+  return new pass_ipa_oacc (ctxt);
+}
+
+/* The ipa oacc kernels pass.  */
+
+namespace {
+
+const pass_data pass_data_ipa_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc_kernels, ctxt)
+  {}
+
+}; // class pass_ipa_oacc_kernels
 
 } // anon namespace
 
-gimple_opt_pass *
-make_pass_oacc_kernels2 (gcc::context *ctxt)
+simple_ipa_opt_pass *
+make_pass_ipa_oacc_kernels (gcc::context *ctxt)
 {
-  return new pass_oacc_kernels2 (ctxt);
+  return new pass_ipa_oacc_kernels (ctxt);
 }
 
 /* The no-loop superpass.  */
diff --git gcc/tree-ssa-structalias.c gcc/tree-ssa-structalias.c
index b34c955..5f8c0b6 100644
--- gcc/tree-ssa-structalias.c
+++ gcc/tree-ssa-structalias.c
@@ -7821,6 +7821,8 @@ public:
 	      && !seen_error ());
     }
 
+  opt_pass * clone () { return new pass_ipa_pta (m_ctxt); }
+
   virtual unsigned int execute (function *) { return ipa_pta_execute (); }
 
 }; // class pass_ipa_pta


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [Committed] Move pass_expand_omp_ssa out of pass_parallelize_loops
  2015-12-14 15:23         ` Richard Biener
@ 2016-01-16 22:41           ` Tom de Vries
  2016-01-18 12:59           ` [Committed] Allow pass_parallelize_loops to be run outside the loop pipeline Tom de Vries
  2016-01-18 13:07           ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Tom de Vries
  2 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-01-16 22:41 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 574 bytes --]

[ was: Re: [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels ]

On 14/12/15 16:22, Richard Biener wrote:
> Can the pass not just use a pass parameter to switch between oacc/non-oacc?

It can, but given PR68874 ('Allow pass groups to be cloned'), if we 
clone pass_parallelize_loops we can no longer use it as a pass group, 
and that means that we no longer can have pass_expand_omp_ssa inside 
pass_parallelize_loops.

This patch moves pass_expand_omp_ssa out of pass_parallelize_loops.

Bootstrapped and reg-tested on x86_64.

Committed to trunk.

Thanks,
- Tom

[-- Attachment #2: 0004-Move-pass_expand_omp_ssa-out-of-pass_parallelize_loops.patch --]
[-- Type: text/x-patch, Size: 830 bytes --]

Move pass_expand_omp_ssa out of pass_parallelize_loops

2016-01-16  Tom de Vries  <tom@codesourcery.com>

	* passes.def: Move pass_expand_omp_ssa out of pass_parallelize_loops.

---
 gcc/passes.def | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index c593851..392a9bc 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -273,9 +273,7 @@ along with GCC; see the file COPYING3.  If not see
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_parallelize_loops);
-	  PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
-	      NEXT_PASS (pass_expand_omp_ssa);
-	  POP_INSERT_PASSES ()
+	  NEXT_PASS (pass_expand_omp_ssa);
 	  NEXT_PASS (pass_ch_vect);
 	  NEXT_PASS (pass_if_conversion);
 	  /* pass_vectorize must immediately follow pass_if_conversion.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [Committed] Allow pass_parallelize_loops to be run outside the loop pipeline
  2015-12-14 15:23         ` Richard Biener
  2016-01-16 22:41           ` [Committed] Move pass_expand_omp_ssa out of pass_parallelize_loops Tom de Vries
@ 2016-01-18 12:59           ` Tom de Vries
  2016-01-18 13:07           ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Tom de Vries
  2 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-01-18 12:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 370 bytes --]

[ was: Re: [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels ]

On 14/12/15 16:22, Richard Biener wrote:
> Can the pass not just use a pass parameter to switch between oacc/non-oacc?

It can, and that means that parloops is run outside the loops pipeline. 
This patch enables that.

Bootstrapped and reg-tested on x86_64.

Committed to trunk.

Thanks,
- Tom


[-- Attachment #2: 0001-Allow-pass_parallelize_loops-to-be-run-outside-the-loop-pipeline.patch --]
[-- Type: text/x-patch, Size: 1474 bytes --]

Allow pass_parallelize_loops to be run outside the loop pipeline

2016-01-18  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (pass_parallelize_loops::execute): Allow
	pass_parallelize_loops to be run outside the loop pipeline.

---
 gcc/tree-parloops.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 46d70ac..885103e 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2844,23 +2844,41 @@ public:
 unsigned
 pass_parallelize_loops::execute (function *fun)
 {
-  if (number_of_loops (fun) <= 1)
-    return 0;
-
   tree nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_THREADS);
   if (nthreads == NULL_TREE)
     return 0;
 
+  bool in_loop_pipeline = scev_initialized_p ();
+  if (!in_loop_pipeline)
+    loop_optimizer_init (LOOPS_NORMAL
+			 | LOOPS_HAVE_RECORDED_EXITS);
+
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  if (!in_loop_pipeline)
+    {
+      rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+      scev_initialize ();
+    }
+
+  unsigned int todo = 0;
   if (parallelize_loops ())
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
       checking_verify_loop_structure ();
 
-      return TODO_update_ssa;
+      todo |= TODO_update_ssa;
+    }
+
+  if (!in_loop_pipeline)
+    {
+      scev_finalize ();
+      loop_optimizer_finalize ();
     }
 
-  return 0;
+  return todo;
 }
 
 } // anon namespace

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [committed] Add oacc_kernels_p argument to pass_parallelize_loops
  2015-12-14 15:23         ` Richard Biener
  2016-01-16 22:41           ` [Committed] Move pass_expand_omp_ssa out of pass_parallelize_loops Tom de Vries
  2016-01-18 12:59           ` [Committed] Allow pass_parallelize_loops to be run outside the loop pipeline Tom de Vries
@ 2016-01-18 13:07           ` Tom de Vries
  2016-01-18 13:30             ` [committed] Add pass_parallelize_loops to pass_oacc_kernels Tom de Vries
  2016-01-20  8:54             ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Thomas Schwinge
  2 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2016-01-18 13:07 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 4836 bytes --]

[was: Re: [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels ]

On 14/12/15 16:22, Richard Biener wrote:
> On Sun, Dec 13, 2015 at 5:58 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
>> On 24/11/15 13:24, Tom de Vries wrote:
>>>
>>> On 16/11/15 12:59, Tom de Vries wrote:
>>>>
>>>> On 09/11/15 20:52, Tom de Vries wrote:
>>>>>
>>>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> this patch series for stage1 trunk adds support to:
>>>>>> - parallelize oacc kernels regions using parloops, and
>>>>>> - map the loops onto the oacc gang dimension.
>>>>>>
>>>>>> The patch series contains these patches:
>>>>>>
>>>>>>        1    Insert new exit block only when needed in
>>>>>>           transform_to_exit_first_loop_alt
>>>>>>        2    Make create_parallel_loop return void
>>>>>>        3    Ignore reduction clause on kernels directive
>>>>>>        4    Implement -foffload-alias
>>>>>>        5    Add in_oacc_kernels_region in struct loop
>>>>>>        6    Add pass_oacc_kernels
>>>>>>        7    Add pass_dominator_oacc_kernels
>>>>>>        8    Add pass_ch_oacc_kernels
>>>>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>>>>       11    Update testcases after adding kernels pass group
>>>>>>       12    Handle acc loop directive
>>>>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>>>
>>>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>>>> intended to be committed at the same time.
>>>>>>
>>>>>> Bootstrapped and reg-tested on x86_64.
>>>>>>
>>>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>>>> patch that enables accelerator testing (which is submitted at
>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>>>
>>>>>> I'll post the individual patches in reply to this message.
>>>>>
>>>>>
>>>>> This patch adds pass_parallelize_loops_oacc_kernels.
>>>>>
>>>>> There's a number of things we do differently in parloops for oacc
>>>>> kernels:
>>>>> - in normal parloops, we generate code to choose between a parallel
>>>>>     version of the loop, and a sequential (low iteration count) version.
>>>>>     Since the code in oacc kernels region is supposed to run on the
>>>>>     accelerator anyway, we skip this check, and don't add a low iteration
>>>>>     count loop.
>>>>> - in normal parloops, we generate an #pragma omp parallel /
>>>>>     GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>>>>>     into a thread function. Since the oacc kernels region is already
>>>>>     split off, we don't add this pair.
>>>>> - we indicate the parallelization factor by setting the oacc function
>>>>>     attributes
>>>>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>>>>     we add the gang clause
>>>>> - in normal parloops, we rewrite the variable accesses in the loop in
>>>>>     terms into accesses relative to a thread function parameter. For the
>>>>>     oacc kernels region, that rewrite has already been done at omp-lower,
>>>>>     so we skip this.
>>>>> - we need to ensure that the entire kernels region can be run in
>>>>>     parallel. The loop independence check is already present, so for oacc
>>>>>     kernels we add a check between blocks outside the loop and the entire
>>>>>     region.
>>>>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>>>>     There's no need for each gang to write to a single location, we can
>>>>>     do this in just one gang. (Typically this is the write of the final
>>>>>     value of the iteration variable if that one is copied back to the
>>>>>     host).
>>>>>
>>>>
>>>> Reposting with loop optimizer init added in
>>>> pass_parallelize_loops_oacc_kernels::execute.
>>>>
>>>
>>> Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
>>>    added in pass_parallelize_loops_oacc_kernels::execute.
>>>
>>
>> Ping.
>>
>> Anything I can do to facilitate the review?
>
> Document new functions.

Done.

avoid if (1).

Done.

> Ideally some refactoring would avoid some of the if (!oacc_kernels_p) spaghetti

Ack. For now, i've tried to minimize the number of oacc_kernels_p tests 
in the code.

Further suggestions on how to improve here are much appreciated.

> but I'm considering tree-parloops.c (and its bugs) yours.

Ack.

> Can the pass not just use a pass parameter to switch between oacc/non-oacc?
>

This patch introduces the pass parameter oacc_kernels_p (but does not 
instantiate an oacc_kernels_p == true pass version yet).

Bootstrapped and reg-tested on x86_64.

Committed to trunk.

Thanks,
- Tom


[-- Attachment #2: 0002-Add-oacc_kernels_p-argument-to-pass_parallelize_loops.patch --]
[-- Type: text/x-patch, Size: 32703 bytes --]

Add oacc_kernels_p argument to pass_parallelize_loops

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
	(create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(find_reduc_addr, get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with
	oacc_kernels_p argument.
	(pass_parallelize_loops::clone, pass_parallelize_loops::set_pass_param):
	New member function.
	(pass_parallelize_loops::bool oacc_kernels_p): New member var.
	* passes.def: Add argument to pass_parallelize_loops instantation.

---
 gcc/omp-low.c       |   2 +-
 gcc/omp-low.h       |   1 +
 gcc/passes.def      |   2 +-
 gcc/tree-parloops.c | 744 ++++++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 641 insertions(+), 108 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index b391ee0..98470c7 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12401,7 +12401,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index 3459c1b..64caef8 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree);
 extern void replace_oacc_fn_attrib (tree, tree);
 extern tree build_oacc_routine_dims (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 392a9bc..d9a8c4e 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -272,7 +272,7 @@ along with GCC; see the file COPYING3.  If not see
 	      NEXT_PASS (pass_dce);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_iv_canon);
-	  NEXT_PASS (pass_parallelize_loops);
+	  NEXT_PASS (pass_parallelize_loops, false /* oacc_kernels_p */);
 	  NEXT_PASS (pass_expand_omp_ssa);
 	  NEXT_PASS (pass_ch_vect);
 	  NEXT_PASS (pass_if_conversion);
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 885103e..7749d34 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
 
-  addr = build_addr (t);
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
+
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1994,10 +2019,11 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
-  basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
+  basic_block for_bb, ex_bb, continue_bb;
   tree t, param;
   gomp_parallel *omp_par_stmt;
   gimple *omp_return_stmt1, *omp_return_stmt2;
@@ -2009,40 +2035,50 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   edge exit, nexit, guard, end, e;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
-  bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (oacc_kernels_p)
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
+  else
+    {
+      basic_block bb = loop_preheader_edge (loop)->src;
+      basic_block paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
 
-  /* Initialize NEW_DATA.  */
-  if (data)
-    {
-      gassign *assign_stmt;
+      /* Initialize NEW_DATA.  */
+      if (data)
+	{
+	  gassign *assign_stmt;
 
-      gsi = gsi_after_labels (bb);
+	  gsi = gsi_after_labels (bb);
 
-      param = make_ssa_name (DECL_ARGUMENTS (loop_fn));
-      assign_stmt = gimple_build_assign (param, build_fold_addr_expr (data));
-      gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
+	  param = make_ssa_name (DECL_ARGUMENTS (loop_fn));
+	  assign_stmt = gimple_build_assign (param, build_fold_addr_expr (data));
+	  gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
 
-      assign_stmt = gimple_build_assign (new_data,
-				  fold_convert (TREE_TYPE (new_data), param));
-      gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
-    }
+	  assign_stmt = gimple_build_assign (new_data,
+					     fold_convert (TREE_TYPE (new_data), param));
+	  gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
+	}
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2107,39 +2143,50 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   PENDING_STMT (e) = NULL;
 
   /* Emit GIMPLE_OMP_FOR.  */
-  gimple_cond_set_lhs (cond_stmt, cvar_base);
-  type = TREE_TYPE (cvar);
-  t = build_omp_clause (loc, OMP_CLAUSE_SCHEDULE);
-  int chunk_size = PARAM_VALUE (PARAM_PARLOOPS_CHUNK_SIZE);
-  enum PARAM_PARLOOPS_SCHEDULE_KIND schedule_type \
-    = (enum PARAM_PARLOOPS_SCHEDULE_KIND) PARAM_VALUE (PARAM_PARLOOPS_SCHEDULE);
-  switch (schedule_type)
+  if (oacc_kernels_p)
+    /* In combination with the NUM_GANGS on the parallel.  */
+    t = build_omp_clause (loc, OMP_CLAUSE_GANG);
+  else
     {
-    case PARAM_PARLOOPS_SCHEDULE_KIND_static:
-      OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_STATIC;
-      break;
-    case PARAM_PARLOOPS_SCHEDULE_KIND_dynamic:
-      OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_DYNAMIC;
-      break;
-    case PARAM_PARLOOPS_SCHEDULE_KIND_guided:
-      OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_GUIDED;
-      break;
-    case PARAM_PARLOOPS_SCHEDULE_KIND_auto:
-      OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_AUTO;
-      chunk_size = 0;
-      break;
-    case PARAM_PARLOOPS_SCHEDULE_KIND_runtime:
-      OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_RUNTIME;
-      chunk_size = 0;
-      break;
-    default:
-      gcc_unreachable ();
+      t = build_omp_clause (loc, OMP_CLAUSE_SCHEDULE);
+      int chunk_size = PARAM_VALUE (PARAM_PARLOOPS_CHUNK_SIZE);
+      enum PARAM_PARLOOPS_SCHEDULE_KIND schedule_type \
+	= (enum PARAM_PARLOOPS_SCHEDULE_KIND) PARAM_VALUE (PARAM_PARLOOPS_SCHEDULE);
+      switch (schedule_type)
+	{
+	case PARAM_PARLOOPS_SCHEDULE_KIND_static:
+	  OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_STATIC;
+	  break;
+	case PARAM_PARLOOPS_SCHEDULE_KIND_dynamic:
+	  OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_DYNAMIC;
+	  break;
+	case PARAM_PARLOOPS_SCHEDULE_KIND_guided:
+	  OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_GUIDED;
+	  break;
+	case PARAM_PARLOOPS_SCHEDULE_KIND_auto:
+	  OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_AUTO;
+	  chunk_size = 0;
+	  break;
+	case PARAM_PARLOOPS_SCHEDULE_KIND_runtime:
+	  OMP_CLAUSE_SCHEDULE_KIND (t) = OMP_CLAUSE_SCHEDULE_RUNTIME;
+	  chunk_size = 0;
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+      if (chunk_size != 0)
+	OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
+	  = build_int_cst (integer_type_node, chunk_size);
     }
-  if (chunk_size != 0)
-    OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
-      = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   t, 1, NULL);
+
+  gimple_cond_set_lhs (cond_stmt, cvar_base);
+  type = TREE_TYPE (cvar);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2181,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2262,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
+      many_iterations_cond
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2315,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2331,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2542,12 +2612,65 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+/* Return the default def of the first function argument.  */
+
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
+/* For PHI in loop header of LOOP, look for pattern:
+
+   <bb preheader>
+   .omp_data_i = &.omp_data_arr;
+   addr = .omp_data_i->sum;
+   sum_a = *addr;
+
+   <bb header>:
+   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+   and return addr.  Otherwise, return NULL_TREE.  */
+
+static tree
+find_reduc_addr (struct loop *loop, gphi *phi)
+{
+  edge e = loop_preheader_edge (loop);
+  tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+  gimple *stmt = SSA_NAME_DEF_STMT (arg);
+  if (!gimple_assign_single_p (stmt))
+    return NULL_TREE;
+  tree memref = gimple_assign_rhs1 (stmt);
+  if (TREE_CODE (memref) != MEM_REF)
+    return NULL_TREE;
+  tree addr = TREE_OPERAND (memref, 0);
+
+  gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+  if (!gimple_assign_single_p (stmt2))
+    return NULL_TREE;
+  tree compref = gimple_assign_rhs1 (stmt2);
+  if (TREE_CODE (compref) != COMPONENT_REF)
+    return NULL_TREE;
+  tree addr2 = TREE_OPERAND (compref, 0);
+  if (TREE_CODE (addr2) != MEM_REF)
+    return NULL_TREE;
+  addr2 = TREE_OPERAND (addr2, 0);
+  if (TREE_CODE (addr2) != SSA_NAME
+      || addr2 != get_omp_data_i_param ())
+    return NULL_TREE;
+
+  return addr;
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2647,6 +2770,26 @@ try_create_reduction_list (loop_p loop,
 	}
     }
 
+  if (oacc_kernels_p)
+    {
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      tree addr = find_reduc_addr (loop, phi);
+	      if (addr == NULL_TREE)
+		return false;
+	      struct reduction_info *red = reduction_phi (reduction_list, phi);
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
 
   return true;
 }
@@ -2679,6 +2822,350 @@ loop_has_phi_with_address_arg (struct loop *loop)
       }
  end:
   free (bbs);
+
+  return res;
+}
+
+/* Return true if memory ref REF (corresponding to the stmt at GSI in
+   REGIONS_BB[I]) conflicts with the statements in REGIONS_BB[I] after gsi,
+   or the statements in REGIONS_BB[I + n].  REF_IS_STORE indicates if REF is a
+   store.  Ignore conflicts with SKIP_STMT.  */
+
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+/* Return true if the bbs in REGION_BBS but not in in_loop_bbs can be executed
+   in parallel with REGION_BBS containing the loop.  Return the stores of
+   reduction results in REDUCTION_STORES.  */
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  tree omp_data_i = get_omp_data_i_param ();
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
+  return true;
+}
+
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+/* Return true if the statements before and after the LOOP can be executed in
+   parallel with the function containing the loop.  Resolve conflicting stores
+   outside LOOP by guarding them such that only a single gang executes them.  */
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, reduction_list,
+				   reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
   return res;
 }
 
@@ -2687,7 +3174,7 @@ loop_has_phi_with_address_arg (struct loop *loop)
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2699,19 +3186,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2723,6 +3220,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2764,6 +3277,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2773,7 +3287,7 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (loop_has_phi_with_address_arg (loop))
@@ -2783,6 +3297,14 @@ parallelize_loops (void)
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -2796,8 +3318,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2832,13 +3355,22 @@ class pass_parallelize_loops : public gimple_opt_pass
 {
 public:
   pass_parallelize_loops (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_parallelize_loops, ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops, ctxt),
+      oacc_kernels_p (false)
   {}
 
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
   virtual unsigned int execute (function *);
+  opt_pass * clone () { return new pass_parallelize_loops (m_ctxt); }
+  void set_pass_param (unsigned int n, bool param)
+    {
+      gcc_assert (n == 0);
+      oacc_kernels_p = param;
+    }
 
+ private:
+  bool oacc_kernels_p;
 }; // class pass_parallelize_loops
 
 unsigned
@@ -2863,7 +3395,7 @@ pass_parallelize_loops::execute (function *fun)
     }
 
   unsigned int todo = 0;
-  if (parallelize_loops ())
+  if (parallelize_loops (oacc_kernels_p))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [committed] Add pass_parallelize_loops to pass_oacc_kernels
  2016-01-18 13:07           ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Tom de Vries
@ 2016-01-18 13:30             ` Tom de Vries
  2016-01-20  8:54             ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Thomas Schwinge
  1 sibling, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-01-18 13:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 5426 bytes --]

[ was: Re: [committed] Add oacc_kernels_p argument to 
pass_parallelize_loops ]

On 18/01/16 14:07, Tom de Vries wrote:
> [was: Re: [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels ]
>
> On 14/12/15 16:22, Richard Biener wrote:
>> On Sun, Dec 13, 2015 at 5:58 PM, Tom de Vries <Tom_deVries@mentor.com>
>> wrote:
>>> On 24/11/15 13:24, Tom de Vries wrote:
>>>>
>>>> On 16/11/15 12:59, Tom de Vries wrote:
>>>>>
>>>>> On 09/11/15 20:52, Tom de Vries wrote:
>>>>>>
>>>>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> this patch series for stage1 trunk adds support to:
>>>>>>> - parallelize oacc kernels regions using parloops, and
>>>>>>> - map the loops onto the oacc gang dimension.
>>>>>>>
>>>>>>> The patch series contains these patches:
>>>>>>>
>>>>>>>        1    Insert new exit block only when needed in
>>>>>>>           transform_to_exit_first_loop_alt
>>>>>>>        2    Make create_parallel_loop return void
>>>>>>>        3    Ignore reduction clause on kernels directive
>>>>>>>        4    Implement -foffload-alias
>>>>>>>        5    Add in_oacc_kernels_region in struct loop
>>>>>>>        6    Add pass_oacc_kernels
>>>>>>>        7    Add pass_dominator_oacc_kernels
>>>>>>>        8    Add pass_ch_oacc_kernels
>>>>>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>>>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>>>>>       11    Update testcases after adding kernels pass group
>>>>>>>       12    Handle acc loop directive
>>>>>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>>>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>>>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>>>>
>>>>>>> The first 9 patches are more or less independent, but patches
>>>>>>> 10-16 are
>>>>>>> intended to be committed at the same time.
>>>>>>>
>>>>>>> Bootstrapped and reg-tested on x86_64.
>>>>>>>
>>>>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>>>>> patch that enables accelerator testing (which is submitted at
>>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>>>>
>>>>>>> I'll post the individual patches in reply to this message.
>>>>>>
>>>>>>
>>>>>> This patch adds pass_parallelize_loops_oacc_kernels.
>>>>>>
>>>>>> There's a number of things we do differently in parloops for oacc
>>>>>> kernels:
>>>>>> - in normal parloops, we generate code to choose between a parallel
>>>>>>     version of the loop, and a sequential (low iteration count)
>>>>>> version.
>>>>>>     Since the code in oacc kernels region is supposed to run on the
>>>>>>     accelerator anyway, we skip this check, and don't add a low
>>>>>> iteration
>>>>>>     count loop.
>>>>>> - in normal parloops, we generate an #pragma omp parallel /
>>>>>>     GIMPLE_OMP_RETURN pair to delimit the region which will we
>>>>>> split off
>>>>>>     into a thread function. Since the oacc kernels region is already
>>>>>>     split off, we don't add this pair.
>>>>>> - we indicate the parallelization factor by setting the oacc function
>>>>>>     attributes
>>>>>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>>>>>     we add the gang clause
>>>>>> - in normal parloops, we rewrite the variable accesses in the loop in
>>>>>>     terms into accesses relative to a thread function parameter.
>>>>>> For the
>>>>>>     oacc kernels region, that rewrite has already been done at
>>>>>> omp-lower,
>>>>>>     so we skip this.
>>>>>> - we need to ensure that the entire kernels region can be run in
>>>>>>     parallel. The loop independence check is already present, so
>>>>>> for oacc
>>>>>>     kernels we add a check between blocks outside the loop and the
>>>>>> entire
>>>>>>     region.
>>>>>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>>>>>     There's no need for each gang to write to a single location,
>>>>>> we can
>>>>>>     do this in just one gang. (Typically this is the write of the
>>>>>> final
>>>>>>     value of the iteration variable if that one is copied back to the
>>>>>>     host).
>>>>>>
>>>>>
>>>>> Reposting with loop optimizer init added in
>>>>> pass_parallelize_loops_oacc_kernels::execute.
>>>>>
>>>>
>>>> Reposting with loop_optimizer_finalize,scev_initialize and
>>>> scev_finalize
>>>>    added in pass_parallelize_loops_oacc_kernels::execute.
>>>>
>>>
>>> Ping.
>>>
>>> Anything I can do to facilitate the review?
>>
>> Document new functions.
>
> Done.
>
> avoid if (1).
>
> Done.
>
>> Ideally some refactoring would avoid some of the if (!oacc_kernels_p)
>> spaghetti
>
> Ack. For now, i've tried to minimize the number of oacc_kernels_p tests
> in the code.
>
> Further suggestions on how to improve here are much appreciated.
>
>> but I'm considering tree-parloops.c (and its bugs) yours.
>
> Ack.
>
>> Can the pass not just use a pass parameter to switch between
>> oacc/non-oacc?
>>
>
> This patch introduces the pass parameter oacc_kernels_p (but does not
> instantiate an oacc_kernels_p == true pass version yet).

This patch add pass_parallelize_loops to pass_oacc_kernels (using pass 
parameter oacc_kernels_p == true).

As a consequence, it needs to update parloops testcases to use dumpfile 
parloops2.

Bootstrapped and reg-tested on x86_64.

Build with nvidia accelerator and tested goacc.exp and libgomp.

Committed to trunk.

Thanks,
- Tom


[-- Attachment #2: 0003-Add-pass_parallelize_loops-to-pass_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 41223 bytes --]

Add pass_parallelize_loops to pass_oacc_kernels

2016-01-18  Tom de Vries  <tom@codesourcery.com>

	* passes.def: Add pass_parallelize_loops to pass_oacc_kernels.

	* gcc.dg/autopar/outer-1.c: Update for new parloops instantiation.
	* gcc.dg/autopar/outer-2.c: Same.
	* gcc.dg/autopar/outer-3.c: Same.
	* gcc.dg/autopar/outer-4.c: Same.
	* gcc.dg/autopar/outer-5.c: Same.
	* gcc.dg/autopar/outer-6.c: Same.
	* gcc.dg/autopar/parallelization-1.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt-2.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt-3.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt-4.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt-5.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt-6.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt-7.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt-pr66652.c: Same.
	* gcc.dg/autopar/parloops-exit-first-loop-alt.c: Same.
	* gcc.dg/autopar/pr39500-1.c: Same.
	* gcc.dg/autopar/pr39500-2.c: Same.
	* gcc.dg/autopar/pr46193.c: Same.
	* gcc.dg/autopar/pr46194.c: Same.
	* gcc.dg/autopar/pr49580.c: Same.
	* gcc.dg/autopar/pr49960-1.c: Same.
	* gcc.dg/autopar/pr49960.c: Same.
	* gcc.dg/autopar/pr68373.c: Same.
	* gcc.dg/autopar/reduc-1.c: Same.
	* gcc.dg/autopar/reduc-1char.c: Same.
	* gcc.dg/autopar/reduc-1short.c: Same.
	* gcc.dg/autopar/reduc-2.c: Same.
	* gcc.dg/autopar/reduc-2char.c: Same.
	* gcc.dg/autopar/reduc-2short.c: Same.
	* gcc.dg/autopar/reduc-3.c: Same.
	* gcc.dg/autopar/reduc-4.c: Same.
	* gcc.dg/autopar/reduc-6.c: Same.
	* gcc.dg/autopar/reduc-7.c: Same.
	* gcc.dg/autopar/reduc-8.c: Same.
	* gcc.dg/autopar/reduc-9.c: Same.
	* gcc.dg/autopar/uns-outer-4.c: Same.
	* gcc.dg/autopar/uns-outer-5.c: Same.
	* gcc.dg/autopar/uns-outer-6.c: Same.
	* gfortran.dg/parloops-exit-first-loop-alt-2.f95: Same.
	* gfortran.dg/parloops-exit-first-loop-alt.f95: Same.

---
 gcc/passes.def                                                 |  2 +-
 gcc/testsuite/gcc.dg/autopar/outer-1.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/outer-2.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/outer-3.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/outer-4.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/outer-5.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/outer-6.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/parallelization-1.c               |  4 ++--
 gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-2.c  |  4 ++--
 gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-3.c  |  4 ++--
 gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-4.c  |  4 ++--
 gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-5.c  |  4 ++--
 gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-6.c  |  4 ++--
 gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-7.c  |  4 ++--
 .../gcc.dg/autopar/parloops-exit-first-loop-alt-pr66652.c      |  6 +++---
 gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt.c    |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr39500-1.c                       |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr39500-2.c                       |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr46193.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr46194.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr49580.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr49960-1.c                       |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr49960.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/pr68373.c                         |  4 ++--
 gcc/testsuite/gcc.dg/autopar/reduc-1.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/reduc-1char.c                     |  6 +++---
 gcc/testsuite/gcc.dg/autopar/reduc-1short.c                    |  6 +++---
 gcc/testsuite/gcc.dg/autopar/reduc-2.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/reduc-2char.c                     | 10 +++++-----
 gcc/testsuite/gcc.dg/autopar/reduc-2short.c                    | 10 +++++-----
 gcc/testsuite/gcc.dg/autopar/reduc-3.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/reduc-4.c                         |  2 +-
 gcc/testsuite/gcc.dg/autopar/reduc-6.c                         |  8 ++++----
 gcc/testsuite/gcc.dg/autopar/reduc-7.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/reduc-8.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/reduc-9.c                         |  6 +++---
 gcc/testsuite/gcc.dg/autopar/uns-outer-4.c                     |  4 ++--
 gcc/testsuite/gcc.dg/autopar/uns-outer-5.c                     |  4 ++--
 gcc/testsuite/gcc.dg/autopar/uns-outer-6.c                     |  6 +++---
 gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95   |  4 ++--
 gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95     |  4 ++--
 41 files changed, 100 insertions(+), 100 deletions(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index d9a8c4e..ab6e083 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -124,7 +124,7 @@ along with GCC; see the file COPYING3.  If not see
 	      NEXT_PASS (pass_lim);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_dce);
-	      /* pass_parallelize_loops_oacc_kernels */
+	      NEXT_PASS (pass_parallelize_loops, true /* oacc_kernels_p */);
 	      NEXT_PASS (pass_expand_omp_ssa);
 	      NEXT_PASS (pass_rebuild_cgraph_edges);
 	  POP_INSERT_PASSES ()
diff --git a/gcc/testsuite/gcc.dg/autopar/outer-1.c b/gcc/testsuite/gcc.dg/autopar/outer-1.c
index d36b557..6607ed0 100644
--- a/gcc/testsuite/gcc.dg/autopar/outer-1.c
+++ b/gcc/testsuite/gcc.dg/autopar/outer-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -27,5 +27,5 @@ int main(void)
 
 
 /* Check that outer loop is parallelized.  */
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/outer-2.c b/gcc/testsuite/gcc.dg/autopar/outer-2.c
index fd7c2be..9533e8d 100644
--- a/gcc/testsuite/gcc.dg/autopar/outer-2.c
+++ b/gcc/testsuite/gcc.dg/autopar/outer-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -27,5 +27,5 @@ int main(void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/outer-3.c b/gcc/testsuite/gcc.dg/autopar/outer-3.c
index 55454a4..130a974 100644
--- a/gcc/testsuite/gcc.dg/autopar/outer-3.c
+++ b/gcc/testsuite/gcc.dg/autopar/outer-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -27,5 +27,5 @@ int main(void)
 
 
 /* Check that outer loop is parallelized.  */
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/outer-4.c b/gcc/testsuite/gcc.dg/autopar/outer-4.c
index 681cf85..b002b0e 100644
--- a/gcc/testsuite/gcc.dg/autopar/outer-4.c
+++ b/gcc/testsuite/gcc.dg/autopar/outer-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -25,6 +25,6 @@ parloop (int N)
 }
 
 
-/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" { xfail *-*-* } } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/outer-5.c b/gcc/testsuite/gcc.dg/autopar/outer-5.c
index d6e0dd3..84b2de1 100644
--- a/gcc/testsuite/gcc.dg/autopar/outer-5.c
+++ b/gcc/testsuite/gcc.dg/autopar/outer-5.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -44,5 +44,5 @@ int main(void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" { xfail *-*-* } } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/outer-6.c b/gcc/testsuite/gcc.dg/autopar/outer-6.c
index 726794c..fff7bce 100644
--- a/gcc/testsuite/gcc.dg/autopar/outer-6.c
+++ b/gcc/testsuite/gcc.dg/autopar/outer-6.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -44,6 +44,6 @@ int main(void)
 
 
 /* Check that outer loop is parallelized.  */
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" { xfail *-*-* } } } */
-/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parallelization-1.c b/gcc/testsuite/gcc.dg/autopar/parallelization-1.c
index 222831a..1b400fb 100644
--- a/gcc/testsuite/gcc.dg/autopar/parallelization-1.c
+++ b/gcc/testsuite/gcc.dg/autopar/parallelization-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -27,5 +27,5 @@ int main(void)
 
 /* Check that the first loop in parloop got parallelized.  */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-2.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-2.c
index f988455..fbd3af8 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-2.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 /* Constant bound, vector addition.  */
 
@@ -18,4 +18,4 @@ f (void)
       c[i] = a[i] + b[i];
 }
 
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-3.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-3.c
index 8bba352..f7a7323 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-3.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 /* Variable bound, reduction.  */
 
@@ -17,4 +17,4 @@ f (unsigned int n, unsigned int *__restrict__ a)
   return sum;
 }
 
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-4.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-4.c
index ccb07bc..6b1a776 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-4.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 /* Constant bound, reduction.  */
 
@@ -19,4 +19,4 @@ f (void)
   return sum;
 }
 
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-5.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-5.c
index 68367b1..f3f8a3b 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-5.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-5.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 /* Variable bound, vector addition, unsigned loop counter, unsigned bound.  */
 
@@ -13,4 +13,4 @@ f (unsigned int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-6.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-6.c
index 80d1550..186eab3 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-6.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-6.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 /* Variable bound, vector addition, unsigned loop counter, signed bound.  */
 
@@ -13,4 +13,4 @@ f (int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-7.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-7.c
index 8ecff0c..46c5ac4 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-7.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-7.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 /* Variable bound, vector addition, signed loop counter, signed bound.  */
 
@@ -13,4 +13,4 @@ f (int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-pr66652.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-pr66652.c
index b320628..a02e183 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-pr66652.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt-pr66652.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 #include <stdio.h>
 #include <stdlib.h>
@@ -21,5 +21,5 @@ f (unsigned int n, unsigned int sum)
   return sum;
 }
 
-/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 1 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 1 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 0 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt.c b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt.c
index c67d262..dce9c92 100644
--- a/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt.c
+++ b/gcc/testsuite/gcc.dg/autopar/parloops-exit-first-loop-alt.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 /* Variable bound, vector addition, signed loop counter, unsigned bound.  */
 
@@ -13,5 +13,5 @@ f (unsigned int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/pr39500-1.c b/gcc/testsuite/gcc.dg/autopar/pr39500-1.c
index 33b93b3..28f4789 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr39500-1.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr39500-1.c
@@ -1,7 +1,7 @@
 /* pr39500: autopar fails to parallel */
 /* origin: nemokingdom@gmail.com(LiFeng) */
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details" } */
 
 void abort (void);
 
@@ -24,4 +24,4 @@ int main (void)
 
 /* Check that the first loop in parloop got parallelized.  */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/pr39500-2.c b/gcc/testsuite/gcc.dg/autopar/pr39500-2.c
index 12fa909..98363e4 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr39500-2.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr39500-2.c
@@ -1,7 +1,7 @@
 /* pr39500: autopar fails to parallel */
 /* origin: nemokingdom@gmail.com(LiFeng) */
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details" } */
 
 int main (void)
 {
@@ -16,4 +16,4 @@ int main (void)
 
 /* This loop cannot be parallelized due to a dependence.  */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/pr46193.c b/gcc/testsuite/gcc.dg/autopar/pr46193.c
index 544a5da..36f89c1 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr46193.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr46193.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 extern void abort (void);
 
@@ -35,4 +35,4 @@ foo2 (int count, char **list)
   return maxaddr;
 }
 
-/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 2 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 2 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/pr46194.c b/gcc/testsuite/gcc.dg/autopar/pr46194.c
index 3daf2ab..2a184a0 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr46194.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr46194.c
@@ -1,6 +1,6 @@
 /* PR tree-optimization/46194 */
 /* { dg-do compile } */
-/* { dg-options "-O -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 #define N 1000
 int a[N];
@@ -20,4 +20,4 @@ int foo (void)
 
 /* This loop cannot be parallelized due to a dependence.  */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/pr49580.c b/gcc/testsuite/gcc.dg/autopar/pr49580.c
index ddb622f..e2c8be8 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr49580.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr49580.c
@@ -1,6 +1,6 @@
 /* PR debug/49580 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -33,5 +33,5 @@ int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops2" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/pr49960-1.c b/gcc/testsuite/gcc.dg/autopar/pr49960-1.c
index 34d5552..bd65c22 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr49960-1.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr49960-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdlib.h>
 #include <stdio.h>
@@ -30,5 +30,5 @@ int main()
 }
 /* Check that no loop gets parallelized.  */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 0 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/pr49960.c b/gcc/testsuite/gcc.dg/autopar/pr49960.c
index c413278..e3fb04d 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr49960.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr49960.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized -fno-partial-inlining" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized -fno-partial-inlining" } */
 
 #include <stdio.h>
 #define MB 100
@@ -50,5 +50,5 @@ void main ()
 
 /* Check that the outer most loop doesn't get parallelized (thus no loop gets parallelized)  */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 0 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 0 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/pr68373.c b/gcc/testsuite/gcc.dg/autopar/pr68373.c
index 8e0f8a5..ecac613 100644
--- a/gcc/testsuite/gcc.dg/autopar/pr68373.c
+++ b/gcc/testsuite/gcc.dg/autopar/pr68373.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops2-details" } */
 
 unsigned int
 foo (int *a, unsigned int n)
@@ -11,4 +11,4 @@ foo (int *a, unsigned int n)
   return i;
 }
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1.c b/gcc/testsuite/gcc.dg/autopar/reduc-1.c
index 6e9a280..1e5f923 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-1.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -66,6 +66,6 @@ int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops2" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1char.c b/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
index 48ead88..0d611b96 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -60,6 +60,6 @@ int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops2" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1short.c b/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
index f3f547c..92654e3 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -59,6 +59,6 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops2" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2.c b/gcc/testsuite/gcc.dg/autopar/reduc-2.c
index 2f4883d..b94b2d4 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-2.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -63,6 +63,6 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" { xfail *-*-* } } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops2" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops2" { xfail *-*-* } } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2char.c b/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
index a2dad44..d48d9f9 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -61,10 +61,10 @@ int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops2" { xfail *-*-* } } } */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops2" { xfail *-*-* } } } */
 
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2short.c b/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
index a50e14f..f5466f0 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -59,8 +59,8 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops2" { xfail *-*-* } } } */
 
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops2" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-3.c b/gcc/testsuite/gcc.dg/autopar/reduc-3.c
index 0d4baef..9ed1c90 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-3.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -50,6 +50,6 @@ int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 1 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 1 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops2" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-4.c b/gcc/testsuite/gcc.dg/autopar/reduc-4.c
index 80b15e2..fb331cc 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-4.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-4.c
@@ -1,4 +1,4 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized --param parloops-chunk-size=100" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized --param parloops-chunk-size=100" } */
 
 #include "reduc-3.c"
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-6.c b/gcc/testsuite/gcc.dg/autopar/reduc-6.c
index 91f679e..22a2e62 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-6.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-6.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -56,6 +56,6 @@ int main (void)
 
 
 /* need -ffast-math to  parallelize these loops.  */
-/* { dg-final { scan-tree-dump-times "Detected reduction" 0 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "FAILED: it is not a part of reduction" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 0 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "FAILED: it is not a part of reduction" 3 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-7.c b/gcc/testsuite/gcc.dg/autopar/reduc-7.c
index 77b99e1..efae736 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-7.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-7.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdlib.h>
 
@@ -84,6 +84,6 @@ int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops2" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-8.c b/gcc/testsuite/gcc.dg/autopar/reduc-8.c
index 18ba03d..b3c0cda 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-8.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-8.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdlib.h>
 
@@ -85,5 +85,5 @@ main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-9.c b/gcc/testsuite/gcc.dg/autopar/reduc-9.c
index 90f4db2..99f9298 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-9.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-9.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 #include <stdlib.h>
 
@@ -84,5 +84,5 @@ int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops2" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/uns-outer-4.c b/gcc/testsuite/gcc.dg/autopar/uns-outer-4.c
index 5eb67ea..ee4bb82 100644
--- a/gcc/testsuite/gcc.dg/autopar/uns-outer-4.c
+++ b/gcc/testsuite/gcc.dg/autopar/uns-outer-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -21,5 +21,5 @@ parloop (int N)
   g_sum = sum;
 }
 
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/uns-outer-5.c b/gcc/testsuite/gcc.dg/autopar/uns-outer-5.c
index a929e5d..2d93a2d 100644
--- a/gcc/testsuite/gcc.dg/autopar/uns-outer-5.c
+++ b/gcc/testsuite/gcc.dg/autopar/uns-outer-5.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -45,5 +45,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" { xfail *-*-* } } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/uns-outer-6.c b/gcc/testsuite/gcc.dg/autopar/uns-outer-6.c
index 5c745f8..dc2870b 100644
--- a/gcc/testsuite/gcc.dg/autopar/uns-outer-6.c
+++ b/gcc/testsuite/gcc.dg/autopar/uns-outer-6.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops2-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -46,6 +46,6 @@ main (void)
 
 
 /* Check that outer loop is parallelized.  */
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops2" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops2" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95 b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
index 52434f2..236480c 100644
--- a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
+++ b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
@@ -1,7 +1,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-require-effective-target pthread }
 ! { dg-additional-options "-ftree-parallelize-loops=2" }
-! { dg-additional-options "-fdump-tree-parloops-details" }
+! { dg-additional-options "-fdump-tree-parloops2-details" }
 
 ! Constant bound, vector addition.
 
@@ -16,4 +16,4 @@ subroutine foo ()
   end do
 end subroutine foo
 
-! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } }
+! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } }
diff --git a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95 b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
index 1eb9dfd..a33e11d 100644
--- a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
+++ b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
@@ -1,7 +1,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-require-effective-target pthread }
 ! { dg-additional-options "-ftree-parallelize-loops=2" }
-! { dg-additional-options "-fdump-tree-parloops-details" }
+! { dg-additional-options "-fdump-tree-parloops2-details" }
 
 ! Variable bound, vector addition.
 
@@ -17,5 +17,5 @@ subroutine foo (nr)
   end do
 end subroutine foo
 
-! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } }
+! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops2" } }
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [committed] Add oacc kernels tests in goacc
  2015-11-09 20:08 ` [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c Tom de Vries
@ 2016-01-18 13:33   ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-01-18 13:33 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]

[ was: Re: [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c ]

On 09/11/15 21:07, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> This patch adds C/C++ oacc kernels compilation tests.
>

This reduced patch contains the test-cases that currently pass.

Bootstrapped and reg-tested on x86_64.

Build with nvidia accelerator and tested goacc.exp and libgomp.

Committed to trunk.

Thanks,
- Tom


[-- Attachment #2: 0004-Add-oacc-kernels-tests-in-goacc.patch --]
[-- Type: text/x-patch, Size: 22445 bytes --]

Add oacc kernels tests in goacc

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/goacc/kernels-counter-vars-function-scope.c: New test.
	* c-c++-common/goacc/kernels-double-reduction.c: New test.
	* c-c++-common/goacc/kernels-empty.c: New test.
	* c-c++-common/goacc/kernels-eternal.c: New test.
	* c-c++-common/goacc/kernels-loop-2.c: New test.
	* c-c++-common/goacc/kernels-loop-3.c: New test.
	* c-c++-common/goacc/kernels-loop-data-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-loop-data-update.c: New test.
	* c-c++-common/goacc/kernels-loop-data.c: New test.
	* c-c++-common/goacc/kernels-loop-g.c: New test.
	* c-c++-common/goacc/kernels-loop-mod-not-zero.c: New test.
	* c-c++-common/goacc/kernels-loop-n.c: New test.
	* c-c++-common/goacc/kernels-loop-nest.c: New test.
	* c-c++-common/goacc/kernels-loop.c: New test.
	* c-c++-common/goacc/kernels-noreturn.c: New test.
	* c-c++-common/goacc/kernels-one-counter-var.c: New test.
	* c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-reduction.c: New test.

---
 .../goacc/kernels-counter-vars-function-scope.c    | 54 +++++++++++++++++
 .../goacc/kernels-double-reduction-n.c             | 37 ++++++++++++
 .../c-c++-common/goacc/kernels-double-reduction.c  | 37 ++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-empty.c   |  6 ++
 gcc/testsuite/c-c++-common/goacc/kernels-eternal.c | 11 ++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c  | 70 ++++++++++++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c  | 49 +++++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c  | 17 ++++++
 .../c-c++-common/goacc/kernels-loop-mod-not-zero.c | 52 ++++++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c  | 56 +++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-nest.c         | 39 ++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop.c    | 56 +++++++++++++++++
 .../c-c++-common/goacc/kernels-noreturn.c          | 12 ++++
 .../c-c++-common/goacc/kernels-one-counter-var.c   | 54 +++++++++++++++++
 .../c-c++-common/goacc/kernels-reduction.c         | 36 +++++++++++
 15 files changed, 586 insertions(+)

diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c b/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
new file mode 100644
index 0000000..e8b5357
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
@@ -0,0 +1,54 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+  COUNTERTYPE i;
+  COUNTERTYPE ii;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
new file mode 100644
index 0000000..c39d674
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
@@ -0,0 +1,37 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N 500
+
+unsigned int a[N][N];
+
+void  __attribute__((noinline,noclone))
+foo (unsigned int n)
+{
+  int i, j;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:n]) copy (sum)
+  {
+    for (i = 0; i < n; ++i)
+      for (j = 0; j < n; ++j)
+	sum += a[i][j];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
new file mode 100644
index 0000000..3501d0d
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
@@ -0,0 +1,37 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N 500
+
+unsigned int a[N][N];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i, j;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:N]) copy (sum)
+  {
+    for (i = 0; i < N; ++i)
+      for (j = 0; j < N; ++j)
+	sum += a[i][j];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-empty.c b/gcc/testsuite/c-c++-common/goacc/kernels-empty.c
new file mode 100644
index 0000000..e91b81c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-empty.c
@@ -0,0 +1,6 @@
+void
+foo (void)
+{
+#pragma acc kernels
+  ;
+}
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c b/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
new file mode 100644
index 0000000..edc17d2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
@@ -0,0 +1,11 @@
+int
+main (void)
+{
+#pragma acc kernels
+  {
+    while (1)
+      ;
+  }
+
+  return 0;
+}
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
new file mode 100644
index 0000000..f97584d
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
@@ -0,0 +1,70 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels copyout (a[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels copyout (b[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops1" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
new file mode 100644
index 0000000..530d62a
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
@@ -0,0 +1,49 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int i;
+
+  unsigned int *__restrict c;
+
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    c[i] = i * 2;
+
+#pragma acc kernels copy (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = c[ii] + ii + 1;
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != i * 2 + i + 1)
+      abort ();
+
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
new file mode 100644
index 0000000..4f1c2c5
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-g" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include "kernels-loop.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
new file mode 100644
index 0000000..151db51
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
@@ -0,0 +1,52 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
new file mode 100644
index 0000000..bee5f5a
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
@@ -0,0 +1,56 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+foo (COUNTERTYPE n)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:n], b[0:n]) copyout (c[0:n])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < n; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
new file mode 100644
index 0000000..ea0e342
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
@@ -0,0 +1,39 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Based on autopar/outer-1.c.  */
+
+#include <stdlib.h>
+
+#define N 1000
+
+int
+main (void)
+{
+  int x[N][N];
+
+#pragma acc kernels copyout (x)
+  {
+    for (int ii = 0; ii < N; ii++)
+      for (int jj = 0; jj < N; jj++)
+	x[ii][jj] = ii + jj + 3;
+  }
+
+  for (int i = 0; i < N; i++)
+    for (int j = 0; j < N; j++)
+      if (x[i][j] != i + j + 3)
+	abort ();
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop.c
new file mode 100644
index 0000000..ab5dfb9
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop.c
@@ -0,0 +1,56 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c b/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
new file mode 100644
index 0000000..1a8cc67
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
@@ -0,0 +1,12 @@
+int
+main (void)
+{
+
+#pragma acc kernels
+  {
+    __builtin_abort ();
+  }
+
+  return 0;
+}
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c b/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
new file mode 100644
index 0000000..b16a8cd
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
@@ -0,0 +1,54 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+  COUNTERTYPE i;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (i = 0; i < N; i++)
+      c[i] = a[i] + b[i];
+  }
+
+  for (i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
new file mode 100644
index 0000000..61c5df3
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
@@ -0,0 +1,36 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define n 10000
+
+unsigned int a[n];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:n]) copy (sum)
+  {
+    for (i = 0; i < n; ++i)
+      sum += a[i];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [comitted] Add oacc kernels test in libgomp
  2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
@ 2016-01-18 13:39   ` Tom de Vries
  2016-03-09  9:18   ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
  1 sibling, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-01-18 13:39 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

[ was: Re: [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c ]

On 09/11/15 21:10, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> This patch adds C/C++ oacc kernels execution tests.
>

Bootstrapped and reg-tested on x86_64.

Build with nvidia accelerator and tested goacc.exp and libgomp.

Committed to trunk as attached (AFAICT, no changes compared to original 
posting, other than commit title).

Thanks,
- Tom


[-- Attachment #2: 0005-Add-oacc-kernels-test-in-libgomp.patch --]
[-- Type: text/x-patch, Size: 16242 bytes --]

Add oacc kernels test in libgomp

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c: Same.

---
 .../libgomp.oacc-c-c++-common/kernels-loop-2.c     | 47 ++++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-3.c     | 34 ++++++++++++++++
 .../kernels-loop-and-seq-2.c                       | 36 +++++++++++++++++
 .../kernels-loop-and-seq-3.c                       | 37 +++++++++++++++++
 .../kernels-loop-and-seq-4.c                       | 36 +++++++++++++++++
 .../kernels-loop-and-seq-5.c                       | 37 +++++++++++++++++
 .../kernels-loop-and-seq-6.c                       | 36 +++++++++++++++++
 .../kernels-loop-and-seq.c                         | 37 +++++++++++++++++
 .../kernels-loop-collapse.c                        | 40 ++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-g.c     |  5 +++
 .../kernels-loop-mod-not-zero.c                    | 41 +++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-n.c     | 47 ++++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-nest.c  | 26 ++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop.c       | 41 +++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-reduction.c  | 37 +++++++++++++++++
 15 files changed, 537 insertions(+)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
new file mode 100644
index 0000000..13e57bd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels copyout (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels copyout (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
new file mode 100644
index 0000000..f61a74a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
@@ -0,0 +1,34 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int i;
+
+  unsigned int *__restrict c;
+
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    c[i] = i * 2;
+
+#pragma acc kernels copy (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = c[ii] + ii + 1;
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != i * 2 + i + 1)
+      abort ();
+
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
new file mode 100644
index 0000000..2e4100f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    a[0] = a[0] + 1;
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
new file mode 100644
index 0000000..b3e736b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+
+#pragma acc kernels copy (a[0:N])
+  {
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+
+    a[0] = 2;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
new file mode 100644
index 0000000..8b9affa
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    a[0] = 2;
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
new file mode 100644
index 0000000..83d4e7f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+  int r;
+#pragma acc kernels copyout(r) copy (a[0:N])
+  {
+    r = a[0];
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return r;
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 0)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
new file mode 100644
index 0000000..01d5e5e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    int r = a[0];
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1 + r;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
new file mode 100644
index 0000000..61d1283
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+
+#pragma acc kernels copy (a[0:N])
+  {
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+
+    a[0] = a[0] + 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
new file mode 100644
index 0000000..f7f04cb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 100
+
+int a[N][N];
+
+void __attribute__((noinline, noclone))
+foo (int m, int n)
+{
+  int i, j;
+  #pragma acc kernels
+  {
+#pragma acc loop collapse(2)
+    for (i = 0; i < m; i++)
+      for (j = 0; j < n; j++)
+	a[i][j] = 1;
+  }
+}
+
+int
+main (void)
+{
+  int i, j;
+
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      a[i][j] = 0;
+
+  foo (N, N);
+
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      if (a[i][j] != 1)
+	abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
new file mode 100644
index 0000000..96b6e4e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-g" } */
+
+#include "kernels-loop.c"
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
new file mode 100644
index 0000000..1433cb2
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
@@ -0,0 +1,41 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
new file mode 100644
index 0000000..fd0d5b1
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+static int __attribute__((noinline,noclone))
+foo (COUNTERTYPE n)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:n], b[0:n]) copyout (c[0:n])
+  {
+    for (COUNTERTYPE ii = 0; ii < n; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+int
+main (void)
+{
+  return foo (N);
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
new file mode 100644
index 0000000..21d2599
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 1000
+
+int
+main (void)
+{
+  int x[N][N];
+
+#pragma acc kernels copyout (x)
+  {
+    for (int ii = 0; ii < N; ii++)
+      for (int jj = 0; jj < N; jj++)
+	x[ii][jj] = ii + jj + 3;
+  }
+
+  for (int i = 0; i < N; i++)
+    for (int j = 0; j < N; j++)
+      if (x[i][j] != i + j + 3)
+	abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
new file mode 100644
index 0000000..3762e5a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
@@ -0,0 +1,41 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
new file mode 100644
index 0000000..511e25f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define n 10000
+
+unsigned int a[n];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:n]) copy (sum)
+  {
+    for (i = 0; i < n; ++i)
+      sum += a[i];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+int
+main ()
+{
+  int i;
+
+  for (i = 0; i < n; ++i)
+    a[i] = i % 2;
+
+  foo ();
+
+  return 0;
+}

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING^2][PATCH, 3/16] Ignore reduction clause on kernels directive
  2015-11-24 12:25   ` [PING][PATCH, " Tom de Vries
@ 2016-01-18 14:24     ` Tom de Vries
  2016-01-18 14:26       ` Jakub Jelinek
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-01-18 14:24 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener, Thomas Schwinge

On 24/11/15 13:21, Tom de Vries wrote:
> On 09/11/15 16:50, Tom de Vries wrote:
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>       1    Insert new exit block only when needed in
>>>          transform_to_exit_first_loop_alt
>>>       2    Make create_parallel_loop return void
>>>       3    Ignore reduction clause on kernels directive
>>>       4    Implement -foffload-alias
>>>       5    Add in_oacc_kernels_region in struct loop
>>>       6    Add pass_oacc_kernels
>>>       7    Add pass_dominator_oacc_kernels
>>>       8    Add pass_ch_oacc_kernels
>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>      11    Update testcases after adding kernels pass group
>>>      12    Handle acc loop directive
>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> As discussed here (
>> https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00785.html ), the kernels
>> directive does not allow the reduction clause.  This patch fixes that.
>>
>

Ping^2.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PING^2][PATCH, 3/16] Ignore reduction clause on kernels directive
  2016-01-18 14:24     ` [PING^2][PATCH, " Tom de Vries
@ 2016-01-18 14:26       ` Jakub Jelinek
  0 siblings, 0 replies; 133+ messages in thread
From: Jakub Jelinek @ 2016-01-18 14:26 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Richard Biener, Thomas Schwinge

On Mon, Jan 18, 2016 at 03:24:21PM +0100, Tom de Vries wrote:
> >>As discussed here (
> >>https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00785.html ), the kernels
> >>directive does not allow the reduction clause.  This patch fixes that.
> >>
> >
> 
> Ping^2.

Ok.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING^2][PATCH, 12/16] Handle acc loop directive
  2015-11-24 12:30   ` [PING][PATCH, " Tom de Vries
@ 2016-01-18 14:27     ` Tom de Vries
  2016-01-26 12:38       ` [PING^3][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-01-18 14:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 24/11/15 13:26, Tom de Vries wrote:
> On 09/11/15 21:06, Tom de Vries wrote:
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>       1    Insert new exit block only when needed in
>>>          transform_to_exit_first_loop_alt
>>>       2    Make create_parallel_loop return void
>>>       3    Ignore reduction clause on kernels directive
>>>       4    Implement -foffload-alias
>>>       5    Add in_oacc_kernels_region in struct loop
>>>       6    Add pass_oacc_kernels
>>>       7    Add pass_dominator_oacc_kernels
>>>       8    Add pass_ch_oacc_kernels
>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>      11    Update testcases after adding kernels pass group
>>>      12    Handle acc loop directive
>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patch deals with loops in an oacc kernels region which are
>> annotated using "#pragma acc loop". It expands such a loop as a normal
>> loop, which has the effect of ignoring the "#pragma acc loop".
>>
>

Ping^2.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [committed] Add oacc_kernels_p argument to pass_parallelize_loops
  2016-01-18 13:07           ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Tom de Vries
  2016-01-18 13:30             ` [committed] Add pass_parallelize_loops to pass_oacc_kernels Tom de Vries
@ 2016-01-20  8:54             ` Thomas Schwinge
  2016-01-20 10:31               ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Thomas Schwinge @ 2016-01-20  8:54 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek, Richard Biener, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

Hi!

On Mon, 18 Jan 2016 14:07:11 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> Add oacc_kernels_p argument to pass_parallelize_loops

> --- a/gcc/tree-parloops.c
> +++ b/gcc/tree-parloops.c

> @@ -2315,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,

|   /* Ensure that the exit condition is the first statement in the loop.
|      The common case is that latch of the loop is empty (apart from the
|      increment) and immediately follows the loop exit test.  Attempt to move the
|      entry of the loop directly before the exit check and increase the number of
|      iterations of the loop by one.  */
|   if (try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
|     {
|       if (dump_file
| 	  && (dump_flags & TDF_DETAILS))
| 	fprintf (dump_file,
| 		 "alternative exit-first loop transform succeeded"
| 		 " for loop %d\n", loop->num);
|     }
|   else
|     {
> +      if (oacc_kernels_p)
> +	n_threads = 1;
> +
|       /* Fall back on the method that handles more cases, but duplicates the
| 	 loop body: move the exit condition of LOOP to the beginning of its
| 	 header, and duplicate the part of the last iteration that gets disabled
| 	 to the exit of the loop.  */
|       transform_to_exit_first_loop (loop, reduction_list, nit);
|     }

Just for my own education: this pessimization "n_threads = 1" for OpenACC
kernels is because the duplicated loop bodies generated by
transform_to_exit_first_loop are not appropriate for parallel OpenACC
offloading execution?  (Might add a source code comment here?)  Testing
on gomp-4_0-branch, there are no changes in the testsuite if I remove
this hunk.


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [committed] Add oacc_kernels_p argument to pass_parallelize_loops
  2016-01-20  8:54             ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Thomas Schwinge
@ 2016-01-20 10:31               ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-01-20 10:31 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: gcc-patches, Jakub Jelinek, Richard Biener, Richard Biener

On 20/01/16 09:54, Thomas Schwinge wrote:
> Hi!
>
> On Mon, 18 Jan 2016 14:07:11 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
>> Add oacc_kernels_p argument to pass_parallelize_loops
>
>> --- a/gcc/tree-parloops.c
>> +++ b/gcc/tree-parloops.c
>
>> @@ -2315,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
>
> |   /* Ensure that the exit condition is the first statement in the loop.
> |      The common case is that latch of the loop is empty (apart from the
> |      increment) and immediately follows the loop exit test.  Attempt to move the
> |      entry of the loop directly before the exit check and increase the number of
> |      iterations of the loop by one.  */
> |   if (try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
> |     {
> |       if (dump_file
> | 	  && (dump_flags & TDF_DETAILS))
> | 	fprintf (dump_file,
> | 		 "alternative exit-first loop transform succeeded"
> | 		 " for loop %d\n", loop->num);
> |     }
> |   else
> |     {
>> +      if (oacc_kernels_p)
>> +	n_threads = 1;
>> +
> |       /* Fall back on the method that handles more cases, but duplicates the
> | 	 loop body: move the exit condition of LOOP to the beginning of its
> | 	 header, and duplicate the part of the last iteration that gets disabled
> | 	 to the exit of the loop.  */
> |       transform_to_exit_first_loop (loop, reduction_list, nit);
> |     }
>
> Just for my own education: this pessimization "n_threads = 1" for OpenACC
> kernels is because the duplicated loop bodies generated by
> transform_to_exit_first_loop are not appropriate for parallel OpenACC
> offloading execution?

In the case of standard parloops, only the loop is executed in parallel, 
so the duplicated loop body is outside the parallel region.

In the case of oacc parloops, the duplicated body is included in the 
kernels region, and executed in parallel.

The duplicated body for the last iteration can be executed in parallel 
with the loop body in the loop for all the other iterations. We've done 
the dependency analysis for that.

But the duplicated loop body for the last iteration is now executed in 
parallel with itself as well. We've got code that deals with that by 
guarding the side-effects such that they're only executed for a single 
gang. But that code is atm only effective in oacc_entry_exit_ok, before 
transform_to_exit_first_loop_alt introduces the duplicated loop body.

> (Might add a source code comment here?)  Testing
> on gomp-4_0-branch, there are no changes in the testsuite if I remove
> this hunk.

If you want to see the effect of removing the 'n_threads = 1' hunk, make 
try_transform_to_exit_first_loop_alt always return false.

I expect a loop
   for (i = 0; i < N; ++i)
     a[i] = a[i] + 1;
would give incorrect results in a[N - 1].

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING^3][PATCH, 12/16] Handle acc loop directive
  2016-01-18 14:27     ` [PING^2][PATCH, " Tom de Vries
@ 2016-01-26 12:38       ` Tom de Vries
  2016-01-26 12:50         ` Jakub Jelinek
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-01-26 12:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 18/01/16 15:27, Tom de Vries wrote:
> On 24/11/15 13:26, Tom de Vries wrote:
>> On 09/11/15 21:06, Tom de Vries wrote:
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>> Hi,
>>>>
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>>>
>>>> The patch series contains these patches:
>>>>
>>>>       1    Insert new exit block only when needed in
>>>>          transform_to_exit_first_loop_alt
>>>>       2    Make create_parallel_loop return void
>>>>       3    Ignore reduction clause on kernels directive
>>>>       4    Implement -foffload-alias
>>>>       5    Add in_oacc_kernels_region in struct loop
>>>>       6    Add pass_oacc_kernels
>>>>       7    Add pass_dominator_oacc_kernels
>>>>       8    Add pass_ch_oacc_kernels
>>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>>      11    Update testcases after adding kernels pass group
>>>>      12    Handle acc loop directive
>>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>
>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>> intended to be committed at the same time.
>>>>
>>>> Bootstrapped and reg-tested on x86_64.
>>>>
>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>> patch that enables accelerator testing (which is submitted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>
>>>> I'll post the individual patches in reply to this message.
>>>
>>> this patch deals with loops in an oacc kernels region which are
>>> annotated using "#pragma acc loop". It expands such a loop as a normal
>>> loop, which has the effect of ignoring the "#pragma acc loop".
>>>
>>
>

Ping^3. ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html )

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PING^3][PATCH, 12/16] Handle acc loop directive
  2016-01-26 12:38       ` [PING^3][PATCH, " Tom de Vries
@ 2016-01-26 12:50         ` Jakub Jelinek
  2016-02-12 11:11           ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Jakub Jelinek @ 2016-01-26 12:50 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Richard Biener

On Tue, Jan 26, 2016 at 01:38:39PM +0100, Tom de Vries wrote:
> Ping^3. ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html )

First of all, I wonder if it wouldn't be far easier to handle these during
gimplification rather than during omp expansion or during parsing.  Inside
kernels, do you need to honor any clauses on the acc loop, like
privatization etc., or can you just ignore it altogether (after parsing them
to ensure they are valid)?
Handling this in expand_omp_for_generic is not really nice, because it will
make already very complicated function even more complex.
   gomp_ordered *ord_stmt;
+
+  /* True if this is nested inside an OpenACC kernels construct.  */
+  bool inside_kernels_p;
 };

is bad placement, there are other bool/unsigned char fields earlier and the
smaller fields should be adjacent for smaller padding of the struct.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels)
  2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
  2015-11-11 10:59   ` Richard Biener
@ 2016-02-05 12:06   ` Thomas Schwinge
  2016-02-10 14:40     ` Use plain -fopenacc to enable OpenACC kernels processing Thomas Schwinge
  1 sibling, 1 reply; 133+ messages in thread
From: Thomas Schwinge @ 2016-02-05 12:06 UTC (permalink / raw)
  To: Jakub Jelinek, gcc-patches, Nathan Sidwell; +Cc: Tom de Vries, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 33560 bytes --]

Hi!

On Mon, 9 Nov 2015 18:39:19 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.

> Atm, the parallelization behaviour for the kernels region is controlled 
> by flag_tree_parallelize_loops, which is also used to control generic 
> auto-parallelization by autopar using omp. That is not ideal, and we may 
> want a separate flag (or param) to control the behaviour for oacc 
> kernels, f.i. -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.

I suggest to use plain -fopenacc to enable OpenACC kernels processing
(which just makes sense, I hope) ;-) and have later processing stages
determine the actual parametrization (currently: number of gangs) (that
is, Nathan's recent "Default compute dimensions" patches).

The code changes are simple enough; OK for trunk?  (This patch depends on
my 'Un-parallelized OpenACC kernels constructs with nvptx offloading:
"avoid offloading"' pending review,
<http://news.gmane.org/find-root.php?message_id=%3C87zivg8rcy.fsf%40hertz.schwinge.homeip.net%3E>.)

Originally, I want to use:

    OMP_CLAUSE_NUM_GANGS_EXPR (clause) = build_int_cst (integer_type_node, n_threads == 0 ? -1 : n_threads);

... to store -1 "have the compiler decidew" (instead of now 0 "have the
run-time decide", which might prevent some code optimizations, as I
understand it) for the n_threads == 0 case, but it seems that for an
offloaded OpenACC kernels region, gcc/omp-low.c:oacc_validate_dims is
called with the parameter "used" set to 0 instead of "gang", and then the
"Default anything left to 1 or a partitioned default" logic will default
dims["gang"] to oacc_min_dims["gang"] (that is, 1) instead of the
oacc_default_dims["gang"] (that is, 32).  Nathan, does that smell like a
bug (and could you look into that)?

diff --git gcc/tree-parloops.c gcc/tree-parloops.c
index 139e38c..e498e5b 100644
--- gcc/tree-parloops.c
+++ gcc/tree-parloops.c
@@ -2016,7 +2016,8 @@ transform_to_exit_first_loop (struct loop *loop,
 /* Create the parallel constructs for LOOP as described in gen_parallel_loop.
    LOOP_FN and DATA are the arguments of GIMPLE_OMP_PARALLEL.
    NEW_DATA is the variable that should be initialized from the argument
-   of LOOP_FN.  N_THREADS is the requested number of threads.  */
+   of LOOP_FN.  N_THREADS is the requested number of threads, which can be 0 if
+   that number is to be determined later.  */
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
@@ -2049,6 +2050,7 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       basic_block paral_bb = single_pred (bb);
       gsi = gsi_last_bb (paral_bb);
 
+      gcc_checking_assert (n_threads != 0);
       t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
       OMP_CLAUSE_NUM_THREADS_EXPR (t)
 	= build_int_cst (integer_type_node, n_threads);
@@ -2221,7 +2223,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 }
 
 /* Generates code to execute the iterations of LOOP in N_THREADS
-   threads in parallel.
+   threads in parallel, which can be 0 if that number is to be determined
+   later.
 
    NITER describes number of iterations of LOOP.
    REDUCTION_LIST describes the reductions existent in the LOOP.  */
@@ -2318,6 +2321,7 @@ gen_parallel_loop (struct loop *loop,
       else
 	m_p_thread=MIN_PER_THREAD;
 
+      gcc_checking_assert (n_threads != 0);
       many_iterations_cond =
 	fold_build2 (GE_EXPR, boolean_type_node,
 		     nit, build_int_cst (type, m_p_thread * n_threads));
@@ -3177,7 +3181,7 @@ oacc_entry_exit_ok (struct loop *loop,
 static bool
 parallelize_loops (bool oacc_kernels_p)
 {
-  unsigned n_threads = flag_tree_parallelize_loops;
+  unsigned n_threads;
   bool changed = false;
   struct loop *loop;
   struct loop *skip_loop = NULL;
@@ -3199,6 +3203,13 @@ parallelize_loops (bool oacc_kernels_p)
   if (cfun->has_nonlocal_label)
     return false;
 
+  /* For OpenACC kernels, n_threads will be determined later; otherwise, it's
+     the argument to -ftree-parallelize-loops.  */
+  if (oacc_kernels_p)
+    n_threads = 0;
+  else
+    n_threads = flag_tree_parallelize_loops;
+
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
@@ -3361,7 +3372,13 @@ public:
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual bool gate (function *)
+  {
+    if (oacc_kernels_p)
+      return flag_openacc;
+    else
+      return flag_tree_parallelize_loops > 1;
+  }
   virtual unsigned int execute (function *);
   opt_pass * clone () { return new pass_parallelize_loops (m_ctxt); }
   void set_pass_param (unsigned int n, bool param)
diff --git gcc/tree-ssa-loop.c gcc/tree-ssa-loop.c
index bdbade5..4c39fbc 100644
--- gcc/tree-ssa-loop.c
+++ gcc/tree-ssa-loop.c
@@ -148,7 +148,7 @@ make_pass_tree_loop (gcc::context *ctxt)
 static bool
 gate_oacc_kernels (function *fn)
 {
-  if (flag_tree_parallelize_loops <= 1)
+  if (!flag_openacc)
     return false;
 
   tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
@@ -230,10 +230,9 @@ public:
   virtual bool gate (function *)
   {
     return (optimize
-	    /* Don't bother doing anything if the program has errors.  */
-	    && !seen_error ()
 	    && flag_openacc
-	    && flag_tree_parallelize_loops > 1);
+	    /* Don't bother doing anything if the program has errors.  */
+	    && !seen_error ());
   }
 
 }; // class pass_ipa_oacc
diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
index fe28154..2fd3d52 100644
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -4140,7 +4140,7 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 	  bool avoid_offloading_p = true;
 	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
 	    {
-	      if (dims[ix] > 1)
+	      if (dims[ix] > 1 || dims[ix] == 0)
 		{
 		  avoid_offloading_p = false;
 		  break;
diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index bc24651..f795bf7 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -103,6 +103,10 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
       return;
     }
 
+  /* Default: let the runtime choose.  */
+  for (i = 0; i != GOMP_DIM_MAX; i++)
+    dims[i] = 0;
+
   va_start (ap, kinds);
   /* TODO: This will need amending when device_type is implemented.  */
   while ((tag = va_arg (ap, unsigned)) != 0)
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index 7ec1810..3f1bb6d 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -894,9 +894,21 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   /* Initialize the launch dimensions.  Typically this is constant,
      provided by the device compiler, but we must permit runtime
      values.  */
-  for (i = 0; i != 3; i++)
-    if (targ_fn->launch->dim[i])
-      dims[i] = targ_fn->launch->dim[i];
+  int seen_zero = 0;
+  for (i = 0; i != GOMP_DIM_MAX; i++)
+    {
+      if (targ_fn->launch->dim[i])
+       dims[i] = targ_fn->launch->dim[i];
+      if (!dims[i])
+       seen_zero = 1;
+    }
+
+  if (seen_zero)
+    {
+      for (i = 0; i != GOMP_DIM_MAX; i++)
+       if (!dims[i])
+         dims[i] = /* TODO */ 32;
+    }
 
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is

The TODO in libgomp/plugin/plugin-nvptx.c:nvptx_exec will be resolved by
Nathan's "Default compute dimensions (runtime)",
<http://news.gmane.org/find-root.php?message_id=%3C56B21D23.5060209%40acm.org%3E>.

The remainder is just "mechanical" updates to the test cases:

diff --git gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
index e8b5357..17f240e 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -51,4 +50,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
index c39d674..750f576 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -34,4 +33,4 @@ foo (unsigned int n)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
index 3501d0d..df60d6a 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -34,4 +33,4 @@ foo (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
index f97584d..913d91f 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -67,4 +66,4 @@ main (void)
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
index 530d62a..1822d2a 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -45,5 +44,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
index 4f1c2c5..e946319 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
@@ -1,6 +1,5 @@
 /* { dg-additional-options "-O2" } */
 /* { dg-additional-options "-g" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -13,5 +12,4 @@
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
index 151db51..9b63b45 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -49,4 +48,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
index bee5f5a..279f797 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -52,5 +51,4 @@ foo (COUNTERTYPE n)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
index ea0e342..db1071f 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -36,4 +35,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop.c gcc/testsuite/c-c++-common/goacc/kernels-loop.c
index ab5dfb9..abf7a3c 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -52,5 +51,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
index b16a8cd..95f4817 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -50,5 +49,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
index 61c5df3..6f5a418 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -32,5 +31,4 @@ foo (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
index 4db3a50..3334741 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
@@ -1,5 +1,4 @@
 ! { dg-additional-options "-O2" }
-! { dg-additional-options "-ftree-parallelize-loops=32" }
 
 program main
    implicit none
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
index fef3d10..fb92da8 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
@@ -1,5 +1,4 @@
 ! { dg-additional-options "-O2" }
-! { dg-additional-options "-ftree-parallelize-loops=10" }
 
 program main
    implicit none
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
index 08745fc..366b4f5 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
@@ -1,6 +1,5 @@
 /* Test that the compiler decides to "avoid offloading".  */
 
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* The ACC_DEVICE_TYPE environment variable gets set in the testing
    framework, and that overrides the "avoid offloading" flag at run time.
    { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } } */
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
index 724228a..a63ec97 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
@@ -1,8 +1,6 @@
 /* Test that a user can override the compiler's "avoid offloading"
    decision at run time.  */
 
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <openacc.h>
 
 int main(void)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
index 2fb5196..da01d02 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
@@ -1,7 +1,6 @@
 /* Test that a user can override the compiler's "avoid offloading"
    decision at compile time.  */
 
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* Override the compiler's "avoid offloading" decision.
    { dg-additional-options "-foffload-force" } */
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
index 87ca378..39899ab 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
@@ -1,7 +1,5 @@
 /* This test exercises combined directives.  */
 
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 int
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
index 8f0144c..31da8b1 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include  <openacc.h>
 
 int test_parallel ()
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
index 3ef6f9b..51745ba 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
@@ -1,5 +1,4 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
 
 #include <stdlib.h>
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
index 614ad33..588e864 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 int i;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
index 13e57bd..c7592d6 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N (1024 * 512)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
index f61a74a..31114ac 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N (1024 * 512)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
index 5cdc200..3ffdfe2 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
index 2e4d4d2..a554d66 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
index 5bf00db..f0144b4 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
index d39b667..4719edd 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
index bb2e85b..ca4f638 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
index e513827..d2fff38 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
index c4791a4..0df4b3f 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 100
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
index 96b6e4e..88258be 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
@@ -1,5 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-g" } */
 
 #include "kernels-loop.c"
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
index 1433cb2..147ebb5 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N ((1024 * 512) + 1)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
index fd0d5b1..9a3eaca 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N ((1024 * 512) + 1)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
index 21d2599..28c725a 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 1000
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
index 3762e5a..355123c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N (1024 * 512)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
index 511e25f..8647a94 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define n 10000
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
index 94a5ae2..83cddb5 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
@@ -1,5 +1,3 @@
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 int
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
index 5f18b94..ca5cd01 100644
--- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
@@ -2,7 +2,6 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-cpp" }
-! { dg-additional-options "-ftree-parallelize-loops=32" }
 ! The "avoid offloading" warning is only triggered for -O2 and higher.
 ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 ! The ACC_DEVICE_TYPE environment variable gets set in the testing
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
index 51801ad..6200b37 100644
--- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
@@ -3,7 +3,6 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-cpp" }
-! { dg-additional-options "-ftree-parallelize-loops=32" }
 ! The "avoid offloading" warning is only triggered for -O2 and higher.
 ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
index bea6ab8..865d09f 100644
--- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
@@ -3,7 +3,6 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-cpp" }
-! { dg-additional-options "-ftree-parallelize-loops=32" }
 ! Override the compiler's "avoid offloading" decision.
 ! { dg-additional-options "-foffload-force" }
 
diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
index 4b52579..12ff36c 100644
--- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
@@ -1,7 +1,6 @@
 ! This test exercises combined directives.
 
 ! { dg-do run }
-! { dg-additional-options "-ftree-parallelize-loops=32" }
 ! The "avoid offloading" warning is only triggered for -O2 and higher.
 ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 
diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
index b9298c7..0643e89 100644
--- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
@@ -2,7 +2,6 @@
 ! offloaded regions are properly mapped using present_or_copy.
 
 ! { dg-do run }
-! { dg-additional-options "-ftree-parallelize-loops=32" }
 ! The "avoid offloading" warning is only triggered for -O2 and higher.
 ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Use plain -fopenacc to enable OpenACC kernels processing
  2016-02-05 12:06   ` Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels) Thomas Schwinge
@ 2016-02-10 14:40     ` Thomas Schwinge
  2016-02-15 16:54       ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Thomas Schwinge @ 2016-02-10 14:40 UTC (permalink / raw)
  To: Jakub Jelinek, gcc-patches, Bernd Schmidt
  Cc: Tom de Vries, Richard Biener, Nathan Sidwell

Hi!

Will this patch be acceptable for GCC trunk in the current development
stage?  In its current incarnation, this patch depends on my
'Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid
offloading"' patch,
<http://news.gmane.org/find-root.php?message_id=%3C87zivg8rcy.fsf%40hertz.schwinge.homeip.net%3E>,
which Bernd suggested "has to be considered after gcc-6".  So, I'll have
to re-work this patch here, hence I'm first checking if it generally
meets approval?

On Fri, 5 Feb 2016 13:06:17 +0100, I wrote:
> On Mon, 9 Nov 2015 18:39:19 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > On 09/11/15 16:35, Tom de Vries wrote:
> > > this patch series for stage1 trunk adds support to:
> > > - parallelize oacc kernels regions using parloops, and
> > > - map the loops onto the oacc gang dimension.
> 
> > Atm, the parallelization behaviour for the kernels region is controlled 
> > by flag_tree_parallelize_loops, which is also used to control generic 
> > auto-parallelization by autopar using omp. That is not ideal, and we may 
> > want a separate flag (or param) to control the behaviour for oacc 
> > kernels, f.i. -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.
> 
> I suggest to use plain -fopenacc to enable OpenACC kernels processing
> (which just makes sense, I hope) ;-) and have later processing stages
> determine the actual parametrization (currently: number of gangs) (that
> is, Nathan's recent "Default compute dimensions" patches).
> 
> The code changes are simple enough; OK for trunk?  (This patch depends on
> my 'Un-parallelized OpenACC kernels constructs with nvptx offloading:
> "avoid offloading"' pending review,
> <http://news.gmane.org/find-root.php?message_id=%3C87zivg8rcy.fsf%40hertz.schwinge.homeip.net%3E>.)
> 
> Originally, I want to use:
> 
>     OMP_CLAUSE_NUM_GANGS_EXPR (clause) = build_int_cst (integer_type_node, n_threads == 0 ? -1 : n_threads);
> 
> ... to store -1 "have the compiler decidew" (instead of now 0 "have the
> run-time decide", which might prevent some code optimizations, as I
> understand it) for the n_threads == 0 case, but it seems that for an
> offloaded OpenACC kernels region, gcc/omp-low.c:oacc_validate_dims is
> called with the parameter "used" set to 0 instead of "gang", and then the
> "Default anything left to 1 or a partitioned default" logic will default
> dims["gang"] to oacc_min_dims["gang"] (that is, 1) instead of the
> oacc_default_dims["gang"] (that is, 32).  Nathan, does that smell like a
> bug (and could you look into that)?
> 
> diff --git gcc/tree-parloops.c gcc/tree-parloops.c
> index 139e38c..e498e5b 100644
> --- gcc/tree-parloops.c
> +++ gcc/tree-parloops.c
> @@ -2016,7 +2016,8 @@ transform_to_exit_first_loop (struct loop *loop,
>  /* Create the parallel constructs for LOOP as described in gen_parallel_loop.
>     LOOP_FN and DATA are the arguments of GIMPLE_OMP_PARALLEL.
>     NEW_DATA is the variable that should be initialized from the argument
> -   of LOOP_FN.  N_THREADS is the requested number of threads.  */
> +   of LOOP_FN.  N_THREADS is the requested number of threads, which can be 0 if
> +   that number is to be determined later.  */
>  
>  static void
>  create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
> @@ -2049,6 +2050,7 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
>        basic_block paral_bb = single_pred (bb);
>        gsi = gsi_last_bb (paral_bb);
>  
> +      gcc_checking_assert (n_threads != 0);
>        t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
>        OMP_CLAUSE_NUM_THREADS_EXPR (t)
>  	= build_int_cst (integer_type_node, n_threads);
> @@ -2221,7 +2223,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
>  }
>  
>  /* Generates code to execute the iterations of LOOP in N_THREADS
> -   threads in parallel.
> +   threads in parallel, which can be 0 if that number is to be determined
> +   later.
>  
>     NITER describes number of iterations of LOOP.
>     REDUCTION_LIST describes the reductions existent in the LOOP.  */
> @@ -2318,6 +2321,7 @@ gen_parallel_loop (struct loop *loop,
>        else
>  	m_p_thread=MIN_PER_THREAD;
>  
> +      gcc_checking_assert (n_threads != 0);
>        many_iterations_cond =
>  	fold_build2 (GE_EXPR, boolean_type_node,
>  		     nit, build_int_cst (type, m_p_thread * n_threads));
> @@ -3177,7 +3181,7 @@ oacc_entry_exit_ok (struct loop *loop,
>  static bool
>  parallelize_loops (bool oacc_kernels_p)
>  {
> -  unsigned n_threads = flag_tree_parallelize_loops;
> +  unsigned n_threads;
>    bool changed = false;
>    struct loop *loop;
>    struct loop *skip_loop = NULL;
> @@ -3199,6 +3203,13 @@ parallelize_loops (bool oacc_kernels_p)
>    if (cfun->has_nonlocal_label)
>      return false;
>  
> +  /* For OpenACC kernels, n_threads will be determined later; otherwise, it's
> +     the argument to -ftree-parallelize-loops.  */
> +  if (oacc_kernels_p)
> +    n_threads = 0;
> +  else
> +    n_threads = flag_tree_parallelize_loops;
> +
>    gcc_obstack_init (&parloop_obstack);
>    reduction_info_table_type reduction_list (10);
>  
> @@ -3361,7 +3372,13 @@ public:
>    {}
>  
>    /* opt_pass methods: */
> -  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
> +  virtual bool gate (function *)
> +  {
> +    if (oacc_kernels_p)
> +      return flag_openacc;
> +    else
> +      return flag_tree_parallelize_loops > 1;
> +  }
>    virtual unsigned int execute (function *);
>    opt_pass * clone () { return new pass_parallelize_loops (m_ctxt); }
>    void set_pass_param (unsigned int n, bool param)
> diff --git gcc/tree-ssa-loop.c gcc/tree-ssa-loop.c
> index bdbade5..4c39fbc 100644
> --- gcc/tree-ssa-loop.c
> +++ gcc/tree-ssa-loop.c
> @@ -148,7 +148,7 @@ make_pass_tree_loop (gcc::context *ctxt)
>  static bool
>  gate_oacc_kernels (function *fn)
>  {
> -  if (flag_tree_parallelize_loops <= 1)
> +  if (!flag_openacc)
>      return false;
>  
>    tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
> @@ -230,10 +230,9 @@ public:
>    virtual bool gate (function *)
>    {
>      return (optimize
> -	    /* Don't bother doing anything if the program has errors.  */
> -	    && !seen_error ()
>  	    && flag_openacc
> -	    && flag_tree_parallelize_loops > 1);
> +	    /* Don't bother doing anything if the program has errors.  */
> +	    && !seen_error ());
>    }
>  
>  }; // class pass_ipa_oacc
> diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
> index fe28154..2fd3d52 100644
> --- gcc/config/nvptx/nvptx.c
> +++ gcc/config/nvptx/nvptx.c
> @@ -4140,7 +4140,7 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
>  	  bool avoid_offloading_p = true;
>  	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
>  	    {
> -	      if (dims[ix] > 1)
> +	      if (dims[ix] > 1 || dims[ix] == 0)
>  		{
>  		  avoid_offloading_p = false;
>  		  break;
> diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
> index bc24651..f795bf7 100644
> --- libgomp/oacc-parallel.c
> +++ libgomp/oacc-parallel.c
> @@ -103,6 +103,10 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
>        return;
>      }
>  
> +  /* Default: let the runtime choose.  */
> +  for (i = 0; i != GOMP_DIM_MAX; i++)
> +    dims[i] = 0;
> +
>    va_start (ap, kinds);
>    /* TODO: This will need amending when device_type is implemented.  */
>    while ((tag = va_arg (ap, unsigned)) != 0)
> diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
> index 7ec1810..3f1bb6d 100644
> --- libgomp/plugin/plugin-nvptx.c
> +++ libgomp/plugin/plugin-nvptx.c
> @@ -894,9 +894,21 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>    /* Initialize the launch dimensions.  Typically this is constant,
>       provided by the device compiler, but we must permit runtime
>       values.  */
> -  for (i = 0; i != 3; i++)
> -    if (targ_fn->launch->dim[i])
> -      dims[i] = targ_fn->launch->dim[i];
> +  int seen_zero = 0;
> +  for (i = 0; i != GOMP_DIM_MAX; i++)
> +    {
> +      if (targ_fn->launch->dim[i])
> +       dims[i] = targ_fn->launch->dim[i];
> +      if (!dims[i])
> +       seen_zero = 1;
> +    }
> +
> +  if (seen_zero)
> +    {
> +      for (i = 0; i != GOMP_DIM_MAX; i++)
> +       if (!dims[i])
> +         dims[i] = /* TODO */ 32;
> +    }
>  
>    /* This reserves a chunk of a pre-allocated page of memory mapped on both
>       the host and the device. HP is a host pointer to the new chunk, and DP is
> 
> The TODO in libgomp/plugin/plugin-nvptx.c:nvptx_exec will be resolved by
> Nathan's "Default compute dimensions (runtime)",
> <http://news.gmane.org/find-root.php?message_id=%3C56B21D23.5060209%40acm.org%3E>.
> 
> The remainder is just "mechanical" updates to the test cases:
> 
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
> index e8b5357..17f240e 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -51,4 +50,4 @@ main (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
> index c39d674..750f576 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -34,4 +33,4 @@ foo (unsigned int n)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
> index 3501d0d..df60d6a 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -34,4 +33,4 @@ foo (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
> index f97584d..913d91f 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -67,4 +66,4 @@ main (void)
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
> index 530d62a..1822d2a 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -45,5 +44,4 @@ main (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> -
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
> index 4f1c2c5..e946319 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
> @@ -1,6 +1,5 @@
>  /* { dg-additional-options "-O2" } */
>  /* { dg-additional-options "-g" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -13,5 +12,4 @@
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> -
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
> index 151db51..9b63b45 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -49,4 +48,4 @@ main (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
> index bee5f5a..279f797 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -52,5 +51,4 @@ foo (COUNTERTYPE n)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> -
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
> index ea0e342..db1071f 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -36,4 +35,4 @@ main (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop.c gcc/testsuite/c-c++-common/goacc/kernels-loop.c
> index ab5dfb9..abf7a3c 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-loop.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -52,5 +51,4 @@ main (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> -
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
> index b16a8cd..95f4817 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -50,5 +49,4 @@ main (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> -
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/c-c++-common/goacc/kernels-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
> index 61c5df3..6f5a418 100644
> --- gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
> +++ gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
> @@ -1,5 +1,4 @@
>  /* { dg-additional-options "-O2" } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>  /* { dg-additional-options "-fdump-tree-optimized" } */
>  
> @@ -32,5 +31,4 @@ foo (void)
>  /* Check that the loop has been split off into a function.  */
>  /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>  
> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
> -
> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
> index 4db3a50..3334741 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
> @@ -1,5 +1,4 @@
>  ! { dg-additional-options "-O2" }
> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>  
>  program main
>     implicit none
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
> index fef3d10..fb92da8 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
> @@ -1,5 +1,4 @@
>  ! { dg-additional-options "-O2" }
> -! { dg-additional-options "-ftree-parallelize-loops=10" }
>  
>  program main
>     implicit none
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> index 08745fc..366b4f5 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> @@ -1,6 +1,5 @@
>  /* Test that the compiler decides to "avoid offloading".  */
>  
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* The ACC_DEVICE_TYPE environment variable gets set in the testing
>     framework, and that overrides the "avoid offloading" flag at run time.
>     { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } } */
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> index 724228a..a63ec97 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> @@ -1,8 +1,6 @@
>  /* Test that a user can override the compiler's "avoid offloading"
>     decision at run time.  */
>  
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <openacc.h>
>  
>  int main(void)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> index 2fb5196..da01d02 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> @@ -1,7 +1,6 @@
>  /* Test that a user can override the compiler's "avoid offloading"
>     decision at compile time.  */
>  
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* Override the compiler's "avoid offloading" decision.
>     { dg-additional-options "-foffload-force" } */
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> index 87ca378..39899ab 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> @@ -1,7 +1,5 @@
>  /* This test exercises combined directives.  */
>  
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  int
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> index 8f0144c..31da8b1 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include  <openacc.h>
>  
>  int test_parallel ()
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> index 3ef6f9b..51745ba 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> @@ -1,5 +1,4 @@
>  /* { dg-do run { target openacc_nvidia_accel_selected } } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
>  
>  #include <stdlib.h>
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> index 614ad33..588e864 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  int i;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
> index 13e57bd..c7592d6 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
> @@ -1,6 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N (1024 * 512)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
> index f61a74a..31114ac 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
> @@ -1,6 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N (1024 * 512)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> index 5cdc200..3ffdfe2 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 32
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> index 2e4d4d2..a554d66 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 32
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> index 5bf00db..f0144b4 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 32
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> index d39b667..4719edd 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 32
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> index bb2e85b..ca4f638 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 32
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> index e513827..d2fff38 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 32
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> index c4791a4..0df4b3f 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 100
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
> index 96b6e4e..88258be 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
> @@ -1,5 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-g" } */
>  
>  #include "kernels-loop.c"
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
> index 1433cb2..147ebb5 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
> @@ -1,6 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N ((1024 * 512) + 1)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
> index fd0d5b1..9a3eaca 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
> @@ -1,6 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N ((1024 * 512) + 1)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
> index 21d2599..28c725a 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
> @@ -1,6 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N 1000
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
> index 3762e5a..355123c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
> @@ -1,6 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define N (1024 * 512)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
> index 511e25f..8647a94 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
> @@ -1,6 +1,3 @@
> -/* { dg-do run } */
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  #define n 10000
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> index 94a5ae2..83cddb5 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> @@ -1,5 +1,3 @@
> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> -
>  #include <stdlib.h>
>  
>  int
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> index 5f18b94..ca5cd01 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> @@ -2,7 +2,6 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-cpp" }
> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>  ! The "avoid offloading" warning is only triggered for -O2 and higher.
>  ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>  ! The ACC_DEVICE_TYPE environment variable gets set in the testing
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> index 51801ad..6200b37 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> @@ -3,7 +3,6 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-cpp" }
> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>  ! The "avoid offloading" warning is only triggered for -O2 and higher.
>  ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>  
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> index bea6ab8..865d09f 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> @@ -3,7 +3,6 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-cpp" }
> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>  ! Override the compiler's "avoid offloading" decision.
>  ! { dg-additional-options "-foffload-force" }
>  
> diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> index 4b52579..12ff36c 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> @@ -1,7 +1,6 @@
>  ! This test exercises combined directives.
>  
>  ! { dg-do run }
> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>  ! The "avoid offloading" warning is only triggered for -O2 and higher.
>  ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>  
> diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> index b9298c7..0643e89 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> @@ -2,7 +2,6 @@
>  ! offloaded regions are properly mapped using present_or_copy.
>  
>  ! { dg-do run }
> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>  ! The "avoid offloading" warning is only triggered for -O2 and higher.
>  ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PING^3][PATCH, 12/16] Handle acc loop directive
  2016-01-26 12:50         ` Jakub Jelinek
@ 2016-02-12 11:11           ` Tom de Vries
  2016-02-22 10:55             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-02-12 11:11 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Richard Biener

On 26/01/16 13:49, Jakub Jelinek wrote:
> On Tue, Jan 26, 2016 at 01:38:39PM +0100, Tom de Vries wrote:
>> Ping^3. ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html )
>
> First of all, I wonder if it wouldn't be far easier to handle these during
> gimplification rather than during omp expansion or during parsing.  Inside
> kernels, do you need to honor any clauses on the acc loop, like
> privatization etc., or can you just ignore it altogether (after parsing them
> to ensure they are valid)?

The oacc loop clauses are: gang, worker, vector, seq, auto, tile, 
device_type, independent, private, reduction.

AFAIU, there're all safe to ignore. That has largely been the approach 
in the gomp-4_0-branch, and sofar I haven't seen any failures due to 
ignoring a loop clause in a kernels region.

But we do want to be able to honor loop clauses in a kernels region at 
some point. F.i., supporting the independent clause would allow more 
test-cases to be parallelized.

At some point we had an implementation of the independent clause in the 
gomp-4_0-branch, but that had to be reverted ( 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00696.html ).

Anyway, the implementation of the propagation of the independent 
property was to keep the loop directive with the independent clause 
until omp-expand (where we have cfg), and set a new field 
marked_independent in the corresponding struct loop.

If we want to do the expansion of the loop directive to a normal loop at 
gimplication, I see two issues:
- in general, we don't only check for correctness during parsing,
   there's also checking being done during scan_omp, which happens in
   pass_lower_omp, after gimplification.
- how do we mark the new loop as being independent?

> Handling this in expand_omp_for_generic is not really nice, because it will
> make already very complicated function even more complex.

An alternative would be to copy expand_omp_for_generic, apply the patch, 
and partially evaluate for the single call introduced in the patch.

Do you prefer this approach?

Thanks,
- Tom

>     gomp_ordered *ord_stmt;
> +
> +  /* True if this is nested inside an OpenACC kernels construct.  */
> +  bool inside_kernels_p;
>   };
>
> is bad placement, there are other bool/unsigned char fields earlier and the
> smaller fields should be adjacent for smaller padding of the struct.
>
> 	Jakub
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Use plain -fopenacc to enable OpenACC kernels processing
  2016-02-10 14:40     ` Use plain -fopenacc to enable OpenACC kernels processing Thomas Schwinge
@ 2016-02-15 16:54       ` Tom de Vries
  2016-02-23 15:19         ` Thomas Schwinge
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-02-15 16:54 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: Jakub Jelinek, gcc-patches, Bernd Schmidt, Richard Biener,
	Nathan Sidwell

On 10/02/16 15:40, Thomas Schwinge wrote:
> Hi!
>
> Will this patch be acceptable for GCC trunk in the current development
> stage?  In its current incarnation, this patch depends on my
> 'Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid
> offloading"' patch,
> <http://news.gmane.org/find-root.php?message_id=%3C87zivg8rcy.fsf%40hertz.schwinge.homeip.net%3E>,
> which Bernd suggested "has to be considered after gcc-6".  So, I'll have
> to re-work this patch here, hence I'm first checking if it generally
> meets approval?
>
> On Fri, 5 Feb 2016 13:06:17 +0100, I wrote:
>> On Mon, 9 Nov 2015 18:39:19 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>
>>> Atm, the parallelization behaviour for the kernels region is controlled
>>> by flag_tree_parallelize_loops, which is also used to control generic
>>> auto-parallelization by autopar using omp. That is not ideal, and we may
>>> want a separate flag (or param) to control the behaviour for oacc
>>> kernels, f.i. -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.
>>
>> I suggest to use plain -fopenacc to enable OpenACC kernels processing
>> (which just makes sense, I hope) ;-) and have later processing stages
>> determine the actual parametrization (currently: number of gangs) (that
>> is, Nathan's recent "Default compute dimensions" patches).
>>

Hi Thomas,

That makes a lot of sense.  Thanks for working on this.

>> The code changes are simple enough; OK for trunk?  (This patch depends on
>> my 'Un-parallelized OpenACC kernels constructs with nvptx offloading:
>> "avoid offloading"' pending review,
>> <http://news.gmane.org/find-root.php?message_id=%3C87zivg8rcy.fsf%40hertz.schwinge.homeip.net%3E>.)
>>
>> Originally, I want to use:
>>
>>      OMP_CLAUSE_NUM_GANGS_EXPR (clause) = build_int_cst (integer_type_node, n_threads == 0 ? -1 : n_threads);
>>
>> ... to store -1 "have the compiler decidew" (instead of now 0 "have the
>> run-time decide", which might prevent some code optimizations, as I
>> understand it) for the n_threads == 0 case, but it seems that for an
>> offloaded OpenACC kernels region, gcc/omp-low.c:oacc_validate_dims is
>> called with the parameter "used" set to 0 instead of "gang", and then the
>> "Default anything left to 1 or a partitioned default" logic will default
>> dims["gang"] to oacc_min_dims["gang"] (that is, 1) instead of the
>> oacc_default_dims["gang"] (that is, 32).  Nathan, does that smell like a
>> bug (and could you look into that)?
>>
>> diff --git gcc/tree-parloops.c gcc/tree-parloops.c
>> index 139e38c..e498e5b 100644
>> --- gcc/tree-parloops.c
>> +++ gcc/tree-parloops.c
>> @@ -2016,7 +2016,8 @@ transform_to_exit_first_loop (struct loop *loop,
>>   /* Create the parallel constructs for LOOP as described in gen_parallel_loop.
>>      LOOP_FN and DATA are the arguments of GIMPLE_OMP_PARALLEL.
>>      NEW_DATA is the variable that should be initialized from the argument
>> -   of LOOP_FN.  N_THREADS is the requested number of threads.  */
>> +   of LOOP_FN.  N_THREADS is the requested number of threads, which can be 0 if
>> +   that number is to be determined later.  */
>>
>>   static void
>>   create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
>> @@ -2049,6 +2050,7 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
>>         basic_block paral_bb = single_pred (bb);
>>         gsi = gsi_last_bb (paral_bb);
>>
>> +      gcc_checking_assert (n_threads != 0);
>>         t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
>>         OMP_CLAUSE_NUM_THREADS_EXPR (t)
>>   	= build_int_cst (integer_type_node, n_threads);
>> @@ -2221,7 +2223,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
>>   }
>>
>>   /* Generates code to execute the iterations of LOOP in N_THREADS
>> -   threads in parallel.
>> +   threads in parallel, which can be 0 if that number is to be determined
>> +   later.
>>
>>      NITER describes number of iterations of LOOP.
>>      REDUCTION_LIST describes the reductions existent in the LOOP.  */
>> @@ -2318,6 +2321,7 @@ gen_parallel_loop (struct loop *loop,
>>         else
>>   	m_p_thread=MIN_PER_THREAD;
>>
>> +      gcc_checking_assert (n_threads != 0);
>>         many_iterations_cond =
>>   	fold_build2 (GE_EXPR, boolean_type_node,
>>   		     nit, build_int_cst (type, m_p_thread * n_threads));
>> @@ -3177,7 +3181,7 @@ oacc_entry_exit_ok (struct loop *loop,
>>   static bool
>>   parallelize_loops (bool oacc_kernels_p)
>>   {
>> -  unsigned n_threads = flag_tree_parallelize_loops;
>> +  unsigned n_threads;
>>     bool changed = false;
>>     struct loop *loop;
>>     struct loop *skip_loop = NULL;
>> @@ -3199,6 +3203,13 @@ parallelize_loops (bool oacc_kernels_p)
>>     if (cfun->has_nonlocal_label)
>>       return false;
>>
>> +  /* For OpenACC kernels, n_threads will be determined later; otherwise, it's
>> +     the argument to -ftree-parallelize-loops.  */
>> +  if (oacc_kernels_p)
>> +    n_threads = 0;
>> +  else
>> +    n_threads = flag_tree_parallelize_loops;
>> +
>>     gcc_obstack_init (&parloop_obstack);
>>     reduction_info_table_type reduction_list (10);
>>
>> @@ -3361,7 +3372,13 @@ public:
>>     {}
>>
>>     /* opt_pass methods: */
>> -  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
>> +  virtual bool gate (function *)
>> +  {
>> +    if (oacc_kernels_p)
>> +      return flag_openacc;
>> +    else
>> +      return flag_tree_parallelize_loops > 1;
>> +  }

I wouldn't mind using the tertiary expression here, but I suppose that's 
a taste thing.

>>     virtual unsigned int execute (function *);
>>     opt_pass * clone () { return new pass_parallelize_loops (m_ctxt); }
>>     void set_pass_param (unsigned int n, bool param)

The oacc-parloops changes look good to me. I approve them for 6.0 stage 
4 (given that using the ftree-parallelize-loops=<n> flag for oacc 
kernels parallelization was was just a placeholder waiting to be 
replaced by an oacc-based approach). [ And I'd expect that the 
tree-ssa-loop.c changes and the mechanical testsuite changes can be 
regarded as trivial. ]

Thanks,
- Tom

>> diff --git gcc/tree-ssa-loop.c gcc/tree-ssa-loop.c
>> index bdbade5..4c39fbc 100644
>> --- gcc/tree-ssa-loop.c
>> +++ gcc/tree-ssa-loop.c
>> @@ -148,7 +148,7 @@ make_pass_tree_loop (gcc::context *ctxt)
>>   static bool
>>   gate_oacc_kernels (function *fn)
>>   {
>> -  if (flag_tree_parallelize_loops <= 1)
>> +  if (!flag_openacc)
>>       return false;
>>
>>     tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
>> @@ -230,10 +230,9 @@ public:
>>     virtual bool gate (function *)
>>     {
>>       return (optimize
>> -	    /* Don't bother doing anything if the program has errors.  */
>> -	    && !seen_error ()
>>   	    && flag_openacc
>> -	    && flag_tree_parallelize_loops > 1);
>> +	    /* Don't bother doing anything if the program has errors.  */
>> +	    && !seen_error ());
>>     }
>>
>>   }; // class pass_ipa_oacc
>> diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
>> index fe28154..2fd3d52 100644
>> --- gcc/config/nvptx/nvptx.c
>> +++ gcc/config/nvptx/nvptx.c
>> @@ -4140,7 +4140,7 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
>>   	  bool avoid_offloading_p = true;
>>   	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
>>   	    {
>> -	      if (dims[ix] > 1)
>> +	      if (dims[ix] > 1 || dims[ix] == 0)
>>   		{
>>   		  avoid_offloading_p = false;
>>   		  break;
>> diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
>> index bc24651..f795bf7 100644
>> --- libgomp/oacc-parallel.c
>> +++ libgomp/oacc-parallel.c
>> @@ -103,6 +103,10 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
>>         return;
>>       }
>>
>> +  /* Default: let the runtime choose.  */
>> +  for (i = 0; i != GOMP_DIM_MAX; i++)
>> +    dims[i] = 0;
>> +
>>     va_start (ap, kinds);
>>     /* TODO: This will need amending when device_type is implemented.  */
>>     while ((tag = va_arg (ap, unsigned)) != 0)
>> diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
>> index 7ec1810..3f1bb6d 100644
>> --- libgomp/plugin/plugin-nvptx.c
>> +++ libgomp/plugin/plugin-nvptx.c
>> @@ -894,9 +894,21 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>>     /* Initialize the launch dimensions.  Typically this is constant,
>>        provided by the device compiler, but we must permit runtime
>>        values.  */
>> -  for (i = 0; i != 3; i++)
>> -    if (targ_fn->launch->dim[i])
>> -      dims[i] = targ_fn->launch->dim[i];
>> +  int seen_zero = 0;
>> +  for (i = 0; i != GOMP_DIM_MAX; i++)
>> +    {
>> +      if (targ_fn->launch->dim[i])
>> +       dims[i] = targ_fn->launch->dim[i];
>> +      if (!dims[i])
>> +       seen_zero = 1;
>> +    }
>> +
>> +  if (seen_zero)
>> +    {
>> +      for (i = 0; i != GOMP_DIM_MAX; i++)
>> +       if (!dims[i])
>> +         dims[i] = /* TODO */ 32;
>> +    }
>>
>>     /* This reserves a chunk of a pre-allocated page of memory mapped on both
>>        the host and the device. HP is a host pointer to the new chunk, and DP is
>>
>> The TODO in libgomp/plugin/plugin-nvptx.c:nvptx_exec will be resolved by
>> Nathan's "Default compute dimensions (runtime)",
>> <http://news.gmane.org/find-root.php?message_id=%3C56B21D23.5060209%40acm.org%3E>.
>>
>> The remainder is just "mechanical" updates to the test cases:
>>
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
>> index e8b5357..17f240e 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -51,4 +50,4 @@ main (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
>> index c39d674..750f576 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -34,4 +33,4 @@ foo (unsigned int n)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
>> index 3501d0d..df60d6a 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -34,4 +33,4 @@ foo (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
>> index f97584d..913d91f 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -67,4 +66,4 @@ main (void)
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
>> index 530d62a..1822d2a 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -45,5 +44,4 @@ main (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> -
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
>> index 4f1c2c5..e946319 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
>> @@ -1,6 +1,5 @@
>>   /* { dg-additional-options "-O2" } */
>>   /* { dg-additional-options "-g" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -13,5 +12,4 @@
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> -
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
>> index 151db51..9b63b45 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -49,4 +48,4 @@ main (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
>> index bee5f5a..279f797 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -52,5 +51,4 @@ foo (COUNTERTYPE n)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> -
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
>> index ea0e342..db1071f 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -36,4 +35,4 @@ main (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop.c gcc/testsuite/c-c++-common/goacc/kernels-loop.c
>> index ab5dfb9..abf7a3c 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-loop.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-loop.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -52,5 +51,4 @@ main (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> -
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
>> index b16a8cd..95f4817 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -50,5 +49,4 @@ main (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> -
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/c-c++-common/goacc/kernels-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
>> index 61c5df3..6f5a418 100644
>> --- gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
>> +++ gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-additional-options "-O2" } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-fdump-tree-parloops1-all" } */
>>   /* { dg-additional-options "-fdump-tree-optimized" } */
>>
>> @@ -32,5 +31,4 @@ foo (void)
>>   /* Check that the loop has been split off into a function.  */
>>   /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
>>
>> -/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
>> -
>> +/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
>> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
>> index 4db3a50..3334741 100644
>> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
>> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
>> @@ -1,5 +1,4 @@
>>   ! { dg-additional-options "-O2" }
>> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>>
>>   program main
>>      implicit none
>> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
>> index fef3d10..fb92da8 100644
>> --- gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
>> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
>> @@ -1,5 +1,4 @@
>>   ! { dg-additional-options "-O2" }
>> -! { dg-additional-options "-ftree-parallelize-loops=10" }
>>
>>   program main
>>      implicit none
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
>> index 08745fc..366b4f5 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
>> @@ -1,6 +1,5 @@
>>   /* Test that the compiler decides to "avoid offloading".  */
>>
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* The ACC_DEVICE_TYPE environment variable gets set in the testing
>>      framework, and that overrides the "avoid offloading" flag at run time.
>>      { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } } */
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
>> index 724228a..a63ec97 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
>> @@ -1,8 +1,6 @@
>>   /* Test that a user can override the compiler's "avoid offloading"
>>      decision at run time.  */
>>
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <openacc.h>
>>
>>   int main(void)
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
>> index 2fb5196..da01d02 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
>> @@ -1,7 +1,6 @@
>>   /* Test that a user can override the compiler's "avoid offloading"
>>      decision at compile time.  */
>>
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* Override the compiler's "avoid offloading" decision.
>>      { dg-additional-options "-foffload-force" } */
>>
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
>> index 87ca378..39899ab 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
>> @@ -1,7 +1,5 @@
>>   /* This test exercises combined directives.  */
>>
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   int
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
>> index 8f0144c..31da8b1 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include  <openacc.h>
>>
>>   int test_parallel ()
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
>> index 3ef6f9b..51745ba 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
>> @@ -1,5 +1,4 @@
>>   /* { dg-do run { target openacc_nvidia_accel_selected } } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
>>
>>   #include <stdlib.h>
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
>> index 614ad33..588e864 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   int i;
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
>> index 13e57bd..c7592d6 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
>> @@ -1,6 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N (1024 * 512)
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
>> index f61a74a..31114ac 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
>> @@ -1,6 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N (1024 * 512)
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
>> index 5cdc200..3ffdfe2 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 32
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
>> index 2e4d4d2..a554d66 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 32
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
>> index 5bf00db..f0144b4 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 32
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
>> index d39b667..4719edd 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 32
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
>> index bb2e85b..ca4f638 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 32
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
>> index e513827..d2fff38 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 32
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
>> index c4791a4..0df4b3f 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 100
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
>> index 96b6e4e..88258be 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>>   /* { dg-additional-options "-g" } */
>>
>>   #include "kernels-loop.c"
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
>> index 1433cb2..147ebb5 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
>> @@ -1,6 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N ((1024 * 512) + 1)
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
>> index fd0d5b1..9a3eaca 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
>> @@ -1,6 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N ((1024 * 512) + 1)
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
>> index 21d2599..28c725a 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
>> @@ -1,6 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N 1000
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
>> index 3762e5a..355123c 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
>> @@ -1,6 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define N (1024 * 512)
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
>> index 511e25f..8647a94 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
>> @@ -1,6 +1,3 @@
>> -/* { dg-do run } */
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   #define n 10000
>> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
>> index 94a5ae2..83cddb5 100644
>> --- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
>> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
>> @@ -1,5 +1,3 @@
>> -/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>> -
>>   #include <stdlib.h>
>>
>>   int
>> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
>> index 5f18b94..ca5cd01 100644
>> --- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
>> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
>> @@ -2,7 +2,6 @@
>>
>>   ! { dg-do run }
>>   ! { dg-additional-options "-cpp" }
>> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>>   ! The "avoid offloading" warning is only triggered for -O2 and higher.
>>   ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>>   ! The ACC_DEVICE_TYPE environment variable gets set in the testing
>> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
>> index 51801ad..6200b37 100644
>> --- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
>> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
>> @@ -3,7 +3,6 @@
>>
>>   ! { dg-do run }
>>   ! { dg-additional-options "-cpp" }
>> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>>   ! The "avoid offloading" warning is only triggered for -O2 and higher.
>>   ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>>
>> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
>> index bea6ab8..865d09f 100644
>> --- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
>> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
>> @@ -3,7 +3,6 @@
>>
>>   ! { dg-do run }
>>   ! { dg-additional-options "-cpp" }
>> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>>   ! Override the compiler's "avoid offloading" decision.
>>   ! { dg-additional-options "-foffload-force" }
>>
>> diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
>> index 4b52579..12ff36c 100644
>> --- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
>> +++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
>> @@ -1,7 +1,6 @@
>>   ! This test exercises combined directives.
>>
>>   ! { dg-do run }
>> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>>   ! The "avoid offloading" warning is only triggered for -O2 and higher.
>>   ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>>
>> diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
>> index b9298c7..0643e89 100644
>> --- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
>> +++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
>> @@ -2,7 +2,6 @@
>>   ! offloaded regions are properly mapped using present_or_copy.
>>
>>   ! { dg-do run }
>> -! { dg-additional-options "-ftree-parallelize-loops=32" }
>>   ! The "avoid offloading" warning is only triggered for -O2 and higher.
>>   ! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PING^3][PATCH, 12/16] Handle acc loop directive
  2016-02-12 11:11           ` Tom de Vries
@ 2016-02-22 10:55             ` Tom de Vries
  2016-02-22 10:58               ` Jakub Jelinek
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-02-22 10:55 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2390 bytes --]

On 12/02/16 12:10, Tom de Vries wrote:
> On 26/01/16 13:49, Jakub Jelinek wrote:
>> On Tue, Jan 26, 2016 at 01:38:39PM +0100, Tom de Vries wrote:
>>> Ping^3. ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html )
>>
>> First of all, I wonder if it wouldn't be far easier to handle these
>> during
>> gimplification rather than during omp expansion or during parsing.
>> Inside
>> kernels, do you need to honor any clauses on the acc loop, like
>> privatization etc., or can you just ignore it altogether (after
>> parsing them
>> to ensure they are valid)?
>
> The oacc loop clauses are: gang, worker, vector, seq, auto, tile,
> device_type, independent, private, reduction.
>
> AFAIU, there're all safe to ignore. That has largely been the approach
> in the gomp-4_0-branch, and sofar I haven't seen any failures due to
> ignoring a loop clause in a kernels region.
>
> But we do want to be able to honor loop clauses in a kernels region at
> some point. F.i., supporting the independent clause would allow more
> test-cases to be parallelized.
>
> At some point we had an implementation of the independent clause in the
> gomp-4_0-branch, but that had to be reverted (
> https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00696.html ).
>
> Anyway, the implementation of the propagation of the independent
> property was to keep the loop directive with the independent clause
> until omp-expand (where we have cfg), and set a new field
> marked_independent in the corresponding struct loop.
>
> If we want to do the expansion of the loop directive to a normal loop at
> gimplication, I see two issues:
> - in general, we don't only check for correctness during parsing,
>    there's also checking being done during scan_omp, which happens in
>    pass_lower_omp, after gimplification.
> - how do we mark the new loop as being independent?
>
>> Handling this in expand_omp_for_generic is not really nice, because it
>> will
>> make already very complicated function even more complex.
>
> An alternative would be to copy expand_omp_for_generic, apply the patch,
> and partially evaluate for the single call introduced in the patch.
>
> Do you prefer this approach?

Jakub,

Following up on your suggestion to implement this during gimplification, 
I wrote attached patch.

I'll put it through some openacc testing and add testcases. Is this 
approach acceptable for stage4?

Thanks,
- Tom

[-- Attachment #2: 0001-Ignore-acc-loop-directive-in-kernels-region.patch --]
[-- Type: text/x-patch, Size: 3514 bytes --]

Ignore acc loop directive in kernels region

---
 gcc/gimplify.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 7be6bd7..cec0627 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -8364,6 +8364,82 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *)
   return NULL_TREE;
 }
 
+/* Gimplify the loops with index I and higher in omp_for FOR_STMT as a
+   sequential loop, and append the resulting gimple statements to PRE_P.  */
+
+static void
+gimplify_omp_for_seq (tree for_stmt, gimple_seq *pre_p, unsigned int i)
+{
+  gcc_assert (OMP_FOR_ORIG_DECLS (for_stmt) == NULL_TREE);
+  unsigned int len = TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt));
+  gcc_assert (i < len);
+
+  /* Gimplify OMP_FOR[i] as:
+
+     if (i == 0)
+       OMP_FOR_PRE_BODY;
+     OMP_FOR_INIT[i];
+     goto <loop_entry_label>;
+     <fall_thru_label>:
+     if (i == len - 1)
+       OMP_FOR_BODY;
+     else
+       OMP_FOR[i+1];
+    OMP_FOR_INCR[i];
+    <loop_entry_label>:
+    if (OMP_FOR_COND[i])
+      goto <fall_thru_label>;
+    else
+      goto <loop_exit_label>;
+    <loop_exit_label>:
+  */
+
+  tree loop_entry_label = create_artificial_label (UNKNOWN_LOCATION);
+  tree fall_thru_label = create_artificial_label (UNKNOWN_LOCATION);
+  tree loop_exit_label = create_artificial_label (UNKNOWN_LOCATION);
+
+  /* if (i = 0) OMP_FOR_PRE_BODY. */
+  if (i == 0)
+    gimplify_and_add (OMP_FOR_PRE_BODY (for_stmt), pre_p);
+
+  /* OMP_FOR_INIT[i].  */
+  tree init = TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), i);
+  gimplify_stmt (&init, pre_p);
+
+  /* goto <loop_entry_label>.  */
+  gimplify_seq_add_stmt (pre_p, gimple_build_goto (loop_entry_label));
+
+  /* <fall_thru_label>.  */
+  gimplify_seq_add_stmt (pre_p, gimple_build_label (fall_thru_label));
+
+  /* if (i == len - 1) OMP_FOR_BODY
+     else OMP_FOR[i+1].  */
+  if (i == len - 1)
+    gimplify_and_return_first (OMP_FOR_BODY (for_stmt), pre_p);
+  else
+    gimplify_omp_for_seq (for_stmt, pre_p, i + 1);
+
+  /* OMP_FOR_INCR[i].  */
+  tree incr = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i);
+  gimplify_stmt (&incr, pre_p);
+
+  /* <loop_entry_label>.  */
+  gimplify_seq_add_stmt (pre_p, gimple_build_label (loop_entry_label));
+
+  /* if (OMP_FOR_COND[i]) goto <fall_thru_label>
+     else goto <loop_exit_label>.  */
+  tree cond = TREE_VEC_ELT (OMP_FOR_COND (for_stmt), i);
+  tree var = TREE_OPERAND (cond, 0);
+  tree final_val = TREE_OPERAND (cond, 1);
+  gimplify_expr (&final_val, pre_p, NULL, is_gimple_val, fb_rvalue);
+  gimple *gimple_cond = gimple_build_cond (TREE_CODE (cond), var, final_val,
+					   fall_thru_label, loop_exit_label);
+  gimplify_seq_add_stmt (pre_p, gimple_cond);
+
+  /* <loop_exit_label>.  */
+  gimplify_seq_add_stmt (pre_p, gimple_build_label (loop_exit_label));
+}
+
 /* Gimplify the gross structure of an OMP_FOR statement.  */
 
 static enum gimplify_status
@@ -8403,6 +8479,15 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
       gcc_unreachable ();
     }
 
+  if (ort == ORT_ACC
+      && gimplify_omp_ctxp != NULL
+      && gimplify_omp_ctxp->region_type == ORT_ACC_KERNELS)
+    {
+      /* For now, ignore loop directive in kernels region.  */
+      gimplify_omp_for_seq (for_stmt, pre_p, 0);
+      return GS_ALL_DONE;
+    }
+
   /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
      clause for the IV.  */
   if (ort == ORT_SIMD && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PING^3][PATCH, 12/16] Handle acc loop directive
  2016-02-22 10:55             ` Tom de Vries
@ 2016-02-22 10:58               ` Jakub Jelinek
  2016-02-29  3:27                 ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Jakub Jelinek @ 2016-02-22 10:58 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Richard Biener

On Mon, Feb 22, 2016 at 11:54:46AM +0100, Tom de Vries wrote:
> Following up on your suggestion to implement this during gimplification, I
> wrote attached patch.
> 
> I'll put it through some openacc testing and add testcases. Is this approach
> acceptable for stage4?

LGTM.

>  gcc/gimplify.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 85 insertions(+)
> 
> diff --git a/gcc/gimplify.c b/gcc/gimplify.c
> index 7be6bd7..cec0627 100644
> --- a/gcc/gimplify.c
> +++ b/gcc/gimplify.c
> @@ -8364,6 +8364,82 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *)
>    return NULL_TREE;
>  }
>  
> +/* Gimplify the loops with index I and higher in omp_for FOR_STMT as a
> +   sequential loop, and append the resulting gimple statements to PRE_P.  */
> +
> +static void
> +gimplify_omp_for_seq (tree for_stmt, gimple_seq *pre_p, unsigned int i)
> +{
> +  gcc_assert (OMP_FOR_ORIG_DECLS (for_stmt) == NULL_TREE);
> +  unsigned int len = TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt));
> +  gcc_assert (i < len);
> +
> +  /* Gimplify OMP_FOR[i] as:
> +
> +     if (i == 0)
> +       OMP_FOR_PRE_BODY;
> +     OMP_FOR_INIT[i];
> +     goto <loop_entry_label>;
> +     <fall_thru_label>:
> +     if (i == len - 1)
> +       OMP_FOR_BODY;
> +     else
> +       OMP_FOR[i+1];
> +    OMP_FOR_INCR[i];
> +    <loop_entry_label>:
> +    if (OMP_FOR_COND[i])
> +      goto <fall_thru_label>;
> +    else
> +      goto <loop_exit_label>;
> +    <loop_exit_label>:
> +  */
> +
> +  tree loop_entry_label = create_artificial_label (UNKNOWN_LOCATION);
> +  tree fall_thru_label = create_artificial_label (UNKNOWN_LOCATION);
> +  tree loop_exit_label = create_artificial_label (UNKNOWN_LOCATION);
> +
> +  /* if (i = 0) OMP_FOR_PRE_BODY. */
> +  if (i == 0)
> +    gimplify_and_add (OMP_FOR_PRE_BODY (for_stmt), pre_p);
> +
> +  /* OMP_FOR_INIT[i].  */
> +  tree init = TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), i);
> +  gimplify_stmt (&init, pre_p);
> +
> +  /* goto <loop_entry_label>.  */
> +  gimplify_seq_add_stmt (pre_p, gimple_build_goto (loop_entry_label));
> +
> +  /* <fall_thru_label>.  */
> +  gimplify_seq_add_stmt (pre_p, gimple_build_label (fall_thru_label));
> +
> +  /* if (i == len - 1) OMP_FOR_BODY
> +     else OMP_FOR[i+1].  */
> +  if (i == len - 1)
> +    gimplify_and_return_first (OMP_FOR_BODY (for_stmt), pre_p);
> +  else
> +    gimplify_omp_for_seq (for_stmt, pre_p, i + 1);
> +
> +  /* OMP_FOR_INCR[i].  */
> +  tree incr = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i);
> +  gimplify_stmt (&incr, pre_p);
> +
> +  /* <loop_entry_label>.  */
> +  gimplify_seq_add_stmt (pre_p, gimple_build_label (loop_entry_label));
> +
> +  /* if (OMP_FOR_COND[i]) goto <fall_thru_label>
> +     else goto <loop_exit_label>.  */
> +  tree cond = TREE_VEC_ELT (OMP_FOR_COND (for_stmt), i);
> +  tree var = TREE_OPERAND (cond, 0);
> +  tree final_val = TREE_OPERAND (cond, 1);
> +  gimplify_expr (&final_val, pre_p, NULL, is_gimple_val, fb_rvalue);
> +  gimple *gimple_cond = gimple_build_cond (TREE_CODE (cond), var, final_val,
> +					   fall_thru_label, loop_exit_label);
> +  gimplify_seq_add_stmt (pre_p, gimple_cond);
> +
> +  /* <loop_exit_label>.  */
> +  gimplify_seq_add_stmt (pre_p, gimple_build_label (loop_exit_label));
> +}
> +
>  /* Gimplify the gross structure of an OMP_FOR statement.  */
>  
>  static enum gimplify_status
> @@ -8403,6 +8479,15 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
>        gcc_unreachable ();
>      }
>  
> +  if (ort == ORT_ACC
> +      && gimplify_omp_ctxp != NULL
> +      && gimplify_omp_ctxp->region_type == ORT_ACC_KERNELS)
> +    {
> +      /* For now, ignore loop directive in kernels region.  */
> +      gimplify_omp_for_seq (for_stmt, pre_p, 0);
> +      return GS_ALL_DONE;
> +    }
> +
>    /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
>       clause for the IV.  */
>    if (ort == ORT_SIMD && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)


	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Use plain -fopenacc to enable OpenACC kernels processing
  2016-02-15 16:54       ` Tom de Vries
@ 2016-02-23 15:19         ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2016-02-23 15:19 UTC (permalink / raw)
  To: Tom de Vries, Nathan Sidwell, gcc-patches
  Cc: Jakub Jelinek, Bernd Schmidt, Richard Biener

Hi!

On Mon, 15 Feb 2016 17:53:58 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 10/02/16 15:40, Thomas Schwinge wrote:
> > On Fri, 5 Feb 2016 13:06:17 +0100, I wrote:
> >> On Mon, 9 Nov 2015 18:39:19 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> >>> On 09/11/15 16:35, Tom de Vries wrote:
> >>>> this patch series for stage1 trunk adds support to:
> >>>> - parallelize oacc kernels regions using parloops, and
> >>>> - map the loops onto the oacc gang dimension.
> >>
> >>> Atm, the parallelization behaviour for the kernels region is controlled
> >>> by flag_tree_parallelize_loops, which is also used to control generic
> >>> auto-parallelization by autopar using omp. That is not ideal, and we may
> >>> want a separate flag (or param) to control the behaviour for oacc
> >>> kernels, f.i. -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.
> >>
> >> I suggest to use plain -fopenacc to enable OpenACC kernels processing
> >> (which just makes sense, I hope) ;-) and have later processing stages
> >> determine the actual parametrization (currently: number of gangs) (that
> >> is, Nathan's recent "Default compute dimensions" patches).
> 
> That makes a lot of sense.  Thanks for working on this.

> >> Originally, I want to use:
> >>
> >>      OMP_CLAUSE_NUM_GANGS_EXPR (clause) = build_int_cst (integer_type_node, n_threads == 0 ? -1 : n_threads);
> >>
> >> ... to store -1 "have the compiler decidew" (instead of now 0 "have the
> >> run-time decide", which might prevent some code optimizations, as I
> >> understand it) for the n_threads == 0 case, but it seems that for an
> >> offloaded OpenACC kernels region, gcc/omp-low.c:oacc_validate_dims is
> >> called with the parameter "used" set to 0 instead of "gang", and then the
> >> "Default anything left to 1 or a partitioned default" logic will default
> >> dims["gang"] to oacc_min_dims["gang"] (that is, 1) instead of the
> >> oacc_default_dims["gang"] (that is, 32).  Nathan, does that smell like a
> >> bug (and could you look into that)?

<https://gcc.gnu.org/PR69921> filed.  (Nathan?)

> >> --- gcc/tree-parloops.c
> >> +++ gcc/tree-parloops.c

> The oacc-parloops changes look good to me. I approve them for 6.0 stage 
> 4 (given that using the ftree-parallelize-loops=<n> flag for oacc 
> kernels parallelization was was just a placeholder waiting to be 
> replaced by an oacc-based approach). [ And I'd expect that the 
> tree-ssa-loop.c changes and the mechanical testsuite changes can be 
> regarded as trivial. ]

Thanks; committed (without changes) in r233634:

commit 3a37a410bbfed45d04f06887c348938182369d5a
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Feb 23 15:07:54 2016 +0000

    Use plain -fopenacc to enable OpenACC kernels processing
    
    	gcc/
    	* tree-parloops.c (create_parallel_loop, gen_parallel_loop)
    	(parallelize_loops): In OpenACC kernels mode, set n_threads to
    	zero.
    	(pass_parallelize_loops::gate): In OpenACC kernels mode, gate on
    	flag_openacc.
    	* tree-ssa-loop.c (gate_oacc_kernels): Likewise.
    	gcc/testsuite/
    	* c-c++-common/goacc/kernels-counter-vars-function-scope.c: Adjust
    	to -ftree-parallelize-loops/-fopenacc changes.
    	* c-c++-common/goacc/kernels-double-reduction-n.c: Likewise.
    	* c-c++-common/goacc/kernels-double-reduction.c: Likewise.
    	* c-c++-common/goacc/kernels-loop-2.c: Likewise.
    	* c-c++-common/goacc/kernels-loop-3.c: Likewise.
    	* c-c++-common/goacc/kernels-loop-g.c: Likewise.
    	* c-c++-common/goacc/kernels-loop-mod-not-zero.c: Likewise.
    	* c-c++-common/goacc/kernels-loop-n.c: Likewise.
    	* c-c++-common/goacc/kernels-loop-nest.c: Likewise.
    	* c-c++-common/goacc/kernels-loop.c: Likewise.
    	* c-c++-common/goacc/kernels-one-counter-var.c: Likewise.
    	* c-c++-common/goacc/kernels-reduction.c: Likewise.
    	* gfortran.dg/goacc/kernels-loop-inner.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loops-adjacent.f95: Likewise.
    	libgomp/
    	* oacc-parallel.c (GOACC_parallel_keyed): Initialize dims.
    	* plugin/plugin-nvptx.c (nvptx_exec): Provide default values for
    	dims.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: Adjust to
    	-ftree-parallelize-loops/-fopenacc changes.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c:
    	Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@233634 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog                                      |  9 ++++++
 gcc/testsuite/ChangeLog                            | 18 ++++++++++++
 .../goacc/kernels-counter-vars-function-scope.c    |  3 +-
 .../goacc/kernels-double-reduction-n.c             |  3 +-
 .../c-c++-common/goacc/kernels-double-reduction.c  |  3 +-
 gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c  |  3 +-
 gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c  |  4 +--
 gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c  |  4 +--
 .../c-c++-common/goacc/kernels-loop-mod-not-zero.c |  3 +-
 gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c  |  4 +--
 .../c-c++-common/goacc/kernels-loop-nest.c         |  3 +-
 gcc/testsuite/c-c++-common/goacc/kernels-loop.c    |  4 +--
 .../c-c++-common/goacc/kernels-one-counter-var.c   |  4 +--
 .../c-c++-common/goacc/kernels-reduction.c         |  4 +--
 .../gfortran.dg/goacc/kernels-loop-inner.f95       |  1 -
 .../gfortran.dg/goacc/kernels-loops-adjacent.f95   |  1 -
 gcc/tree-parloops.c                                | 25 ++++++++++++++---
 gcc/tree-ssa-loop.c                                |  7 ++---
 libgomp/ChangeLog                                  | 32 ++++++++++++++++++++++
 libgomp/oacc-parallel.c                            |  4 +++
 libgomp/plugin/plugin-nvptx.c                      | 18 ++++++++++--
 .../libgomp.oacc-c-c++-common/kernels-loop-2.c     |  3 --
 .../libgomp.oacc-c-c++-common/kernels-loop-3.c     |  3 --
 .../kernels-loop-and-seq-2.c                       |  3 --
 .../kernels-loop-and-seq-3.c                       |  3 --
 .../kernels-loop-and-seq-4.c                       |  3 --
 .../kernels-loop-and-seq-5.c                       |  3 --
 .../kernels-loop-and-seq-6.c                       |  3 --
 .../kernels-loop-and-seq.c                         |  3 --
 .../kernels-loop-collapse.c                        |  3 --
 .../libgomp.oacc-c-c++-common/kernels-loop-g.c     |  2 --
 .../kernels-loop-mod-not-zero.c                    |  3 --
 .../libgomp.oacc-c-c++-common/kernels-loop-n.c     |  3 --
 .../libgomp.oacc-c-c++-common/kernels-loop-nest.c  |  3 --
 .../libgomp.oacc-c-c++-common/kernels-loop.c       |  3 --
 .../libgomp.oacc-c-c++-common/kernels-reduction.c  |  3 --
 36 files changed, 114 insertions(+), 87 deletions(-)

diff --git gcc/ChangeLog gcc/ChangeLog
index ce8d366..0b2149d 100644
--- gcc/ChangeLog
+++ gcc/ChangeLog
@@ -1,3 +1,12 @@
+2016-02-23  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* tree-parloops.c (create_parallel_loop, gen_parallel_loop)
+	(parallelize_loops): In OpenACC kernels mode, set n_threads to
+	zero.
+	(pass_parallelize_loops::gate): In OpenACC kernels mode, gate on
+	flag_openacc.
+	* tree-ssa-loop.c (gate_oacc_kernels): Likewise.
+
 2016-02-23  Richard Biener  <rguenther@suse.de>
 
 	* mem-stats.h (struct mem_usage): Use PRIu64 for printing size_t.
diff --git gcc/testsuite/ChangeLog gcc/testsuite/ChangeLog
index 60372ce..17cf40c 100644
--- gcc/testsuite/ChangeLog
+++ gcc/testsuite/ChangeLog
@@ -1,3 +1,21 @@
+2016-02-23  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* c-c++-common/goacc/kernels-counter-vars-function-scope.c: Adjust
+	to -ftree-parallelize-loops/-fopenacc changes.
+	* c-c++-common/goacc/kernels-double-reduction-n.c: Likewise.
+	* c-c++-common/goacc/kernels-double-reduction.c: Likewise.
+	* c-c++-common/goacc/kernels-loop-2.c: Likewise.
+	* c-c++-common/goacc/kernels-loop-3.c: Likewise.
+	* c-c++-common/goacc/kernels-loop-g.c: Likewise.
+	* c-c++-common/goacc/kernels-loop-mod-not-zero.c: Likewise.
+	* c-c++-common/goacc/kernels-loop-n.c: Likewise.
+	* c-c++-common/goacc/kernels-loop-nest.c: Likewise.
+	* c-c++-common/goacc/kernels-loop.c: Likewise.
+	* c-c++-common/goacc/kernels-one-counter-var.c: Likewise.
+	* c-c++-common/goacc/kernels-reduction.c: Likewise.
+	* gfortran.dg/goacc/kernels-loop-inner.f95: Likewise.
+	* gfortran.dg/goacc/kernels-loops-adjacent.f95: Likewise.
+
 2016-02-23  Rainer Orth  <ro@CeBiTec.Uni-Bielefeld.DE>
 
 	* gcc.target/i386/chkp-hidden-def.c: Require alias support.
diff --git gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
index e8b5357..17f240e 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -51,4 +50,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
index c39d674..750f576 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction-n.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -34,4 +33,4 @@ foo (unsigned int n)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
index 3501d0d..df60d6a 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -34,4 +33,4 @@ foo (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
index f97584d..913d91f 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -67,4 +66,4 @@ main (void)
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
index 530d62a..1822d2a 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -45,5 +44,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
index 4f1c2c5..e946319 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
@@ -1,6 +1,5 @@
 /* { dg-additional-options "-O2" } */
 /* { dg-additional-options "-g" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -13,5 +12,4 @@
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
index 151db51..9b63b45 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -49,4 +48,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
index bee5f5a..279f797 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -52,5 +51,4 @@ foo (COUNTERTYPE n)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
index ea0e342..db1071f 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -36,4 +35,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-loop.c gcc/testsuite/c-c++-common/goacc/kernels-loop.c
index ab5dfb9..abf7a3c 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-loop.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-loop.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -52,5 +51,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
index b16a8cd..95f4817 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -50,5 +49,4 @@ main (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/c-c++-common/goacc/kernels-reduction.c gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
index 61c5df3..6f5a418 100644
--- gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
+++ gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
@@ -1,5 +1,4 @@
 /* { dg-additional-options "-O2" } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-fdump-tree-parloops1-all" } */
 /* { dg-additional-options "-fdump-tree-optimized" } */
 
@@ -32,5 +31,4 @@ foo (void)
 /* Check that the loop has been split off into a function.  */
 /* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
 
-/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops1" } } */
-
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } } */
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
index 4db3a50..3334741 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-inner.f95
@@ -1,5 +1,4 @@
 ! { dg-additional-options "-O2" }
-! { dg-additional-options "-ftree-parallelize-loops=32" }
 
 program main
    implicit none
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
index fef3d10..fb92da8 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loops-adjacent.f95
@@ -1,5 +1,4 @@
 ! { dg-additional-options "-O2" }
-! { dg-additional-options "-ftree-parallelize-loops=10" }
 
 program main
    implicit none
diff --git gcc/tree-parloops.c gcc/tree-parloops.c
index 139e38c..e498e5b 100644
--- gcc/tree-parloops.c
+++ gcc/tree-parloops.c
@@ -2016,7 +2016,8 @@ transform_to_exit_first_loop (struct loop *loop,
 /* Create the parallel constructs for LOOP as described in gen_parallel_loop.
    LOOP_FN and DATA are the arguments of GIMPLE_OMP_PARALLEL.
    NEW_DATA is the variable that should be initialized from the argument
-   of LOOP_FN.  N_THREADS is the requested number of threads.  */
+   of LOOP_FN.  N_THREADS is the requested number of threads, which can be 0 if
+   that number is to be determined later.  */
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
@@ -2049,6 +2050,7 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       basic_block paral_bb = single_pred (bb);
       gsi = gsi_last_bb (paral_bb);
 
+      gcc_checking_assert (n_threads != 0);
       t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
       OMP_CLAUSE_NUM_THREADS_EXPR (t)
 	= build_int_cst (integer_type_node, n_threads);
@@ -2221,7 +2223,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 }
 
 /* Generates code to execute the iterations of LOOP in N_THREADS
-   threads in parallel.
+   threads in parallel, which can be 0 if that number is to be determined
+   later.
 
    NITER describes number of iterations of LOOP.
    REDUCTION_LIST describes the reductions existent in the LOOP.  */
@@ -2318,6 +2321,7 @@ gen_parallel_loop (struct loop *loop,
       else
 	m_p_thread=MIN_PER_THREAD;
 
+      gcc_checking_assert (n_threads != 0);
       many_iterations_cond =
 	fold_build2 (GE_EXPR, boolean_type_node,
 		     nit, build_int_cst (type, m_p_thread * n_threads));
@@ -3177,7 +3181,7 @@ oacc_entry_exit_ok (struct loop *loop,
 static bool
 parallelize_loops (bool oacc_kernels_p)
 {
-  unsigned n_threads = flag_tree_parallelize_loops;
+  unsigned n_threads;
   bool changed = false;
   struct loop *loop;
   struct loop *skip_loop = NULL;
@@ -3199,6 +3203,13 @@ parallelize_loops (bool oacc_kernels_p)
   if (cfun->has_nonlocal_label)
     return false;
 
+  /* For OpenACC kernels, n_threads will be determined later; otherwise, it's
+     the argument to -ftree-parallelize-loops.  */
+  if (oacc_kernels_p)
+    n_threads = 0;
+  else
+    n_threads = flag_tree_parallelize_loops;
+
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
@@ -3361,7 +3372,13 @@ public:
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual bool gate (function *)
+  {
+    if (oacc_kernels_p)
+      return flag_openacc;
+    else
+      return flag_tree_parallelize_loops > 1;
+  }
   virtual unsigned int execute (function *);
   opt_pass * clone () { return new pass_parallelize_loops (m_ctxt); }
   void set_pass_param (unsigned int n, bool param)
diff --git gcc/tree-ssa-loop.c gcc/tree-ssa-loop.c
index bdbade5..4c39fbc 100644
--- gcc/tree-ssa-loop.c
+++ gcc/tree-ssa-loop.c
@@ -148,7 +148,7 @@ make_pass_tree_loop (gcc::context *ctxt)
 static bool
 gate_oacc_kernels (function *fn)
 {
-  if (flag_tree_parallelize_loops <= 1)
+  if (!flag_openacc)
     return false;
 
   tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
@@ -230,10 +230,9 @@ public:
   virtual bool gate (function *)
   {
     return (optimize
-	    /* Don't bother doing anything if the program has errors.  */
-	    && !seen_error ()
 	    && flag_openacc
-	    && flag_tree_parallelize_loops > 1);
+	    /* Don't bother doing anything if the program has errors.  */
+	    && !seen_error ());
   }
 
 }; // class pass_ipa_oacc
diff --git libgomp/ChangeLog libgomp/ChangeLog
index 1394126..e6a7082 100644
--- libgomp/ChangeLog
+++ libgomp/ChangeLog
@@ -1,3 +1,35 @@
+2016-02-23  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* oacc-parallel.c (GOACC_parallel_keyed): Initialize dims.
+	* plugin/plugin-nvptx.c (nvptx_exec): Provide default values for
+	dims.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: Adjust to
+	-ftree-parallelize-loops/-fopenacc changes.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c:
+	Likewise.
+
 2016-02-22  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* testsuite/libgomp.oacc-c-c++-common/vprop.c: New test.
diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index bc24651..f795bf7 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -103,6 +103,10 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
       return;
     }
 
+  /* Default: let the runtime choose.  */
+  for (i = 0; i != GOMP_DIM_MAX; i++)
+    dims[i] = 0;
+
   va_start (ap, kinds);
   /* TODO: This will need amending when device_type is implemented.  */
   while ((tag = va_arg (ap, unsigned)) != 0)
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index 7ec1810..3f1bb6d 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -894,9 +894,21 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   /* Initialize the launch dimensions.  Typically this is constant,
      provided by the device compiler, but we must permit runtime
      values.  */
-  for (i = 0; i != 3; i++)
-    if (targ_fn->launch->dim[i])
-      dims[i] = targ_fn->launch->dim[i];
+  int seen_zero = 0;
+  for (i = 0; i != GOMP_DIM_MAX; i++)
+    {
+      if (targ_fn->launch->dim[i])
+       dims[i] = targ_fn->launch->dim[i];
+      if (!dims[i])
+       seen_zero = 1;
+    }
+
+  if (seen_zero)
+    {
+      for (i = 0; i != GOMP_DIM_MAX; i++)
+       if (!dims[i])
+         dims[i] = /* TODO */ 32;
+    }
 
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
index 13e57bd..c7592d6 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N (1024 * 512)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
index f61a74a..31114ac 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N (1024 * 512)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
index 2e4100f..d36592f 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
index b3e736b..e622971 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
index 8b9affa..c731278 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
index 83d4e7f..67dcce2 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
index 01d5e5e..b8b5dde 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
index 61d1283..9d9308a 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 32
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
index f7f04cb..997d6c7 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 100
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
index 96b6e4e..88258be 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
@@ -1,5 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-g" } */
 
 #include "kernels-loop.c"
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
index 1433cb2..147ebb5 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N ((1024 * 512) + 1)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
index fd0d5b1..9a3eaca 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N ((1024 * 512) + 1)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
index 21d2599..28c725a 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N 1000
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
index 3762e5a..355123c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define N (1024 * 512)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
index 511e25f..8647a94 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-additional-options "-ftree-parallelize-loops=32" } */
-
 #include <stdlib.h>
 
 #define n 10000


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PING^3][PATCH, 12/16] Handle acc loop directive
  2016-02-22 10:58               ` Jakub Jelinek
@ 2016-02-29  3:27                 ` Tom de Vries
  2016-03-07  8:22                   ` [PING][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-02-29  3:27 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1353 bytes --]

On 22-02-16 11:57, Jakub Jelinek wrote:
> On Mon, Feb 22, 2016 at 11:54:46AM +0100, Tom de Vries wrote:
>> Following up on your suggestion to implement this during gimplification, I
>> wrote attached patch.
>>
>> I'll put it through some openacc testing and add testcases. Is this approach
>> acceptable for stage4?
>
> LGTM.

Hi,

I ran into trouble during testing of this patch, with ignoring the private 
clause on the loop directive.

This openacc testcase compiles atm without a problem:
...
int
main (void)
{
   int j;
#pragma acc kernels default(none)
   {
#pragma acc loop private (j)
     for (unsigned i = 0; i < 1000; ++i)
       {
	j;
       }
   }
}
...

But when compiling with the patch, and ignoring the private clause, we run into 
this error:
...
test.c: In function ‘main’:
test.c:10:2: error: ‘j’ not specified in enclosing OpenACC ‘kernels’ construct
   j;
   ^
test.c:5:9: note: enclosing OpenACC ‘kernels’ construct
  #pragma acc kernels default(none)
...

So I updated the patch to ignore all but the private clause on the loop 
directive during gimplification, and moved the sequential expansion of the 
omp-for construct from gimplify to omp-lower.

Bootstrapped and reg-tested on x86_64.

Build for nvidia accelerator and reg-tested goacc.exp and libgomp testsuite.

Updated patch still ok for stage4?

Thanks,
- Tom


[-- Attachment #2: 0001-Ignore-acc-loop-directive-in-kernels-region.patch --]
[-- Type: text/x-patch, Size: 19675 bytes --]

Ignore acc loop directive in kernels region

2016-02-29  Tom de Vries  <tom@codesourcery.com>

	* gimplify.c (gimplify_ctx_in_oacc_kernels_region): New function.
	(gimplify_omp_for): Ignore all but private clause on loop directive in
	kernels region.
	* omp-low.c (lower_omp_for_seq): New function.
	(lower_omp_for): Use lower_omp_for_seq in kernels region.  Don't
	generate omp continue/return.

	* c-c++-common/goacc/kernels-acc-loop-reduction.c: New test.
	* c-c++-common/goacc/kernels-acc-loop-smaller-equal.c: Same.
	* c-c++-common/goacc/kernels-loop-2-acc-loop.c: Same.
	* c-c++-common/goacc/kernels-loop-3-acc-loop.c: Same.
	* c-c++-common/goacc/kernels-loop-acc-loop.c: Same.
	* c-c++-common/goacc/kernels-loop-n-acc-loop.c: Same.
	* c-c++-common/goacc/combined-directives.c: Update test.
	* c-c++-common/goacc/loop-private-1.c: Same.
	* gfortran.dg/goacc/combined-directives.f90: Same.
	* gfortran.dg/goacc/gang-static.f95: Same.
	* gfortran.dg/goacc/reduction-2.f95: Same.

---
 gcc/gimplify.c                                     | 41 ++++++++++
 gcc/omp-low.c                                      | 93 ++++++++++++++++++++--
 .../c-c++-common/goacc/combined-directives.c       | 16 ++--
 .../goacc/kernels-acc-loop-reduction.c             | 24 ++++++
 .../goacc/kernels-acc-loop-smaller-equal.c         | 22 +++++
 .../c-c++-common/goacc/kernels-loop-2-acc-loop.c   | 17 ++++
 .../c-c++-common/goacc/kernels-loop-3-acc-loop.c   | 14 ++++
 .../c-c++-common/goacc/kernels-loop-acc-loop.c     | 14 ++++
 .../c-c++-common/goacc/kernels-loop-n-acc-loop.c   | 14 ++++
 gcc/testsuite/c-c++-common/goacc/loop-private-1.c  |  2 +-
 .../gfortran.dg/goacc/combined-directives.f90      | 16 ++--
 gcc/testsuite/gfortran.dg/goacc/gang-static.f95    |  4 +-
 gcc/testsuite/gfortran.dg/goacc/reduction-2.f95    |  3 +-
 13 files changed, 252 insertions(+), 28 deletions(-)

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 7be6bd7..4b82305 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -8364,6 +8364,20 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *)
   return NULL_TREE;
 }
 
+/* Return true if CTX is (part of) an oacc kernels region.  */
+
+static bool
+gimplify_ctx_in_oacc_kernels_region (gimplify_omp_ctx *ctx)
+{
+  for (;ctx != NULL; ctx = ctx->outer_context)
+    {
+      if (ctx->region_type == ORT_ACC_KERNELS)
+	return true;
+    }
+
+  return false;
+}
+
 /* Gimplify the gross structure of an OMP_FOR statement.  */
 
 static enum gimplify_status
@@ -8403,6 +8417,33 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
       gcc_unreachable ();
     }
 
+  /* Skip loop clauses not handled in kernels region.  */
+  if (gimplify_ctx_in_oacc_kernels_region (gimplify_omp_ctxp))
+    {
+      tree *prev_ptr = &OMP_FOR_CLAUSES (for_stmt);
+
+      while (tree probe = *prev_ptr)
+	{
+	  tree *next_ptr = &OMP_CLAUSE_CHAIN (probe);
+
+	  bool keep_clause;
+	  switch (OMP_CLAUSE_CODE (probe))
+	    {
+	    case OMP_CLAUSE_PRIVATE:
+	      keep_clause = true;
+	      break;
+	    default:
+	      keep_clause = false;
+	      break;
+	    }
+
+	  if (keep_clause)
+	    prev_ptr = next_ptr;
+	  else
+	    *prev_ptr = *next_ptr;
+	}
+    }
+
   /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
      clause for the IV.  */
   if (ort == ORT_SIMD && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fcbb3e0..bb70ac2 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -14944,6 +14944,75 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Lower the loops with index I and higher in omp_for FOR_STMT as a sequential
+   loop, and append the resulting gimple statements to PRE_P.  */
+
+static void
+lower_omp_for_seq (gimple_seq *pre_p, gimple *for_stmt, unsigned int i)
+{
+  unsigned int len = gimple_omp_for_collapse (for_stmt);
+  gcc_assert (i < len);
+
+  /* Gimplify OMP_FOR[i] as:
+
+     OMP_FOR_INIT[i];
+     goto <loop_entry_label>;
+     <fall_thru_label>:
+     if (i == len - 1)
+       OMP_FOR_BODY;
+     else
+       OMP_FOR[i+1];
+    OMP_FOR_INCR[i];
+    <loop_entry_label>:
+    if (OMP_FOR_COND[i])
+      goto <fall_thru_label>;
+    else
+      goto <loop_exit_label>;
+    <loop_exit_label>:
+  */
+
+  tree loop_entry_label = create_artificial_label (UNKNOWN_LOCATION);
+  tree fall_thru_label = create_artificial_label (UNKNOWN_LOCATION);
+  tree loop_exit_label = create_artificial_label (UNKNOWN_LOCATION);
+
+  /* OMP_FOR_INIT[i].  */
+  tree init = gimple_omp_for_initial (for_stmt, i);
+  tree var = gimple_omp_for_index (for_stmt, i);
+  gimple *g = gimple_build_assign (var, init);
+  gimple_seq_add_stmt (pre_p, g);
+
+  /* goto <loop_entry_label>.  */
+  gimple_seq_add_stmt (pre_p, gimple_build_goto (loop_entry_label));
+
+  /* <fall_thru_label>.  */
+  gimple_seq_add_stmt (pre_p, gimple_build_label (fall_thru_label));
+
+  /* if (i == len - 1) OMP_FOR_BODY
+     else OMP_FOR[i+1].  */
+  if (i == len - 1)
+    gimple_seq_add_seq (pre_p, gimple_omp_body (for_stmt));
+  else
+    lower_omp_for_seq (pre_p, for_stmt, i + 1);
+
+  /* OMP_FOR_INCR[i].  */
+  tree incr = gimple_omp_for_incr (for_stmt, i);
+  g = gimple_build_assign (var, incr);
+  gimple_seq_add_stmt (pre_p, g);
+
+  /* <loop_entry_label>.  */
+  gimple_seq_add_stmt (pre_p, gimple_build_label (loop_entry_label));
+
+  /* if (OMP_FOR_COND[i]) goto <fall_thru_label>
+     else goto <loop_exit_label>.  */
+  enum tree_code cond = gimple_omp_for_cond (for_stmt, i);
+  tree final_val = gimple_omp_for_final (for_stmt, i);
+  gimple *gimple_cond = gimple_build_cond (cond, var, final_val,
+					   fall_thru_label, loop_exit_label);
+  gimple_seq_add_stmt (pre_p, gimple_cond);
+
+  /* <loop_exit_label>.  */
+  gimple_seq_add_stmt (pre_p, gimple_build_label (loop_exit_label));
+}
 
 /* Lower code for an OMP loop directive.  */
 
@@ -14957,6 +15026,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   gimple_seq omp_for_body, body, dlist;
   gimple_seq oacc_head = NULL, oacc_tail = NULL;
   size_t i;
+  bool oacc_kernels_p = (is_gimple_omp_oacc (ctx->stmt)
+			 && ctx_in_oacc_kernels_region (ctx));
 
   push_gimplify_context ();
 
@@ -15065,7 +15136,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   extract_omp_for_data (stmt, &fd, NULL);
 
   if (is_gimple_omp_oacc (ctx->stmt)
-      && !ctx_in_oacc_kernels_region (ctx))
+      && !oacc_kernels_p)
     lower_oacc_head_tail (gimple_location (stmt),
 			  gimple_omp_for_clauses (stmt),
 			  &oacc_head, &oacc_tail, ctx);
@@ -15088,13 +15159,18 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 						ctx);
 	}
 
-  if (!gimple_omp_for_grid_phony (stmt))
-    gimple_seq_add_stmt (&body, stmt);
-  gimple_seq_add_seq (&body, gimple_omp_body (stmt));
+  if (oacc_kernels_p)
+    lower_omp_for_seq (&body, stmt, 0);
+  else if (gimple_omp_for_grid_phony (stmt))
+    gimple_seq_add_seq (&body, gimple_omp_body (stmt));
+  else
+    {
+      gimple_seq_add_stmt (&body, stmt);
+      gimple_seq_add_seq (&body, gimple_omp_body (stmt));
 
-  if (!gimple_omp_for_grid_phony (stmt))
-    gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
-							   fd.loop.v));
+      gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
+							     fd.loop.v));
+    }
 
   /* After the loop, add exit clauses.  */
   lower_reduction_clauses (gimple_omp_for_clauses (stmt), &body, ctx);
@@ -15106,7 +15182,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   body = maybe_catch_exception (body);
 
-  if (!gimple_omp_for_grid_phony (stmt))
+  if (!gimple_omp_for_grid_phony (stmt)
+      && !oacc_kernels_p)
     {
       /* Region exit marker goes at the end of the loop body.  */
       gimple_seq_add_stmt (&body, gimple_build_omp_return (fd.have_nowait));
diff --git a/gcc/testsuite/c-c++-common/goacc/combined-directives.c b/gcc/testsuite/c-c++-common/goacc/combined-directives.c
index c387285..66b8b65 100644
--- a/gcc/testsuite/c-c++-common/goacc/combined-directives.c
+++ b/gcc/testsuite/c-c++-common/goacc/combined-directives.c
@@ -108,12 +108,12 @@ test ()
 //    ;
 }
 
-// { dg-final { scan-tree-dump-times "acc loop collapse.2. private.j. private.i" 2 "gimple" } }
-// { dg-final { scan-tree-dump-times "acc loop gang" 2 "gimple" } }
-// { dg-final { scan-tree-dump-times "acc loop worker" 2 "gimple" } }
-// { dg-final { scan-tree-dump-times "acc loop vector" 2 "gimple" } }
-// { dg-final { scan-tree-dump-times "acc loop seq" 2 "gimple" } }
-// { dg-final { scan-tree-dump-times "acc loop auto" 2 "gimple" } }
-// { dg-final { scan-tree-dump-times "acc loop tile.2, 3" 2 "gimple" } }
-// { dg-final { scan-tree-dump-times "acc loop independent private.i" 2 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop collapse.2. private.j. private.i" 1 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop gang" 1 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop worker" 1 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop vector" 1 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop seq" 1 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop auto" 1 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop tile.2, 3" 1 "gimple" } }
+// { dg-final { scan-tree-dump-times "acc loop independent private.i" 1 "gimple" } }
 // { dg-final { scan-tree-dump-times "private.z" 2 "gimple" } }
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
new file mode 100644
index 0000000..6a9f52b
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
@@ -0,0 +1,24 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+unsigned int a[1000];
+
+unsigned int
+foo (int n)
+{
+  unsigned int sum = 0;
+
+#pragma acc kernels loop gang reduction(+:sum)
+  for (int i = 0; i < n; i++)
+    sum += a[i];
+
+  return sum;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
new file mode 100644
index 0000000..d18c779
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
@@ -0,0 +1,22 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+unsigned int
+foo (int n)
+{
+  unsigned int sum = 1;
+
+  #pragma acc kernels loop
+  for (int i = 1; i <= n; i++)
+    sum += i;
+
+  return sum;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
new file mode 100644
index 0000000..95354e1
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-2.c"
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
new file mode 100644
index 0000000..1ad3067
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
@@ -0,0 +1,14 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-3.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
new file mode 100644
index 0000000..47b8459
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
@@ -0,0 +1,14 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
new file mode 100644
index 0000000..25b56d7
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
@@ -0,0 +1,14 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-parloops1-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-n.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/loop-private-1.c b/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
index 38a4a7d..9b2f7fa 100644
--- a/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
+++ b/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
@@ -10,4 +10,4 @@ f (int i, int j)
       ;
 }
 
-/* { dg-final { scan-tree-dump-times "#pragma acc loop collapse\\(2\\) private\\(j\\) private\\(i\\)" 1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "#pragma acc loop private\\(j\\) private\\(i\\)" 1 "gimple" } } */
diff --git a/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90 b/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
index 6977525..e89ddc9 100644
--- a/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
+++ b/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
@@ -144,12 +144,12 @@ subroutine test
 !  !$acc end kernels loop
 end subroutine test
 
-! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. collapse.2." 2 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.i. gang" 2 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. worker" 2 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. vector" 2 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. seq" 2 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. auto" 2 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. tile.2, 3" 2 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.i. independent" 2 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. collapse.2." 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. gang" 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. worker" 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. vector" 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. seq" 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. auto" 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. tile.2, 3" 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.i. independent" 1 "gimple" } }
 ! { dg-final { scan-tree-dump-times "private.z" 2 "gimple" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/gang-static.f95 b/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
index 3481085..c14b7b2 100644
--- a/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
@@ -78,5 +78,5 @@ end subroutine test
 ! { dg-final { scan-tree-dump-times "gang\\(static:2\\)" 1 "omplower" } }
 ! { dg-final { scan-tree-dump-times "gang\\(static:5\\)" 1 "omplower" } }
 ! { dg-final { scan-tree-dump-times "gang\\(static:20\\)" 1 "omplower" } }
-! { dg-final { scan-tree-dump-times "gang\\(num: 5 static:\\\*\\)" 1 "omplower" } }
-! { dg-final { scan-tree-dump-times "gang\\(num: 30 static:20\\)" 1 "omplower" } }
+! { dg-final { scan-tree-dump-times "gang\\(num: 5 static:\\\*\\)" 0 "omplower" } }
+! { dg-final { scan-tree-dump-times "gang\\(num: 30 static:20\\)" 0 "omplower" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95 b/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
index 929fb0e..4c431c8 100644
--- a/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
@@ -11,6 +11,7 @@ subroutine foo ()
   !$acc end parallel loop
   !$acc kernels loop reduction(+:a)
   do k = 2,6
+     a = a + 1
   enddo
   !$acc end kernels loop
 end subroutine
@@ -18,5 +19,5 @@ end subroutine
 ! { dg-final { scan-tree-dump-times "target oacc_parallel firstprivate.a." 1 "gimple" } }
 ! { dg-final { scan-tree-dump-times "acc loop private.p. reduction..:a." 1 "gimple" } }
 ! { dg-final { scan-tree-dump-times "target oacc_kernels map.force_tofrom:a .len: 4.." 1 "gimple" } }
-! { dg-final { scan-tree-dump-times "acc loop private.k. reduction..:a." 1 "gimple" } }
+! { dg-final { scan-tree-dump-times "acc loop private.k." 1 "gimple" } }
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING][PATCH, 12/16] Handle acc loop directive
  2016-02-29  3:27                 ` Tom de Vries
@ 2016-03-07  8:22                   ` Tom de Vries
  2016-03-14  6:21                     ` [PING^2][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-03-07  8:22 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Richard Biener

On 29/02/16 04:26, Tom de Vries wrote:
> On 22-02-16 11:57, Jakub Jelinek wrote:
>> On Mon, Feb 22, 2016 at 11:54:46AM +0100, Tom de Vries wrote:
>>> Following up on your suggestion to implement this during
>>> gimplification, I
>>> wrote attached patch.
>>>
>>> I'll put it through some openacc testing and add testcases. Is this
>>> approach
>>> acceptable for stage4?
>>
>> LGTM.
>
> Hi,
>
> I ran into trouble during testing of this patch, with ignoring the
> private clause on the loop directive.
>
> This openacc testcase compiles atm without a problem:
> ...
> int
> main (void)
> {
>    int j;
> #pragma acc kernels default(none)
>    {
> #pragma acc loop private (j)
>      for (unsigned i = 0; i < 1000; ++i)
>        {
>      j;
>        }
>    }
> }
> ...
>
> But when compiling with the patch, and ignoring the private clause, we
> run into this error:
> ...
> test.c: In function ‘main’:
> test.c:10:2: error: ‘j’ not specified in enclosing OpenACC ‘kernels’
> construct
>    j;
>    ^
> test.c:5:9: note: enclosing OpenACC ‘kernels’ construct
>   #pragma acc kernels default(none)
> ...
>
> So I updated the patch to ignore all but the private clause on the loop
> directive during gimplification, and moved the sequential expansion of
> the omp-for construct from gimplify to omp-lower.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build for nvidia accelerator and reg-tested goacc.exp and libgomp
> testsuite.
>
> Updated patch still ok for stage4?
>

Ping. ( Submitted here: 
https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01903.html )

Thanks,
- Tom

> 0001-Ignore-acc-loop-directive-in-kernels-region.patch
>
>
> Ignore acc loop directive in kernels region
>
> 2016-02-29  Tom de Vries  <tom@codesourcery.com>
>
> 	* gimplify.c (gimplify_ctx_in_oacc_kernels_region): New function.
> 	(gimplify_omp_for): Ignore all but private clause on loop directive in
> 	kernels region.
> 	* omp-low.c (lower_omp_for_seq): New function.
> 	(lower_omp_for): Use lower_omp_for_seq in kernels region.  Don't
> 	generate omp continue/return.
>
> 	* c-c++-common/goacc/kernels-acc-loop-reduction.c: New test.
> 	* c-c++-common/goacc/kernels-acc-loop-smaller-equal.c: Same.
> 	* c-c++-common/goacc/kernels-loop-2-acc-loop.c: Same.
> 	* c-c++-common/goacc/kernels-loop-3-acc-loop.c: Same.
> 	* c-c++-common/goacc/kernels-loop-acc-loop.c: Same.
> 	* c-c++-common/goacc/kernels-loop-n-acc-loop.c: Same.
> 	* c-c++-common/goacc/combined-directives.c: Update test.
> 	* c-c++-common/goacc/loop-private-1.c: Same.
> 	* gfortran.dg/goacc/combined-directives.f90: Same.
> 	* gfortran.dg/goacc/gang-static.f95: Same.
> 	* gfortran.dg/goacc/reduction-2.f95: Same.
>
> ---
>   gcc/gimplify.c                                     | 41 ++++++++++
>   gcc/omp-low.c                                      | 93 ++++++++++++++++++++--
>   .../c-c++-common/goacc/combined-directives.c       | 16 ++--
>   .../goacc/kernels-acc-loop-reduction.c             | 24 ++++++
>   .../goacc/kernels-acc-loop-smaller-equal.c         | 22 +++++
>   .../c-c++-common/goacc/kernels-loop-2-acc-loop.c   | 17 ++++
>   .../c-c++-common/goacc/kernels-loop-3-acc-loop.c   | 14 ++++
>   .../c-c++-common/goacc/kernels-loop-acc-loop.c     | 14 ++++
>   .../c-c++-common/goacc/kernels-loop-n-acc-loop.c   | 14 ++++
>   gcc/testsuite/c-c++-common/goacc/loop-private-1.c  |  2 +-
>   .../gfortran.dg/goacc/combined-directives.f90      | 16 ++--
>   gcc/testsuite/gfortran.dg/goacc/gang-static.f95    |  4 +-
>   gcc/testsuite/gfortran.dg/goacc/reduction-2.f95    |  3 +-
>   13 files changed, 252 insertions(+), 28 deletions(-)
>
> diff --git a/gcc/gimplify.c b/gcc/gimplify.c
> index 7be6bd7..4b82305 100644
> --- a/gcc/gimplify.c
> +++ b/gcc/gimplify.c
> @@ -8364,6 +8364,20 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *)
>     return NULL_TREE;
>   }
>
> +/* Return true if CTX is (part of) an oacc kernels region.  */
> +
> +static bool
> +gimplify_ctx_in_oacc_kernels_region (gimplify_omp_ctx *ctx)
> +{
> +  for (;ctx != NULL; ctx = ctx->outer_context)
> +    {
> +      if (ctx->region_type == ORT_ACC_KERNELS)
> +	return true;
> +    }
> +
> +  return false;
> +}
> +
>   /* Gimplify the gross structure of an OMP_FOR statement.  */
>
>   static enum gimplify_status
> @@ -8403,6 +8417,33 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
>         gcc_unreachable ();
>       }
>
> +  /* Skip loop clauses not handled in kernels region.  */
> +  if (gimplify_ctx_in_oacc_kernels_region (gimplify_omp_ctxp))
> +    {
> +      tree *prev_ptr = &OMP_FOR_CLAUSES (for_stmt);
> +
> +      while (tree probe = *prev_ptr)
> +	{
> +	  tree *next_ptr = &OMP_CLAUSE_CHAIN (probe);
> +
> +	  bool keep_clause;
> +	  switch (OMP_CLAUSE_CODE (probe))
> +	    {
> +	    case OMP_CLAUSE_PRIVATE:
> +	      keep_clause = true;
> +	      break;
> +	    default:
> +	      keep_clause = false;
> +	      break;
> +	    }
> +
> +	  if (keep_clause)
> +	    prev_ptr = next_ptr;
> +	  else
> +	    *prev_ptr = *next_ptr;
> +	}
> +    }
> +
>     /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
>        clause for the IV.  */
>     if (ort == ORT_SIMD && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)
> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
> index fcbb3e0..bb70ac2 100644
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -14944,6 +14944,75 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
>       }
>   }
>
> +/* Lower the loops with index I and higher in omp_for FOR_STMT as a sequential
> +   loop, and append the resulting gimple statements to PRE_P.  */
> +
> +static void
> +lower_omp_for_seq (gimple_seq *pre_p, gimple *for_stmt, unsigned int i)
> +{
> +  unsigned int len = gimple_omp_for_collapse (for_stmt);
> +  gcc_assert (i < len);
> +
> +  /* Gimplify OMP_FOR[i] as:
> +
> +     OMP_FOR_INIT[i];
> +     goto <loop_entry_label>;
> +     <fall_thru_label>:
> +     if (i == len - 1)
> +       OMP_FOR_BODY;
> +     else
> +       OMP_FOR[i+1];
> +    OMP_FOR_INCR[i];
> +    <loop_entry_label>:
> +    if (OMP_FOR_COND[i])
> +      goto <fall_thru_label>;
> +    else
> +      goto <loop_exit_label>;
> +    <loop_exit_label>:
> +  */
> +
> +  tree loop_entry_label = create_artificial_label (UNKNOWN_LOCATION);
> +  tree fall_thru_label = create_artificial_label (UNKNOWN_LOCATION);
> +  tree loop_exit_label = create_artificial_label (UNKNOWN_LOCATION);
> +
> +  /* OMP_FOR_INIT[i].  */
> +  tree init = gimple_omp_for_initial (for_stmt, i);
> +  tree var = gimple_omp_for_index (for_stmt, i);
> +  gimple *g = gimple_build_assign (var, init);
> +  gimple_seq_add_stmt (pre_p, g);
> +
> +  /* goto <loop_entry_label>.  */
> +  gimple_seq_add_stmt (pre_p, gimple_build_goto (loop_entry_label));
> +
> +  /* <fall_thru_label>.  */
> +  gimple_seq_add_stmt (pre_p, gimple_build_label (fall_thru_label));
> +
> +  /* if (i == len - 1) OMP_FOR_BODY
> +     else OMP_FOR[i+1].  */
> +  if (i == len - 1)
> +    gimple_seq_add_seq (pre_p, gimple_omp_body (for_stmt));
> +  else
> +    lower_omp_for_seq (pre_p, for_stmt, i + 1);
> +
> +  /* OMP_FOR_INCR[i].  */
> +  tree incr = gimple_omp_for_incr (for_stmt, i);
> +  g = gimple_build_assign (var, incr);
> +  gimple_seq_add_stmt (pre_p, g);
> +
> +  /* <loop_entry_label>.  */
> +  gimple_seq_add_stmt (pre_p, gimple_build_label (loop_entry_label));
> +
> +  /* if (OMP_FOR_COND[i]) goto <fall_thru_label>
> +     else goto <loop_exit_label>.  */
> +  enum tree_code cond = gimple_omp_for_cond (for_stmt, i);
> +  tree final_val = gimple_omp_for_final (for_stmt, i);
> +  gimple *gimple_cond = gimple_build_cond (cond, var, final_val,
> +					   fall_thru_label, loop_exit_label);
> +  gimple_seq_add_stmt (pre_p, gimple_cond);
> +
> +  /* <loop_exit_label>.  */
> +  gimple_seq_add_stmt (pre_p, gimple_build_label (loop_exit_label));
> +}
>
>   /* Lower code for an OMP loop directive.  */
>
> @@ -14957,6 +15026,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>     gimple_seq omp_for_body, body, dlist;
>     gimple_seq oacc_head = NULL, oacc_tail = NULL;
>     size_t i;
> +  bool oacc_kernels_p = (is_gimple_omp_oacc (ctx->stmt)
> +			 && ctx_in_oacc_kernels_region (ctx));
>
>     push_gimplify_context ();
>
> @@ -15065,7 +15136,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>     extract_omp_for_data (stmt, &fd, NULL);
>
>     if (is_gimple_omp_oacc (ctx->stmt)
> -      && !ctx_in_oacc_kernels_region (ctx))
> +      && !oacc_kernels_p)
>       lower_oacc_head_tail (gimple_location (stmt),
>   			  gimple_omp_for_clauses (stmt),
>   			  &oacc_head, &oacc_tail, ctx);
> @@ -15088,13 +15159,18 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>   						ctx);
>   	}
>
> -  if (!gimple_omp_for_grid_phony (stmt))
> -    gimple_seq_add_stmt (&body, stmt);
> -  gimple_seq_add_seq (&body, gimple_omp_body (stmt));
> +  if (oacc_kernels_p)
> +    lower_omp_for_seq (&body, stmt, 0);
> +  else if (gimple_omp_for_grid_phony (stmt))
> +    gimple_seq_add_seq (&body, gimple_omp_body (stmt));
> +  else
> +    {
> +      gimple_seq_add_stmt (&body, stmt);
> +      gimple_seq_add_seq (&body, gimple_omp_body (stmt));
>
> -  if (!gimple_omp_for_grid_phony (stmt))
> -    gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
> -							   fd.loop.v));
> +      gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
> +							     fd.loop.v));
> +    }
>
>     /* After the loop, add exit clauses.  */
>     lower_reduction_clauses (gimple_omp_for_clauses (stmt), &body, ctx);
> @@ -15106,7 +15182,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>
>     body = maybe_catch_exception (body);
>
> -  if (!gimple_omp_for_grid_phony (stmt))
> +  if (!gimple_omp_for_grid_phony (stmt)
> +      && !oacc_kernels_p)
>       {
>         /* Region exit marker goes at the end of the loop body.  */
>         gimple_seq_add_stmt (&body, gimple_build_omp_return (fd.have_nowait));
> diff --git a/gcc/testsuite/c-c++-common/goacc/combined-directives.c b/gcc/testsuite/c-c++-common/goacc/combined-directives.c
> index c387285..66b8b65 100644
> --- a/gcc/testsuite/c-c++-common/goacc/combined-directives.c
> +++ b/gcc/testsuite/c-c++-common/goacc/combined-directives.c
> @@ -108,12 +108,12 @@ test ()
>   //    ;
>   }
>
> -// { dg-final { scan-tree-dump-times "acc loop collapse.2. private.j. private.i" 2 "gimple" } }
> -// { dg-final { scan-tree-dump-times "acc loop gang" 2 "gimple" } }
> -// { dg-final { scan-tree-dump-times "acc loop worker" 2 "gimple" } }
> -// { dg-final { scan-tree-dump-times "acc loop vector" 2 "gimple" } }
> -// { dg-final { scan-tree-dump-times "acc loop seq" 2 "gimple" } }
> -// { dg-final { scan-tree-dump-times "acc loop auto" 2 "gimple" } }
> -// { dg-final { scan-tree-dump-times "acc loop tile.2, 3" 2 "gimple" } }
> -// { dg-final { scan-tree-dump-times "acc loop independent private.i" 2 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop collapse.2. private.j. private.i" 1 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop gang" 1 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop worker" 1 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop vector" 1 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop seq" 1 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop auto" 1 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop tile.2, 3" 1 "gimple" } }
> +// { dg-final { scan-tree-dump-times "acc loop independent private.i" 1 "gimple" } }
>   // { dg-final { scan-tree-dump-times "private.z" 2 "gimple" } }
> diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
> new file mode 100644
> index 0000000..6a9f52b
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
> @@ -0,0 +1,24 @@
> +/* { dg-additional-options "-O2" } */
> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
> +/* { dg-additional-options "-fdump-tree-optimized" } */
> +
> +unsigned int a[1000];
> +
> +unsigned int
> +foo (int n)
> +{
> +  unsigned int sum = 0;
> +
> +#pragma acc kernels loop gang reduction(+:sum)
> +  for (int i = 0; i < n; i++)
> +    sum += a[i];
> +
> +  return sum;
> +}
> +
> +/* Check that only one loop is analyzed, and that it can be parallelized.  */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
> +
> +/* Check that the loop has been split off into a function.  */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
> diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
> new file mode 100644
> index 0000000..d18c779
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
> @@ -0,0 +1,22 @@
> +/* { dg-additional-options "-O2" } */
> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
> +/* { dg-additional-options "-fdump-tree-optimized" } */
> +
> +unsigned int
> +foo (int n)
> +{
> +  unsigned int sum = 1;
> +
> +  #pragma acc kernels loop
> +  for (int i = 1; i <= n; i++)
> +    sum += i;
> +
> +  return sum;
> +}
> +
> +/* Check that only one loop is analyzed, and that it can be parallelized.  */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
> +
> +/* Check that the loop has been split off into a function.  */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
> diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
> new file mode 100644
> index 0000000..95354e1
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
> @@ -0,0 +1,17 @@
> +/* { dg-additional-options "-O2" } */
> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
> +/* { dg-additional-options "-fdump-tree-optimized" } */
> +
> +/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
> +#define ACC_LOOP
> +#include "kernels-loop-2.c"
> +
> +/* Check that only three loops are analyzed, and that all can be
> +   parallelized.  */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
> +
> +/* Check that the loop has been split off into a function.  */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
> diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
> new file mode 100644
> index 0000000..1ad3067
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
> @@ -0,0 +1,14 @@
> +/* { dg-additional-options "-O2" } */
> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
> +/* { dg-additional-options "-fdump-tree-optimized" } */
> +
> +/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
> +#define ACC_LOOP
> +#include "kernels-loop-3.c"
> +
> +/* Check that only one loop is analyzed, and that it can be parallelized.  */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
> +
> +/* Check that the loop has been split off into a function.  */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
> diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
> new file mode 100644
> index 0000000..47b8459
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
> @@ -0,0 +1,14 @@
> +/* { dg-additional-options "-O2" } */
> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
> +/* { dg-additional-options "-fdump-tree-optimized" } */
> +
> +/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
> +#define ACC_LOOP
> +#include "kernels-loop.c"
> +
> +/* Check that only one loop is analyzed, and that it can be parallelized.  */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
> +
> +/* Check that the loop has been split off into a function.  */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
> diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
> new file mode 100644
> index 0000000..25b56d7
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
> @@ -0,0 +1,14 @@
> +/* { dg-additional-options "-O2" } */
> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
> +/* { dg-additional-options "-fdump-tree-optimized" } */
> +
> +/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
> +#define ACC_LOOP
> +#include "kernels-loop-n.c"
> +
> +/* Check that only one loop is analyzed, and that it can be parallelized.  */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops1" } } */
> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
> +
> +/* Check that the loop has been split off into a function.  */
> +/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
> diff --git a/gcc/testsuite/c-c++-common/goacc/loop-private-1.c b/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
> index 38a4a7d..9b2f7fa 100644
> --- a/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
> +++ b/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
> @@ -10,4 +10,4 @@ f (int i, int j)
>         ;
>   }
>
> -/* { dg-final { scan-tree-dump-times "#pragma acc loop collapse\\(2\\) private\\(j\\) private\\(i\\)" 1 "gimple" } } */
> +/* { dg-final { scan-tree-dump-times "#pragma acc loop private\\(j\\) private\\(i\\)" 1 "gimple" } } */
> diff --git a/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90 b/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
> index 6977525..e89ddc9 100644
> --- a/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
> +++ b/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
> @@ -144,12 +144,12 @@ subroutine test
>   !  !$acc end kernels loop
>   end subroutine test
>
> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. collapse.2." 2 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.i. gang" 2 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. worker" 2 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. vector" 2 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. seq" 2 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. auto" 2 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. tile.2, 3" 2 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.i. independent" 2 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. collapse.2." 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. gang" 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. worker" 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. vector" 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. seq" 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. auto" 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j. tile.2, 3" 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.i. independent" 1 "gimple" } }
>   ! { dg-final { scan-tree-dump-times "private.z" 2 "gimple" } }
> diff --git a/gcc/testsuite/gfortran.dg/goacc/gang-static.f95 b/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
> index 3481085..c14b7b2 100644
> --- a/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
> +++ b/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
> @@ -78,5 +78,5 @@ end subroutine test
>   ! { dg-final { scan-tree-dump-times "gang\\(static:2\\)" 1 "omplower" } }
>   ! { dg-final { scan-tree-dump-times "gang\\(static:5\\)" 1 "omplower" } }
>   ! { dg-final { scan-tree-dump-times "gang\\(static:20\\)" 1 "omplower" } }
> -! { dg-final { scan-tree-dump-times "gang\\(num: 5 static:\\\*\\)" 1 "omplower" } }
> -! { dg-final { scan-tree-dump-times "gang\\(num: 30 static:20\\)" 1 "omplower" } }
> +! { dg-final { scan-tree-dump-times "gang\\(num: 5 static:\\\*\\)" 0 "omplower" } }
> +! { dg-final { scan-tree-dump-times "gang\\(num: 30 static:20\\)" 0 "omplower" } }
> diff --git a/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95 b/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
> index 929fb0e..4c431c8 100644
> --- a/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
> +++ b/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
> @@ -11,6 +11,7 @@ subroutine foo ()
>     !$acc end parallel loop
>     !$acc kernels loop reduction(+:a)
>     do k = 2,6
> +     a = a + 1
>     enddo
>     !$acc end kernels loop
>   end subroutine
> @@ -18,5 +19,5 @@ end subroutine
>   ! { dg-final { scan-tree-dump-times "target oacc_parallel firstprivate.a." 1 "gimple" } }
>   ! { dg-final { scan-tree-dump-times "acc loop private.p. reduction..:a." 1 "gimple" } }
>   ! { dg-final { scan-tree-dump-times "target oacc_kernels map.force_tofrom:a .len: 4.." 1 "gimple" } }
> -! { dg-final { scan-tree-dump-times "acc loop private.k. reduction..:a." 1 "gimple" } }
> +! { dg-final { scan-tree-dump-times "acc loop private.k." 1 "gimple" } }
>
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c
  2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
  2016-01-18 13:39   ` [comitted] Add oacc kernels test in libgomp Tom de Vries
@ 2016-03-09  9:18   ` Tom de Vries
  2016-03-18 12:46     ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc (was: [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c) Thomas Schwinge
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-03-09  9:18 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 09/11/15 21:10, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> This patch adds C/C++ oacc kernels execution tests.
>

Retested on current trunk.

Committed, minus the kernels-parallel-loop-data-enter-exit.f95 test.

Thanks,
- Tom

> 0015-Add-libgomp.oacc-c-c-common-kernels-.c.patch
>
>
> Add libgomp.oacc-c-c++-common/kernels-*.c
>
> 2015-11-09  Tom de Vries  <tom@codesourcery.com>
>
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: New test.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c:
> 	Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c:
> 	Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
> 	Same.
> 	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c: Same.
> ---
>   .../libgomp.oacc-c-c++-common/kernels-loop-2.c     | 47 ++++++++++++++++++
>   .../libgomp.oacc-c-c++-common/kernels-loop-3.c     | 34 +++++++++++++
>   .../kernels-loop-and-seq-2.c                       | 36 ++++++++++++++
>   .../kernels-loop-and-seq-3.c                       | 37 ++++++++++++++
>   .../kernels-loop-and-seq-4.c                       | 36 ++++++++++++++
>   .../kernels-loop-and-seq-5.c                       | 37 ++++++++++++++
>   .../kernels-loop-and-seq-6.c                       | 36 ++++++++++++++
>   .../kernels-loop-and-seq.c                         | 37 ++++++++++++++
>   .../kernels-loop-collapse.c                        | 40 ++++++++++++++++
>   .../kernels-loop-data-2.c                          | 56 ++++++++++++++++++++++
>   .../kernels-loop-data-enter-exit-2.c               | 54 +++++++++++++++++++++
>   .../kernels-loop-data-enter-exit.c                 | 51 ++++++++++++++++++++
>   .../kernels-loop-data-update.c                     | 53 ++++++++++++++++++++
>   .../libgomp.oacc-c-c++-common/kernels-loop-data.c  | 50 +++++++++++++++++++
>   .../libgomp.oacc-c-c++-common/kernels-loop-g.c     |  5 ++
>   .../kernels-loop-mod-not-zero.c                    | 41 ++++++++++++++++
>   .../libgomp.oacc-c-c++-common/kernels-loop-n.c     | 47 ++++++++++++++++++
>   .../libgomp.oacc-c-c++-common/kernels-loop-nest.c  | 26 ++++++++++
>   .../libgomp.oacc-c-c++-common/kernels-loop.c       | 41 ++++++++++++++++
>   .../kernels-parallel-loop-data-enter-exit.c        | 52 ++++++++++++++++++++
>   .../libgomp.oacc-c-c++-common/kernels-reduction.c  | 37 ++++++++++++++
>   21 files changed, 853 insertions(+)
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
>   create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
>
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
> new file mode 100644
> index 0000000..13e57bd
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
> @@ -0,0 +1,47 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +#pragma acc kernels copyout (a[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      a[i] = i * 2;
> +  }
> +
> +#pragma acc kernels copyout (b[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      b[i] = i * 4;
> +  }
> +
> +#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
> new file mode 100644
> index 0000000..f61a74a
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
> @@ -0,0 +1,34 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int i;
> +
> +  unsigned int *__restrict c;
> +
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    c[i] = i * 2;
> +
> +#pragma acc kernels copy (c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = c[ii] + ii + 1;
> +  }
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != i * 2 + i + 1)
> +      abort ();
> +
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> new file mode 100644
> index 0000000..2e4100f
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> @@ -0,0 +1,36 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 32
> +
> +unsigned int
> +foo (int n, unsigned int *a)
> +{
> +#pragma acc kernels copy (a[0:N])
> +  {
> +    a[0] = a[0] + 1;
> +
> +    for (int i = 0; i < n; i++)
> +      a[i] = 1;
> +  }
> +
> +  return a[0];
> +}
> +
> +int
> +main (void)
> +{
> +  unsigned int a[N];
> +  unsigned res, i;
> +
> +  for (i = 0; i < N; ++i)
> +    a[i] = i % 4;
> +
> +  res = foo (N, a);
> +  if (res != 1)
> +    abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> new file mode 100644
> index 0000000..b3e736b
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> @@ -0,0 +1,37 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 32
> +
> +unsigned int
> +foo (int n, unsigned int *a)
> +{
> +
> +#pragma acc kernels copy (a[0:N])
> +  {
> +    for (int i = 0; i < n; i++)
> +      a[i] = 1;
> +
> +    a[0] = 2;
> +  }
> +
> +  return a[0];
> +}
> +
> +int
> +main (void)
> +{
> +  unsigned int a[N];
> +  unsigned res, i;
> +
> +  for (i = 0; i < N; ++i)
> +    a[i] = i % 4;
> +
> +  res = foo (N, a);
> +  if (res != 2)
> +    abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> new file mode 100644
> index 0000000..8b9affa
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> @@ -0,0 +1,36 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 32
> +
> +unsigned int
> +foo (int n, unsigned int *a)
> +{
> +#pragma acc kernels copy (a[0:N])
> +  {
> +    a[0] = 2;
> +
> +    for (int i = 0; i < n; i++)
> +      a[i] = 1;
> +  }
> +
> +  return a[0];
> +}
> +
> +int
> +main (void)
> +{
> +  unsigned int a[N];
> +  unsigned res, i;
> +
> +  for (i = 0; i < N; ++i)
> +    a[i] = i % 4;
> +
> +  res = foo (N, a);
> +  if (res != 1)
> +    abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> new file mode 100644
> index 0000000..83d4e7f
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> @@ -0,0 +1,37 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 32
> +
> +unsigned int
> +foo (int n, unsigned int *a)
> +{
> +  int r;
> +#pragma acc kernels copyout(r) copy (a[0:N])
> +  {
> +    r = a[0];
> +
> +    for (int i = 0; i < n; i++)
> +      a[i] = 1;
> +  }
> +
> +  return r;
> +}
> +
> +int
> +main (void)
> +{
> +  unsigned int a[N];
> +  unsigned res, i;
> +
> +  for (i = 0; i < N; ++i)
> +    a[i] = i % 4;
> +
> +  res = foo (N, a);
> +  if (res != 0)
> +    abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> new file mode 100644
> index 0000000..01d5e5e
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> @@ -0,0 +1,36 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 32
> +
> +unsigned int
> +foo (int n, unsigned int *a)
> +{
> +#pragma acc kernels copy (a[0:N])
> +  {
> +    int r = a[0];
> +
> +    for (int i = 0; i < n; i++)
> +      a[i] = 1 + r;
> +  }
> +
> +  return a[0];
> +}
> +
> +int
> +main (void)
> +{
> +  unsigned int a[N];
> +  unsigned res, i;
> +
> +  for (i = 0; i < N; ++i)
> +    a[i] = i % 4;
> +
> +  res = foo (N, a);
> +  if (res != 1)
> +    abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> new file mode 100644
> index 0000000..61d1283
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> @@ -0,0 +1,37 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 32
> +
> +unsigned int
> +foo (int n, unsigned int *a)
> +{
> +
> +#pragma acc kernels copy (a[0:N])
> +  {
> +    for (int i = 0; i < n; i++)
> +      a[i] = 1;
> +
> +    a[0] = a[0] + 1;
> +  }
> +
> +  return a[0];
> +}
> +
> +int
> +main (void)
> +{
> +  unsigned int a[N];
> +  unsigned res, i;
> +
> +  for (i = 0; i < N; ++i)
> +    a[i] = i % 4;
> +
> +  res = foo (N, a);
> +  if (res != 2)
> +    abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> new file mode 100644
> index 0000000..f7f04cb
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> @@ -0,0 +1,40 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 100
> +
> +int a[N][N];
> +
> +void __attribute__((noinline, noclone))
> +foo (int m, int n)
> +{
> +  int i, j;
> +  #pragma acc kernels
> +  {
> +#pragma acc loop collapse(2)
> +    for (i = 0; i < m; i++)
> +      for (j = 0; j < n; j++)
> +	a[i][j] = 1;
> +  }
> +}
> +
> +int
> +main (void)
> +{
> +  int i, j;
> +
> +  for (i = 0; i < N; i++)
> +    for (j = 0; j < N; j++)
> +      a[i][j] = 0;
> +
> +  foo (N, N);
> +
> +  for (i = 0; i < N; i++)
> +    for (j = 0; j < N; j++)
> +      if (a[i][j] != 1)
> +	abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
> new file mode 100644
> index 0000000..b889ef9
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
> @@ -0,0 +1,56 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +#pragma acc data copyout (a[0:N])
> +  {
> +#pragma acc kernels present (a[0:N])
> +    {
> +      for (COUNTERTYPE i = 0; i < N; i++)
> +	a[i] = i * 2;
> +    }
> +  }
> +
> +#pragma acc data copyout (b[0:N])
> +  {
> +#pragma acc kernels present (b[0:N])
> +    {
> +      for (COUNTERTYPE i = 0; i < N; i++)
> +	b[i] = i * 4;
> +    }
> +  }
> +
> +#pragma acc data copyin (a[0:N], b[0:N]) copyout (c[0:N])
> +  {
> +#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
> +    {
> +      for (COUNTERTYPE ii = 0; ii < N; ii++)
> +	c[ii] = a[ii] + b[ii];
> +    }
> +  }
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
> new file mode 100644
> index 0000000..d508a44
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
> @@ -0,0 +1,54 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +#pragma acc enter data create (a[0:N])
> +#pragma acc kernels present (a[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      a[i] = i * 2;
> +  }
> +#pragma acc exit data copyout (a[0:N])
> +
> +#pragma acc enter data create (b[0:N])
> +#pragma acc kernels present (b[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      b[i] = i * 4;
> +  }
> +#pragma acc exit data copyout (b[0:N])
> +
> +
> +#pragma acc enter data copyin (a[0:N], b[0:N]) create (c[0:N])
> +#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +#pragma acc exit data copyout (c[0:N])
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
> new file mode 100644
> index 0000000..11d82f7
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
> @@ -0,0 +1,51 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
> +
> +#pragma acc kernels present (a[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      a[i] = i * 2;
> +  }
> +
> +#pragma acc kernels present (b[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      b[i] = i * 4;
> +  }
> +
> +#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +
> +#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
> new file mode 100644
> index 0000000..a7d4e84
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
> @@ -0,0 +1,53 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
> +
> +#pragma acc kernels present (a[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      a[i] = i * 2;
> +  }
> +
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      b[i] = i * 4;
> +  }
> +
> +#pragma acc update device (b[0:N])
> +
> +#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +
> +#pragma acc exit data copyout (a[0:N], c[0:N])
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> +
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
> new file mode 100644
> index 0000000..607d7de
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
> @@ -0,0 +1,50 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +#pragma acc data copyout (a[0:N], b[0:N], c[0:N])
> +  {
> +#pragma acc kernels present (a[0:N])
> +    {
> +      for (COUNTERTYPE i = 0; i < N; i++)
> +	a[i] = i * 2;
> +    }
> +
> +#pragma acc kernels present (b[0:N])
> +    {
> +      for (COUNTERTYPE i = 0; i < N; i++)
> +	b[i] = i * 4;
> +    }
> +
> +#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
> +    {
> +      for (COUNTERTYPE ii = 0; ii < N; ii++)
> +	c[ii] = a[ii] + b[ii];
> +    }
> +  }
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
> new file mode 100644
> index 0000000..96b6e4e
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
> @@ -0,0 +1,5 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +/* { dg-additional-options "-g" } */
> +
> +#include "kernels-loop.c"
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
> new file mode 100644
> index 0000000..1433cb2
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
> @@ -0,0 +1,41 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N ((1024 * 512) + 1)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    a[i] = i * 2;
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    b[i] = i * 4;
> +
> +#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
> new file mode 100644
> index 0000000..fd0d5b1
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
> @@ -0,0 +1,47 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N ((1024 * 512) + 1)
> +#define COUNTERTYPE unsigned int
> +
> +static int __attribute__((noinline,noclone))
> +foo (COUNTERTYPE n)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
> +
> +  for (COUNTERTYPE i = 0; i < n; i++)
> +    a[i] = i * 2;
> +
> +  for (COUNTERTYPE i = 0; i < n; i++)
> +    b[i] = i * 4;
> +
> +#pragma acc kernels copyin (a[0:n], b[0:n]) copyout (c[0:n])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < n; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +
> +  for (COUNTERTYPE i = 0; i < n; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> +
> +int
> +main (void)
> +{
> +  return foo (N);
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
> new file mode 100644
> index 0000000..21d2599
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
> @@ -0,0 +1,26 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N 1000
> +
> +int
> +main (void)
> +{
> +  int x[N][N];
> +
> +#pragma acc kernels copyout (x)
> +  {
> +    for (int ii = 0; ii < N; ii++)
> +      for (int jj = 0; jj < N; jj++)
> +	x[ii][jj] = ii + jj + 3;
> +  }
> +
> +  for (int i = 0; i < N; i++)
> +    for (int j = 0; j < N; j++)
> +      if (x[i][j] != i + j + 3)
> +	abort ();
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
> new file mode 100644
> index 0000000..3762e5a
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
> @@ -0,0 +1,41 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    a[i] = i * 2;
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    b[i] = i * 4;
> +
> +#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
> new file mode 100644
> index 0000000..767f6c8
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
> @@ -0,0 +1,52 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define N (1024 * 512)
> +#define COUNTERTYPE unsigned int
> +
> +int
> +main (void)
> +{
> +  unsigned int *__restrict a;
> +  unsigned int *__restrict b;
> +  unsigned int *__restrict c;
> +
> +  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
> +
> +#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
> +
> +#pragma acc kernels present (a[0:N])
> +  {
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      a[i] = i * 2;
> +  }
> +
> +#pragma acc parallel present (b[0:N])
> +  {
> +#pragma acc loop
> +    for (COUNTERTYPE i = 0; i < N; i++)
> +      b[i] = i * 4;
> +  }
> +
> +#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
> +  {
> +    for (COUNTERTYPE ii = 0; ii < N; ii++)
> +      c[ii] = a[ii] + b[ii];
> +  }
> +
> +#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
> +
> +  for (COUNTERTYPE i = 0; i < N; i++)
> +    if (c[i] != a[i] + b[i])
> +      abort ();
> +
> +  free (a);
> +  free (b);
> +  free (c);
> +
> +  return 0;
> +}
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
> new file mode 100644
> index 0000000..511e25f
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
> @@ -0,0 +1,37 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <stdlib.h>
> +
> +#define n 10000
> +
> +unsigned int a[n];
> +
> +void  __attribute__((noinline,noclone))
> +foo (void)
> +{
> +  int i;
> +  unsigned int sum = 1;
> +
> +#pragma acc kernels copyin (a[0:n]) copy (sum)
> +  {
> +    for (i = 0; i < n; ++i)
> +      sum += a[i];
> +  }
> +
> +  if (sum != 5001)
> +    abort ();
> +}
> +
> +int
> +main ()
> +{
> +  int i;
> +
> +  for (i = 0; i < n; ++i)
> +    a[i] = i % 2;
> +
> +  foo ();
> +
> +  return 0;
> +}
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95
  2015-11-09 20:12 ` [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95 Tom de Vries
@ 2016-03-09  9:19   ` Tom de Vries
  2016-03-16 13:12     ` Thomas Schwinge
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-03-09  9:19 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 09/11/15 21:12, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> This patch adds Fortran oacc kernels execution tests.

Retested on current trunk.

Committed, minus the kernels-parallel-loop-data-enter-exit.f95 test.

Thanks,
- Tom

> 0016-Add-libgomp.oacc-fortran-kernels-.f95.patch
>
>
> Add libgomp.oacc-fortran/kernels-*.f95
>
> 2015-11-09  Tom de Vries  <tom@codesourcery.com>
>
> 	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: New test.
> 	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95: Same.
> 	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
> 	Same.
> 	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95: Same.
> 	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95: Same.
> 	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Same.
> 	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Same.
> 	* testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95:
> 	Same.
> ---
>   .../libgomp.oacc-fortran/kernels-loop-2.f95        | 32 ++++++++++++++++++
>   .../libgomp.oacc-fortran/kernels-loop-data-2.f95   | 38 ++++++++++++++++++++++
>   .../kernels-loop-data-enter-exit-2.f95             | 38 ++++++++++++++++++++++
>   .../kernels-loop-data-enter-exit.f95               | 36 ++++++++++++++++++++
>   .../kernels-loop-data-update.f95                   | 36 ++++++++++++++++++++
>   .../libgomp.oacc-fortran/kernels-loop-data.f95     | 36 ++++++++++++++++++++
>   .../libgomp.oacc-fortran/kernels-loop.f95          | 28 ++++++++++++++++
>   .../kernels-parallel-loop-data-enter-exit.f95      | 37 +++++++++++++++++++++
>   8 files changed, 281 insertions(+)
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
>   create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
>
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
> new file mode 100644
> index 0000000..1fb40ee
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
> @@ -0,0 +1,32 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  !$acc kernels copyout (a(0:n-1))
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +  !$acc end kernels
> +
> +  !$acc kernels copyout (b(0:n-1))
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +  !$acc end kernels
> +
> +  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
> new file mode 100644
> index 0000000..7b52253
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
> @@ -0,0 +1,38 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  !$acc data copyout (a(0:n-1))
> +  !$acc kernels present (a(0:n-1))
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +  !$acc end kernels
> +  !$acc end data
> +
> +  !$acc data copyout (b(0:n-1))
> +  !$acc kernels present (b(0:n-1))
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +  !$acc end kernels
> +  !$acc end data
> +
> +  !$acc data copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> +  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +  !$acc end data
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
> new file mode 100644
> index 0000000..af98efa
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
> @@ -0,0 +1,38 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  !$acc enter data create (a(0:n-1))
> +  !$acc kernels present (a(0:n-1))
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +  !$acc end kernels
> +  !$acc exit data copyout (a(0:n-1))
> +
> +  !$acc enter data create (b(0:n-1))
> +  !$acc kernels present (b(0:n-1))
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +  !$acc end kernels
> +  !$acc exit data copyout (b(0:n-1))
> +
> +  !$acc enter data copyin (a(0:n-1), b(0:n-1)) create (c(0:n-1))
> +  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +  !$acc exit data copyout (c(0:n-1))
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
> new file mode 100644
> index 0000000..bb6f8dc
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
> @@ -0,0 +1,36 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
> +
> +  !$acc kernels present (a(0:n-1))
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +  !$acc end kernels
> +
> +  !$acc kernels present (b(0:n-1))
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +  !$acc end kernels
> +
> +  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +
> +  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
> new file mode 100644
> index 0000000..cab1f2c
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
> @@ -0,0 +1,36 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
> +
> +  !$acc kernels present (a(0:n-1))
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +  !$acc end kernels
> +
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +
> +  !$acc update device (b(0:n-1))
> +
> +  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +
> +  !$acc exit data copyout (a(0:n-1), c(0:n-1))
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
> new file mode 100644
> index 0000000..f26671d
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
> @@ -0,0 +1,36 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  !$acc data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
> +
> +  !$acc kernels present (a(0:n-1))
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +  !$acc end kernels
> +
> +  !$acc kernels present (b(0:n-1))
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +  !$acc end kernels
> +
> +  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +
> +  !$acc end data
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
> new file mode 100644
> index 0000000..b02dd57
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
> @@ -0,0 +1,28 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +
> +  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
> new file mode 100644
> index 0000000..2322152
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
> @@ -0,0 +1,37 @@
> +! { dg-do run }
> +! { dg-options "-ftree-parallelize-loops=32" }
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 1024
> +  integer, dimension (0:n-1) :: a, b, c
> +  integer                    :: i, ii
> +
> +  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
> +
> +  !$acc kernels present (a(0:n-1))
> +  do i = 0, n - 1
> +     a(i) = i * 2
> +  end do
> +  !$acc end kernels
> +
> +  !$acc parallel present (b(0:n-1))
> +  !$acc loop
> +  do i = 0, n -1
> +     b(i) = i * 4
> +  end do
> +  !$acc end parallel
> +
> +  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
> +  do ii = 0, n - 1
> +     c(ii) = a(ii) + b(ii)
> +  end do
> +  !$acc end kernels
> +
> +  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
> +
> +  do i = 0, n - 1
> +     if (c(i) .ne. a(i) + b(i)) call abort
> +  end do
> +
> +end program main
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING^2][PATCH, 12/16] Handle acc loop directive
  2016-03-07  8:22                   ` [PING][PATCH, " Tom de Vries
@ 2016-03-14  6:21                     ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-03-14  6:21 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Richard Biener

On 07/03/16 09:21, Tom de Vries wrote:
> On 29/02/16 04:26, Tom de Vries wrote:
>> On 22-02-16 11:57, Jakub Jelinek wrote:
>>> On Mon, Feb 22, 2016 at 11:54:46AM +0100, Tom de Vries wrote:
>>>> Following up on your suggestion to implement this during
>>>> gimplification, I
>>>> wrote attached patch.
>>>>
>>>> I'll put it through some openacc testing and add testcases. Is this
>>>> approach
>>>> acceptable for stage4?
>>>
>>> LGTM.
>>
>> Hi,
>>
>> I ran into trouble during testing of this patch, with ignoring the
>> private clause on the loop directive.
>>
>> This openacc testcase compiles atm without a problem:
>> ...
>> int
>> main (void)
>> {
>>    int j;
>> #pragma acc kernels default(none)
>>    {
>> #pragma acc loop private (j)
>>      for (unsigned i = 0; i < 1000; ++i)
>>        {
>>      j;
>>        }
>>    }
>> }
>> ...
>>
>> But when compiling with the patch, and ignoring the private clause, we
>> run into this error:
>> ...
>> test.c: In function ‘main’:
>> test.c:10:2: error: ‘j’ not specified in enclosing OpenACC ‘kernels’
>> construct
>>    j;
>>    ^
>> test.c:5:9: note: enclosing OpenACC ‘kernels’ construct
>>   #pragma acc kernels default(none)
>> ...
>>
>> So I updated the patch to ignore all but the private clause on the loop
>> directive during gimplification, and moved the sequential expansion of
>> the omp-for construct from gimplify to omp-lower.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build for nvidia accelerator and reg-tested goacc.exp and libgomp
>> testsuite.
>>
>> Updated patch still ok for stage4?
>>

Ping. ( Submitted here:
https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01903.html )

Thanks,
- Tom

>> 0001-Ignore-acc-loop-directive-in-kernels-region.patch
>>
>>
>> Ignore acc loop directive in kernels region
>>
>> 2016-02-29  Tom de Vries  <tom@codesourcery.com>
>>
>>     * gimplify.c (gimplify_ctx_in_oacc_kernels_region): New function.
>>     (gimplify_omp_for): Ignore all but private clause on loop
>> directive in
>>     kernels region.
>>     * omp-low.c (lower_omp_for_seq): New function.
>>     (lower_omp_for): Use lower_omp_for_seq in kernels region.  Don't
>>     generate omp continue/return.
>>
>>     * c-c++-common/goacc/kernels-acc-loop-reduction.c: New test.
>>     * c-c++-common/goacc/kernels-acc-loop-smaller-equal.c: Same.
>>     * c-c++-common/goacc/kernels-loop-2-acc-loop.c: Same.
>>     * c-c++-common/goacc/kernels-loop-3-acc-loop.c: Same.
>>     * c-c++-common/goacc/kernels-loop-acc-loop.c: Same.
>>     * c-c++-common/goacc/kernels-loop-n-acc-loop.c: Same.
>>     * c-c++-common/goacc/combined-directives.c: Update test.
>>     * c-c++-common/goacc/loop-private-1.c: Same.
>>     * gfortran.dg/goacc/combined-directives.f90: Same.
>>     * gfortran.dg/goacc/gang-static.f95: Same.
>>     * gfortran.dg/goacc/reduction-2.f95: Same.
>>
>> ---
>>   gcc/gimplify.c                                     | 41 ++++++++++
>>   gcc/omp-low.c                                      | 93
>> ++++++++++++++++++++--
>>   .../c-c++-common/goacc/combined-directives.c       | 16 ++--
>>   .../goacc/kernels-acc-loop-reduction.c             | 24 ++++++
>>   .../goacc/kernels-acc-loop-smaller-equal.c         | 22 +++++
>>   .../c-c++-common/goacc/kernels-loop-2-acc-loop.c   | 17 ++++
>>   .../c-c++-common/goacc/kernels-loop-3-acc-loop.c   | 14 ++++
>>   .../c-c++-common/goacc/kernels-loop-acc-loop.c     | 14 ++++
>>   .../c-c++-common/goacc/kernels-loop-n-acc-loop.c   | 14 ++++
>>   gcc/testsuite/c-c++-common/goacc/loop-private-1.c  |  2 +-
>>   .../gfortran.dg/goacc/combined-directives.f90      | 16 ++--
>>   gcc/testsuite/gfortran.dg/goacc/gang-static.f95    |  4 +-
>>   gcc/testsuite/gfortran.dg/goacc/reduction-2.f95    |  3 +-
>>   13 files changed, 252 insertions(+), 28 deletions(-)
>>
>> diff --git a/gcc/gimplify.c b/gcc/gimplify.c
>> index 7be6bd7..4b82305 100644
>> --- a/gcc/gimplify.c
>> +++ b/gcc/gimplify.c
>> @@ -8364,6 +8364,20 @@ find_combined_omp_for (tree *tp, int
>> *walk_subtrees, void *)
>>     return NULL_TREE;
>>   }
>>
>> +/* Return true if CTX is (part of) an oacc kernels region.  */
>> +
>> +static bool
>> +gimplify_ctx_in_oacc_kernels_region (gimplify_omp_ctx *ctx)
>> +{
>> +  for (;ctx != NULL; ctx = ctx->outer_context)
>> +    {
>> +      if (ctx->region_type == ORT_ACC_KERNELS)
>> +    return true;
>> +    }
>> +
>> +  return false;
>> +}
>> +
>>   /* Gimplify the gross structure of an OMP_FOR statement.  */
>>
>>   static enum gimplify_status
>> @@ -8403,6 +8417,33 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
>>         gcc_unreachable ();
>>       }
>>
>> +  /* Skip loop clauses not handled in kernels region.  */
>> +  if (gimplify_ctx_in_oacc_kernels_region (gimplify_omp_ctxp))
>> +    {
>> +      tree *prev_ptr = &OMP_FOR_CLAUSES (for_stmt);
>> +
>> +      while (tree probe = *prev_ptr)
>> +    {
>> +      tree *next_ptr = &OMP_CLAUSE_CHAIN (probe);
>> +
>> +      bool keep_clause;
>> +      switch (OMP_CLAUSE_CODE (probe))
>> +        {
>> +        case OMP_CLAUSE_PRIVATE:
>> +          keep_clause = true;
>> +          break;
>> +        default:
>> +          keep_clause = false;
>> +          break;
>> +        }
>> +
>> +      if (keep_clause)
>> +        prev_ptr = next_ptr;
>> +      else
>> +        *prev_ptr = *next_ptr;
>> +    }
>> +    }
>> +
>>     /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
>>        clause for the IV.  */
>>     if (ort == ORT_SIMD && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt))
>> == 1)
>> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
>> index fcbb3e0..bb70ac2 100644
>> --- a/gcc/omp-low.c
>> +++ b/gcc/omp-low.c
>> @@ -14944,6 +14944,75 @@ lower_omp_for_lastprivate (struct
>> omp_for_data *fd, gimple_seq *body_p,
>>       }
>>   }
>>
>> +/* Lower the loops with index I and higher in omp_for FOR_STMT as a
>> sequential
>> +   loop, and append the resulting gimple statements to PRE_P.  */
>> +
>> +static void
>> +lower_omp_for_seq (gimple_seq *pre_p, gimple *for_stmt, unsigned int i)
>> +{
>> +  unsigned int len = gimple_omp_for_collapse (for_stmt);
>> +  gcc_assert (i < len);
>> +
>> +  /* Gimplify OMP_FOR[i] as:
>> +
>> +     OMP_FOR_INIT[i];
>> +     goto <loop_entry_label>;
>> +     <fall_thru_label>:
>> +     if (i == len - 1)
>> +       OMP_FOR_BODY;
>> +     else
>> +       OMP_FOR[i+1];
>> +    OMP_FOR_INCR[i];
>> +    <loop_entry_label>:
>> +    if (OMP_FOR_COND[i])
>> +      goto <fall_thru_label>;
>> +    else
>> +      goto <loop_exit_label>;
>> +    <loop_exit_label>:
>> +  */
>> +
>> +  tree loop_entry_label = create_artificial_label (UNKNOWN_LOCATION);
>> +  tree fall_thru_label = create_artificial_label (UNKNOWN_LOCATION);
>> +  tree loop_exit_label = create_artificial_label (UNKNOWN_LOCATION);
>> +
>> +  /* OMP_FOR_INIT[i].  */
>> +  tree init = gimple_omp_for_initial (for_stmt, i);
>> +  tree var = gimple_omp_for_index (for_stmt, i);
>> +  gimple *g = gimple_build_assign (var, init);
>> +  gimple_seq_add_stmt (pre_p, g);
>> +
>> +  /* goto <loop_entry_label>.  */
>> +  gimple_seq_add_stmt (pre_p, gimple_build_goto (loop_entry_label));
>> +
>> +  /* <fall_thru_label>.  */
>> +  gimple_seq_add_stmt (pre_p, gimple_build_label (fall_thru_label));
>> +
>> +  /* if (i == len - 1) OMP_FOR_BODY
>> +     else OMP_FOR[i+1].  */
>> +  if (i == len - 1)
>> +    gimple_seq_add_seq (pre_p, gimple_omp_body (for_stmt));
>> +  else
>> +    lower_omp_for_seq (pre_p, for_stmt, i + 1);
>> +
>> +  /* OMP_FOR_INCR[i].  */
>> +  tree incr = gimple_omp_for_incr (for_stmt, i);
>> +  g = gimple_build_assign (var, incr);
>> +  gimple_seq_add_stmt (pre_p, g);
>> +
>> +  /* <loop_entry_label>.  */
>> +  gimple_seq_add_stmt (pre_p, gimple_build_label (loop_entry_label));
>> +
>> +  /* if (OMP_FOR_COND[i]) goto <fall_thru_label>
>> +     else goto <loop_exit_label>.  */
>> +  enum tree_code cond = gimple_omp_for_cond (for_stmt, i);
>> +  tree final_val = gimple_omp_for_final (for_stmt, i);
>> +  gimple *gimple_cond = gimple_build_cond (cond, var, final_val,
>> +                       fall_thru_label, loop_exit_label);
>> +  gimple_seq_add_stmt (pre_p, gimple_cond);
>> +
>> +  /* <loop_exit_label>.  */
>> +  gimple_seq_add_stmt (pre_p, gimple_build_label (loop_exit_label));
>> +}
>>
>>   /* Lower code for an OMP loop directive.  */
>>
>> @@ -14957,6 +15026,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p,
>> omp_context *ctx)
>>     gimple_seq omp_for_body, body, dlist;
>>     gimple_seq oacc_head = NULL, oacc_tail = NULL;
>>     size_t i;
>> +  bool oacc_kernels_p = (is_gimple_omp_oacc (ctx->stmt)
>> +             && ctx_in_oacc_kernels_region (ctx));
>>
>>     push_gimplify_context ();
>>
>> @@ -15065,7 +15136,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p,
>> omp_context *ctx)
>>     extract_omp_for_data (stmt, &fd, NULL);
>>
>>     if (is_gimple_omp_oacc (ctx->stmt)
>> -      && !ctx_in_oacc_kernels_region (ctx))
>> +      && !oacc_kernels_p)
>>       lower_oacc_head_tail (gimple_location (stmt),
>>                 gimple_omp_for_clauses (stmt),
>>                 &oacc_head, &oacc_tail, ctx);
>> @@ -15088,13 +15159,18 @@ lower_omp_for (gimple_stmt_iterator *gsi_p,
>> omp_context *ctx)
>>                           ctx);
>>       }
>>
>> -  if (!gimple_omp_for_grid_phony (stmt))
>> -    gimple_seq_add_stmt (&body, stmt);
>> -  gimple_seq_add_seq (&body, gimple_omp_body (stmt));
>> +  if (oacc_kernels_p)
>> +    lower_omp_for_seq (&body, stmt, 0);
>> +  else if (gimple_omp_for_grid_phony (stmt))
>> +    gimple_seq_add_seq (&body, gimple_omp_body (stmt));
>> +  else
>> +    {
>> +      gimple_seq_add_stmt (&body, stmt);
>> +      gimple_seq_add_seq (&body, gimple_omp_body (stmt));
>>
>> -  if (!gimple_omp_for_grid_phony (stmt))
>> -    gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
>> -                               fd.loop.v));
>> +      gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
>> +                                 fd.loop.v));
>> +    }
>>
>>     /* After the loop, add exit clauses.  */
>>     lower_reduction_clauses (gimple_omp_for_clauses (stmt), &body, ctx);
>> @@ -15106,7 +15182,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p,
>> omp_context *ctx)
>>
>>     body = maybe_catch_exception (body);
>>
>> -  if (!gimple_omp_for_grid_phony (stmt))
>> +  if (!gimple_omp_for_grid_phony (stmt)
>> +      && !oacc_kernels_p)
>>       {
>>         /* Region exit marker goes at the end of the loop body.  */
>>         gimple_seq_add_stmt (&body, gimple_build_omp_return
>> (fd.have_nowait));
>> diff --git a/gcc/testsuite/c-c++-common/goacc/combined-directives.c
>> b/gcc/testsuite/c-c++-common/goacc/combined-directives.c
>> index c387285..66b8b65 100644
>> --- a/gcc/testsuite/c-c++-common/goacc/combined-directives.c
>> +++ b/gcc/testsuite/c-c++-common/goacc/combined-directives.c
>> @@ -108,12 +108,12 @@ test ()
>>   //    ;
>>   }
>>
>> -// { dg-final { scan-tree-dump-times "acc loop collapse.2. private.j.
>> private.i" 2 "gimple" } }
>> -// { dg-final { scan-tree-dump-times "acc loop gang" 2 "gimple" } }
>> -// { dg-final { scan-tree-dump-times "acc loop worker" 2 "gimple" } }
>> -// { dg-final { scan-tree-dump-times "acc loop vector" 2 "gimple" } }
>> -// { dg-final { scan-tree-dump-times "acc loop seq" 2 "gimple" } }
>> -// { dg-final { scan-tree-dump-times "acc loop auto" 2 "gimple" } }
>> -// { dg-final { scan-tree-dump-times "acc loop tile.2, 3" 2 "gimple" } }
>> -// { dg-final { scan-tree-dump-times "acc loop independent private.i"
>> 2 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop collapse.2. private.j.
>> private.i" 1 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop gang" 1 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop worker" 1 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop vector" 1 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop seq" 1 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop auto" 1 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop tile.2, 3" 1 "gimple" } }
>> +// { dg-final { scan-tree-dump-times "acc loop independent private.i"
>> 1 "gimple" } }
>>   // { dg-final { scan-tree-dump-times "private.z" 2 "gimple" } }
>> diff --git
>> a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
>> b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
>> new file mode 100644
>> index 0000000..6a9f52b
>> --- /dev/null
>> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
>> @@ -0,0 +1,24 @@
>> +/* { dg-additional-options "-O2" } */
>> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
>> +/* { dg-additional-options "-fdump-tree-optimized" } */
>> +
>> +unsigned int a[1000];
>> +
>> +unsigned int
>> +foo (int n)
>> +{
>> +  unsigned int sum = 0;
>> +
>> +#pragma acc kernels loop gang reduction(+:sum)
>> +  for (int i = 0; i < n; i++)
>> +    sum += a[i];
>> +
>> +  return sum;
>> +}
>> +
>> +/* Check that only one loop is analyzed, and that it can be
>> parallelized.  */
>> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1
>> "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
>> +
>> +/* Check that the loop has been split off into a function.  */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
>> diff --git
>> a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
>> b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
>> new file mode 100644
>> index 0000000..d18c779
>> --- /dev/null
>> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
>> @@ -0,0 +1,22 @@
>> +/* { dg-additional-options "-O2" } */
>> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
>> +/* { dg-additional-options "-fdump-tree-optimized" } */
>> +
>> +unsigned int
>> +foo (int n)
>> +{
>> +  unsigned int sum = 1;
>> +
>> +  #pragma acc kernels loop
>> +  for (int i = 1; i <= n; i++)
>> +    sum += i;
>> +
>> +  return sum;
>> +}
>> +
>> +/* Check that only one loop is analyzed, and that it can be
>> parallelized.  */
>> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1
>> "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
>> +
>> +/* Check that the loop has been split off into a function.  */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
>> diff --git
>> a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
>> b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
>> new file mode 100644
>> index 0000000..95354e1
>> --- /dev/null
>> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
>> @@ -0,0 +1,17 @@
>> +/* { dg-additional-options "-O2" } */
>> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
>> +/* { dg-additional-options "-fdump-tree-optimized" } */
>> +
>> +/* Check that loops with '#pragma acc loop' tagged gets properly
>> parallelized.  */
>> +#define ACC_LOOP
>> +#include "kernels-loop-2.c"
>> +
>> +/* Check that only three loops are analyzed, and that all can be
>> +   parallelized.  */
>> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3
>> "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
>> +
>> +/* Check that the loop has been split off into a function.  */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*main._omp_fn.0" 1 "optimized" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*main._omp_fn.1" 1 "optimized" } } */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*main._omp_fn.2" 1 "optimized" } } */
>> diff --git
>> a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
>> b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
>> new file mode 100644
>> index 0000000..1ad3067
>> --- /dev/null
>> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-additional-options "-O2" } */
>> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
>> +/* { dg-additional-options "-fdump-tree-optimized" } */
>> +
>> +/* Check that loops with '#pragma acc loop' tagged gets properly
>> parallelized.  */
>> +#define ACC_LOOP
>> +#include "kernels-loop-3.c"
>> +
>> +/* Check that only one loop is analyzed, and that it can be
>> parallelized.  */
>> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1
>> "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
>> +
>> +/* Check that the loop has been split off into a function.  */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*main._omp_fn.0" 1 "optimized" } } */
>> diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
>> b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
>> new file mode 100644
>> index 0000000..47b8459
>> --- /dev/null
>> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-additional-options "-O2" } */
>> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
>> +/* { dg-additional-options "-fdump-tree-optimized" } */
>> +
>> +/* Check that loops with '#pragma acc loop' tagged gets properly
>> parallelized.  */
>> +#define ACC_LOOP
>> +#include "kernels-loop.c"
>> +
>> +/* Check that only one loop is analyzed, and that it can be
>> parallelized.  */
>> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1
>> "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
>> +
>> +/* Check that the loop has been split off into a function.  */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*main._omp_fn.0" 1 "optimized" } } */
>> diff --git
>> a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
>> b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
>> new file mode 100644
>> index 0000000..25b56d7
>> --- /dev/null
>> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-additional-options "-O2" } */
>> +/* { dg-additional-options "-fdump-tree-parloops1-all" } */
>> +/* { dg-additional-options "-fdump-tree-optimized" } */
>> +
>> +/* Check that loops with '#pragma acc loop' tagged gets properly
>> parallelized.  */
>> +#define ACC_LOOP
>> +#include "kernels-loop-n.c"
>> +
>> +/* Check that only one loop is analyzed, and that it can be
>> parallelized.  */
>> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1
>> "parloops1" } } */
>> +/* { dg-final { scan-tree-dump-not "FAILED:" "parloops1" } } */
>> +
>> +/* Check that the loop has been split off into a function.  */
>> +/* { dg-final { scan-tree-dump-times "(?n);; Function
>> .*foo.*._omp_fn.0" 1 "optimized" } } */
>> diff --git a/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
>> b/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
>> index 38a4a7d..9b2f7fa 100644
>> --- a/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
>> +++ b/gcc/testsuite/c-c++-common/goacc/loop-private-1.c
>> @@ -10,4 +10,4 @@ f (int i, int j)
>>         ;
>>   }
>>
>> -/* { dg-final { scan-tree-dump-times "#pragma acc loop
>> collapse\\(2\\) private\\(j\\) private\\(i\\)" 1 "gimple" } } */
>> +/* { dg-final { scan-tree-dump-times "#pragma acc loop private\\(j\\)
>> private\\(i\\)" 1 "gimple" } } */
>> diff --git a/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
>> b/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
>> index 6977525..e89ddc9 100644
>> --- a/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
>> +++ b/gcc/testsuite/gfortran.dg/goacc/combined-directives.f90
>> @@ -144,12 +144,12 @@ subroutine test
>>   !  !$acc end kernels loop
>>   end subroutine test
>>
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> collapse.2." 2 "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. gang" 2
>> "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> worker" 2 "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> vector" 2 "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> seq" 2 "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> auto" 2 "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> tile.2, 3" 2 "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.i. independent"
>> 2 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> collapse.2." 1 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. gang" 1
>> "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> worker" 1 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> vector" 1 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> seq" 1 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> auto" 1 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. private.j.
>> tile.2, 3" 1 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.i. independent"
>> 1 "gimple" } }
>>   ! { dg-final { scan-tree-dump-times "private.z" 2 "gimple" } }
>> diff --git a/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
>> b/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
>> index 3481085..c14b7b2 100644
>> --- a/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
>> +++ b/gcc/testsuite/gfortran.dg/goacc/gang-static.f95
>> @@ -78,5 +78,5 @@ end subroutine test
>>   ! { dg-final { scan-tree-dump-times "gang\\(static:2\\)" 1
>> "omplower" } }
>>   ! { dg-final { scan-tree-dump-times "gang\\(static:5\\)" 1
>> "omplower" } }
>>   ! { dg-final { scan-tree-dump-times "gang\\(static:20\\)" 1
>> "omplower" } }
>> -! { dg-final { scan-tree-dump-times "gang\\(num: 5 static:\\\*\\)" 1
>> "omplower" } }
>> -! { dg-final { scan-tree-dump-times "gang\\(num: 30 static:20\\)" 1
>> "omplower" } }
>> +! { dg-final { scan-tree-dump-times "gang\\(num: 5 static:\\\*\\)" 0
>> "omplower" } }
>> +! { dg-final { scan-tree-dump-times "gang\\(num: 30 static:20\\)" 0
>> "omplower" } }
>> diff --git a/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
>> b/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
>> index 929fb0e..4c431c8 100644
>> --- a/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
>> +++ b/gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
>> @@ -11,6 +11,7 @@ subroutine foo ()
>>     !$acc end parallel loop
>>     !$acc kernels loop reduction(+:a)
>>     do k = 2,6
>> +     a = a + 1
>>     enddo
>>     !$acc end kernels loop
>>   end subroutine
>> @@ -18,5 +19,5 @@ end subroutine
>>   ! { dg-final { scan-tree-dump-times "target oacc_parallel
>> firstprivate.a." 1 "gimple" } }
>>   ! { dg-final { scan-tree-dump-times "acc loop private.p.
>> reduction..:a." 1 "gimple" } }
>>   ! { dg-final { scan-tree-dump-times "target oacc_kernels
>> map.force_tofrom:a .len: 4.." 1 "gimple" } }
>> -! { dg-final { scan-tree-dump-times "acc loop private.k.
>> reduction..:a." 1 "gimple" } }
>> +! { dg-final { scan-tree-dump-times "acc loop private.k." 1 "gimple" } }
>>
>>
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-02  9:59                         ` Jakub Jelinek
@ 2016-03-14 13:16                           ` Tom de Vries
  2016-03-14 23:18                             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-03-14 13:16 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches

On 02/12/15 10:58, Jakub Jelinek wrote:
> On Fri, Nov 27, 2015 at 01:03:52PM +0100, Tom de Vries wrote:
>> Handle non-declared variables in kernels alias analysis
>>
>> 2015-11-27  Tom de Vries  <tom@codesourcery.com>
>>
>> 	* gimplify.c (gimplify_scan_omp_clauses): Initialize
>> 	OMP_CLAUSE_ORIG_DECL.
>> 	* omp-low.c (install_var_field_1): Handle base_pointers_restrict for
>> 	pointers.
>> 	(map_ptr_clause_points_to_clause_p)
>> 	(nr_map_ptr_clauses_pointing_to_clause): New function.
>> 	(omp_target_base_pointers_restrict_p): Handle GOMP_MAP_POINTER.
>> 	* tree-pretty-print.c (dump_omp_clause): Print OMP_CLAUSE_ORIG_DECL.
>> 	* tree.c (omp_clause_num_ops): Set num_ops for OMP_CLAUSE_MAP to 3.
>> 	* tree.h (OMP_CLAUSE_ORIG_DECL): New macro.
>>
>> 	* c-c++-common/goacc/kernels-alias-10.c: New test.
>> 	* c-c++-common/goacc/kernels-alias-9.c: New test.
>
> I don't like this (mainly the addition of OMP_CLAUSE_ORIG_DECL),
> but it also sounds wrong to me.
> The primary question is how do you handle GOMP_MAP_POINTER
> (which is something we don't use for C/C++ OpenMP anymore,
> and Fortran OpenMP will stop using it in GCC 7 or 6.2?) on the OpenACC
> libgomp side, does it work like GOMP_MAP_ALLOC or GOMP_MAP_FORCE_ALLOC?

When a GOMP_MAP_POINTER mapping is encountered, first we check if it has 
been mapped before:
- if it hasn't been mapped before, we check if the area the pointer
   points to has been mapped, and if not, error out. Else we map the
   pointer to a device pointer, and write the device pointer value
   to the device pointer variable.
- if the pointer has been mapped before, we reuse the mapping and write
   the device pointer value to the device pointer variable.

> Similarly GOMP_MAP_TO_PSET.
> If it works like GOMP_MAP_ALLOC (it does
> on the OpenMP side in target.c, so if something is already mapped, no
> further pointer assignment happens), then your change looks wrong.
> If it works like GOMP_MAP_FORCE_ALLOC, then you just should treat
> GOMP_MAP_POINTER on all OpenACC constructs as opcode that allows the
> restrict operation.

I guess it works mostly like GOMP_MAP_ALLOC, but I don't understand the 
relevance of the comparison for the patch. What is interesting for the 
restrict optimization is whether what GOMP_MAP_POINTER points to has 
been mapped with or without the force flag during the same mapping sequence.

> If it should behave differently depending on
> if the corresponding array section has been mapped with GOMP_MAP_FORCE_*
> or without it,

The mapping itself shouldn't behave differently.

> then supposedly you should use a different code for
> those two.

I could add f.i. an unsigned int aux_flags to struct tree_omp_clause, 
set a new POINTS_TO_FORCE_VAR flag when translating the acc clause into 
mapping clauses, and use that flag later on when dealing with the 
GOMP_MAP_POINTER clause. Is that an acceptable approach?

[ Instead I could define a new gcc-internal-only 
GOMP_MAP_POINTER_POINTS_TO_FORCE kind, but I'd rather avoid this, given 
that it would be handled the same as GOMP_MAP_POINTER everywhere, except 
for a single point in the source code. ]

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2016-03-14 13:16                           ` Tom de Vries
@ 2016-03-14 23:18                             ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2016-03-14 23:18 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3551 bytes --]

On 14/03/16 14:16, Tom de Vries wrote:
> On 02/12/15 10:58, Jakub Jelinek wrote:
>> On Fri, Nov 27, 2015 at 01:03:52PM +0100, Tom de Vries wrote:
>>> Handle non-declared variables in kernels alias analysis
>>>
>>> 2015-11-27  Tom de Vries  <tom@codesourcery.com>
>>>
>>>     * gimplify.c (gimplify_scan_omp_clauses): Initialize
>>>     OMP_CLAUSE_ORIG_DECL.
>>>     * omp-low.c (install_var_field_1): Handle base_pointers_restrict for
>>>     pointers.
>>>     (map_ptr_clause_points_to_clause_p)
>>>     (nr_map_ptr_clauses_pointing_to_clause): New function.
>>>     (omp_target_base_pointers_restrict_p): Handle GOMP_MAP_POINTER.
>>>     * tree-pretty-print.c (dump_omp_clause): Print OMP_CLAUSE_ORIG_DECL.
>>>     * tree.c (omp_clause_num_ops): Set num_ops for OMP_CLAUSE_MAP to 3.
>>>     * tree.h (OMP_CLAUSE_ORIG_DECL): New macro.
>>>
>>>     * c-c++-common/goacc/kernels-alias-10.c: New test.
>>>     * c-c++-common/goacc/kernels-alias-9.c: New test.
>>
>> I don't like this (mainly the addition of OMP_CLAUSE_ORIG_DECL),
>> but it also sounds wrong to me.
>> The primary question is how do you handle GOMP_MAP_POINTER
>> (which is something we don't use for C/C++ OpenMP anymore,
>> and Fortran OpenMP will stop using it in GCC 7 or 6.2?) on the OpenACC
>> libgomp side, does it work like GOMP_MAP_ALLOC or GOMP_MAP_FORCE_ALLOC?
>
> When a GOMP_MAP_POINTER mapping is encountered, first we check if it has
> been mapped before:
> - if it hasn't been mapped before, we check if the area the pointer
>    points to has been mapped, and if not, error out. Else we map the
>    pointer to a device pointer, and write the device pointer value
>    to the device pointer variable.
> - if the pointer has been mapped before, we reuse the mapping and write
>    the device pointer value to the device pointer variable.
>
>> Similarly GOMP_MAP_TO_PSET.
>> If it works like GOMP_MAP_ALLOC (it does
>> on the OpenMP side in target.c, so if something is already mapped, no
>> further pointer assignment happens), then your change looks wrong.
>> If it works like GOMP_MAP_FORCE_ALLOC, then you just should treat
>> GOMP_MAP_POINTER on all OpenACC constructs as opcode that allows the
>> restrict operation.
>
> I guess it works mostly like GOMP_MAP_ALLOC, but I don't understand the
> relevance of the comparison for the patch. What is interesting for the
> restrict optimization is whether what GOMP_MAP_POINTER points to has
> been mapped with or without the force flag during the same mapping
> sequence.
>
>> If it should behave differently depending on
>> if the corresponding array section has been mapped with GOMP_MAP_FORCE_*
>> or without it,
>
> The mapping itself shouldn't behave differently.
>
>> then supposedly you should use a different code for
>> those two.
>
> I could add f.i. an unsigned int aux_flags to struct tree_omp_clause,
> set a new POINTS_TO_FORCE_VAR flag when translating the acc clause into
> mapping clauses, and use that flag later on when dealing with the
> GOMP_MAP_POINTER clause. Is that an acceptable approach?
>
> [ Instead I could define a new gcc-internal-only
> GOMP_MAP_POINTER_POINTS_TO_FORCE kind, but I'd rather avoid this, given
> that it would be handled the same as GOMP_MAP_POINTER everywhere, except
> for a single point in the source code. ]

I found the example of OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION and 
OMP_CLAUSE_MAP_MAYBE_ZERO_LENGTH_ARRAY_SECTION, which re-purpose 
existing but unused fields, and used something similar in attached patch 
(untested, c-only for the moment).

Thanks,
- Tom


[-- Attachment #2: 0001-Handle-non-declared-variables-in-kernels-alias-analysis.patch --]
[-- Type: text/x-patch, Size: 6844 bytes --]

2016-03-14  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (install_var_field): Handle base_pointers_restrict for
	pointers.
	(omp_target_base_pointers_restrict_p): Handle GOMP_MAP_POINTER.
	* tree.h (OMP_CLAUSE_MAP_POINTER_TO_FORCED): define.

	* c-typeck.c (handle_omp_array_sections): Set
	OMP_CLAUSE_MAP_POINTER_TO_FORCED on GOMP_MAP_POINTER clause.

	* c-c++-common/goacc/kernels-alias-10.c: New test.
	* c-c++-common/goacc/kernels-alias-9.c: New test.

Handle non-declared variables in kernels alias analysis

---
 gcc/c/c-typeck.c                                   | 15 ++++++-
 gcc/omp-low.c                                      | 48 ++++++++++++++++++++++
 .../c-c++-common/goacc/kernels-alias-10.c          | 29 +++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c | 29 +++++++++++++
 gcc/tree.h                                         |  3 ++
 5 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c
index 6aa0f03..a05831d 100644
--- a/gcc/c/c-typeck.c
+++ b/gcc/c/c-typeck.c
@@ -12446,7 +12446,20 @@ handle_omp_array_sections (tree c, bool is_omp)
 	  }
       tree c2 = build_omp_clause (OMP_CLAUSE_LOCATION (c), OMP_CLAUSE_MAP);
       if (!is_omp)
-	OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_POINTER);
+	{
+	  OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_POINTER);
+	  switch (OMP_CLAUSE_MAP_KIND (c))
+	    {
+	    case GOMP_MAP_FORCE_ALLOC:
+	    case GOMP_MAP_FORCE_TO:
+	    case GOMP_MAP_FORCE_FROM:
+	    case GOMP_MAP_FORCE_TOFROM:
+	      OMP_CLAUSE_MAP_POINTER_TO_FORCED (c2) = 1;
+	      break;
+	    default:
+	      break;
+	    }
+	}
       else if (TREE_CODE (t) == COMPONENT_REF)
 	OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_ALWAYS_POINTER);
       else
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 82dec9d..f9d953d 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1429,6 +1429,9 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx,
     }
   else if (by_ref)
     {
+      if (base_pointers_restrict
+	  && POINTER_TYPE_P (type))
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
       type = build_pointer_type (type);
       if (base_pointers_restrict)
 	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
@@ -3132,6 +3135,47 @@ omp_target_base_pointers_restrict_p (tree clauses)
      Because both mappings have the force prefix, we know that they will be
      allocated when calling the corresponding offloaded function, which means we
      can mark the base pointers for a and b in the offloaded function as
+     restrict.
+
+     II.  GOMP_MAP_POINTER example:
+
+       void foo (unsigned int *a, unsigned int *b)
+       {
+	 #pragma acc kernels copyout (a[0:2]) copyout (b[0:2])
+	 {
+	   a[0] = 0;
+	   b[0] = 1;
+	 }
+       }
+
+     After gimplification, we have:
+
+     foo (unsigned int * a, unsigned int * b)
+     {
+       unsigned int * b.0;
+       unsigned int * a.1;
+
+       b.0 = b;
+       a.1 = a;
+       #pragma omp target oacc_kernels \
+	 map(force_from:*a.1 (*a) [len: 8]) \
+	 map(alloc:a [pointer assign, bias: 0]) \
+	 map(force_from:*b.0 (*b) [len: 8]) \
+	 map(alloc:b [pointer assign, bias: 0])
+       {
+	 unsigned int * a.2;
+	 unsigned int * b.3;
+
+	 a.2 = a;
+	 *a.2 = 0;
+	 b.3 = b;
+	 *b.3 = 1;
+       }
+     }
+
+     By testing for OMP_CLAUSE_MAP_POINTER_TO_FORCED, we can known for both
+     pointer assign mappings that they point to a force-prefixed mapping,  so
+     we can mark the base pointers for a and b in the offloaded function as
      restrict.  */
 
   tree c;
@@ -3147,6 +3191,10 @@ omp_target_base_pointers_restrict_p (tree clauses)
 	case GOMP_MAP_FORCE_FROM:
 	case GOMP_MAP_FORCE_TOFROM:
 	  break;
+	case GOMP_MAP_POINTER:
+	  if (!OMP_CLAUSE_MAP_POINTER_TO_FORCED (c))
+	    return false;
+	  break;
 	default:
 	  return false;
 	}
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c
new file mode 100644
index 0000000..ce5bbe8
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int b[N];
+  unsigned int c[N];
+  unsigned int d[N];
+
+#pragma acc kernels copyin (a[0:N]) create (b[0:N]) copyout (c[0:N]) copy (d[0:N])
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 8 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c
new file mode 100644
index 0000000..7229fd4
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (unsigned int *a, unsigned int *b, unsigned int *c, unsigned int *d)
+{
+
+#pragma acc kernels copyin (a[0:N]) create (b[0:N]) copyout (c[0:N]) copy (d[0:N])
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 6" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 7" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 8" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 9" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 12 "ealias" } } */
+
diff --git a/gcc/tree.h b/gcc/tree.h
index 544a6a1..bc48ea8 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1533,6 +1533,9 @@ extern void protected_set_expr_location (tree, location_t);
 #define OMP_CLAUSE_MAP_MAYBE_ZERO_LENGTH_ARRAY_SECTION(NODE) \
   TREE_PROTECTED (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_MAP))
 
+#define OMP_CLAUSE_MAP_POINTER_TO_FORCED(NODE) \
+  TREE_PRIVATE (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_MAP))
+
 #define OMP_CLAUSE_PROC_BIND_KIND(NODE) \
   (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_PROC_BIND)->omp_clause.subcode.proc_bind_kind)
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95
  2016-03-09  9:19   ` Tom de Vries
@ 2016-03-16 13:12     ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2016-03-16 13:12 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 5851 bytes --]

Hi!

On Wed, 9 Mar 2016 10:19:09 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 09/11/15 21:12, Tom de Vries wrote:
> > This patch adds Fortran oacc kernels execution tests.
> 
> Retested on current trunk.
> 
> Committed, minus the kernels-parallel-loop-data-enter-exit.f95 test.

As obvious, committed in r234257:

commit baeaf028bfed958e14abc8b9f3ca10949bacaf97
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Wed Mar 16 13:10:20 2016 +0000

    Nowadays, we use plain -fopenacc to enable OpenACC kernels processing
    
    	libgomp/
    	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: Adjust to
    	-ftree-parallelize-loops/-fopenacc changes.
    	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@234257 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog                                         | 15 +++++++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95 |  1 -
 .../libgomp.oacc-fortran/kernels-loop-data-2.f95          |  1 -
 .../kernels-loop-data-enter-exit-2.f95                    |  1 -
 .../libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95 |  1 -
 .../libgomp.oacc-fortran/kernels-loop-data-update.f95     |  1 -
 .../testsuite/libgomp.oacc-fortran/kernels-loop-data.f95  |  1 -
 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95   |  1 -
 8 files changed, 15 insertions(+), 7 deletions(-)

diff --git libgomp/ChangeLog libgomp/ChangeLog
index 5a91504..fca65e6 100644
--- libgomp/ChangeLog
+++ libgomp/ChangeLog
@@ -1,3 +1,18 @@
+2016-03-16  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: Adjust to
+	-ftree-parallelize-loops/-fopenacc changes.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Likewise.
+
 2016-03-13  Thomas Schwinge  <thomas@codesourcery.com>
 
 	* testsuite/lib/libgomp.exp (libgomp_init): Potentially append to
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
index 1fb40ee..163e8d5 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
index 7b52253..4c73606 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
index af98efa..da11aaf 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
index bb6f8dc..f4b4eb3 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
index cab1f2c..d2083e2 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
index f26671d..a908f54 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
index b02dd57..6fb5ba3 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc (was: [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c)
  2016-03-09  9:18   ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
@ 2016-03-18 12:46     ` Thomas Schwinge
  2016-04-05  9:13       ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Thomas Schwinge @ 2016-03-18 12:46 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 7250 bytes --]

Hi!

On Wed, 9 Mar 2016 10:17:28 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> [Should have cited
> <http://news.gmane.org/find-root.php?message_id=%3C5640FD6A.3080807%40mentor.com%3E>
> instead of the C/C++ tests]

> Retested on current trunk.
> 
> Committed, minus the kernels-parallel-loop-data-enter-exit.f95 test.

Is there a reason why you omitted the following tree scanning tests (as
done for C/C++, and also present for Fortran on gomp-4_0-branch)?  (Note
that I had to XFAIL gfortran.dg/goacc/kernels-loop-n.f95.)  OK to commit?

commit f0294eeb30ef285c3930b975ccbc1b6d7052cc03
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Fri Mar 18 12:52:37 2016 +0100

    Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc
    
    	gcc/testsuite/
    	* gfortran.dg/goacc/kernels-loop-2.f95: Scan for parallelization.
    	* gfortran.dg/goacc/kernels-loop-data-2.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data-update.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-n.f95: Likewise, XFAILed.
---
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95                 | 2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95            | 1 +
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95 | 2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95   | 2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95       | 2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95              | 2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95                 | 7 +++++++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95                   | 2 ++
 8 files changed, 20 insertions(+)

diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
index 5cc2e8b..865f7a6 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
@@ -40,3 +40,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
index d1bfc70..c9f3a62 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
@@ -47,3 +47,4 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
 
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
index feac7b2..3361607 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
@@ -46,3 +46,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
index 632983f..5ba56fb 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
@@ -44,3 +44,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
index 41b0d96..a622a96 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
@@ -43,3 +43,5 @@ end program main
 ! Check that the loop has been split off into a function.
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 2 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
index 3de2057..4ec2ac3 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
@@ -44,3 +44,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
index 21e2e86..90439ca 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
@@ -36,3 +36,10 @@ end module test
 
 ! Check that the loop has been split off into a function.
 ! { dg-final { scan-tree-dump-times "(?n);; Function __test_MOD_foo._omp_fn.0 " 1 "optimized" } }
+
+! TODO, *.parloops1:
+!     SUCCESS: may be parallelized
+!     Stmt *_9 = 0;
+!     conflicts with entry/exit stmt: _7 = *_6;
+!     entry/exit not ok: FAILED
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" { xfail *-*-* } } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
index f7e14b4..ae2cac6 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
@@ -34,3 +34,5 @@ end program main
 
 ! Check that the loop has been split off into a function.
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } }


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc
  2016-03-18 12:46     ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc (was: [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c) Thomas Schwinge
@ 2016-04-05  9:13       ` Tom de Vries
  2016-04-07 15:26         ` Thomas Schwinge
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2016-04-05  9:13 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 18/03/16 13:37, Thomas Schwinge wrote:
> Hi!
>
> On Wed, 9 Mar 2016 10:17:28 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
>> [Should have cited
>> <http://news.gmane.org/find-root.php?message_id=%3C5640FD6A.3080807%40mentor.com%3E>
>> instead of the C/C++ tests]
>
>> Retested on current trunk.
>>
>> Committed, minus the kernels-parallel-loop-data-enter-exit.f95 test.
>
> Is there a reason why you omitted the following tree scanning tests (as
> done for C/C++, and also present for Fortran on gomp-4_0-branch)?

I think that was a question of trying to avoid interaction between:
- the tests I was committing and
- removing the dependency of openacc kernels on
   -ftree-parallelize-loops=<n>
which were sort of happening in parallel.

> (Note
> that I had to XFAIL gfortran.dg/goacc/kernels-loop-n.f95.)

Right. I remember looking into this before, and classified it as the 
openacc version of PR68787 - fipa-pta to interpret restrict.

Now that we'll have an xfail for it, I've filed it as PR70545 - 
'[openacc] gfortran.dg/goacc/kernels-loop-n.f95 not parallelized'.

>  OK to commit?
>

Yes please.

Thanks,
- Tom

> commit f0294eeb30ef285c3930b975ccbc1b6d7052cc03
> Author: Thomas Schwinge <thomas@codesourcery.com>
> Date:   Fri Mar 18 12:52:37 2016 +0100
>
>      Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc
>
>      	gcc/testsuite/
>      	* gfortran.dg/goacc/kernels-loop-2.f95: Scan for parallelization.
>      	* gfortran.dg/goacc/kernels-loop-data-2.f95: Likewise.
>      	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: Likewise.
>      	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: Likewise.
>      	* gfortran.dg/goacc/kernels-loop-data-update.f95: Likewise.
>      	* gfortran.dg/goacc/kernels-loop-data.f95: Likewise.
>      	* gfortran.dg/goacc/kernels-loop.f95: Likewise.
>      	* gfortran.dg/goacc/kernels-loop-n.f95: Likewise, XFAILed.
> ---
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95                 | 2 ++
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95            | 1 +
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95 | 2 ++
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95   | 2 ++
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95       | 2 ++
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95              | 2 ++
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95                 | 7 +++++++
>   gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95                   | 2 ++
>   8 files changed, 20 insertions(+)
>
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
> index 5cc2e8b..865f7a6 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
> @@ -40,3 +40,5 @@ end program main
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
> +
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
> index d1bfc70..c9f3a62 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
> @@ -47,3 +47,4 @@ end program main
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
>
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
> index feac7b2..3361607 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
> @@ -46,3 +46,5 @@ end program main
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
> +
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
> index 632983f..5ba56fb 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
> @@ -44,3 +44,5 @@ end program main
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
> +
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
> index 41b0d96..a622a96 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
> @@ -43,3 +43,5 @@ end program main
>   ! Check that the loop has been split off into a function.
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
> +
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 2 "parloops1" } }
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
> index 3de2057..4ec2ac3 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
> @@ -44,3 +44,5 @@ end program main
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
> +
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
> index 21e2e86..90439ca 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
> @@ -36,3 +36,10 @@ end module test
>
>   ! Check that the loop has been split off into a function.
>   ! { dg-final { scan-tree-dump-times "(?n);; Function __test_MOD_foo._omp_fn.0 " 1 "optimized" } }
> +
> +! TODO, *.parloops1:
> +!     SUCCESS: may be parallelized
> +!     Stmt *_9 = 0;
> +!     conflicts with entry/exit stmt: _7 = *_6;
> +!     entry/exit not ok: FAILED
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" { xfail *-*-* } } }
> diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
> index f7e14b4..ae2cac6 100644
> --- gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
> +++ gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
> @@ -34,3 +34,5 @@ end program main
>
>   ! Check that the loop has been split off into a function.
>   ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
> +
> +! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } }
>
>
> Grüße
>   Thomas
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc
  2016-04-05  9:13       ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc Tom de Vries
@ 2016-04-07 15:26         ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2016-04-07 15:26 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Jakub Jelinek, Richard Biener

Hi!

On Tue, 5 Apr 2016 11:12:44 +0200, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 18/03/16 13:37, Thomas Schwinge wrote:
> > On Wed, 9 Mar 2016 10:17:28 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> >> [Should have cited
> >> <http://news.gmane.org/find-root.php?message_id=%3C5640FD6A.3080807%40mentor.com%3E>
> >> instead of the C/C++ tests]
> >
> >> Retested on current trunk.
> >>
> >> Committed, minus the kernels-parallel-loop-data-enter-exit.f95 test.
> >
> > [tree scanning tests (as
> > done for C/C++, and also present for Fortran on gomp-4_0-branch)]

> > (Note
> > that I had to XFAIL gfortran.dg/goacc/kernels-loop-n.f95.)
> 
> Right. I remember looking into this before, and classified it as the 
> openacc version of PR68787 - fipa-pta to interpret restrict.
> 
> Now that we'll have an xfail for it, I've filed it as PR70545 - 
> '[openacc] gfortran.dg/goacc/kernels-loop-n.f95 not parallelized'.

Makes sense to specify that PR70545 "Depends on: PR68787", and add
"Keywords: openacc"?

> >  OK to commit?
> 
> Yes please.

With the XFAIL noted in the respective test case, committed in r234809:

commit 1b61585a37935375c252a27648089c37018f459e
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Apr 7 15:21:37 2016 +0000

    Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc
    
    	gcc/testsuite/
    	* gfortran.dg/goacc/kernels-loop-2.f95: Scan for parallelization.
    	* gfortran.dg/goacc/kernels-loop-data-2.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data-update.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-data.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop.f95: Likewise.
    	* gfortran.dg/goacc/kernels-loop-n.f95: Likewise, XFAILed.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@234809 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/testsuite/ChangeLog                                      | 12 ++++++++++++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95           |  2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95      |  1 +
 .../gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95     |  2 ++
 .../gfortran.dg/goacc/kernels-loop-data-enter-exit.f95       |  2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95 |  2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95        |  2 ++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95           |  3 +++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95             |  2 ++
 9 files changed, 28 insertions(+)

diff --git gcc/testsuite/ChangeLog gcc/testsuite/ChangeLog
index d3c74ed..7688a83 100644
--- gcc/testsuite/ChangeLog
+++ gcc/testsuite/ChangeLog
@@ -1,3 +1,15 @@
+2016-04-07  Thomas Schwinge  <thomas@codesourcery.com>
+	    Tom de Vries  <tom@codesourcery.com>
+
+	* gfortran.dg/goacc/kernels-loop-2.f95: Scan for parallelization.
+	* gfortran.dg/goacc/kernels-loop-data-2.f95: Likewise.
+	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: Likewise.
+	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: Likewise.
+	* gfortran.dg/goacc/kernels-loop-data-update.f95: Likewise.
+	* gfortran.dg/goacc/kernels-loop-data.f95: Likewise.
+	* gfortran.dg/goacc/kernels-loop.f95: Likewise.
+	* gfortran.dg/goacc/kernels-loop-n.f95: Likewise, XFAILed.
+
 2016-04-06  Patrick Palka  <ppalka@gcc.gnu.org>
 
 	PR c/70436
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
index 5cc2e8b..865f7a6 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
@@ -40,3 +40,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
index d1bfc70..c9f3a62 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
@@ -47,3 +47,4 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
 
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
index feac7b2..3361607 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
@@ -46,3 +46,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
index 632983f..5ba56fb 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
@@ -44,3 +44,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
index 41b0d96..a622a96 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
@@ -43,3 +43,5 @@ end program main
 ! Check that the loop has been split off into a function.
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 2 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
index 3de2057..4ec2ac3 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
@@ -44,3 +44,5 @@ end program main
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 3 "parloops1" } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
index 21e2e86..409fe6f 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop-n.f95
@@ -36,3 +36,6 @@ end module test
 
 ! Check that the loop has been split off into a function.
 ! { dg-final { scan-tree-dump-times "(?n);; Function __test_MOD_foo._omp_fn.0 " 1 "optimized" } }
+
+! TODO, PR70545.
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" { xfail *-*-* } } }
diff --git gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
index f7e14b4..ae2cac6 100644
--- gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
+++ gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
@@ -34,3 +34,5 @@ end program main
 
 ! Check that the loop has been split off into a function.
 ! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(0," 1 "parloops1" } }


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 133+ messages in thread

end of thread, other threads:[~2016-04-07 15:26 UTC | newest]

Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
2015-11-11 10:50   ` Richard Biener
2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
2015-11-11 10:50   ` Richard Biener
2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
2015-11-24 12:25   ` [PING][PATCH, " Tom de Vries
2016-01-18 14:24     ` [PING^2][PATCH, " Tom de Vries
2016-01-18 14:26       ` Jakub Jelinek
2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
2015-11-11 10:53   ` Richard Biener
2015-11-11 11:01     ` Jakub Jelinek
2015-11-12 16:04       ` Tom de Vries
2015-11-13  8:46         ` Richard Biener
2015-11-13 11:03           ` Tom de Vries
2015-11-13 11:30             ` Richard Biener
2015-11-13 11:39               ` Jakub Jelinek
2015-11-21 12:24                 ` Tom de Vries
2015-11-23 11:46                   ` Richard Biener
2015-11-27 11:44                     ` Tom de Vries
2015-11-27 12:14                       ` Tom de Vries
2015-12-02  9:59                         ` Jakub Jelinek
2016-03-14 13:16                           ` Tom de Vries
2016-03-14 23:18                             ` Tom de Vries
2015-12-02  9:46                       ` Jakub Jelinek
2015-12-02 13:11                         ` Tom de Vries
2015-12-11 12:45                 ` Tom de Vries
2015-12-11 13:00                   ` Richard Biener
2015-12-13 16:38                     ` Tom de Vries
2015-12-14 13:26                       ` Richard Biener
2015-12-14 15:44                         ` Tom de Vries
2015-12-16 13:16                           ` Richard Biener
2015-12-16 14:43                             ` Tom de Vries
2015-12-17 12:03                               ` [gomp4] " Thomas Schwinge
2015-12-03 11:53       ` Tom de Vries
2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
2015-11-11 10:57   ` Richard Biener
2015-11-16 11:39     ` Tom de Vries
2015-11-16 12:41       ` Richard Biener
2015-11-16 11:39     ` Tom de Vries
2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
2015-11-11 10:59   ` Richard Biener
2015-11-19 13:51     ` Tom de Vries
2015-11-24 12:17       ` Tom de Vries
2015-11-25 10:42         ` Richard Biener
2016-02-05 12:06   ` Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels) Thomas Schwinge
2016-02-10 14:40     ` Use plain -fopenacc to enable OpenACC kernels processing Thomas Schwinge
2016-02-15 16:54       ` Tom de Vries
2016-02-23 15:19         ` Thomas Schwinge
2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
2015-11-11 11:05   ` Richard Biener
2015-11-16 12:04     ` Tom de Vries
2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
2015-11-11 20:29   ` Tom de Vries
2015-11-30 12:12     ` [gomp4] Use pass_ch instead of pass_ch_oacc_kernels (was: [PATCH, 8/16] Add pass_ch_oacc_kernels) Thomas Schwinge
2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
2015-11-16 11:59   ` Tom de Vries
2015-11-24 12:27     ` Tom de Vries
2015-12-13 16:58       ` [PIING][PATCH, " Tom de Vries
2015-12-14 15:23         ` Richard Biener
2016-01-16 22:41           ` [Committed] Move pass_expand_omp_ssa out of pass_parallelize_loops Tom de Vries
2016-01-18 12:59           ` [Committed] Allow pass_parallelize_loops to be run outside the loop pipeline Tom de Vries
2016-01-18 13:07           ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Tom de Vries
2016-01-18 13:30             ` [committed] Add pass_parallelize_loops to pass_oacc_kernels Tom de Vries
2016-01-20  8:54             ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Thomas Schwinge
2016-01-20 10:31               ` Tom de Vries
2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
2015-11-11 11:03   ` Richard Biener
2015-11-16 11:55     ` Tom de Vries
2015-11-16 12:45       ` Richard Biener
2015-11-16 23:21         ` Tom de Vries
2015-11-17 10:05           ` Richard Biener
2015-11-17 14:54             ` Tom de Vries
2015-11-17 15:18               ` Richard Biener
2015-11-17 15:39                 ` Tom de Vries
2015-11-17 22:21                   ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
2015-11-19  9:36                     ` Tom de Vries
2015-11-20 10:15                       ` Richard Biener
2015-11-18  8:30                   ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
2015-11-18 16:22                     ` Bernhard Reutner-Fischer
2015-11-20 12:53                       ` [committed, trivial] Fix typo and trailing whitespace in dump-file strings in parloops Tom de Vries
2015-11-19  0:35               ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
2015-11-20 10:28                 ` Richard Biener
2015-11-21  8:42                   ` Tom de Vries
2015-11-23 11:31                     ` Richard Biener
2015-11-23 15:53                       ` Tom de Vries
2015-11-23 16:38                         ` Richard Biener
2015-11-19 10:31         ` Tom de Vries
2015-11-20 10:37           ` Richard Biener
2015-11-20 13:27             ` Tom de Vries
2015-11-20 13:29               ` Richard Biener
2015-11-20 16:34                 ` Tom de Vries
2015-11-23 10:11                   ` Richard Biener
2015-11-24 12:22                     ` Tom de Vries
2015-11-24 13:19                       ` Richard Biener
2015-11-24 14:33                         ` Tom de Vries
2015-11-24 14:36                           ` Richard Biener
2015-11-24 15:05                             ` Tom de Vries
2015-11-25 10:43                               ` Richard Biener
2015-11-25 10:44                       ` Richard Biener
2015-11-30 17:48                         ` [gomp4] " Thomas Schwinge
2015-11-22 23:37             ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
2015-11-23 10:33               ` Richard Biener
2015-11-23 11:27                 ` Tom de Vries
2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
2015-11-11 11:03   ` Richard Biener
2015-11-12 14:32     ` Tom de Vries
2015-11-12 14:43       ` Richard Biener
2015-11-12 15:42         ` David Malcolm
2015-11-13  9:44           ` Richard Biener
2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
2015-11-24 12:30   ` [PING][PATCH, " Tom de Vries
2016-01-18 14:27     ` [PING^2][PATCH, " Tom de Vries
2016-01-26 12:38       ` [PING^3][PATCH, " Tom de Vries
2016-01-26 12:50         ` Jakub Jelinek
2016-02-12 11:11           ` Tom de Vries
2016-02-22 10:55             ` Tom de Vries
2016-02-22 10:58               ` Jakub Jelinek
2016-02-29  3:27                 ` Tom de Vries
2016-03-07  8:22                   ` [PING][PATCH, " Tom de Vries
2016-03-14  6:21                     ` [PING^2][PATCH, " Tom de Vries
2015-11-09 20:08 ` [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c Tom de Vries
2016-01-18 13:33   ` [committed] Add oacc kernels tests in goacc Tom de Vries
2015-11-09 20:09 ` [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95 Tom de Vries
2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
2016-01-18 13:39   ` [comitted] Add oacc kernels test in libgomp Tom de Vries
2016-03-09  9:18   ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
2016-03-18 12:46     ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc (was: [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c) Thomas Schwinge
2016-04-05  9:13       ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc Tom de Vries
2016-04-07 15:26         ` Thomas Schwinge
2015-11-09 20:12 ` [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95 Tom de Vries
2016-03-09  9:19   ` Tom de Vries
2016-03-16 13:12     ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).