public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH series, 16] Use parloops to parallelize oacc kernels regions
@ 2015-11-09 15:35 Tom de Vries
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
                   ` (15 more replies)
  0 siblings, 16 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:35 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1382 bytes --]

Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

      1	Insert new exit block only when needed in
         transform_to_exit_first_loop_alt
      2	Make create_parallel_loop return void
      3	Ignore reduction clause on kernels directive
      4	Implement -foffload-alias
      5	Add in_oacc_kernels_region in struct loop
      6	Add pass_oacc_kernels
      7	Add pass_dominator_oacc_kernels
      8	Add pass_ch_oacc_kernels
      9	Add pass_parallelize_loops_oacc_kernels
     10	Add pass_oacc_kernels pass group in passes.def
     11	Update testcases after adding kernels pass group
     12	Handle acc loop directive
     13	Add c-c++-common/goacc/kernels-*.c
     14	Add gfortran.dg/goacc/kernels-*.f95
     15	Add libgomp.oacc-c-c++-common/kernels-*.c
     16	Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are 
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a 
patch that enables accelerator testing (which is submitted at 
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.

Thanks,
- Tom

[-- Attachment #2: patch-series-summmary.txt --]
[-- Type: text/plain, Size: 10935 bytes --]

---

1
Insert new exit block only when needed in transform_to_exit_first_loop_alt

2015-06-30  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (transform_to_exit_first_loop_alt): Insert new exit
	block only when needed.
---

2
Make create_parallel_loop return void

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (create_parallel_loop): Return void.
---

3
Ignore reduction clause on kernels directive

2015-11-08  Tom de Vries  <tom@codesourcery.com>

	* c-omp.c (c_oacc_split_loop_clauses): Don't copy OMP_CLAUSE_REDUCTION,
	classify as loop clause.
---

4
Implement -foffload-alias

2015-11-03  Tom de Vries  <tom@codesourcery.com>

	* common.opt (foffload-alias): New option.
	* flag-types.h (enum offload_alias): New enum.
	* omp-low.c (install_var_field): Handle flag_offload_alias.
	* doc/invoke.texi (@item Code Generation Options): Add -foffload-alias.
	(@item -foffload-alias): New item.

	* c-c++-common/goacc/kernels-loop-offload-alias-none.c: New test.
	* c-c++-common/goacc/kernels-loop-offload-alias-ptr.c: New test.
---

5
Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.
---

6
Add pass_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_oacc_kernels): Declare.
	* tree-ssa-loop.c (gate_oacc_kernels): New static function.
	(pass_data_oacc_kernels): New pass_data.
	(class pass_oacc_kernels): New pass.
	(make_pass_oacc_kernels): New function.
---

7
Add pass_dominator_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_dominator_oacc_kernels): Declare.
	* tree-ssa-dom.c (class dominator_base): New class.  Factor out of ...
	(class pass_dominator): ... here.
	(dominator_base::may_peel_loop_headers_p)
        (pass_dominator::may_peel_loop_headers_p): New function.
	(pass_dominator_oacc_kernels): New pass.
	(make_pass_dominator_oacc_kernels): New function.
	(dominator_base::execute): Use may_peel_loop_headers_p.
---

8
Add pass_ch_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_ch_oacc_kernels): Declare.
	* tree-ssa-loop-ch.c (pass_ch::pass_ch (pass_data, gcc::context)): New
	constructor.
	(pass_data_ch_oacc_kernels): New pass_data.
	(class pass_ch_oacc_kernels): New pass.
	(pass_ch_oacc_kernels::process_loop_p): New function.
	(make_pass_ch_oacc_kernels): New function.
---

9
Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
        (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.
---

10
Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* tree-ssa-loop.c (pass_scev_cprop::clone, pass_tree_loop_init::clone)
	(pass_tree_loop_done::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
---

11
Update testcases after adding kernels pass group

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/restrict-2.c: Update after adding pass_oacc_kernels pass
	group.
	* c-c++-common/restrict-4.c: Same.
	* g++.dg/tree-ssa/copyprop-1.C: Same.
	* g++.dg/tree-ssa/pr33615.C: Same.
	* g++.dg/tree-ssa/restrict1.C: Same.
	* gcc.dg/gomp/notify-new-function-3.c: Same.
	* gcc.dg/pr23911.c: Same.
	* gcc.dg/pr41488.c: Same.
	* gcc.dg/tm/pub-safety-1.c: Same.
	* gcc.dg/tm/reg-promotion.c: Same.
	* gcc.dg/tree-ssa/20030709-2.c: Same.
	* gcc.dg/tree-ssa/20030731-2.c: Same.
	* gcc.dg/tree-ssa/20040729-1.c: Same.
	* gcc.dg/tree-ssa/20050314-1.c: Same.
	* gcc.dg/tree-ssa/cfgcleanup-1.c: Same.
	* gcc.dg/tree-ssa/loop-17.c: Same.
	* gcc.dg/tree-ssa/loop-32.c: Same.
	* gcc.dg/tree-ssa/loop-33.c: Same.
	* gcc.dg/tree-ssa/loop-34.c: Same.
	* gcc.dg/tree-ssa/loop-35.c: Same.
	* gcc.dg/tree-ssa/loop-36.c: Same.
	* gcc.dg/tree-ssa/loop-39.c: Same.
	* gcc.dg/tree-ssa/loop-7.c: Same.
	* gcc.dg/tree-ssa/pr21086.c: Same.
	* gcc.dg/tree-ssa/pr23109.c: Same.
	* gcc.dg/tree-ssa/restrict-3.c: Same.
	* gcc.dg/tree-ssa/restrict-5.c: Same.
	* gcc.dg/tree-ssa/scev-7.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-1.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-1.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-10.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-11.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-12.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-3.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-6.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-7.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-8.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-9.c: Same.
	* gcc.dg/tree-ssa/structopt-1.c: Same.
	* gcc.dg/vect/pr26359.c: Same.
	* gfortran.dg/pr32921.f: Same.
---

12
Handle acc loop directive

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (struct omp_region): Add inside_kernels_p field.
	(expand_omp_for_generic): Only set address taken for istart0
	and end0 unless necessary.  Adjust to generate a 'sequential' loop
	when GOMP builtin arguments are BUILT_IN_NONE.
	(expand_omp_for): Use expand_omp_for_generic() to generate a
	non-parallelized loop for OMP_FORs inside OpenACC kernels regions.
	(expand_omp): Mark inside_kernels_p field true for regions
	nested inside OpenACC kernels constructs.
---

13
Add c-c++-common/goacc/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/goacc/kernels-acc-loop-reduction.c: New test.
	* c-c++-common/goacc/kernels-acc-loop-smaller-equal.c: New test.
	* c-c++-common/goacc/kernels-counter-var-redundant-load.c: New test.
	* c-c++-common/goacc/kernels-counter-vars-function-scope.c: New test.
	* c-c++-common/goacc/kernels-double-reduction.c: New test.
	* c-c++-common/goacc/kernels-empty.c: New test.
	* c-c++-common/goacc/kernels-eternal.c: New test.
	* c-c++-common/goacc/kernels-loop-2-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-2.c: New test.
	* c-c++-common/goacc/kernels-loop-3-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-3.c: New test.
	* c-c++-common/goacc/kernels-loop-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-data-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-loop-data-update.c: New test.
	* c-c++-common/goacc/kernels-loop-data.c: New test.
	* c-c++-common/goacc/kernels-loop-g.c: New test.
	* c-c++-common/goacc/kernels-loop-mod-not-zero.c: New test.
	* c-c++-common/goacc/kernels-loop-n-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-n.c: New test.
	* c-c++-common/goacc/kernels-loop-nest.c: New test.
	* c-c++-common/goacc/kernels-loop.c: New test.
	* c-c++-common/goacc/kernels-noreturn.c: New test.
	* c-c++-common/goacc/kernels-one-counter-var.c: New test.
	* c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-reduction.c: New test.
---

14
Add gfortran.dg/goacc/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* gfortran.dg/goacc/kernels-loop-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-update.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data.f95: New test.
	* gfortran.dg/goacc/kernels-loop.f95: New test.
	* gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95: New test.
---

15
Add libgomp.oacc-c-c++-common/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c: Same.
---

16
Add libgomp.oacc-fortran/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: New test.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
	Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95:
	Same.
---

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
@ 2015-11-09 15:44 ` Tom de Vries
  2015-11-11 10:50   ` Richard Biener
  2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1801 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.
>

In transform_to_exit_first_loop_alt we insert a new exit block  in 
between the new loop header and the old exit block. Currently, we also 
do this if this is not necessary.

This patch figures out when we need to insert a new exit block, and only 
then inserts it.

Thanks,
- Tom


[-- Attachment #2: 0001-Insert-new-exit-block-only-when-needed-in-transform_.patch --]
[-- Type: text/x-patch, Size: 3396 bytes --]

Insert new exit block only when needed in transform_to_exit_first_loop_alt

2015-06-30  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (transform_to_exit_first_loop_alt): Insert new exit
	block only when needed.
---
 gcc/tree-parloops.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 3d41275..6a49aa9 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -1695,10 +1695,15 @@ transform_to_exit_first_loop_alt (struct loop *loop,
   /* Set the latch arguments of the new phis to ivtmp/sum_b.  */
   flush_pending_stmts (post_inc_edge);
 
-  /* Create a new empty exit block, inbetween the new loop header and the old
-     exit block.  The function separate_decls_in_region needs this block to
-     insert code that is active on loop exit, but not any other path.  */
-  basic_block new_exit_block = split_edge (exit);
+
+  basic_block new_exit_block = NULL;
+  if (!single_pred_p (exit->dest))
+    {
+      /* Create a new empty exit block, inbetween the new loop header and the
+	 old exit block.  The function separate_decls_in_region needs this block
+	 to insert code that is active on loop exit, but not any other path.  */
+      new_exit_block = split_edge (exit);
+    }
 
   /* Insert and register the reduction exit phis.  */
   for (gphi_iterator gsi = gsi_start_phis (exit_block);
@@ -1706,17 +1711,24 @@ transform_to_exit_first_loop_alt (struct loop *loop,
        gsi_next (&gsi))
     {
       gphi *phi = gsi.phi ();
+      gphi *nphi = NULL;
       tree res_z = PHI_RESULT (phi);
+      tree res_c;
 
-      /* Now that we have a new exit block, duplicate the phi of the old exit
-	 block in the new exit block to preserve loop-closed ssa.  */
-      edge succ_new_exit_block = single_succ_edge (new_exit_block);
-      edge pred_new_exit_block = single_pred_edge (new_exit_block);
-      tree res_y = copy_ssa_name (res_z, phi);
-      gphi *nphi = create_phi_node (res_y, new_exit_block);
-      tree res_c = PHI_ARG_DEF_FROM_EDGE (phi, succ_new_exit_block);
-      add_phi_arg (nphi, res_c, pred_new_exit_block, UNKNOWN_LOCATION);
-      add_phi_arg (phi, res_y, succ_new_exit_block, UNKNOWN_LOCATION);
+      if (new_exit_block != NULL)
+	{
+	  /* Now that we have a new exit block, duplicate the phi of the old
+	     exit block in the new exit block to preserve loop-closed ssa.  */
+	  edge succ_new_exit_block = single_succ_edge (new_exit_block);
+	  edge pred_new_exit_block = single_pred_edge (new_exit_block);
+	  tree res_y = copy_ssa_name (res_z, phi);
+	  nphi = create_phi_node (res_y, new_exit_block);
+	  res_c = PHI_ARG_DEF_FROM_EDGE (phi, succ_new_exit_block);
+	  add_phi_arg (nphi, res_c, pred_new_exit_block, UNKNOWN_LOCATION);
+	  add_phi_arg (phi, res_y, succ_new_exit_block, UNKNOWN_LOCATION);
+	}
+      else
+	res_c = PHI_ARG_DEF_FROM_EDGE (phi, exit);
 
       if (virtual_operand_p (res_z))
 	continue;
@@ -1724,7 +1736,9 @@ transform_to_exit_first_loop_alt (struct loop *loop,
       gimple *reduc_phi = SSA_NAME_DEF_STMT (res_c);
       struct reduction_info *red = reduction_phi (reduction_list, reduc_phi);
       if (red != NULL)
-	red->keep_res = nphi;
+	red->keep_res = (nphi != NULL
+			 ? nphi
+			 : phi);
     }
 
   /* We're going to cancel the loop at the end of gen_parallel_loop, but until
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 2/16] Make create_parallel_loop return void
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
@ 2015-11-09 15:45 ` Tom de Vries
  2015-11-11 10:50   ` Richard Biener
  2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:45 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch makes create_parallel_loop return void.  The result is 
currently unused.

Thanks,
- Tom


[-- Attachment #2: 0002-Make-create_parallel_loop-return-void.patch --]
[-- Type: text/x-patch, Size: 1362 bytes --]

Make create_parallel_loop return void

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (create_parallel_loop): Return void.
---
 gcc/tree-parloops.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 6a49aa9..17415a8 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -1986,10 +1986,9 @@ transform_to_exit_first_loop (struct loop *loop,
 /* Create the parallel constructs for LOOP as described in gen_parallel_loop.
    LOOP_FN and DATA are the arguments of GIMPLE_OMP_PARALLEL.
    NEW_DATA is the variable that should be initialized from the argument
-   of LOOP_FN.  N_THREADS is the requested number of threads.  Returns the
-   basic block containing GIMPLE_OMP_PARALLEL tree.  */
+   of LOOP_FN.  N_THREADS is the requested number of threads.  */
 
-static basic_block
+static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 		      tree new_data, unsigned n_threads, location_t loc)
 {
@@ -2162,8 +2161,6 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   /* After the above dom info is hosed.  Re-compute it.  */
   free_dominance_info (CDI_DOMINATORS);
   calculate_dominance_info (CDI_DOMINATORS);
-
-  return paral_bb;
 }
 
 /* Generates code to execute the iterations of LOOP in N_THREADS
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 3/16] Ignore reduction clause on kernels directive
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
  2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
@ 2015-11-09 15:51 ` Tom de Vries
  2015-11-24 12:25   ` [PING][PATCH, " Tom de Vries
  2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 15:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1698 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

As discussed here ( 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00785.html ), the kernels 
directive does not allow the reduction clause.  This patch fixes that.

Thanks,
- Tom


[-- Attachment #2: 0003-Ignore-reduction-clause-on-kernels-directive.patch --]
[-- Type: text/x-patch, Size: 1306 bytes --]

Ignore reduction clause on kernels directive

2015-11-08  Tom de Vries  <tom@codesourcery.com>

	* c-omp.c (c_oacc_split_loop_clauses): Don't copy OMP_CLAUSE_REDUCTION,
	classify as loop clause.
---
 gcc/c-family/c-omp.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/gcc/c-family/c-omp.c b/gcc/c-family/c-omp.c
index 3e93b59..a3b99b2 100644
--- a/gcc/c-family/c-omp.c
+++ b/gcc/c-family/c-omp.c
@@ -867,7 +867,7 @@ c_omp_check_loop_iv_exprs (location_t stmt_loc, tree declv, tree decl,
 tree
 c_oacc_split_loop_clauses (tree clauses, tree *not_loop_clauses)
 {
-  tree next, loop_clauses, t;
+  tree next, loop_clauses;
 
   loop_clauses = *not_loop_clauses = NULL_TREE;
   for (; clauses ; clauses = next)
@@ -886,16 +886,11 @@ c_oacc_split_loop_clauses (tree clauses, tree *not_loop_clauses)
 	case OMP_CLAUSE_SEQ:
 	case OMP_CLAUSE_INDEPENDENT:
 	case OMP_CLAUSE_PRIVATE:
+	case OMP_CLAUSE_REDUCTION:
 	  OMP_CLAUSE_CHAIN (clauses) = loop_clauses;
 	  loop_clauses = clauses;
 	  break;
 
-	  /* Reductions belong in both constructs.  */
-	case OMP_CLAUSE_REDUCTION:
-	  t = copy_node (clauses);
-	  OMP_CLAUSE_CHAIN (t) = loop_clauses;
-	  loop_clauses = t;
-
 	  /* Parallel/kernels clauses.  */
 	default:
 	  OMP_CLAUSE_CHAIN (clauses) = *not_loop_clauses;
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 4/16] Implement -foffload-alias
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (2 preceding siblings ...)
  2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
@ 2015-11-09 16:10 ` Tom de Vries
  2015-11-11 10:53   ` Richard Biener
  2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 16:10 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2652 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch addresses the problem that once the offloading region has 
been split off from the original function, alias analysis can no longer 
use information available in the original function that would allow it 
to do a more precise analysis for the offloading function. [ At some 
point we could use fipa-pta for that, as discussed in PR46032, but 
that's not feasible now. ]

The basic idea behind the patch is that for typical usage, the base 
pointers used in an offloaded region are non-aliasing. The patch works 
by adding restrict to the types of the fields used to pass data to an 
offloading region.


The patch implements a new option
-foffload-alias=<none|pointer|all>.

The option -foffload-alias=none instructs the compiler to assume that
object references and pointer dereferences in an offload region do not
alias.

The option -foffload-alias=pointer instructs the compiler to assume that 
objects references in an offload region do not alias.

The option -foffload-alias=all instructs the compiler to make no
assumptions about aliasing in offload regions.

The default value is -foffload-alias=none.

Thanks,
- Tom


[-- Attachment #2: 0004-Implement-foffload-alias.patch --]
[-- Type: text/x-patch, Size: 9102 bytes --]

Implement -foffload-alias

2015-11-03  Tom de Vries  <tom@codesourcery.com>

	* common.opt (foffload-alias): New option.
	* flag-types.h (enum offload_alias): New enum.
	* omp-low.c (install_var_field): Handle flag_offload_alias.
	* doc/invoke.texi (@item Code Generation Options): Add -foffload-alias.
	(@item -foffload-alias): New item.

	* c-c++-common/goacc/kernels-loop-offload-alias-none.c: New test.
	* c-c++-common/goacc/kernels-loop-offload-alias-ptr.c: New test.
---
 gcc/common.opt                                     | 16 ++++++
 gcc/doc/invoke.texi                                | 11 ++++
 gcc/flag-types.h                                   |  7 +++
 gcc/omp-low.c                                      | 28 +++++++++-
 .../goacc/kernels-loop-offload-alias-none.c        | 61 ++++++++++++++++++++++
 .../goacc/kernels-loop-offload-alias-ptr.c         | 44 ++++++++++++++++
 6 files changed, 165 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c

diff --git a/gcc/common.opt b/gcc/common.opt
index 961a1b6..7135b1a 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1735,6 +1735,22 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
 EnumValue
 Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
 
+foffload-alias=
+Common Joined RejectNegative Enum(offload_alias) Var(flag_offload_alias) Init(OFFLOAD_ALIAS_NONE)
+-foffload-alias=[all|pointer|none]     Assume non-aliasing in an offload region
+
+Enum
+Name(offload_alias) Type(enum offload_alias) UnknownError(unknown offload aliasing %qs)
+
+EnumValue
+Enum(offload_alias) String(all) Value(OFFLOAD_ALIAS_ALL)
+
+EnumValue
+Enum(offload_alias) String(pointer) Value(OFFLOAD_ALIAS_POINTER)
+
+EnumValue
+Enum(offload_alias) String(none) Value(OFFLOAD_ALIAS_NONE)
+
 fomit-frame-pointer
 Common Report Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2e5953b..6928efd 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1143,6 +1143,7 @@ See S/390 and zSeries Options.
 -finstrument-functions-exclude-function-list=@var{sym},@var{sym},@dots{} @gol
 -finstrument-functions-exclude-file-list=@var{file},@var{file},@dots{} @gol
 -fno-common  -fno-ident @gol
+-foffload-alias=@r{[}none@r{|}pointer@r{|}all@r{]} @gol
 -fpcc-struct-return  -fpic  -fPIC -fpie -fPIE -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
@@ -23852,6 +23853,16 @@ The options @option{-ftrapv} and @option{-fwrapv} override each other, so using
 using @option{-ftrapv} @option{-fwrapv} @option{-fno-wrapv} on the command-line
 results in @option{-ftrapv} being effective.
 
+@item -foffload-alias=@r{[}none@r{|}pointer@r{|}all@r{]}
+@opindex -foffload-alias
+The option @option{-foffload-alias=none} instructs the compiler to assume that
+objects references and pointer dereferences in an offload region do not alias.
+The option @option{-foffload-alias=pointer} instruct the compiler to assume that
+objects references in an offload region do not alias.  The option
+@option{-foffload-alias=all} instructs the compiler to make no assumptions about
+aliasing in offload regions.  The default value is
+@option{-foffload-alias=none}.
+
 @item -fexceptions
 @opindex fexceptions
 Enable exception handling.  Generates extra code needed to propagate
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index 6301cea..87b1677 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -293,5 +293,12 @@ enum gfc_convert
   GFC_FLAG_CONVERT_LITTLE
 };
 
+enum offload_alias
+{
+  OFFLOAD_ALIAS_ALL,
+  OFFLOAD_ALIAS_POINTER,
+  OFFLOAD_ALIAS_NONE
+};
+
 
 #endif /* ! GCC_FLAG_TYPES_H */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 45d1927..d052c13 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1371,6 +1371,14 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
   tree field, type, sfield = NULL_TREE;
   splay_tree_key key = (splay_tree_key) var;
 
+  /* We use flag_offload_alias only for the oacc kernels region for the
+     moment.  */
+  bool offload_alias_p = is_oacc_kernels (ctx);
+  bool no_alias_var_p
+    = offload_alias_p && flag_offload_alias != OFFLOAD_ALIAS_ALL;
+  bool no_alias_ptr_p
+    = offload_alias_p && flag_offload_alias == OFFLOAD_ALIAS_NONE;
+
   if ((mask & 8) != 0)
     {
       key = (splay_tree_key) &DECL_UID (var);
@@ -1387,10 +1395,26 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
   if (mask & 4)
     {
       gcc_assert (TREE_CODE (type) == ARRAY_TYPE);
-      type = build_pointer_type (build_pointer_type (type));
+
+      type = build_pointer_type (type);
+      if (no_alias_var_p)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+
+      type = build_pointer_type (type);
+      if (no_alias_var_p)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
     }
   else if (by_ref)
-    type = build_pointer_type (type);
+    {
+      if (no_alias_ptr_p
+	  && POINTER_TYPE_P (type))
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+
+      type = build_pointer_type (type);
+
+      if (no_alias_var_p)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+    }
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
 
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c
new file mode 100644
index 0000000..79d8daa
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-none.c
@@ -0,0 +1,61 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+/* { dg-additional-options "-fdump-tree-alias-all" } */
+/* { dg-additional-options "-foffload-alias=none" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+static void
+foo (unsigned int *a, unsigned int *b, unsigned int *c)
+{
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+}
+
+int
+main (void)
+{
+  unsigned int *a;
+  unsigned int *b;
+  unsigned int *c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  foo (a, b, c);
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 3 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 6" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 7" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 9 "alias" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c
new file mode 100644
index 0000000..de4f45a
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-offload-alias-ptr.c
@@ -0,0 +1,44 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+/* { dg-additional-options "-fdump-tree-alias-all" } */
+/* { dg-additional-options "-foffload-alias=pointer" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+
+int
+main (void)
+{
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  return 0;
+}
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 3 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "alias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 6 "alias" } } */
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (3 preceding siblings ...)
  2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
@ 2015-11-09 16:31 ` Tom de Vries
  2015-11-11 10:57   ` Richard Biener
  2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1976 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch adds and initializes the field in_oacc_kernels_region field 
in struct loop.

The field is used to signal to subsequent passes that we're dealing with 
a loop in a kernels region that we're trying parallelize.

Note that we do not parallelize kernels regions with more than one loop 
nest. [ In general, kernels regions with more than one loop nest should 
be split up into seperate kernels regions, but that's not supported atm. ]

Thanks,
- Tom


[-- Attachment #2: 0005-Add-in_oacc_kernels_region-in-struct-loop.patch --]
[-- Type: text/x-patch, Size: 3333 bytes --]

Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.
---
 gcc/cfgloop.h |  3 +++
 gcc/omp-low.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 6af6893..ee73bf9 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if the loop is part of an oacc kernels region.  */
+  bool in_oacc_kernels_region;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index d052c13..7121d73 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12429,6 +12429,61 @@ get_oacc_ifn_dim_arg (const gimple *stmt)
   return (int) axis;
 }
 
+/* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
+   at REGION_EXIT.  */
+
+static void
+mark_loops_in_oacc_kernels_region (basic_block region_entry,
+				   basic_block region_exit)
+{
+  bitmap dominated_bitmap = BITMAP_GGC_ALLOC ();
+  bitmap excludes_bitmap = BITMAP_GGC_ALLOC ();
+  unsigned di;
+  basic_block bb;
+
+  bitmap_clear (dominated_bitmap);
+  bitmap_clear (excludes_bitmap);
+
+  /* Get all the blocks dominated by the region entry.  That will include the
+     entire region.  */
+  vec<basic_block> dominated
+    = get_all_dominated_blocks (CDI_DOMINATORS, region_entry);
+  FOR_EACH_VEC_ELT (dominated, di, bb)
+      bitmap_set_bit (dominated_bitmap, bb->index);
+
+  /* Exclude all the blocks which are not in the region: the blocks dominated by
+     the region exit.  */
+  if (region_exit != NULL)
+    {
+      vec<basic_block> excludes
+	= get_all_dominated_blocks (CDI_DOMINATORS, region_exit);
+      FOR_EACH_VEC_ELT (excludes, di, bb)
+	bitmap_set_bit (excludes_bitmap, bb->index);
+    }
+
+  /* Don't parallelize the kernels region if it contains more than one outer
+     loop.  */
+  unsigned int nr_outer_loops = 0;
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 0)
+    {
+      if (loop_outer (loop) != current_loops->tree_root)
+	continue;
+
+      if (bitmap_bit_p (dominated_bitmap, loop->header->index)
+	  && !bitmap_bit_p (excludes_bitmap, loop->header->index))
+	nr_outer_loops++;
+    }
+  if (nr_outer_loops != 1)
+    return;
+
+  /* Mark the loops in the region.  */
+  FOR_EACH_LOOP (loop, 0)
+    if (bitmap_bit_p (dominated_bitmap, loop->header->index)
+	&& !bitmap_bit_p (excludes_bitmap, loop->header->index))
+      loop->in_oacc_kernels_region = true;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -12483,6 +12538,9 @@ expand_omp_target (struct omp_region *region)
   entry_bb = region->entry;
   exit_bb = region->exit;
 
+  if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+    mark_loops_in_oacc_kernels_region (region->entry, region->exit);
+
   if (offloaded)
     {
       unsigned srcidx, dstidx, num;
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (4 preceding siblings ...)
  2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
@ 2015-11-09 17:39 ` Tom de Vries
  2015-11-11 10:59   ` Richard Biener
  2016-02-05 12:06   ` Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels) Thomas Schwinge
  2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 17:39 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patchs add a pass group pass_oacc_kernels (which will be added to 
the pass list as a whole in patch 10).

Atm, the parallelization behaviour for the kernels region is controlled 
by flag_tree_parallelize_loops, which is also used to control generic 
auto-parallelization by autopar using omp. That is not ideal, and we may 
want a separate flag (or param) to control the behaviour for oacc 
kernels, f.i. -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.

The purpose of the pass group as a whole is to massage the offloaded 
function into a shape that parloops can deal with it, and then run 
parloops on it.

Consider a testcase with a reduction, and a loop counter declared 
outside the offload region:
...
unsigned int a[n];

unsigned int
foo (void)
{
   int i;
   unsigned int sum = 1;

#pragma acc kernels copyin (a[0:n]) copy (sum)
   {
     for (i = 0; i < n; ++i)
       sum += a[i];
   }

   return sum;
}
...

After ealias, the loop body looks like this:
...
   <bb 5>:
   _8 = *.omp_data_i_3(D).a;
   _9 = *.omp_data_i_3(D).i;
   _10 = *_9;
   _11 = *_8[_10];
   _12 = *.omp_data_i_3(D).sum;
   sum.0_13 = *_12;
   sum.1_14 = _11 + sum.0_13;
   _15 = *.omp_data_i_3(D).sum;
   *_15 = sum.1_14;
   _17 = *.omp_data_i_3(D).i;
   _18 = *_17;
   _19 = *.omp_data_i_3(D).i;
   _20 = _18 + 1;
   *_19 = _20;
   goto <bb 6>;
...
In other words, the iteration variable is in memory, as is the reduction 
variable, and the body contains lots of loop invariant loads.

At the end of the pass group, just before parloops, the body has been 
rewritten to have a local iteration variable and a local reduction 
variable, and all the loop invariant loads have been moved out of the loop:
...
   <bb 4>:
   # _27 = PHI <0(2), _20(5)>
   # D__lsm.7_28 = PHI <D__lsm.7_29(2), sum.1_14(5)>
   _11 = *_8[_27];
   sum.1_14 = _11 + D__lsm.7_28;
   _20 = _27 + 1;
   if (_20 <= 9999)
     goto <bb 5>;
   else
     goto <bb 3>;
...

Thanks,
- Tom


[-- Attachment #2: 0006-Add-pass_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 2986 bytes --]

Add pass_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_oacc_kernels): Declare.
	* tree-ssa-loop.c (gate_oacc_kernels): New static function.
	(pass_data_oacc_kernels): New pass_data.
	(class pass_oacc_kernels): New pass.
	(make_pass_oacc_kernels): New function.
---
 gcc/tree-pass.h     |  1 +
 gcc/tree-ssa-loop.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..4ed8da6 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -463,6 +463,7 @@ extern gimple_opt_pass *make_pass_strength_reduction (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 8ecd140..b51cac2 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
+#include "omp-low.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -141,6 +142,70 @@ make_pass_tree_loop (gcc::context *ctxt)
   return new pass_tree_loop (ctxt);
 }
 
+/* Gate for oacc kernels pass group.  */
+
+static bool
+gate_oacc_kernels (function *fn)
+{
+  if (flag_tree_parallelize_loops <= 1)
+    return false;
+
+  tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
+  if (oacc_function_attr == NULL_TREE)
+    return false;
+
+  tree val = TREE_VALUE (oacc_function_attr);
+  while (val != NULL_TREE && TREE_VALUE (val) == NULL_TREE)
+    val = TREE_CHAIN (val);
+
+  if (val != NULL_TREE)
+    return false;
+
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 0)
+    if (loop->in_oacc_kernels_region)
+      return true;
+
+  return false;
+}
+
+/* The oacc kernels superpass.  */
+
+namespace {
+
+const pass_data pass_data_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
+
+}; // class pass_oacc_kernels
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_oacc_kernels (ctxt);
+}
+
 /* The no-loop superpass.  */
 
 namespace {
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 7/16] Add pass_dominator_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (5 preceding siblings ...)
  2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
@ 2015-11-09 18:14 ` Tom de Vries
  2015-11-11 11:05   ` Richard Biener
  2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 18:14 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2069 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch adds pass_dominator_oacc_kernels (which we may as well call 
pass_dominator_no_peel_loop_headers. It doesn't do anything 
oacc-kernels-specific), to be used in the kernels pass group.

The reason I'm adding a new pass instead of using pass_dominator is that 
pass_dominator uses first_pass_instance. So adding a pass_dominator 
instance A before a pass_dominator instance B has the unexpected 
consequence that it may change the behaviour of instance B. I've filed 
PR68247 - "Remove pass_first_instance" to note this issue.

Thanks,
- Tom


[-- Attachment #2: 0007-Add-pass_dominator_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 4482 bytes --]

Add pass_dominator_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_dominator_oacc_kernels): Declare.
	* tree-ssa-dom.c (class dominator_base): New class.  Factor out of ...
	(class pass_dominator): ... here.
	(dominator_base::may_peel_loop_headers_p)
        (pass_dominator::may_peel_loop_headers_p): New function.
	(pass_dominator_oacc_kernels): New pass.
	(make_pass_dominator_oacc_kernels): New function.
	(dominator_base::execute): Use may_peel_loop_headers_p.
---
 gcc/tree-pass.h    |  1 +
 gcc/tree-ssa-dom.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 4ed8da6..2825aea 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -395,6 +395,7 @@ extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ealias (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_dominator (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_dominator_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_dce (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cd_dce (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
index 3887bbe1..e4ff63a 100644
--- a/gcc/tree-ssa-dom.c
+++ b/gcc/tree-ssa-dom.c
@@ -519,6 +519,19 @@ private:
 
 namespace {
 
+class dominator_base : public gimple_opt_pass
+{
+ protected:
+  dominator_base (pass_data data, gcc::context *ctxt)
+    : gimple_opt_pass (data, ctxt)
+  {}
+
+  unsigned int execute (function *);
+
+ protected:
+  virtual bool may_peel_loop_headers_p (void) { return true; }
+}; // class dominator_base
+
 const pass_data pass_data_dominator =
 {
   GIMPLE_PASS, /* type */
@@ -532,22 +545,23 @@ const pass_data pass_data_dominator =
   ( TODO_cleanup_cfg | TODO_update_ssa ), /* todo_flags_finish */
 };
 
-class pass_dominator : public gimple_opt_pass
+class pass_dominator : public dominator_base
 {
 public:
   pass_dominator (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_dominator, ctxt)
+    : dominator_base (pass_data_dominator, ctxt)
   {}
 
   /* opt_pass methods: */
   opt_pass * clone () { return new pass_dominator (m_ctxt); }
   virtual bool gate (function *) { return flag_tree_dom != 0; }
-  virtual unsigned int execute (function *);
 
+ protected:
+  virtual bool may_peel_loop_headers_p (void) { return first_pass_instance; }
 }; // class pass_dominator
 
 unsigned int
-pass_dominator::execute (function *fun)
+dominator_base::execute (function *fun)
 {
   memset (&opt_stats, 0, sizeof (opt_stats));
 
@@ -619,7 +633,7 @@ pass_dominator::execute (function *fun)
   free_all_edge_infos ();
 
   /* Thread jumps, creating duplicate blocks as needed.  */
-  cfg_altered |= thread_through_all_blocks (first_pass_instance);
+  cfg_altered |= thread_through_all_blocks (may_peel_loop_headers_p ());
 
   if (cfg_altered)
     free_dominance_info (CDI_DOMINATORS);
@@ -700,6 +714,34 @@ pass_dominator::execute (function *fun)
   return 0;
 }
 
+const pass_data pass_data_dominator_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "dom_oacc_kernels", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_TREE_SSA_DOMINATOR_OPTS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  ( TODO_cleanup_cfg | TODO_update_ssa ), /* todo_flags_finish */
+};
+
+class pass_dominator_oacc_kernels : public dominator_base
+{
+public:
+  pass_dominator_oacc_kernels (gcc::context *ctxt)
+    : dominator_base (pass_data_dominator_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_dominator_oacc_kernels (m_ctxt); }
+  virtual bool gate (function *) { return true; }
+
+ protected:
+  virtual bool may_peel_loop_headers_p (void) { return false; }
+}; // class pass_dominator_oacc_kernels
+
 } // anon namespace
 
 gimple_opt_pass *
@@ -708,6 +750,11 @@ make_pass_dominator (gcc::context *ctxt)
   return new pass_dominator (ctxt);
 }
 
+gimple_opt_pass *
+make_pass_dominator_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_dominator_oacc_kernels (ctxt);
+}
 
 /* Given a conditional statement CONDSTMT, convert the
    condition to a canonical form.  */
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 8/16] Add pass_ch_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (6 preceding siblings ...)
  2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
@ 2015-11-09 18:34 ` Tom de Vries
  2015-11-11 20:29   ` Tom de Vries
  2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2076 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch adds a pass pass_ch_oacc_kernels, which is like pass_ch, but 
only runs for loops with oacc_kernels_region set.

[ But... thinking about it a bit more, I think that we could use a 
regular pass_ch instead. We only use the kernels pass group for a single 
loop nest in a kernels region, and we mark all the loops in the loop 
nest with oacc_kernels_region. So I think that the oacc_kernels_region 
test in pass_ch_oacc_kernels::process_loop_p evaluates to true. ]

So, I'll try to confirm with retesting that we can drop this patch.

Thanks,
- Tom


[-- Attachment #2: 0008-Add-pass_ch_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 3397 bytes --]

Add pass_ch_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_ch_oacc_kernels): Declare.
	* tree-ssa-loop-ch.c (pass_ch::pass_ch (pass_data, gcc::context)): New
	constructor.
	(pass_data_ch_oacc_kernels): New pass_data.
	(class pass_ch_oacc_kernels): New pass.
	(pass_ch_oacc_kernels::process_loop_p): New function.
	(make_pass_ch_oacc_kernels): New function.
---
 gcc/tree-pass.h        |  1 +
 gcc/tree-ssa-loop-ch.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 2825aea..f95a820 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -389,6 +389,7 @@ extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_ch_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..8bf47fe 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -33,6 +33,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-ssa-scopedtables.h"
 #include "tree-ssa-threadedge.h"
+#include "omp-low.h"
 
 /* Duplicates headers of loops if they are small enough, so that the statements
    in the loop body are always executed when the loop is entered.  This
@@ -124,7 +125,7 @@ do_while_loop_p (struct loop *loop)
 
 namespace {
 
-/* Common superclass for both header-copying phases.  */
+/* Common superclass for header-copying phases.  */
 class ch_base : public gimple_opt_pass
 {
   protected:
@@ -159,6 +160,10 @@ public:
     : ch_base (pass_data_ch, ctxt)
   {}
 
+  pass_ch (pass_data data, gcc::context *ctxt)
+    : ch_base (data, ctxt)
+  {}
+
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_ch != 0; }
   
@@ -414,3 +419,50 @@ make_pass_ch (gcc::context *ctxt)
 {
   return new pass_ch (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_ch_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "ch_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_CH, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_cleanup_cfg, /* todo_flags_finish */
+};
+
+class pass_ch_oacc_kernels : public pass_ch
+{
+public:
+  pass_ch_oacc_kernels (gcc::context *ctxt)
+    : pass_ch (pass_data_ch_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return true; }
+
+protected:
+  /* ch_base method: */
+  virtual bool process_loop_p (struct loop *loop);
+}; // class pass_ch_oacc_kernels
+
+} // anon namespace
+
+bool
+pass_ch_oacc_kernels::process_loop_p (struct loop *loop)
+{
+  if (!loop->in_oacc_kernels_region)
+    return false;
+
+  return pass_ch::process_loop_p (loop);
+}
+
+gimple_opt_pass *
+make_pass_ch_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_ch_oacc_kernels (ctxt);
+}
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (7 preceding siblings ...)
  2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
@ 2015-11-09 19:53 ` Tom de Vries
  2015-11-16 11:59   ` Tom de Vries
  2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 19:53 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3122 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds pass_parallelize_loops_oacc_kernels.

There's a number of things we do differently in parloops for oacc kernels:
- in normal parloops, we generate code to choose between a parallel
   version of the loop, and a sequential (low iteration count) version.
   Since the code in oacc kernels region is supposed to run on the
   accelerator anyway, we skip this check, and don't add a low iteration
   count loop.
- in normal parloops, we generate an #pragma omp parallel /
   GIMPLE_OMP_RETURN pair to delimit the region which will we split off
   into a thread function. Since the oacc kernels region is already
   split off, we don't add this pair.
- we indicate the parallelization factor by setting the oacc function
   attributes
- we generate an #pragma oacc loop instead of an #pragma omp for, and
   we add the gang clause
- in normal parloops, we rewrite the variable accesses in the loop in
   terms into accesses relative to a thread function parameter. For the
   oacc kernels region, that rewrite has already been done at omp-lower,
   so we skip this.
- we need to ensure that the entire kernels region can be run in
   parallel. The loop independence check is already present, so for oacc
   kernels we add a check between blocks outside the loop and the entire
   region.
- we guard stores in the blocks outside the loop with gang_pos == 0.
   There's no need for each gang to write to a single location, we can
   do this in just one gang. (Typically this is the write of the final
   value of the iteration variable if that one is copied back to the
   host).

Thanks,
- Tom


[-- Attachment #2: 0009-Add-pass_parallelize_loops_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 30668 bytes --]

Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
        (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.
---
 gcc/omp-low.c       |   8 +-
 gcc/omp-low.h       |   1 +
 gcc/tree-parloops.c | 689 +++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-pass.h     |   2 +
 4 files changed, 636 insertions(+), 64 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 39c12c1..13fa456 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11967,10 +11967,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
   gsi_remove (&gsi, true);
   gsi = gsi_last_bb (store_bb);
+  stmt = gsi_stmt (gsi);
   gsi_remove (&gsi, true);
 
   if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_no_phi);
+    {
+      release_defs (stmt);
+      update_ssa (TODO_update_ssa_no_phi);
+    }
 
   return true;
 }
@@ -12344,7 +12348,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index ee0f8ac..fa5396d 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -31,6 +31,7 @@ extern bool make_gimple_omp_edges (basic_block, struct omp_region **, int *);
 extern void omp_finish_file (void);
 extern tree omp_member_access_dummy_var (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..0222016 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
 
-  addr = build_addr (t);
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
   basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   gomp_continue *omp_cont_stmt;
   tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
   edge exit, nexit, guard, end, e;
+  tree for_clauses = NULL_TREE;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
   bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (!oacc_kernels_p)
+    {
+      paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
+    }
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+  if (!oacc_kernels_p)
+    {
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+    }
+  else
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
 
   /* Initialize NEW_DATA.  */
   if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
     }
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+  /* Skip insertion of OMP_RETURN for oacc_kernels_p.  We've already generated
+     one when lowering the oacc kernels directive in
+     pass_lower_omp/lower_omp (). */
+  if (!oacc_kernels_p)
+    {
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
     OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
       = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  if (1)
+    {
+      /* In combination with the NUM_GANGS on the parallel.  */
+      for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+    }
+
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   for_clauses, 1, NULL);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
+      many_iterations_cond
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2527,12 +2606,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2588,6 +2676,7 @@ try_create_reduction_list (loop_p loop,
 			 "  FAILED: it is not a part of reduction.\n");
 	      return false;
 	    }
+	  red->keep_res = phi;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
 	      fprintf (dump_file, "reduction phi is  ");
@@ -2622,15 +2711,402 @@ try_create_reduction_list (loop_p loop,
     }
 
 
+  if (oacc_kernels_p)
+    {
+      edge e = loop_preheader_edge (loop);
+
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      struct reduction_info *red;
+	      red = reduction_phi (reduction_list, phi);
+
+	      /* Look for pattern:
+
+		 <bb preheader>
+		   .omp_data_i = &.omp_data_arr;
+		   addr = .omp_data_i->sum;
+		   sum_a = *addr;
+
+		 <bb header>:
+		   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+		 and assign addr to reduc->reduc_addr.  */
+
+	      tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      gimple *stmt = SSA_NAME_DEF_STMT (arg);
+	      if (!gimple_assign_single_p (stmt))
+		return false;
+	      tree memref = gimple_assign_rhs1 (stmt);
+	      if (TREE_CODE (memref) != MEM_REF)
+		return false;
+	      tree addr = TREE_OPERAND (memref, 0);
+
+	      gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+	      if (!gimple_assign_single_p (stmt2))
+		return false;
+	      tree compref = gimple_assign_rhs1 (stmt2);
+	      if (TREE_CODE (compref) != COMPONENT_REF)
+		return false;
+	      tree addr2 = TREE_OPERAND (compref, 0);
+	      if (TREE_CODE (addr2) != MEM_REF)
+		return false;
+	      addr2 = TREE_OPERAND (addr2, 0);
+	      if (TREE_CODE (addr2) != SSA_NAME
+		  || addr2 != get_omp_data_i_param ())
+		return false;
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
+
+  return true;
+}
+
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      tree omp_data_i,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
   return true;
 }
 
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  tree omp_data_i = get_omp_data_i_param ();
+  gcc_assert (omp_data_i != NULL_TREE);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+				   reduction_list, reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
+  return res;
+}
+
 /* Detect parallel loops and generate parallel code using libgomp
    primitives.  Returns true if some loop was parallelized, false
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2642,19 +3118,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2666,6 +3152,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2707,6 +3209,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2716,14 +3219,23 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (!flag_loop_parallelize_all
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
+      /* Skip inner loop(s) of parallelized loop.  */
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2736,8 +3248,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2787,7 +3300,7 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
@@ -2806,3 +3319,55 @@ make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "parloops_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+
+      return TODO_update_ssa;
+    }
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index f95a820..8eaf678 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -384,6 +384,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+  make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (8 preceding siblings ...)
  2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
@ 2015-11-09 19:59 ` Tom de Vries
  2015-11-11 11:03   ` Richard Biener
  2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 19:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.
>

This patch adds the pass_oacc_kernels pass group to the pass list in 
passes.def.

Note the repetition of pass_lim/pass_copy_prop. The first pair is for an 
inner loop in a loop nest, the second for an outer loop in a loop nest.

Thanks,
- Tom


[-- Attachment #2: 0010-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 3025 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* tree-ssa-loop.c (pass_scev_cprop::clone, pass_tree_loop_init::clone)
	(pass_tree_loop_done::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
---
 gcc/omp-low.c       |  1 +
 gcc/passes.def      | 21 +++++++++++++++++++++
 gcc/tree-ssa-loop.c |  3 +++
 3 files changed, 25 insertions(+)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 13fa456..1283cc7 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13360,6 +13360,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..b7a5424 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when there are oacc kernels in the
+	     function.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_ch_oacc_kernels);
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_tree_loop_init);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_scev_cprop);
+	      NEXT_PASS (pass_tree_loop_done);
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_tree_loop_init);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_tree_loop_done);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index b51cac2..0557f99 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -270,6 +270,7 @@ public:
 
   /* opt_pass methods: */
   virtual unsigned int execute (function *);
+  opt_pass * clone () { return new pass_tree_loop_init (m_ctxt); }
 
 }; // class pass_tree_loop_init
 
@@ -374,6 +375,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_scev_cprop; }
   virtual unsigned int execute (function *) { return scev_const_prop (); }
+  opt_pass * clone () { return new pass_scev_cprop (m_ctxt); }
 
 }; // class pass_scev_cprop
 
@@ -516,6 +518,7 @@ public:
 
   /* opt_pass methods: */
   virtual unsigned int execute (function *) { return tree_ssa_loop_done (); }
+  opt_pass * clone () { return new pass_tree_loop_done (m_ctxt); }
 
 }; // class pass_tree_loop_done
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (9 preceding siblings ...)
  2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
@ 2015-11-09 20:02 ` Tom de Vries
  2015-11-11 11:03   ` Richard Biener
  2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:02 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1658 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch updates existing testcases with new pass numbers, given the 
passes that were added in the pass list in patch 10.

Thanks,
- Tom


[-- Attachment #2: 0011-Update-testcases-after-adding-kernels-pass-group.patch --]
[-- Type: text/x-patch, Size: 35345 bytes --]

Update testcases after adding kernels pass group

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/restrict-2.c: Update after adding pass_oacc_kernels pass
	group.
	* c-c++-common/restrict-4.c: Same.
	* g++.dg/tree-ssa/copyprop-1.C: Same.
	* g++.dg/tree-ssa/pr33615.C: Same.
	* g++.dg/tree-ssa/restrict1.C: Same.
	* gcc.dg/gomp/notify-new-function-3.c: Same.
	* gcc.dg/pr23911.c: Same.
	* gcc.dg/pr41488.c: Same.
	* gcc.dg/tm/pub-safety-1.c: Same.
	* gcc.dg/tm/reg-promotion.c: Same.
	* gcc.dg/tree-ssa/20030709-2.c: Same.
	* gcc.dg/tree-ssa/20030731-2.c: Same.
	* gcc.dg/tree-ssa/20040729-1.c: Same.
	* gcc.dg/tree-ssa/20050314-1.c: Same.
	* gcc.dg/tree-ssa/cfgcleanup-1.c: Same.
	* gcc.dg/tree-ssa/loop-17.c: Same.
	* gcc.dg/tree-ssa/loop-32.c: Same.
	* gcc.dg/tree-ssa/loop-33.c: Same.
	* gcc.dg/tree-ssa/loop-34.c: Same.
	* gcc.dg/tree-ssa/loop-35.c: Same.
	* gcc.dg/tree-ssa/loop-36.c: Same.
	* gcc.dg/tree-ssa/loop-39.c: Same.
	* gcc.dg/tree-ssa/loop-7.c: Same.
	* gcc.dg/tree-ssa/pr21086.c: Same.
	* gcc.dg/tree-ssa/pr23109.c: Same.
	* gcc.dg/tree-ssa/restrict-3.c: Same.
	* gcc.dg/tree-ssa/restrict-5.c: Same.
	* gcc.dg/tree-ssa/scev-7.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-1.c: Same.
	* gcc.dg/tree-ssa/ssa-dce-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-1.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-10.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-11.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-12.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-2.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-3.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-6.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-7.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-8.c: Same.
	* gcc.dg/tree-ssa/ssa-lim-9.c: Same.
	* gcc.dg/tree-ssa/structopt-1.c: Same.
	* gcc.dg/vect/pr26359.c: Same.
	* gfortran.dg/pr32921.f: Same.
---
 gcc/testsuite/c-c++-common/restrict-2.c           | 4 ++--
 gcc/testsuite/c-c++-common/restrict-4.c           | 4 ++--
 gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C        | 4 ++--
 gcc/testsuite/g++.dg/tree-ssa/pr33615.C           | 4 ++--
 gcc/testsuite/g++.dg/tree-ssa/restrict1.C         | 4 ++--
 gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c | 2 +-
 gcc/testsuite/gcc.dg/pr23911.c                    | 6 +++---
 gcc/testsuite/gcc.dg/pr41488.c                    | 4 ++--
 gcc/testsuite/gcc.dg/tm/pub-safety-1.c            | 4 ++--
 gcc/testsuite/gcc.dg/tm/reg-promotion.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c        | 8 ++++----
 gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c      | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-17.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-32.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-33.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-34.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-35.c           | 6 +++---
 gcc/testsuite/gcc.dg/tree-ssa/loop-36.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-39.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/loop-7.c            | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/pr21086.c           | 6 +++---
 gcc/testsuite/gcc.dg/tree-ssa/pr23109.c           | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/scev-7.c            | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c        | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c         | 6 +++---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c         | 4 ++--
 gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c       | 4 ++--
 gcc/testsuite/gcc.dg/vect/pr26359.c               | 4 ++--
 gcc/testsuite/gfortran.dg/pr32921.f               | 4 ++--
 43 files changed, 91 insertions(+), 91 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/restrict-2.c b/gcc/testsuite/c-c++-common/restrict-2.c
index 5e8bca7..183a0de 100644
--- a/gcc/testsuite/c-c++-common/restrict-2.c
+++ b/gcc/testsuite/c-c++-common/restrict-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim3-details" } */
 
 void foo (float * __restrict__ a, float * __restrict__ b, int n, int j)
 {
@@ -10,4 +10,4 @@ void foo (float * __restrict__ a, float * __restrict__ b, int n, int j)
 
 /* We should move the RHS of the store out of the loop.  */
 
-/* { dg-final { scan-tree-dump-times "Moving statement" 11 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 11 "lim3" } } */
diff --git a/gcc/testsuite/c-c++-common/restrict-4.c b/gcc/testsuite/c-c++-common/restrict-4.c
index cea6cd8..8dd597c 100644
--- a/gcc/testsuite/c-c++-common/restrict-4.c
+++ b/gcc/testsuite/c-c++-common/restrict-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile }  */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 struct Foo
 {
@@ -15,4 +15,4 @@ void bar(struct Foo f, int * __restrict__ q)
     }
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion" "lim3" } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C b/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C
index 5ff289c..34a9f7b 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/copyprop-1.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-dce2" } */
+/* { dg-options "-O -fdump-tree-dce3" } */
 
 /* Verify that we can eliminate the useless conversions to/from
    const qualified pointer types
@@ -27,4 +27,4 @@ int foo(Object&o)
 
 /* Remaining should be two loads.  */
 
-/* { dg-final { scan-tree-dump-times " = \[^\n\]*;" 2 "dce2" } } */
+/* { dg-final { scan-tree-dump-times " = \[^\n\]*;" 2 "dce3" } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr33615.C b/gcc/testsuite/g++.dg/tree-ssa/pr33615.C
index f1b7a64..dd2bbb2 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr33615.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr33615.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fnon-call-exceptions -fdump-tree-lim1-details -w" } */
+/* { dg-options "-O -fnon-call-exceptions -fdump-tree-lim3-details -w" } */
 
 extern volatile int y;
 
@@ -16,4 +16,4 @@ foo (double a, int x)
 
 // The expression 1.0 / 0.0 should not be treated as a loop invariant
 // if it may throw an exception.
-// { dg-final { scan-tree-dump-times "invariant up to" 0 "lim1" } }
+// { dg-final { scan-tree-dump-times "invariant up to" 0 "lim3" } }
diff --git a/gcc/testsuite/g++.dg/tree-ssa/restrict1.C b/gcc/testsuite/g++.dg/tree-ssa/restrict1.C
index 5952fca..718d1ec 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/restrict1.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/restrict1.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 struct Foo
 {
@@ -16,4 +16,4 @@ void bar(Foo f, int * __restrict__ q)
     }
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c b/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c
index a8f24b1..033a407 100644
--- a/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c
+++ b/gcc/testsuite/gcc.dg/gomp/notify-new-function-3.c
@@ -11,4 +11,4 @@ foo (int *__restrict a, int *__restrict b, int *__restrict c)
 
 
 /* Check for new function notification in ompexpssa dump.  */
-/* { dg-final { scan-tree-dump-times "Added new ssa gimple function foo\\.\[\\\$_\]loopfn\\.0 to callgraph" 1 "ompexpssa" } } */
+/* { dg-final { scan-tree-dump-times "Added new ssa gimple function foo\\.\[\\\$_\]loopfn\\.0 to callgraph" 1 "ompexpssa2" } } */
diff --git a/gcc/testsuite/gcc.dg/pr23911.c b/gcc/testsuite/gcc.dg/pr23911.c
index 2c27397..3fa0412 100644
--- a/gcc/testsuite/gcc.dg/pr23911.c
+++ b/gcc/testsuite/gcc.dg/pr23911.c
@@ -1,7 +1,7 @@
 /* This was a missed optimization in tree constant propagation
    that CSE would catch later on.  */
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-dce2" } */
+/* { dg-options "-O -fdump-tree-dce3" } */
 
 double _Complex *a; 
 static const double _Complex b[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; 
@@ -16,5 +16,5 @@ test (void)
 
 /* After DCE2 which runs after FRE, the expressions should be fully
    constant folded.  There should be no loads from b left.  */
-/* { dg-final { scan-tree-dump-times "__complex__ \\\(1.0e\\\+0, 0.0\\\)" 2 "dce2" } } */
-/* { dg-final { scan-tree-dump-times "= b" 0 "dce2" } } */
+/* { dg-final { scan-tree-dump-times "__complex__ \\\(1.0e\\\+0, 0.0\\\)" 2 "dce3" } } */
+/* { dg-final { scan-tree-dump-times "= b" 0 "dce3" } } */
diff --git a/gcc/testsuite/gcc.dg/pr41488.c b/gcc/testsuite/gcc.dg/pr41488.c
index b9bc718..6c7686b 100644
--- a/gcc/testsuite/gcc.dg/pr41488.c
+++ b/gcc/testsuite/gcc.dg/pr41488.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-sccp-scev" } */
+/* { dg-options "-O2 -fdump-tree-sccp2-scev" } */
 
 struct struct_t
 {
@@ -14,4 +14,4 @@ void foo (struct struct_t* sp, int start, int end)
     sp->data[i+start] = 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp" } } */
+/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tm/pub-safety-1.c b/gcc/testsuite/gcc.dg/tm/pub-safety-1.c
index c95111c..3841c08 100644
--- a/gcc/testsuite/gcc.dg/tm/pub-safety-1.c
+++ b/gcc/testsuite/gcc.dg/tm/pub-safety-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-fgnu-tm -O1 -fdump-tree-lim1" } */
+/* { dg-options "-fgnu-tm -O1 -fdump-tree-lim3" } */
 
 /* Test that thread visible loads do not get hoisted out of loops if
    the load would not have occurred on each path out of the loop.  */
@@ -20,4 +20,4 @@ void reader()
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Cannot hoist.*DATA_DATA because it is in a transaction" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Cannot hoist.*DATA_DATA because it is in a transaction" 1 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tm/reg-promotion.c b/gcc/testsuite/gcc.dg/tm/reg-promotion.c
index 0200600..e0e5f62 100644
--- a/gcc/testsuite/gcc.dg/tm/reg-promotion.c
+++ b/gcc/testsuite/gcc.dg/tm/reg-promotion.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-fgnu-tm -O2 -fdump-tree-lim1" } */
+/* { dg-options "-fgnu-tm -O2 -fdump-tree-lim3" } */
 
 /* Test that `count' is not written to unless p->data>0.  */
 
@@ -20,4 +20,4 @@ void func()
   }
 }
 
-/* { dg-final { scan-tree-dump-times "Cannot hoist conditional load of count because it is in a transaction" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Cannot hoist conditional load of count because it is in a transaction" 1 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c
index d4f42f9..5009cd6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030709-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-dce2" } */
+/* { dg-options "-O -fdump-tree-dce3" } */
   
 struct rtx_def;
 typedef struct rtx_def *rtx;
@@ -42,13 +42,13 @@ get_alias_set (t)
 
 /* There should be precisely one load of ->decl.rtl.  If there is
    more than, then the dominator optimizations failed.  */
-/* { dg-final { scan-tree-dump-times "->decl\\.rtl" 1 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "->decl\\.rtl" 1 "dce3"} } */
   
 /* There should be no loads of .rtmem since the complex return statement
    is just "return 0".  */
-/* { dg-final { scan-tree-dump-times ".rtmem" 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times ".rtmem" 0 "dce3"} } */
   
 /* There should be one IF statement (the complex return statement should
    collapse down to a simple return 0 without any conditionals).  */
-/* { dg-final { scan-tree-dump-times "if " 1 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "if " 1 "dce3"} } */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c
index bdb22ff..069f953 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030731-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-dce1" } */
+/* { dg-options "-O2 -fdump-tree-dce2" } */
 
 void foo (void);
 
@@ -15,4 +15,4 @@ bar (int i, int partial, int args_addr)
 
 /* There should be only one IF conditional since the first does nothing
    useful.  */
-/* { dg-final { scan-tree-dump-times "if " 1 "dce1"} } */
+/* { dg-final { scan-tree-dump-times "if " 1 "dce2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c b/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c
index 6e7ffbb..812887a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20040729-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-dce2" } */
+/* { dg-options "-O1 -fdump-tree-dce3" } */
 
 int
 foo ()
@@ -16,4 +16,4 @@ foo ()
    compiler was mistakenly thinking that the statement had volatile
    operands.  But 'p' itself is not volatile and taking the address of
    a volatile does not constitute a volatile operand.  */
-/* { dg-final { scan-tree-dump-times "&x" 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "&x" 0 "dce3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c b/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c
index fe220cd..1ad61f1 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20050314-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-lim1-details --param allow-store-data-races=1" } */
+/* { dg-options "-O1 -fdump-tree-lim3-details --param allow-store-data-races=1" } */
 
 float a[100];
 
@@ -17,4 +17,4 @@ void xxx (void)
 /* Store motion may be applied to the assignment to a[k], since sinf
    cannot read nor write the memory.  */
 
-/* { dg-final { scan-tree-dump-times "Moving statement" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 1 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c b/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c
index 4d22a42..53ce973 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/cfgcleanup-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-dce1" } */
+/* { dg-options "-O2 -fdump-tree-dce2" } */
 void
 cleanup (int a, int b)
 {
@@ -15,4 +15,4 @@ cleanup (int a, int b)
   return;
 }
 /* Dce should get rid of the initializers and cfgcleanup should elliminate ifs  */
-/* { dg-final { scan-tree-dump-times "if " 0 "dce1"} } */
+/* { dg-final { scan-tree-dump-times "if " 0 "dce2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c
index 588cf4c..4cb1438 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-17.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-sccp-details" } */
+/* { dg-options "-O -fdump-tree-sccp2-details" } */
 
 /* To determine the number of iterations in this loop we need to fold
    p_4 + 4B > p_4 + 8B to false.  This transformation has caused
@@ -15,4 +15,4 @@ int foo (int *p)
   return i;
 }
 
-/* { dg-final { scan-tree-dump "# of iterations 1, bounded by 1" "sccp" } } */
+/* { dg-final { scan-tree-dump "# of iterations 1, bounded by 1" "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c
index 9953bb5..9b69c73 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-32.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int x;
 int a[100];
@@ -42,4 +42,4 @@ void test3(struct a *A)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 3 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 3 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c
index 2cf4c5a..98a16fb 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-33.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int x;
 int a[100];
@@ -36,4 +36,4 @@ void test5(struct a *A, unsigned b)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 4 "lim1" { xfail { lp64 || llp64 } } } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 4 "lim3" { xfail { lp64 || llp64 } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c
index 67493a5..26fb281 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int r[6];
 
@@ -17,4 +17,4 @@ void f (int n)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of r" 6 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of r" 6 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c
index 70557c5..87d105a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-35.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int x;
 int a[100];
@@ -67,5 +67,5 @@ void test4(struct a *A, unsigned LONG b)
     }
 }
 /* long index not hoisted for avr target PR 36561 */
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 8 "lim1" { xfail { "avr-*-*" } } } } */
-/* { dg-final { scan-tree-dump-times "Executing store motion of" 6 "lim1" { target { "avr-*-*" } } } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 8 "lim3" { xfail { "avr-*-*" } } } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 6 "lim3" { target { "avr-*-*" } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c
index d922991..516cad9 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-36.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-dce3" } */
 
 struct X { float array[2]; };
 
@@ -18,4 +18,4 @@ float foobar () {
 
 /* The temporary structure should have been promoted to registers
    by FRE after the loops have been unrolled by the early unrolling pass.  */
-/* { dg-final { scan-tree-dump-not "c\.array" "dce2" } } */
+/* { dg-final { scan-tree-dump-not "c\.array" "dce3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c
index 53680dd..d1edbd5 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-39.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-sccp-details" } */
+/* { dg-options "-O2 -fdump-tree-sccp2-details" } */
 
 int
 foo (unsigned int n)
@@ -22,4 +22,4 @@ foo (unsigned int n)
   return r + n;
 }
 
-/* { dg-final { scan-tree-dump "# of iterations \[^\n\r]*, bounded by 8" "sccp" } } */
+/* { dg-final { scan-tree-dump "# of iterations \[^\n\r]*, bounded by 8" "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c
index 26fb4ec..e28e4c9 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-7.c
@@ -1,6 +1,6 @@
 /* PR tree-optimization/19828 */
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-lim1-details" } */
+/* { dg-options "-O1 -fdump-tree-lim3-details" } */
 
 int cst_fun1 (int) __attribute__((__const__));
 int cst_fun2 (int) __attribute__((__const__));
@@ -31,4 +31,4 @@ int xxx (void)
    Calls to cst_fun2 and pure_fun2 should not be, since calling
    with k = 0 may be invalid.  */
 
-/* { dg-final { scan-tree-dump-times "Moving statement" 2 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 2 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c b/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c
index 26ea817..e8b62c2 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr21086.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-dce1 -fdelete-null-pointer-checks" } */
+/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-dce2 -fdelete-null-pointer-checks" } */
 
 int
 foo (int *p)
@@ -18,5 +18,5 @@ foo (int *p)
 /* Target disabling -fdelete-null-pointer-checks should not fold checks */
 /* { dg-final { scan-tree-dump "Folding predicate " "vrp1" { target { ! keeps_null_pointer_checks } } } } */
 /* { dg-final { scan-tree-dump-times "Folding predicate " 0 "vrp1" { target {   keeps_null_pointer_checks } } } } */
-/* { dg-final { scan-tree-dump-not "b_. =" "dce1" { target { ! avr-*-* } } } } */
-/* { dg-final { scan-tree-dump "b_. =" "dce1" { target { avr-*-* } } } } */
+/* { dg-final { scan-tree-dump-not "b_. =" "dce2" { target { ! avr-*-* } } } } */
+/* { dg-final { scan-tree-dump "b_. =" "dce2" { target { avr-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c b/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
index 8281a98..040f3ae 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -funsafe-math-optimizations -ftrapping-math -fdump-tree-recip -fdump-tree-lim1" } */
+/* { dg-options "-O2 -funsafe-math-optimizations -ftrapping-math -fdump-tree-recip -fdump-tree-lim3" } */
 /* { dg-warning "-fassociative-math disabled" "" { target *-*-* } 1 } */
 
 double F[2] = { 0., 0. }, e = 0.;
@@ -29,6 +29,6 @@ int main()
 /* LIM only performs the transformation in the no-trapping-math case.  In
    the future we will do it for trapping-math as well in recip, check that
    this is not wrongly optimized.  */
-/* { dg-final { scan-tree-dump-not "reciptmp" "lim1" } } */
+/* { dg-final { scan-tree-dump-not "reciptmp" "lim3" } } */
 /* { dg-final { scan-tree-dump-not "reciptmp" "recip" } } */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c b/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c
index e9e1438..a352129 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/restrict-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim3-details" } */
 
 void f(int * __restrict__ r,
        int a[__restrict__ 16][16],
@@ -14,4 +14,4 @@ void f(int * __restrict__ r,
 
 /* We should apply store motion to the store to *r.  */
 
-/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c b/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c
index 6dd4c99..2e0edab 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/restrict-5.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fno-strict-aliasing -fdump-tree-lim3-details" } */
 
 static inline __attribute__((always_inline))
 void f(int * __restrict__ r,
@@ -20,4 +20,4 @@ void g(int *r, int a[16][16], int b[16][16], int i, int j)
 
 /* We should apply store motion to the store to *r.  */
 
-/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of \\\*r" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c
index 5dfc7b1..ead68d0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-7.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-sccp-scev" } */
+/* { dg-options "-O2 -fdump-tree-sccp2-scev" } */
 
 struct struct_t
 {
@@ -14,4 +14,4 @@ void foo (struct struct_t* sp, int start, int end)
     sp->data[i+start] = 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp" } } */
+/* { dg-final { scan-tree-dump-times "Simplify PEELED_CHREC into POLYNOMIAL_CHREC" 1 "sccp2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c
index 4a8c6b6..0c478d1 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O1 -fdump-tree-dce2" } */
+/* { dg-options "-O1 -fdump-tree-dce3" } */
 
 int t() __attribute__ ((const));
 void
@@ -10,4 +10,4 @@ q()
     i = t();
 }
 /* There should be no IF conditionals.  */
-/* { dg-final { scan-tree-dump-times "if " 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "if " 0 "dce3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c
index 6281a1e..b3f5073 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dce-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-dce3" } */
 
 /* We should notice constantness of this function. */
 static int __attribute__((noinline)) t(int a) 
@@ -13,4 +13,4 @@ void q(void)
     i = t(1);
 }
 /* There should be no IF conditionals.  */
-/* { dg-final { scan-tree-dump-times "if " 0 "dce2"} } */
+/* { dg-final { scan-tree-dump-times "if " 0 "dce3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c
index 1b387cd..6a4b819 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1" } */
+/* { dg-options "-O -fdump-tree-lim3" } */
 
 /* This is a variant that does cause fold to place a cast to
    int before testing bit 1.  */
@@ -18,4 +18,4 @@ quantum_toffoli (int control1, int control2, int target,
     }
 }
 
-/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c
index 79ea042..afa547c 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-10.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 int *l, *r;
 int test_func(void)
@@ -27,4 +27,4 @@ int test_func(void)
   return i;
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion of pos" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of pos" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
index eadf71c..d55f644 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fprofile-arcs -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fprofile-arcs -fdump-tree-lim3-details" } */
 /* { dg-require-profiling "-fprofile-generate" } */
 
 struct thread_param
@@ -22,4 +22,4 @@ void access_buf(struct thread_param* p)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of __gcov0.access_buf\\\[\[01\]\\\] from loop 1" 2 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of __gcov0.access_buf\\\[\[01\]\\\] from loop 1" 2 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c
index 35f17d5..18b055f 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-12.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1" } */
+/* { dg-options "-O -fdump-tree-lim3" } */
 
 int a[1024];
 
@@ -23,4 +23,4 @@ void bar (int x, int z)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "!= 0 ? " 2 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "!= 0 ? " 2 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c
index 8e72f78..9ef7bae 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1" } */
+/* { dg-options "-O -fdump-tree-lim3" } */
 
 /* This is a variant that doesn't cause fold to place a cast to
    int before testing bit 1.  */
@@ -18,4 +18,4 @@ int size)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "1 <<" 3 "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c
index 2035215..dc7f41a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 struct { int x; int y; } global;
 void foo(int n)
@@ -9,5 +9,5 @@ void foo(int n)
     global.y += global.x*global.x;
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion of global.y" "lim1" } } */
-/* { dg-final { scan-tree-dump "Moving statement.*global.x.*out of loop 1" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of global.y" "lim3" } } */
+/* { dg-final { scan-tree-dump "Moving statement.*global.x.*out of loop 1" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c
index 283d206..535d627 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-6.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 
 double a[16][64], y[64], x[16];
 void foo(void)
@@ -10,4 +10,4 @@ void foo(void)
       y[j] = y[j] + a[i][j] * x[i];
 }
 
-/* { dg-final { scan-tree-dump "Executing store motion of y" "lim1" } } */
+/* { dg-final { scan-tree-dump "Executing store motion of y" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
index f9d685e..bf4e8ec 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 extern const int srcshift;
 
@@ -11,4 +11,4 @@ void foo (int *srcdata, int *dstdata)
     dstdata[i] = srcdata[i] << srcshift;
 }
 
-/* { dg-final { scan-tree-dump "Moving statement" "lim1" } } */
+/* { dg-final { scan-tree-dump "Moving statement" "lim3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c
index aaad0f0..fb69af3 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-8.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 void bar (int);
 void foo (int n, int m)
@@ -16,4 +16,4 @@ void foo (int n, int m)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim3"  } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c
index 8abc2c7..9d2e817 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-9.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fdump-tree-lim1-details" } */
+/* { dg-options "-O -fdump-tree-lim3-details" } */
 
 void bar (int);
 void foo (int n, int m)
@@ -16,4 +16,4 @@ void foo (int n, int m)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim3"  } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c b/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c
index 0582e26..6abcb6c 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/structopt-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+/* { dg-options "-O2 -fdump-tree-lim3-details" } */
 int x; int y;
 struct { int x; int y; } global;
 int foo() {
@@ -10,5 +10,5 @@ int foo() {
 		global.y += global.x*global.x;
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of global.y" 1 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of global.y" 1 "lim3" } } */
 /* XXX: We should also check for the load motion of global.x, but there is no easy way to do this.  */
diff --git a/gcc/testsuite/gcc.dg/vect/pr26359.c b/gcc/testsuite/gcc.dg/vect/pr26359.c
index 597ee7e..5b445a9 100644
--- a/gcc/testsuite/gcc.dg/vect/pr26359.c
+++ b/gcc/testsuite/gcc.dg/vect/pr26359.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target vect_int } */
-/* { dg-additional-options "-fdump-tree-dce5-details" } */
+/* { dg-additional-options "-fdump-tree-dce6-details" } */
 
 int a[256], b[256], c[256];
 
@@ -13,4 +13,4 @@ foo () {
   }
 }
 
-/* { dg-final { scan-tree-dump-times "Deleting : vect_" 0 "dce5" } } */
+/* { dg-final { scan-tree-dump-times "Deleting : vect_" 0 "dce6" } } */
diff --git a/gcc/testsuite/gfortran.dg/pr32921.f b/gcc/testsuite/gfortran.dg/pr32921.f
index 1c45d1e..e7264b7 100644
--- a/gcc/testsuite/gfortran.dg/pr32921.f
+++ b/gcc/testsuite/gfortran.dg/pr32921.f
@@ -1,5 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-O2 -fdump-tree-lim1" }
+! { dg-options "-O2 -fdump-tree-lim3" }
 ! gfortran -c -m32 -O2 -S junk.f
 !
       MODULE LES3D_DATA
@@ -45,4 +45,4 @@
 
       RETURN
       END
-! { dg-final { scan-tree-dump-times "stride" 4 "lim1" } }
+! { dg-final { scan-tree-dump-times "stride" 4 "lim3" } }
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 12/16] Handle acc loop directive
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (10 preceding siblings ...)
  2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
@ 2015-11-09 20:06 ` Tom de Vries
  2015-11-24 12:30   ` [PING][PATCH, " Tom de Vries
  2015-11-09 20:08 ` [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c Tom de Vries
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1733 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patch deals with loops in an oacc kernels region which are 
annotated using "#pragma acc loop". It expands such a loop as a normal 
loop, which has the effect of ignoring the "#pragma acc loop".

Thanks,
- Tom


[-- Attachment #2: 0012-Handle-acc-loop-directive.patch --]
[-- Type: text/x-patch, Size: 8736 bytes --]

Handle acc loop directive

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (struct omp_region): Add inside_kernels_p field.
	(expand_omp_for_generic): Only set address taken for istart0
	and end0 unless necessary.  Adjust to generate a 'sequential' loop
	when GOMP builtin arguments are BUILT_IN_NONE.
	(expand_omp_for): Use expand_omp_for_generic() to generate a
	non-parallelized loop for OMP_FORs inside OpenACC kernels regions.
	(expand_omp): Mark inside_kernels_p field true for regions
	nested inside OpenACC kernels constructs.
---
 gcc/omp-low.c | 127 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 87 insertions(+), 40 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 1283cc7..859a2eb 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -136,6 +136,9 @@ struct omp_region
   /* The ordered stmt if type is GIMPLE_OMP_ORDERED and it has
      a depend clause.  */
   gomp_ordered *ord_stmt;
+
+  /* True if this is nested inside an OpenACC kernels construct.  */
+  bool inside_kernels_p;
 };
 
 /* Context structure.  Used to store information about each parallel
@@ -8238,6 +8241,7 @@ expand_omp_for_generic (struct omp_region *region,
   gassign *assign_stmt;
   bool in_combined_parallel = is_combined_parallel (region);
   bool broken_loop = region->cont == NULL;
+  bool seq_loop = (start_fn == BUILT_IN_NONE || next_fn == BUILT_IN_NONE);
   edge e, ne;
   tree *counts = NULL;
   int i;
@@ -8335,8 +8339,12 @@ expand_omp_for_generic (struct omp_region *region,
   type = TREE_TYPE (fd->loop.v);
   istart0 = create_tmp_var (fd->iter_type, ".istart0");
   iend0 = create_tmp_var (fd->iter_type, ".iend0");
-  TREE_ADDRESSABLE (istart0) = 1;
-  TREE_ADDRESSABLE (iend0) = 1;
+
+    if (!seq_loop)
+    {
+      TREE_ADDRESSABLE (istart0) = 1;
+      TREE_ADDRESSABLE (iend0) = 1;
+    }
 
   /* See if we need to bias by LLONG_MIN.  */
   if (fd->iter_type == long_long_unsigned_type_node
@@ -8366,7 +8374,20 @@ expand_omp_for_generic (struct omp_region *region,
   gsi_prev (&gsif);
 
   tree arr = NULL_TREE;
-  if (in_combined_parallel)
+  if (seq_loop)
+    {
+      tree n1 = fold_convert (fd->iter_type, fd->loop.n1);
+      tree n2 = fold_convert (fd->iter_type, fd->loop.n2);
+
+      assign_stmt = gimple_build_assign (istart0, n1);
+      gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
+
+      assign_stmt = gimple_build_assign (iend0, n2);
+      gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
+
+      t = fold_build2 (NE_EXPR, boolean_type_node, istart0, iend0);
+    }
+  else if (in_combined_parallel)
     {
       gcc_assert (fd->ordered == 0);
       /* In a combined parallel loop, emit a call to
@@ -8788,39 +8809,45 @@ expand_omp_for_generic (struct omp_region *region,
 	collapse_bb = extract_omp_for_update_vars (fd, cont_bb, l1_bb);
 
       /* Emit code to get the next parallel iteration in L2_BB.  */
-      gsi = gsi_start_bb (l2_bb);
+      if (!seq_loop)
+	{
+	  gsi = gsi_start_bb (l2_bb);
 
-      t = build_call_expr (builtin_decl_explicit (next_fn), 2,
-			   build_fold_addr_expr (istart0),
-			   build_fold_addr_expr (iend0));
-      t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
-				    false, GSI_CONTINUE_LINKING);
-      if (TREE_TYPE (t) != boolean_type_node)
-	t = fold_build2 (NE_EXPR, boolean_type_node,
-			 t, build_int_cst (TREE_TYPE (t), 0));
-      gcond *cond_stmt = gimple_build_cond_empty (t);
-      gsi_insert_after (&gsi, cond_stmt, GSI_CONTINUE_LINKING);
+	  t = build_call_expr (builtin_decl_explicit (next_fn), 2,
+			       build_fold_addr_expr (istart0),
+			       build_fold_addr_expr (iend0));
+	  t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
+					false, GSI_CONTINUE_LINKING);
+	  if (TREE_TYPE (t) != boolean_type_node)
+	    t = fold_build2 (NE_EXPR, boolean_type_node,
+			     t, build_int_cst (TREE_TYPE (t), 0));
+	  gcond *cond_stmt = gimple_build_cond_empty (t);
+	  gsi_insert_after (&gsi, cond_stmt, GSI_CONTINUE_LINKING);
+	}
     }
 
   /* Add the loop cleanup function.  */
   gsi = gsi_last_bb (exit_bb);
-  if (gimple_omp_return_nowait_p (gsi_stmt (gsi)))
-    t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_NOWAIT);
-  else if (gimple_omp_return_lhs (gsi_stmt (gsi)))
-    t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_CANCEL);
-  else
-    t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END);
-  gcall *call_stmt = gimple_build_call (t, 0);
-  if (gimple_omp_return_lhs (gsi_stmt (gsi)))
-    gimple_call_set_lhs (call_stmt, gimple_omp_return_lhs (gsi_stmt (gsi)));
-  gsi_insert_after (&gsi, call_stmt, GSI_SAME_STMT);
-  if (fd->ordered)
+  if (!seq_loop)
     {
-      tree arr = counts[fd->ordered];
-      tree clobber = build_constructor (TREE_TYPE (arr), NULL);
-      TREE_THIS_VOLATILE (clobber) = 1;
-      gsi_insert_after (&gsi, gimple_build_assign (arr, clobber),
-			GSI_SAME_STMT);
+      if (gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+	t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_NOWAIT);
+      else if (gimple_omp_return_lhs (gsi_stmt (gsi)))
+	t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END_CANCEL);
+      else
+	t = builtin_decl_explicit (BUILT_IN_GOMP_LOOP_END);
+      gcall *call_stmt = gimple_build_call (t, 0);
+      if (gimple_omp_return_lhs (gsi_stmt (gsi)))
+	gimple_call_set_lhs (call_stmt, gimple_omp_return_lhs (gsi_stmt (gsi)));
+      gsi_insert_after (&gsi, call_stmt, GSI_SAME_STMT);
+      if (fd->ordered)
+	{
+	  tree arr = counts[fd->ordered];
+	  tree clobber = build_constructor (TREE_TYPE (arr), NULL);
+	  TREE_THIS_VOLATILE (clobber) = 1;
+	  gsi_insert_after (&gsi, gimple_build_assign (arr, clobber),
+			    GSI_SAME_STMT);
+	}
     }
   gsi_remove (&gsi, true);
 
@@ -8833,7 +8860,9 @@ expand_omp_for_generic (struct omp_region *region,
       gimple_seq phis;
 
       e = find_edge (cont_bb, l3_bb);
-      ne = make_edge (l2_bb, l3_bb, EDGE_FALSE_VALUE);
+      ne = make_edge (l2_bb, l3_bb, (seq_loop
+				     ? EDGE_FALLTHRU
+				     : EDGE_FALSE_VALUE));
 
       phis = phi_nodes (l3_bb);
       for (gsi = gsi_start (phis); !gsi_end_p (gsi); gsi_next (&gsi))
@@ -8873,7 +8902,8 @@ expand_omp_for_generic (struct omp_region *region,
 	  e = find_edge (cont_bb, l2_bb);
 	  e->flags = EDGE_FALLTHRU;
 	}
-      make_edge (l2_bb, l0_bb, EDGE_TRUE_VALUE);
+      if (!seq_loop)
+	make_edge (l2_bb, l0_bb, EDGE_TRUE_VALUE);
 
       if (gimple_in_ssa_p (cfun))
 	{
@@ -8929,12 +8959,16 @@ expand_omp_for_generic (struct omp_region *region,
 
       add_bb_to_loop (l2_bb, outer_loop);
 
-      /* We've added a new loop around the original loop.  Allocate the
-	 corresponding loop struct.  */
-      struct loop *new_loop = alloc_loop ();
-      new_loop->header = l0_bb;
-      new_loop->latch = l2_bb;
-      add_loop (new_loop, outer_loop);
+      struct loop *new_loop = NULL;
+      if (!seq_loop)
+	{
+	  /* We've added a new loop around the original loop.  Allocate the
+	     corresponding loop struct.  */
+	  new_loop = alloc_loop ();
+	  new_loop->header = l0_bb;
+	  new_loop->latch = l2_bb;
+	  add_loop (new_loop, outer_loop);
+	}
 
       /* Allocate a loop structure for the original loop unless we already
 	 had one.  */
@@ -8944,7 +8978,9 @@ expand_omp_for_generic (struct omp_region *region,
 	  struct loop *orig_loop = alloc_loop ();
 	  orig_loop->header = l1_bb;
 	  /* The loop may have multiple latches.  */
-	  add_loop (orig_loop, new_loop);
+	  add_loop (orig_loop, (new_loop != NULL
+				? new_loop
+				: outer_loop));
 	}
     }
 }
@@ -11348,7 +11384,10 @@ expand_omp_for (struct omp_region *region, gimple *inner_stmt)
        original loops from being detected.  Fix that up.  */
     loops_state_set (LOOPS_NEED_FIXUP);
 
-  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
+  if (region->inside_kernels_p)
+    expand_omp_for_generic (region, &fd, BUILT_IN_NONE, BUILT_IN_NONE,
+			    inner_stmt);
+  else if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
     expand_omp_simd (region, &fd);
   else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
     expand_cilk_for (region, &fd);
@@ -13030,6 +13069,14 @@ expand_omp (struct omp_region *region)
       if (region->type == GIMPLE_OMP_PARALLEL)
 	determine_parallel_type (region);
 
+      if (region->type == GIMPLE_OMP_TARGET && region->inner)
+	{
+	  gomp_target *entry = as_a <gomp_target *> (last_stmt (region->entry));
+	  if (gimple_omp_target_kind (entry) == GF_OMP_TARGET_KIND_OACC_KERNELS
+	      || region->inside_kernels_p)
+	    region->inner->inside_kernels_p = true;
+	}
+
       if (region->type == GIMPLE_OMP_FOR
 	  && gimple_omp_for_combined_p (last_stmt (region->entry)))
 	inner_stmt = last_stmt (region->inner->entry);
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (11 preceding siblings ...)
  2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
@ 2015-11-09 20:08 ` Tom de Vries
  2016-01-18 13:33   ` [committed] Add oacc kernels tests in goacc Tom de Vries
  2015-11-09 20:09 ` [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95 Tom de Vries
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:08 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds C/C++ oacc kernels compilation tests.

Thanks,
- Tom


[-- Attachment #2: 0013-Add-c-c-common-goacc-kernels-.c.patch --]
[-- Type: text/x-patch, Size: 46798 bytes --]

Add c-c++-common/goacc/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* c-c++-common/goacc/kernels-acc-loop-reduction.c: New test.
	* c-c++-common/goacc/kernels-acc-loop-smaller-equal.c: New test.
	* c-c++-common/goacc/kernels-counter-var-redundant-load.c: New test.
	* c-c++-common/goacc/kernels-counter-vars-function-scope.c: New test.
	* c-c++-common/goacc/kernels-double-reduction.c: New test.
	* c-c++-common/goacc/kernels-empty.c: New test.
	* c-c++-common/goacc/kernels-eternal.c: New test.
	* c-c++-common/goacc/kernels-loop-2-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-2.c: New test.
	* c-c++-common/goacc/kernels-loop-3-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-3.c: New test.
	* c-c++-common/goacc/kernels-loop-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-data-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit-2.c: New test.
	* c-c++-common/goacc/kernels-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-loop-data-update.c: New test.
	* c-c++-common/goacc/kernels-loop-data.c: New test.
	* c-c++-common/goacc/kernels-loop-g.c: New test.
	* c-c++-common/goacc/kernels-loop-mod-not-zero.c: New test.
	* c-c++-common/goacc/kernels-loop-n-acc-loop.c: New test.
	* c-c++-common/goacc/kernels-loop-n.c: New test.
	* c-c++-common/goacc/kernels-loop-nest.c: New test.
	* c-c++-common/goacc/kernels-loop.c: New test.
	* c-c++-common/goacc/kernels-noreturn.c: New test.
	* c-c++-common/goacc/kernels-one-counter-var.c: New test.
	* c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c: New test.
	* c-c++-common/goacc/kernels-reduction.c: New test.
---
 .../goacc/kernels-acc-loop-reduction.c             | 25 ++++++++
 .../goacc/kernels-acc-loop-smaller-equal.c         | 25 ++++++++
 .../goacc/kernels-counter-var-redundant-load.c     | 36 +++++++++++
 .../goacc/kernels-counter-vars-function-scope.c    | 54 +++++++++++++++++
 .../c-c++-common/goacc/kernels-double-reduction.c  | 37 ++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-empty.c   |  6 ++
 gcc/testsuite/c-c++-common/goacc/kernels-eternal.c | 11 ++++
 .../c-c++-common/goacc/kernels-loop-2-acc-loop.c   | 21 +++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c  | 70 ++++++++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-3-acc-loop.c   | 17 ++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c  | 49 +++++++++++++++
 .../c-c++-common/goacc/kernels-loop-acc-loop.c     | 17 ++++++
 .../c-c++-common/goacc/kernels-loop-data-2.c       | 70 ++++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit-2.c         | 68 +++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit.c           | 65 ++++++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-data-update.c  | 65 ++++++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-data.c         | 64 ++++++++++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c  | 17 ++++++
 .../c-c++-common/goacc/kernels-loop-mod-not-zero.c | 52 ++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-n-acc-loop.c   | 17 ++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c  | 56 +++++++++++++++++
 .../c-c++-common/goacc/kernels-loop-nest.c         | 39 ++++++++++++
 gcc/testsuite/c-c++-common/goacc/kernels-loop.c    | 56 +++++++++++++++++
 .../c-c++-common/goacc/kernels-noreturn.c          | 12 ++++
 .../c-c++-common/goacc/kernels-one-counter-var.c   | 54 +++++++++++++++++
 .../goacc/kernels-parallel-loop-data-enter-exit.c  | 66 ++++++++++++++++++++
 .../c-c++-common/goacc/kernels-reduction.c         | 36 +++++++++++
 27 files changed, 1105 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-empty.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-loop.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c
 create mode 100644 gcc/testsuite/c-c++-common/goacc/kernels-reduction.c

diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
new file mode 100644
index 0000000..dcc5891
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-reduction.c
@@ -0,0 +1,25 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+  unsigned int sum = 0;
+
+#pragma acc kernels loop gang reduction(+:sum)
+  for (int i = 0; i < n; i++)
+    sum += a[i];
+
+  return sum;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
new file mode 100644
index 0000000..c05c694
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-acc-loop-smaller-equal.c
@@ -0,0 +1,25 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+unsigned int
+foo (int n)
+{
+  unsigned int sum = 1;
+
+  #pragma acc kernels loop
+  for (int i = 1; i <= n; i++)
+    sum += i;
+
+  return sum;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*\\._omp_fn\\.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c b/gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c
new file mode 100644
index 0000000..ad101dd
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-counter-var-redundant-load.c
@@ -0,0 +1,36 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-dom_oacc_kernels3" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+COUNTERTYPE
+foo (unsigned int *c)
+{
+  COUNTERTYPE ii;
+
+#pragma acc kernels copyout (c[0:N])
+  {
+    for (ii = 0; ii < N; ii++)
+      c[ii] = 1;
+  }
+
+  return ii;
+}
+
+/* We're expecting:
+
+   .omp_data_i_10 = &.omp_data_arr.3;
+   _11 = .omp_data_i_10->ii;
+   *_11 = 0;
+   _15 = .omp_data_i_10->c;
+   c.1_16 = *_15;
+
+   Check that there is one load from anonymous ssa-name, which we assume to
+   be:
+   - the one to read c.  */
+
+/* { dg-final { scan-tree-dump-times "(?n)\\*_\[0-9\]\[0-9\]*;$" 1 "dom_oacc_kernels3" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c b/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
new file mode 100644
index 0000000..650fb8ca
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-counter-vars-function-scope.c
@@ -0,0 +1,54 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+  COUNTERTYPE i;
+  COUNTERTYPE ii;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
new file mode 100644
index 0000000..da20f34
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-double-reduction.c
@@ -0,0 +1,37 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N 500
+
+unsigned int a[N][N];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i, j;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:N]) copy (sum)
+  {
+    for (i = 0; i < N; ++i)
+      for (j = 0; j < N; ++j)
+	sum += a[i][j];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-empty.c b/gcc/testsuite/c-c++-common/goacc/kernels-empty.c
new file mode 100644
index 0000000..e91b81c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-empty.c
@@ -0,0 +1,6 @@
+void
+foo (void)
+{
+#pragma acc kernels
+  ;
+}
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c b/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
new file mode 100644
index 0000000..edc17d2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-eternal.c
@@ -0,0 +1,11 @@
+int
+main (void)
+{
+#pragma acc kernels
+  {
+    while (1)
+      ;
+  }
+
+  return 0;
+}
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
new file mode 100644
index 0000000..6a4fb1f
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2-acc-loop.c
@@ -0,0 +1,21 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-2.c"
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
new file mode 100644
index 0000000..514591e
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-2.c
@@ -0,0 +1,70 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels copyout (a[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels copyout (b[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
new file mode 100644
index 0000000..a9e81ee
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3-acc-loop.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-3.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
new file mode 100644
index 0000000..790add9
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-3.c
@@ -0,0 +1,49 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int i;
+
+  unsigned int *__restrict c;
+
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    c[i] = i * 2;
+
+#pragma acc kernels copy (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = c[ii] + ii + 1;
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != i * 2 + i + 1)
+      abort ();
+
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
new file mode 100644
index 0000000..516598f
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-acc-loop.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c
new file mode 100644
index 0000000..095ed6c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-2.c
@@ -0,0 +1,70 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+  }
+
+#pragma acc data copyout (b[0:N])
+  {
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+  }
+
+#pragma acc data copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c
new file mode 100644
index 0000000..9efffac
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit-2.c
@@ -0,0 +1,68 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N])
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+#pragma acc exit data copyout (a[0:N])
+
+#pragma acc enter data create (b[0:N])
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+#pragma acc exit data copyout (b[0:N])
+
+
+#pragma acc enter data copyin (a[0:N], b[0:N]) create (c[0:N])
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+#pragma acc exit data copyout (c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c
new file mode 100644
index 0000000..2da20b4
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-enter-exit.c
@@ -0,0 +1,65 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c
new file mode 100644
index 0000000..09b63e5
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data-update.c
@@ -0,0 +1,65 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc update device (b[0:N])
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only two loops are analyzed, and that both can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c
new file mode 100644
index 0000000..437fd73
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-data.c
@@ -0,0 +1,64 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N], b[0:N], c[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only three loops are analyzed, and that all can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
new file mode 100644
index 0000000..27e23f8
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-g.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-g" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include "kernels-loop.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
new file mode 100644
index 0000000..940341d
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-mod-not-zero.c
@@ -0,0 +1,52 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
new file mode 100644
index 0000000..64e59a2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n-acc-loop.c
@@ -0,0 +1,17 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Check that loops with '#pragma acc loop' tagged gets properly parallelized.  */
+#define ACC_LOOP
+#include "kernels-loop-n.c"
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
new file mode 100644
index 0000000..73c6142
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-n.c
@@ -0,0 +1,56 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+foo (COUNTERTYPE n)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:n], b[0:n]) copyout (c[0:n])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < n; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
new file mode 100644
index 0000000..d2aeda6
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop-nest.c
@@ -0,0 +1,39 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+/* Based on autopar/outer-1.c.  */
+
+#include <stdlib.h>
+
+#define N 1000
+
+int
+main (void)
+{
+  int x[N][N];
+
+#pragma acc kernels copyout (x)
+  {
+    for (int ii = 0; ii < N; ii++)
+      for (int jj = 0; jj < N; jj++)
+	x[ii][jj] = ii + jj + 3;
+  }
+
+  for (int i = 0; i < N; i++)
+    for (int j = 0; j < N; j++)
+      if (x[i][j] != i + j + 3)
+	abort ();
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-loop.c b/gcc/testsuite/c-c++-common/goacc/kernels-loop.c
new file mode 100644
index 0000000..925a84e
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-loop.c
@@ -0,0 +1,56 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#ifdef ACC_LOOP
+    #pragma acc loop
+#endif
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c b/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
new file mode 100644
index 0000000..1a8cc67
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-noreturn.c
@@ -0,0 +1,12 @@
+int
+main (void)
+{
+
+#pragma acc kernels
+  {
+    __builtin_abort ();
+  }
+
+  return 0;
+}
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c b/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
new file mode 100644
index 0000000..b000a8c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-one-counter-var.c
@@ -0,0 +1,54 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+  COUNTERTYPE i;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+  for (i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (i = 0; i < N; i++)
+      c[i] = a[i] + b[i];
+  }
+
+  for (i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c b/gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c
new file mode 100644
index 0000000..31b06bd
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c
@@ -0,0 +1,66 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc parallel present (b[0:N])
+  {
+#pragma acc loop
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+/* Check that only two loops are analyzed, and that both can be
+   parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.0" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.1" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*main._omp_fn.2" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c b/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
new file mode 100644
index 0000000..6a0b7a2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-reduction.c
@@ -0,0 +1,36 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+#include <stdlib.h>
+
+#define n 10000
+
+unsigned int a[n];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:n]) copy (sum)
+  {
+    for (i = 0; i < n; ++i)
+      sum += a[i];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+/* Check that only one loop is analyzed, and that it can be parallelized.  */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } } */
+/* { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } } */
+
+/* Check that the loop has been split off into a function.  */
+/* { dg-final { scan-tree-dump-times "(?n);; Function .*foo.*._omp_fn.0" 1 "optimized" } } */
+
+/* { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } } */
+
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (12 preceding siblings ...)
  2015-11-09 20:08 ` [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c Tom de Vries
@ 2015-11-09 20:09 ` Tom de Vries
  2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
  2015-11-09 20:12 ` [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95 Tom de Vries
  15 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1589 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds Fortran oacc kernels compilation tests.

Thanks,
- Tom


[-- Attachment #2: 0014-Add-gfortran.dg-goacc-kernels-.f95.patch --]
[-- Type: text/x-patch, Size: 16544 bytes --]

Add gfortran.dg/goacc/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* gfortran.dg/goacc/kernels-loop-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data-update.f95: New test.
	* gfortran.dg/goacc/kernels-loop-data.f95: New test.
	* gfortran.dg/goacc/kernels-loop.f95: New test.
	* gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95: New test.
---
 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95 | 45 +++++++++++++++++++
 .../gfortran.dg/goacc/kernels-loop-data-2.f95      | 51 ++++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit-2.f95       | 51 ++++++++++++++++++++++
 .../goacc/kernels-loop-data-enter-exit.f95         | 49 +++++++++++++++++++++
 .../gfortran.dg/goacc/kernels-loop-data-update.f95 | 48 ++++++++++++++++++++
 .../gfortran.dg/goacc/kernels-loop-data.f95        | 49 +++++++++++++++++++++
 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95   | 39 +++++++++++++++++
 .../kernels-parallel-loop-data-enter-exit.f95      | 50 +++++++++++++++++++++
 8 files changed, 382 insertions(+)
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95

diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
new file mode 100644
index 0000000..7fd6d4e
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-2.f95
@@ -0,0 +1,45 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc kernels copyout (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyout (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
new file mode 100644
index 0000000..f788f67
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-2.f95
@@ -0,0 +1,51 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyout (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
new file mode 100644
index 0000000..3599052
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95
@@ -0,0 +1,51 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (a(0:n-1))
+
+  !$acc enter data create (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (b(0:n-1))
+
+  !$acc enter data copyin (a(0:n-1), b(0:n-1)) create (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
new file mode 100644
index 0000000..562422e
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-enter-exit.f95
@@ -0,0 +1,49 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
new file mode 100644
index 0000000..ed18fe1
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data-update.f95
@@ -0,0 +1,48 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc update device (b(0:n-1))
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
new file mode 100644
index 0000000..177aa64
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop-data.f95
@@ -0,0 +1,49 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 3 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
new file mode 100644
index 0000000..c9364dd
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-loop.f95
@@ -0,0 +1,39 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only one loop is analyzed, and that it can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 1 "parloops_oacc_kernels" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95 b/gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95
new file mode 100644
index 0000000..d805938
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95
@@ -0,0 +1,50 @@
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! { dg-additional-options "-fdump-tree-parloops_oacc_kernels-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc parallel present (b(0:n-1))
+  !$acc loop
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end parallel
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
+
+! Check that only three loops are analyzed, and that all can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops_oacc_kernels" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops_oacc_kernels" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.0 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.1 " 1 "optimized" } }
+! { dg-final { scan-tree-dump-times "(?n);; Function MAIN__._omp_fn.2 " 1 "optimized" } }
+
+! { dg-final { scan-tree-dump-times "(?n)oacc function \\(32," 2 "parloops_oacc_kernels" } }
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (13 preceding siblings ...)
  2015-11-09 20:09 ` [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95 Tom de Vries
@ 2015-11-09 20:11 ` Tom de Vries
  2016-01-18 13:39   ` [comitted] Add oacc kernels test in libgomp Tom de Vries
  2016-03-09  9:18   ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
  2015-11-09 20:12 ` [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95 Tom de Vries
  15 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1585 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds C/C++ oacc kernels execution tests.

Thanks,
- Tom


[-- Attachment #2: 0015-Add-libgomp.oacc-c-c-common-kernels-.c.patch --]
[-- Type: text/x-patch, Size: 27286 bytes --]

Add libgomp.oacc-c-c++-common/kernels-*.c

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-loop.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
	Same.
	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c: Same.
---
 .../libgomp.oacc-c-c++-common/kernels-loop-2.c     | 47 ++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-3.c     | 34 +++++++++++++
 .../kernels-loop-and-seq-2.c                       | 36 ++++++++++++++
 .../kernels-loop-and-seq-3.c                       | 37 ++++++++++++++
 .../kernels-loop-and-seq-4.c                       | 36 ++++++++++++++
 .../kernels-loop-and-seq-5.c                       | 37 ++++++++++++++
 .../kernels-loop-and-seq-6.c                       | 36 ++++++++++++++
 .../kernels-loop-and-seq.c                         | 37 ++++++++++++++
 .../kernels-loop-collapse.c                        | 40 ++++++++++++++++
 .../kernels-loop-data-2.c                          | 56 ++++++++++++++++++++++
 .../kernels-loop-data-enter-exit-2.c               | 54 +++++++++++++++++++++
 .../kernels-loop-data-enter-exit.c                 | 51 ++++++++++++++++++++
 .../kernels-loop-data-update.c                     | 53 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-data.c  | 50 +++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-g.c     |  5 ++
 .../kernels-loop-mod-not-zero.c                    | 41 ++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-n.c     | 47 ++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop-nest.c  | 26 ++++++++++
 .../libgomp.oacc-c-c++-common/kernels-loop.c       | 41 ++++++++++++++++
 .../kernels-parallel-loop-data-enter-exit.c        | 52 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/kernels-reduction.c  | 37 ++++++++++++++
 21 files changed, 853 insertions(+)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
new file mode 100644
index 0000000..13e57bd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-2.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels copyout (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels copyout (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
new file mode 100644
index 0000000..f61a74a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-3.c
@@ -0,0 +1,34 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int i;
+
+  unsigned int *__restrict c;
+
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    c[i] = i * 2;
+
+#pragma acc kernels copy (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = c[ii] + ii + 1;
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != i * 2 + i + 1)
+      abort ();
+
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
new file mode 100644
index 0000000..2e4100f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    a[0] = a[0] + 1;
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
new file mode 100644
index 0000000..b3e736b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+
+#pragma acc kernels copy (a[0:N])
+  {
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+
+    a[0] = 2;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
new file mode 100644
index 0000000..8b9affa
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    a[0] = 2;
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
new file mode 100644
index 0000000..83d4e7f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+  int r;
+#pragma acc kernels copyout(r) copy (a[0:N])
+  {
+    r = a[0];
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+  }
+
+  return r;
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 0)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
new file mode 100644
index 0000000..01d5e5e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+#pragma acc kernels copy (a[0:N])
+  {
+    int r = a[0];
+
+    for (int i = 0; i < n; i++)
+      a[i] = 1 + r;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 1)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
new file mode 100644
index 0000000..61d1283
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 32
+
+unsigned int
+foo (int n, unsigned int *a)
+{
+
+#pragma acc kernels copy (a[0:N])
+  {
+    for (int i = 0; i < n; i++)
+      a[i] = 1;
+
+    a[0] = a[0] + 1;
+  }
+
+  return a[0];
+}
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned res, i;
+
+  for (i = 0; i < N; ++i)
+    a[i] = i % 4;
+
+  res = foo (N, a);
+  if (res != 2)
+    abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
new file mode 100644
index 0000000..f7f04cb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 100
+
+int a[N][N];
+
+void __attribute__((noinline, noclone))
+foo (int m, int n)
+{
+  int i, j;
+  #pragma acc kernels
+  {
+#pragma acc loop collapse(2)
+    for (i = 0; i < m; i++)
+      for (j = 0; j < n; j++)
+	a[i][j] = 1;
+  }
+}
+
+int
+main (void)
+{
+  int i, j;
+
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      a[i][j] = 0;
+
+  foo (N, N);
+
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      if (a[i][j] != 1)
+	abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
new file mode 100644
index 0000000..b889ef9
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-2.c
@@ -0,0 +1,56 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+  }
+
+#pragma acc data copyout (b[0:N])
+  {
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+  }
+
+#pragma acc data copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
new file mode 100644
index 0000000..d508a44
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit-2.c
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N])
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+#pragma acc exit data copyout (a[0:N])
+
+#pragma acc enter data create (b[0:N])
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+#pragma acc exit data copyout (b[0:N])
+
+
+#pragma acc enter data copyin (a[0:N], b[0:N]) create (c[0:N])
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+#pragma acc exit data copyout (c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
new file mode 100644
index 0000000..11d82f7
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-enter-exit.c
@@ -0,0 +1,51 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc kernels present (b[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
new file mode 100644
index 0000000..a7d4e84
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data-update.c
@@ -0,0 +1,53 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc update device (b[0:N])
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
new file mode 100644
index 0000000..607d7de
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-data.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc data copyout (a[0:N], b[0:N], c[0:N])
+  {
+#pragma acc kernels present (a[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	a[i] = i * 2;
+    }
+
+#pragma acc kernels present (b[0:N])
+    {
+      for (COUNTERTYPE i = 0; i < N; i++)
+	b[i] = i * 4;
+    }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+    {
+      for (COUNTERTYPE ii = 0; ii < N; ii++)
+	c[ii] = a[ii] + b[ii];
+    }
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
new file mode 100644
index 0000000..96b6e4e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-g.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* { dg-additional-options "-g" } */
+
+#include "kernels-loop.c"
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
new file mode 100644
index 0000000..1433cb2
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-mod-not-zero.c
@@ -0,0 +1,41 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
new file mode 100644
index 0000000..fd0d5b1
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-n.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N ((1024 * 512) + 1)
+#define COUNTERTYPE unsigned int
+
+static int __attribute__((noinline,noclone))
+foo (COUNTERTYPE n)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (n * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:n], b[0:n]) copyout (c[0:n])
+  {
+    for (COUNTERTYPE ii = 0; ii < n; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < n; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
+
+int
+main (void)
+{
+  return foo (N);
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
new file mode 100644
index 0000000..21d2599
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N 1000
+
+int
+main (void)
+{
+  int x[N][N];
+
+#pragma acc kernels copyout (x)
+  {
+    for (int ii = 0; ii < N; ii++)
+      for (int jj = 0; jj < N; jj++)
+	x[ii][jj] = ii + jj + 3;
+  }
+
+  for (int i = 0; i < N; i++)
+    for (int j = 0; j < N; j++)
+      if (x[i][j] != i + j + 3)
+	abort ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
new file mode 100644
index 0000000..3762e5a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop.c
@@ -0,0 +1,41 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    a[i] = i * 2;
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    b[i] = i * 4;
+
+#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
new file mode 100644
index 0000000..767f6c8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
@@ -0,0 +1,52 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define N (1024 * 512)
+#define COUNTERTYPE unsigned int
+
+int
+main (void)
+{
+  unsigned int *__restrict a;
+  unsigned int *__restrict b;
+  unsigned int *__restrict c;
+
+  a = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  b = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+  c = (unsigned int *__restrict)malloc (N * sizeof (unsigned int));
+
+#pragma acc enter data create (a[0:N], b[0:N], c[0:N])
+
+#pragma acc kernels present (a[0:N])
+  {
+    for (COUNTERTYPE i = 0; i < N; i++)
+      a[i] = i * 2;
+  }
+
+#pragma acc parallel present (b[0:N])
+  {
+#pragma acc loop
+    for (COUNTERTYPE i = 0; i < N; i++)
+      b[i] = i * 4;
+  }
+
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+  {
+    for (COUNTERTYPE ii = 0; ii < N; ii++)
+      c[ii] = a[ii] + b[ii];
+  }
+
+#pragma acc exit data copyout (a[0:N], b[0:N], c[0:N])
+
+  for (COUNTERTYPE i = 0; i < N; i++)
+    if (c[i] != a[i] + b[i])
+      abort ();
+
+  free (a);
+  free (b);
+  free (c);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
new file mode 100644
index 0000000..511e25f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction.c
@@ -0,0 +1,37 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <stdlib.h>
+
+#define n 10000
+
+unsigned int a[n];
+
+void  __attribute__((noinline,noclone))
+foo (void)
+{
+  int i;
+  unsigned int sum = 1;
+
+#pragma acc kernels copyin (a[0:n]) copy (sum)
+  {
+    for (i = 0; i < n; ++i)
+      sum += a[i];
+  }
+
+  if (sum != 5001)
+    abort ();
+}
+
+int
+main ()
+{
+  int i;
+
+  for (i = 0; i < n; ++i)
+    a[i] = i % 2;
+
+  foo ();
+
+  return 0;
+}
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95
  2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
                   ` (14 preceding siblings ...)
  2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
@ 2015-11-09 20:12 ` Tom de Vries
  2016-03-09  9:19   ` Tom de Vries
  15 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-09 20:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

This patch adds Fortran oacc kernels execution tests.

Thanks,
- Tom


[-- Attachment #2: 0016-Add-libgomp.oacc-fortran-kernels-.f95.patch --]
[-- Type: text/x-patch, Size: 10459 bytes --]

Add libgomp.oacc-fortran/kernels-*.f95

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: New test.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
	Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Same.
	* testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95:
	Same.
---
 .../libgomp.oacc-fortran/kernels-loop-2.f95        | 32 ++++++++++++++++++
 .../libgomp.oacc-fortran/kernels-loop-data-2.f95   | 38 ++++++++++++++++++++++
 .../kernels-loop-data-enter-exit-2.f95             | 38 ++++++++++++++++++++++
 .../kernels-loop-data-enter-exit.f95               | 36 ++++++++++++++++++++
 .../kernels-loop-data-update.f95                   | 36 ++++++++++++++++++++
 .../libgomp.oacc-fortran/kernels-loop-data.f95     | 36 ++++++++++++++++++++
 .../libgomp.oacc-fortran/kernels-loop.f95          | 28 ++++++++++++++++
 .../kernels-parallel-loop-data-enter-exit.f95      | 37 +++++++++++++++++++++
 8 files changed, 281 insertions(+)
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95

diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
new file mode 100644
index 0000000..1fb40ee
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
@@ -0,0 +1,32 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc kernels copyout (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyout (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
new file mode 100644
index 0000000..7b52253
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
@@ -0,0 +1,38 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyout (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  !$acc data copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
new file mode 100644
index 0000000..af98efa
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
@@ -0,0 +1,38 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1))
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (a(0:n-1))
+
+  !$acc enter data create (b(0:n-1))
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (b(0:n-1))
+
+  !$acc enter data copyin (a(0:n-1), b(0:n-1)) create (c(0:n-1))
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+  !$acc exit data copyout (c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
new file mode 100644
index 0000000..bb6f8dc
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
@@ -0,0 +1,36 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
new file mode 100644
index 0000000..cab1f2c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
@@ -0,0 +1,36 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc update device (b(0:n-1))
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
new file mode 100644
index 0000000..f26671d
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
@@ -0,0 +1,36 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (b(0:n-1))
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end kernels
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc end data
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
new file mode 100644
index 0000000..b02dd57
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
@@ -0,0 +1,28 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+
+  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
new file mode 100644
index 0000000..2322152
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
@@ -0,0 +1,37 @@
+! { dg-do run }
+! { dg-options "-ftree-parallelize-loops=32" }
+
+program main
+  implicit none
+  integer, parameter         :: n = 1024
+  integer, dimension (0:n-1) :: a, b, c
+  integer                    :: i, ii
+
+  !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  !$acc kernels present (a(0:n-1))
+  do i = 0, n - 1
+     a(i) = i * 2
+  end do
+  !$acc end kernels
+
+  !$acc parallel present (b(0:n-1))
+  !$acc loop
+  do i = 0, n -1
+     b(i) = i * 4
+  end do
+  !$acc end parallel
+
+  !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
+  do ii = 0, n - 1
+     c(ii) = a(ii) + b(ii)
+  end do
+  !$acc end kernels
+
+  !$acc exit data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
+
+  do i = 0, n - 1
+     if (c(i) .ne. a(i) + b(i)) call abort
+  end do
+
+end program main
-- 
1.9.1


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 2/16] Make create_parallel_loop return void
  2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
@ 2015-11-11 10:50   ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:50 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch makes create_parallel_loop return void.  The result is currently
> unused.

Ok.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt
  2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
@ 2015-11-11 10:50   ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:50 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> > 
> 
> In transform_to_exit_first_loop_alt we insert a new exit block  in between the
> new loop header and the old exit block. Currently, we also do this if this is
> not necessary.
> 
> This patch figures out when we need to insert a new exit block, and only then
> inserts it.

Ok.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
@ 2015-11-11 10:53   ` Richard Biener
  2015-11-11 11:01     ` Jakub Jelinek
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:53 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch addresses the problem that once the offloading region has been
> split off from the original function, alias analysis can no longer use
> information available in the original function that would allow it to do a
> more precise analysis for the offloading function. [ At some point we could
> use fipa-pta for that, as discussed in PR46032, but that's not feasible now. ]
> 
> The basic idea behind the patch is that for typical usage, the base pointers
> used in an offloaded region are non-aliasing. The patch works by adding
> restrict to the types of the fields used to pass data to an offloading region.
> 
> 
> The patch implements a new option
> -foffload-alias=<none|pointer|all>.
> 
> The option -foffload-alias=none instructs the compiler to assume that
> object references and pointer dereferences in an offload region do not
> alias.
> 
> The option -foffload-alias=pointer instructs the compiler to assume that
> objects references in an offload region do not alias.
> 
> The option -foffload-alias=all instructs the compiler to make no
> assumptions about aliasing in offload regions.
> 
> The default value is -foffload-alias=none.

I think global options for this is nonsense.  Please follow what
we do for #pragma GCC ivdep for example, thus allow the alias
behavior to be specified per "region" (whatever makes sense here
in the context of offloading).

Thanks,
Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
@ 2015-11-11 10:57   ` Richard Biener
  2015-11-16 11:39     ` Tom de Vries
  2015-11-16 11:39     ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:57 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch adds and initializes the field in_oacc_kernels_region field in
> struct loop.
> 
> The field is used to signal to subsequent passes that we're dealing with a
> loop in a kernels region that we're trying parallelize.
> 
> Note that we do not parallelize kernels regions with more than one loop nest.
> [ In general, kernels regions with more than one loop nest should be split up
> into seperate kernels regions, but that's not supported atm. ]

I think mark_loops_in_oacc_kernels_region can be greatly simplified.

Both region entry and exit should have the same ->loop_father (a SESE
region).  Then you can just walk that loops inner (and their sibling) 
loops checking their header domination relation with the region entry
exit (only necessary for direct inner loops).

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
@ 2015-11-11 10:59   ` Richard Biener
  2015-11-19 13:51     ` Tom de Vries
  2016-02-05 12:06   ` Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels) Thomas Schwinge
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 10:59 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patchs add a pass group pass_oacc_kernels (which will be added to the
> pass list as a whole in patch 10).

Just to understand (while also skimming the HSA patches).

You are basically relying on autopar for what the HSA patches call
"gridification"?  That is, OMP lowering produces loopy kernels
and autopar then will basically strip the outermost loop?

Richard.

> Atm, the parallelization behaviour for the kernels region is controlled by
> flag_tree_parallelize_loops, which is also used to control generic
> auto-parallelization by autopar using omp. That is not ideal, and we may want
> a separate flag (or param) to control the behaviour for oacc kernels, f.i.
> -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.
> 
> The purpose of the pass group as a whole is to massage the offloaded function
> into a shape that parloops can deal with it, and then run parloops on it.
> 
> Consider a testcase with a reduction, and a loop counter declared outside the
> offload region:
> ...
> unsigned int a[n];
> 
> unsigned int
> foo (void)
> {
>   int i;
>   unsigned int sum = 1;
> 
> #pragma acc kernels copyin (a[0:n]) copy (sum)
>   {
>     for (i = 0; i < n; ++i)
>       sum += a[i];
>   }
> 
>   return sum;
> }
> ...
> 
> After ealias, the loop body looks like this:
> ...
>   <bb 5>:
>   _8 = *.omp_data_i_3(D).a;
>   _9 = *.omp_data_i_3(D).i;
>   _10 = *_9;
>   _11 = *_8[_10];
>   _12 = *.omp_data_i_3(D).sum;
>   sum.0_13 = *_12;
>   sum.1_14 = _11 + sum.0_13;
>   _15 = *.omp_data_i_3(D).sum;
>   *_15 = sum.1_14;
>   _17 = *.omp_data_i_3(D).i;
>   _18 = *_17;
>   _19 = *.omp_data_i_3(D).i;
>   _20 = _18 + 1;
>   *_19 = _20;
>   goto <bb 6>;
> ...
> In other words, the iteration variable is in memory, as is the reduction
> variable, and the body contains lots of loop invariant loads.
> 
> At the end of the pass group, just before parloops, the body has been
> rewritten to have a local iteration variable and a local reduction variable,
> and all the loop invariant loads have been moved out of the loop:
> ...
>   <bb 4>:
>   # _27 = PHI <0(2), _20(5)>
>   # D__lsm.7_28 = PHI <D__lsm.7_29(2), sum.1_14(5)>
>   _11 = *_8[_27];
>   sum.1_14 = _11 + D__lsm.7_28;
>   _20 = _27 + 1;
>   if (_20 <= 9999)
>     goto <bb 5>;
>   else
>     goto <bb 3>;
> ...
> 
> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-11 10:53   ` Richard Biener
@ 2015-11-11 11:01     ` Jakub Jelinek
  2015-11-12 16:04       ` Tom de Vries
  2015-12-03 11:53       ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Jakub Jelinek @ 2015-11-11 11:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: Tom de Vries, gcc-patches

On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
> > The option -foffload-alias=pointer instructs the compiler to assume that
> > objects references in an offload region do not alias.
> > 
> > The option -foffload-alias=all instructs the compiler to make no
> > assumptions about aliasing in offload regions.
> > 
> > The default value is -foffload-alias=none.
> 
> I think global options for this is nonsense.  Please follow what
> we do for #pragma GCC ivdep for example, thus allow the alias
> behavior to be specified per "region" (whatever makes sense here
> in the context of offloading).

Yeah, completely agreed.  I don't see why the offloaded region would be in
any way special, they are C/C++/Fortran code as any other.
What we can and should improve is teach IPA aliasing/points to analysis
about the way we lower the host vs. offloading region boundary, so that
if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
determines something it can be used on the offloaded function side and vice
versa, but a switch like the above is just wrong.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
@ 2015-11-11 11:03   ` Richard Biener
  2015-11-12 14:32     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 11:03 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> This patch updates existing testcases with new pass numbers, given the passes
> that were added in the pass list in patch 10.

I think it would be nice to be able to specify the number in the .def
file instead so we can avoid this kind of churn everytime we do this.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
@ 2015-11-11 11:03   ` Richard Biener
  2015-11-16 11:55     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 11:03 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> > 
> 
> This patch adds the pass_oacc_kernels pass group to the pass list in
> passes.def.
> 
> Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner
> loop in a loop nest, the second for an outer loop in a loop nest.

@@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
          /* pass_build_ealias is a dummy pass that ensures that we
             execute TODO_rebuild_alias at this point.  */
          NEXT_PASS (pass_build_ealias);
+         /* Pass group that runs when there are oacc kernels in the
+            function.  */
+         NEXT_PASS (pass_oacc_kernels);
+         PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+             NEXT_PASS (pass_dominator_oacc_kernels);
+             NEXT_PASS (pass_ch_oacc_kernels);
+             NEXT_PASS (pass_dominator_oacc_kernels);
+             NEXT_PASS (pass_tree_loop_init);
+             NEXT_PASS (pass_lim);
+             NEXT_PASS (pass_copy_prop);
+             NEXT_PASS (pass_lim);
+             NEXT_PASS (pass_copy_prop);

iterate lim/copyprop twice?!  Why's that needed?

+             NEXT_PASS (pass_scev_cprop);

What's that for?  It's supposed to help removing loops - I don't
expect kernels to vanish.

+             NEXT_PASS (pass_tree_loop_done);
+             NEXT_PASS (pass_dominator_oacc_kernels);

Three times DOM?  No please.  I wonder why you don't run oacc_kernels
after FRE and drop the initial DOM(s).

+             NEXT_PASS (pass_dce);
+             NEXT_PASS (pass_tree_loop_init);
+             NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+             NEXT_PASS (pass_expand_omp_ssa);
+             NEXT_PASS (pass_tree_loop_done);

The switches into/outof tree_loop also look odd to me, but well
(they'll be controlled by -ftree-loop-optimize)).

+         POP_INSERT_PASSES ()

Please get some more sense into this pass pipeline.

Richard.


> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 7/16] Add pass_dominator_oacc_kernels
  2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
@ 2015-11-11 11:05   ` Richard Biener
  2015-11-16 12:04     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-11 11:05 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patch adds pass_dominator_oacc_kernels (which we may as well call
> pass_dominator_no_peel_loop_headers. It doesn't do anything
> oacc-kernels-specific), to be used in the kernels pass group.
> 
> The reason I'm adding a new pass instead of using pass_dominator is that
> pass_dominator uses first_pass_instance. So adding a pass_dominator instance A
> before a pass_dominator instance B has the unexpected consequence that it may
> change the behaviour of instance B. I've filed PR68247 - "Remove
> pass_first_instance" to note this issue.

This looks ok (minus my comments to patch #10)

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 8/16] Add pass_ch_oacc_kernels
  2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
@ 2015-11-11 20:29   ` Tom de Vries
  2015-11-30 12:12     ` [gomp4] Use pass_ch instead of pass_ch_oacc_kernels (was: [PATCH, 8/16] Add pass_ch_oacc_kernels) Thomas Schwinge
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-11 20:29 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 09/11/15 19:33, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> this patch adds a pass pass_ch_oacc_kernels, which is like pass_ch, but
> only runs for loops with oacc_kernels_region set.
>
> [ But... thinking about it a bit more, I think that we could use a
> regular pass_ch instead. We only use the kernels pass group for a single
> loop nest in a kernels region, and we mark all the loops in the loop
> nest with oacc_kernels_region. So I think that the oacc_kernels_region
> test in pass_ch_oacc_kernels::process_loop_p evaluates to true. ]
>
> So, I'll try to confirm with retesting that we can drop this patch.
>

That's confirmed. I can use pass_ch instead of pass_ch_oacc_kernels, so 
I'm dropping this patch from the series.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-11 11:03   ` Richard Biener
@ 2015-11-12 14:32     ` Tom de Vries
  2015-11-12 14:43       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-12 14:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 11/11/15 12:03, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> This patch updates existing testcases with new pass numbers, given the passes
>> that were added in the pass list in patch 10.
>
> I think it would be nice to be able to specify the number in the .def
> file instead so we can avoid this kind of churn everytime we do this.

How about something along the lines of:
...
   /* pass_build_ealias is a dummy pass that ensures that we
      execute TODO_rebuild_alias at this point.  */
   NEXT_PASS (pass_build_ealias);
   /* Pass group that runs when there are oacc kernels in the
   function.  */
   NEXT_PASS (pass_oacc_kernels);
   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
   PUSH_ID ("oacc_kernels")
     ...
   POP_ID ()
   POP_INSERT_PASSES ()
   NEXT_PASS (pass_fre);
...

where the PUSH_ID/POP_ID pair has the functionality that all the 
contained passes:
- have the id prefixed to the dump file, so the dump file of pass_ch
   which normally is "ch" becomes "oacc_kernels_ch", and
- the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
   that it doesn't count as numbered instance of pass_ch
?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-12 14:32     ` Tom de Vries
@ 2015-11-12 14:43       ` Richard Biener
  2015-11-12 15:42         ` David Malcolm
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-12 14:43 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Thu, Nov 12, 2015 at 3:31 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 11/11/15 12:03, Richard Biener wrote:
>>
>> On Mon, 9 Nov 2015, Tom de Vries wrote:
>>
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>>
>>>> Hi,
>>>>
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>>>
>>>> The patch series contains these patches:
>>>>
>>>>        1    Insert new exit block only when needed in
>>>>           transform_to_exit_first_loop_alt
>>>>        2    Make create_parallel_loop return void
>>>>        3    Ignore reduction clause on kernels directive
>>>>        4    Implement -foffload-alias
>>>>        5    Add in_oacc_kernels_region in struct loop
>>>>        6    Add pass_oacc_kernels
>>>>        7    Add pass_dominator_oacc_kernels
>>>>        8    Add pass_ch_oacc_kernels
>>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>>       11    Update testcases after adding kernels pass group
>>>>       12    Handle acc loop directive
>>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>
>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>> intended to be committed at the same time.
>>>>
>>>> Bootstrapped and reg-tested on x86_64.
>>>>
>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>> patch that enables accelerator testing (which is submitted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>
>>>> I'll post the individual patches in reply to this message.
>>>
>>>
>>> This patch updates existing testcases with new pass numbers, given the
>>> passes
>>> that were added in the pass list in patch 10.
>>
>>
>> I think it would be nice to be able to specify the number in the .def
>> file instead so we can avoid this kind of churn everytime we do this.
>
>
> How about something along the lines of:
> ...
>   /* pass_build_ealias is a dummy pass that ensures that we
>      execute TODO_rebuild_alias at this point.  */
>   NEXT_PASS (pass_build_ealias);
>   /* Pass group that runs when there are oacc kernels in the
>   function.  */
>   NEXT_PASS (pass_oacc_kernels);
>   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>   PUSH_ID ("oacc_kernels")
>     ...
>   POP_ID ()
>   POP_INSERT_PASSES ()
>   NEXT_PASS (pass_fre);
> ...
>
> where the PUSH_ID/POP_ID pair has the functionality that all the contained
> passes:
> - have the id prefixed to the dump file, so the dump file of pass_ch
>   which normally is "ch" becomes "oacc_kernels_ch", and
> - the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
>   that it doesn't count as numbered instance of pass_ch
> ?

Hmm.  I'd like to have sth that allows me to add "slp" to both
pass_slp_vectorize
instances, having them share the suffix (as no two functions are in both dumps).

We similarly have "duplicates" across the -Og vs. the -O[0-3] pipeline.

Basically make all dump file name suffixes manually specified which means moving
them from the class definition to the actual instance.

Well, just an idea.  In a distant future I like our pass pipeline to become more
dynamic, getting away from a static passes.def towards, say, a pass "script"
(to be able to say "if inlining did nothing skip this group" or similar).

Richard.


> Thanks,
> - Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-12 14:43       ` Richard Biener
@ 2015-11-12 15:42         ` David Malcolm
  2015-11-13  9:44           ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: David Malcolm @ 2015-11-12 15:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: Tom de Vries, Richard Biener, gcc-patches, Jakub Jelinek

On Thu, 2015-11-12 at 15:43 +0100, Richard Biener wrote:
> On Thu, Nov 12, 2015 at 3:31 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > On 11/11/15 12:03, Richard Biener wrote:
> >>
> >> On Mon, 9 Nov 2015, Tom de Vries wrote:
> >>
> >>> On 09/11/15 16:35, Tom de Vries wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> this patch series for stage1 trunk adds support to:
> >>>> - parallelize oacc kernels regions using parloops, and
> >>>> - map the loops onto the oacc gang dimension.
> >>>>
> >>>> The patch series contains these patches:
> >>>>
> >>>>        1    Insert new exit block only when needed in
> >>>>           transform_to_exit_first_loop_alt
> >>>>        2    Make create_parallel_loop return void
> >>>>        3    Ignore reduction clause on kernels directive
> >>>>        4    Implement -foffload-alias
> >>>>        5    Add in_oacc_kernels_region in struct loop
> >>>>        6    Add pass_oacc_kernels
> >>>>        7    Add pass_dominator_oacc_kernels
> >>>>        8    Add pass_ch_oacc_kernels
> >>>>        9    Add pass_parallelize_loops_oacc_kernels
> >>>>       10    Add pass_oacc_kernels pass group in passes.def
> >>>>       11    Update testcases after adding kernels pass group
> >>>>       12    Handle acc loop directive
> >>>>       13    Add c-c++-common/goacc/kernels-*.c
> >>>>       14    Add gfortran.dg/goacc/kernels-*.f95
> >>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
> >>>>
> >>>> The first 9 patches are more or less independent, but patches 10-16 are
> >>>> intended to be committed at the same time.
> >>>>
> >>>> Bootstrapped and reg-tested on x86_64.
> >>>>
> >>>> Build and reg-tested with nvidia accelerator, in combination with a
> >>>> patch that enables accelerator testing (which is submitted at
> >>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> >>>>
> >>>> I'll post the individual patches in reply to this message.
> >>>
> >>>
> >>> This patch updates existing testcases with new pass numbers, given the
> >>> passes
> >>> that were added in the pass list in patch 10.
> >>
> >>
> >> I think it would be nice to be able to specify the number in the .def
> >> file instead so we can avoid this kind of churn everytime we do this.
> >
> >
> > How about something along the lines of:
> > ...
> >   /* pass_build_ealias is a dummy pass that ensures that we
> >      execute TODO_rebuild_alias at this point.  */
> >   NEXT_PASS (pass_build_ealias);
> >   /* Pass group that runs when there are oacc kernels in the
> >   function.  */
> >   NEXT_PASS (pass_oacc_kernels);
> >   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> >   PUSH_ID ("oacc_kernels")
> >     ...
> >   POP_ID ()
> >   POP_INSERT_PASSES ()
> >   NEXT_PASS (pass_fre);
> > ...
> >
> > where the PUSH_ID/POP_ID pair has the functionality that all the contained
> > passes:
> > - have the id prefixed to the dump file, so the dump file of pass_ch
> >   which normally is "ch" becomes "oacc_kernels_ch", and
> > - the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
> >   that it doesn't count as numbered instance of pass_ch
> > ?
> 
> Hmm.  I'd like to have sth that allows me to add "slp" to both
> pass_slp_vectorize
> instances, having them share the suffix (as no two functions are in both dumps).
> 
> We similarly have "duplicates" across the -Og vs. the -O[0-3] pipeline.
> 
> Basically make all dump file name suffixes manually specified which means moving
> them from the class definition to the actual instance.
> 
> Well, just an idea.  In a distant future I like our pass pipeline to become more
> dynamic, getting away from a static passes.def towards, say, a pass "script"
> (to be able to say "if inlining did nothing skip this group" or similar).

Can't that be done by having a parent pass to hold them, with a gate
function?

Or are you thinking of having another domain-specific language?

Thinking aloud, I've sometimes wondered if it would be helpful to be
able to subclass pass_manager, so that multiple passes.def files could
generate alternative pass_manager subclasses, with the precise choice of
pass_manager subclass being determined by options+target.  I don't know
if that latter idea is useful though.

Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-11 11:01     ` Jakub Jelinek
@ 2015-11-12 16:04       ` Tom de Vries
  2015-11-13  8:46         ` Richard Biener
  2015-12-03 11:53       ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-12 16:04 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

On 11/11/15 12:00, Jakub Jelinek wrote:
> On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
>>> The option -foffload-alias=pointer instructs the compiler to assume that
>>> objects references in an offload region do not alias.
>>>
>>> The option -foffload-alias=all instructs the compiler to make no
>>> assumptions about aliasing in offload regions.
>>>
>>> The default value is -foffload-alias=none.
>>
>> I think global options for this is nonsense.  Please follow what
>> we do for #pragma GCC ivdep for example, thus allow the alias
>> behavior to be specified per "region" (whatever makes sense here
>> in the context of offloading).

So, IIUC, instead of a global option foffload-alias, you're saying 
something like the following would be acceptable:
...
#pragma GCC offload-alias=<none|pointer|all>
#pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
   {
     #pragma acc loop
     for (COUNTERTYPE ii = 0; ii < N; ii++)
       c[ii] = a[ii] + b[ii];
   }
...
?

I suppose that would work (though a global option would allow us to 
easily switch between none/pointer/all values in a large number of 
files, something that might be useful when f.i. running an openacc  test 
suite).

> Yeah, completely agreed.  I don't see why the offloaded region would be in
> any way special, they are C/C++/Fortran code as any other.
> What we can and should improve is teach IPA aliasing/points to analysis
> about the way we lower the host vs. offloading region boundary, so that
> if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> determines something it can be used on the offloaded function side and vice
> versa,

I agree this would be a nice way to solve the aliasing info problem, but 
considering the remark of Richard at 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
...
Not that I think IPA PTA is anywhere near production ready
...
I haven't considered proceeding in that direction.

Thanks,
- Tom

> but a switch like the above is just wrong.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-12 16:04       ` Tom de Vries
@ 2015-11-13  8:46         ` Richard Biener
  2015-11-13 11:03           ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-13  8:46 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Thu, 12 Nov 2015, Tom de Vries wrote:

> On 11/11/15 12:00, Jakub Jelinek wrote:
> > On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
> > > > The option -foffload-alias=pointer instructs the compiler to assume that
> > > > objects references in an offload region do not alias.
> > > > 
> > > > The option -foffload-alias=all instructs the compiler to make no
> > > > assumptions about aliasing in offload regions.
> > > > 
> > > > The default value is -foffload-alias=none.
> > > 
> > > I think global options for this is nonsense.  Please follow what
> > > we do for #pragma GCC ivdep for example, thus allow the alias
> > > behavior to be specified per "region" (whatever makes sense here
> > > in the context of offloading).
> 
> So, IIUC, instead of a global option foffload-alias, you're saying something
> like the following would be acceptable:
> ...
> #pragma GCC offload-alias=<none|pointer|all>
> #pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
>   {
>     #pragma acc loop
>     for (COUNTERTYPE ii = 0; ii < N; ii++)
>       c[ii] = a[ii] + b[ii];
>   }
> ...
> ?
> 
> I suppose that would work (though a global option would allow us to easily
> switch between none/pointer/all values in a large number of files, something
> that might be useful when f.i. running an openacc  test suite).
> 
> > Yeah, completely agreed.  I don't see why the offloaded region would be in
> > any way special, they are C/C++/Fortran code as any other.
> > What we can and should improve is teach IPA aliasing/points to analysis
> > about the way we lower the host vs. offloading region boundary, so that
> > if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> > determines something it can be used on the offloaded function side and vice
> > versa,
> 
> I agree this would be a nice way to solve the aliasing info problem, but
> considering the remark of Richard at
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
> ...
> Not that I think IPA PTA is anywhere near production ready

Just to clarify on that sentence:
 1) we lack good testing coverage for IPA PTA so wrong-code bugs might 
still exist
 2) IPA PTA can use a _lot_ of memory and compile-time
 3) for existing wrong-code issues I have merely dumbed down the
use of the analysis result resulting in weaker alias analysis compared to
the local PTA (for some cases)

Because of 2) and no good way to avoid this I decided to not make
fixing 3) a priority (and 1) still holds).

Richard.

> ...
> I haven't considered proceeding in that direction.
> 
> Thanks,
> - Tom
> 
> > but a switch like the above is just wrong.
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 11/16] Update testcases after adding kernels pass group
  2015-11-12 15:42         ` David Malcolm
@ 2015-11-13  9:44           ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-13  9:44 UTC (permalink / raw)
  To: David Malcolm; +Cc: Richard Biener, Tom de Vries, gcc-patches, Jakub Jelinek

On Thu, 12 Nov 2015, David Malcolm wrote:

> On Thu, 2015-11-12 at 15:43 +0100, Richard Biener wrote:
> > On Thu, Nov 12, 2015 at 3:31 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > > On 11/11/15 12:03, Richard Biener wrote:
> > >>
> > >> On Mon, 9 Nov 2015, Tom de Vries wrote:
> > >>
> > >>> On 09/11/15 16:35, Tom de Vries wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> this patch series for stage1 trunk adds support to:
> > >>>> - parallelize oacc kernels regions using parloops, and
> > >>>> - map the loops onto the oacc gang dimension.
> > >>>>
> > >>>> The patch series contains these patches:
> > >>>>
> > >>>>        1    Insert new exit block only when needed in
> > >>>>           transform_to_exit_first_loop_alt
> > >>>>        2    Make create_parallel_loop return void
> > >>>>        3    Ignore reduction clause on kernels directive
> > >>>>        4    Implement -foffload-alias
> > >>>>        5    Add in_oacc_kernels_region in struct loop
> > >>>>        6    Add pass_oacc_kernels
> > >>>>        7    Add pass_dominator_oacc_kernels
> > >>>>        8    Add pass_ch_oacc_kernels
> > >>>>        9    Add pass_parallelize_loops_oacc_kernels
> > >>>>       10    Add pass_oacc_kernels pass group in passes.def
> > >>>>       11    Update testcases after adding kernels pass group
> > >>>>       12    Handle acc loop directive
> > >>>>       13    Add c-c++-common/goacc/kernels-*.c
> > >>>>       14    Add gfortran.dg/goacc/kernels-*.f95
> > >>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > >>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
> > >>>>
> > >>>> The first 9 patches are more or less independent, but patches 10-16 are
> > >>>> intended to be committed at the same time.
> > >>>>
> > >>>> Bootstrapped and reg-tested on x86_64.
> > >>>>
> > >>>> Build and reg-tested with nvidia accelerator, in combination with a
> > >>>> patch that enables accelerator testing (which is submitted at
> > >>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > >>>>
> > >>>> I'll post the individual patches in reply to this message.
> > >>>
> > >>>
> > >>> This patch updates existing testcases with new pass numbers, given the
> > >>> passes
> > >>> that were added in the pass list in patch 10.
> > >>
> > >>
> > >> I think it would be nice to be able to specify the number in the .def
> > >> file instead so we can avoid this kind of churn everytime we do this.
> > >
> > >
> > > How about something along the lines of:
> > > ...
> > >   /* pass_build_ealias is a dummy pass that ensures that we
> > >      execute TODO_rebuild_alias at this point.  */
> > >   NEXT_PASS (pass_build_ealias);
> > >   /* Pass group that runs when there are oacc kernels in the
> > >   function.  */
> > >   NEXT_PASS (pass_oacc_kernels);
> > >   PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > >   PUSH_ID ("oacc_kernels")
> > >     ...
> > >   POP_ID ()
> > >   POP_INSERT_PASSES ()
> > >   NEXT_PASS (pass_fre);
> > > ...
> > >
> > > where the PUSH_ID/POP_ID pair has the functionality that all the contained
> > > passes:
> > > - have the id prefixed to the dump file, so the dump file of pass_ch
> > >   which normally is "ch" becomes "oacc_kernels_ch", and
> > > - the pass name in pass_instances.def becomes pass_oacc_kernels_ch, such
> > >   that it doesn't count as numbered instance of pass_ch
> > > ?
> > 
> > Hmm.  I'd like to have sth that allows me to add "slp" to both
> > pass_slp_vectorize
> > instances, having them share the suffix (as no two functions are in both dumps).
> > 
> > We similarly have "duplicates" across the -Og vs. the -O[0-3] pipeline.
> > 
> > Basically make all dump file name suffixes manually specified which means moving
> > them from the class definition to the actual instance.
> > 
> > Well, just an idea.  In a distant future I like our pass pipeline to become more
> > dynamic, getting away from a static passes.def towards, say, a pass "script"
> > (to be able to say "if inlining did nothing skip this group" or similar).
> 
> Can't that be done by having a parent pass to hold them, with a gate
> function?

Sure, that's how we do it for the loop sub-pipeline for example.

> Or are you thinking of having another domain-specific language?

Kind of.  I'm thinking of the pass pipeline being dynamic in the
sense of a program controlling execution of passes.  Basically
"scripting" the pass manager itself (yes, also with the idea to
give users and us more control).

Of course specific features can be implemented in the pass manager
itself (it's a "script" with static configuration).

> Thinking aloud, I've sometimes wondered if it would be helpful to be
> able to subclass pass_manager, so that multiple passes.def files could
> generate alternative pass_manager subclasses, with the precise choice of
> pass_manager subclass being determined by options+target.  I don't know
> if that latter idea is useful though.

I think the "use" of passes.def is simply too static.  We shouldn't bother
to create all the instances and dump file metadata until we need it.

The first thing to do is of course making the pass manager really
control the flow of compilation rather than various bits of
cgraph infrastructure executing specific (sub-)pass queues.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13  8:46         ` Richard Biener
@ 2015-11-13 11:03           ` Tom de Vries
  2015-11-13 11:30             ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-13 11:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

On 13/11/15 09:46, Richard Biener wrote:
> On Thu, 12 Nov 2015, Tom de Vries wrote:
>
>> On 11/11/15 12:00, Jakub Jelinek wrote:
>>> On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
>>>>> The option -foffload-alias=pointer instructs the compiler to assume that
>>>>> objects references in an offload region do not alias.
>>>>>
>>>>> The option -foffload-alias=all instructs the compiler to make no
>>>>> assumptions about aliasing in offload regions.
>>>>>
>>>>> The default value is -foffload-alias=none.
>>>>
>>>> I think global options for this is nonsense.  Please follow what
>>>> we do for #pragma GCC ivdep for example, thus allow the alias
>>>> behavior to be specified per "region" (whatever makes sense here
>>>> in the context of offloading).
>>
>> So, IIUC, instead of a global option foffload-alias, you're saying something
>> like the following would be acceptable:
>> ...
>> #pragma GCC offload-alias=<none|pointer|all>
>> #pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
>>    {
>>      #pragma acc loop
>>      for (COUNTERTYPE ii = 0; ii < N; ii++)
>>        c[ii] = a[ii] + b[ii];
>>    }
>> ...
>> ?
>>
>> I suppose that would work (though a global option would allow us to easily
>> switch between none/pointer/all values in a large number of files, something
>> that might be useful when f.i. running an openacc  test suite).
>>
>>> Yeah, completely agreed.  I don't see why the offloaded region would be in
>>> any way special, they are C/C++/Fortran code as any other.
>>> What we can and should improve is teach IPA aliasing/points to analysis
>>> about the way we lower the host vs. offloading region boundary, so that
>>> if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
>>> determines something it can be used on the offloaded function side and vice
>>> versa,
>>
>> I agree this would be a nice way to solve the aliasing info problem, but
>> considering the remark of Richard at
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
>> ...
>> Not that I think IPA PTA is anywhere near production ready
>
> Just to clarify on that sentence:
>   1) we lack good testing coverage for IPA PTA so wrong-code bugs might
> still exist
>   2) IPA PTA can use a _lot_ of memory and compile-time
>   3) for existing wrong-code issues I have merely dumbed down the
> use of the analysis result resulting in weaker alias analysis compared to
> the local PTA (for some cases)
>
> Because of 2) and no good way to avoid this I decided to not make
> fixing 3) a priority (and 1) still holds).
>

Hi,

thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.

Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit 
above? Is that sort of what you had in mind?

Thanks,
- Tom



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:03           ` Tom de Vries
@ 2015-11-13 11:30             ` Richard Biener
  2015-11-13 11:39               ` Jakub Jelinek
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-13 11:30 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Fri, 13 Nov 2015, Tom de Vries wrote:

> On 13/11/15 09:46, Richard Biener wrote:
> > On Thu, 12 Nov 2015, Tom de Vries wrote:
> > 
> > > On 11/11/15 12:00, Jakub Jelinek wrote:
> > > > On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
> > > > > > The option -foffload-alias=pointer instructs the compiler to assume
> > > > > > that
> > > > > > objects references in an offload region do not alias.
> > > > > > 
> > > > > > The option -foffload-alias=all instructs the compiler to make no
> > > > > > assumptions about aliasing in offload regions.
> > > > > > 
> > > > > > The default value is -foffload-alias=none.
> > > > > 
> > > > > I think global options for this is nonsense.  Please follow what
> > > > > we do for #pragma GCC ivdep for example, thus allow the alias
> > > > > behavior to be specified per "region" (whatever makes sense here
> > > > > in the context of offloading).
> > > 
> > > So, IIUC, instead of a global option foffload-alias, you're saying
> > > something
> > > like the following would be acceptable:
> > > ...
> > > #pragma GCC offload-alias=<none|pointer|all>
> > > #pragma acc kernels copyin (a[0:N], b[0:N]) copyout (c[0:N])
> > >    {
> > >      #pragma acc loop
> > >      for (COUNTERTYPE ii = 0; ii < N; ii++)
> > >        c[ii] = a[ii] + b[ii];
> > >    }
> > > ...
> > > ?
> > > 
> > > I suppose that would work (though a global option would allow us to easily
> > > switch between none/pointer/all values in a large number of files,
> > > something
> > > that might be useful when f.i. running an openacc  test suite).
> > > 
> > > > Yeah, completely agreed.  I don't see why the offloaded region would be
> > > > in
> > > > any way special, they are C/C++/Fortran code as any other.
> > > > What we can and should improve is teach IPA aliasing/points to analysis
> > > > about the way we lower the host vs. offloading region boundary, so that
> > > > if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> > > > determines something it can be used on the offloaded function side and
> > > > vice
> > > > versa,
> > > 
> > > I agree this would be a nice way to solve the aliasing info problem, but
> > > considering the remark of Richard at
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032#c19 :
> > > ...
> > > Not that I think IPA PTA is anywhere near production ready
> > 
> > Just to clarify on that sentence:
> >   1) we lack good testing coverage for IPA PTA so wrong-code bugs might
> > still exist
> >   2) IPA PTA can use a _lot_ of memory and compile-time
> >   3) for existing wrong-code issues I have merely dumbed down the
> > use of the analysis result resulting in weaker alias analysis compared to
> > the local PTA (for some cases)
> > 
> > Because of 2) and no good way to avoid this I decided to not make
> > fixing 3) a priority (and 1) still holds).
> > 
> 
> Hi,
> 
> thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.
> 
> Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit above?
> Is that sort of what you had in mind?

Yes.  Whether that makes sense is another question of course.  You can
annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
as well if you know dependences without the users intervention.

Richard.

> Thanks,
> - Tom
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:30             ` Richard Biener
@ 2015-11-13 11:39               ` Jakub Jelinek
  2015-11-21 12:24                 ` Tom de Vries
  2015-12-11 12:45                 ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Jakub Jelinek @ 2015-11-13 11:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: Tom de Vries, gcc-patches

On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
> > thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.
> > 
> > Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit above?
> > Is that sort of what you had in mind?
> 
> Yes.  Whether that makes sense is another question of course.  You can
> annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
> as well if you know dependences without the users intervention.

I really don't like even the GCC offload-alias, I just don't see anything
special on the offload code.  Not to mention that the same issue is already
with other outlined functions, like OpenMP tasks or parallel regions, those
aren't offloaded, yet they can suffer from worse alias/points-to analysis
too.

We simply have some compiler internal interface between the caller and
callee of the outlined regions, each interface in between those has
its own structure type used to communicate the info;
we can attach attributes on the fields, or some flags to indicate some
properties interesting from aliasing POV.  We don't really need to perform
full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
the relationship in between such callers and callees (for offloading regions
we already have "omp target entrypoint" attribute on the callee and a
singler caller), tell LTO if possible not to split those into different
partitions if easily possible, and then just for these pairs perform
aliasing/points-to analysis in the caller and the result record using
cliques/special attributes/whatever to the callee side, so that the callee
(outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-11 10:57   ` Richard Biener
  2015-11-16 11:39     ` Tom de Vries
@ 2015-11-16 11:39     ` Tom de Vries
  1 sibling, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On 11/11/15 11:55, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patch adds and initializes the field in_oacc_kernels_region field in
>> struct loop.
>>
>> The field is used to signal to subsequent passes that we're dealing with a
>> loop in a kernels region that we're trying parallelize.
>>
>> Note that we do not parallelize kernels regions with more than one loop nest.
>> [ In general, kernels regions with more than one loop nest should be split up
>> into seperate kernels regions, but that's not supported atm. ]
>
> I think mark_loops_in_oacc_kernels_region can be greatly simplified.
>
> Both region entry and exit should have the same ->loop_father (a SESE
> region).  Then you can just walk that loops inner (and their sibling)
> loops checking their header domination relation with the region entry
> exit (only necessary for direct inner loops).

Updated patch to use the loops structure.  Atm I'm also skipping loops 
containing sibling loops, since I have no test-cases for that yet.

Thanks,
- Tom


[-- Attachment #2: 0003-Add-in_oacc_kernels_region-in-struct-loop.patch --]
[-- Type: text/x-patch, Size: 2785 bytes --]

Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.

---
 gcc/cfgloop.h |  3 +++
 gcc/omp-low.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 6af6893..ee73bf9 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if the loop is part of an oacc kernels region.  */
+  bool in_oacc_kernels_region;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 5f76434..fba7bbd 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12450,6 +12450,46 @@ get_oacc_ifn_dim_arg (const gimple *stmt)
   return (int) axis;
 }
 
+/* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
+   at REGION_EXIT.  */
+
+static void
+mark_loops_in_oacc_kernels_region (basic_block region_entry,
+				   basic_block region_exit)
+{
+  struct loop *outer = region_entry->loop_father;
+  gcc_assert (region_exit == NULL || outer == region_exit->loop_father);
+
+  /* Don't parallelize the kernels region if it contains more than one outer
+     loop.  */
+  unsigned int nr_outer_loops = 0;
+  struct loop *single_outer;
+  for (struct loop *loop = outer->inner; loop != NULL; loop = loop->next)
+    {
+      gcc_assert (loop_outer (loop) == outer);
+
+      if (!dominated_by_p (CDI_DOMINATORS, loop->header, region_entry))
+	continue;
+
+      if (region_exit != NULL
+	  && dominated_by_p (CDI_DOMINATORS, loop->header, region_exit))
+	continue;
+
+      nr_outer_loops++;
+      single_outer = loop;
+    }
+  if (nr_outer_loops != 1)
+    return;
+
+  for (struct loop *loop = single_outer->inner; loop != NULL; loop = loop->inner)
+    if (loop->next)
+      return;
+
+  /* Mark the loops in the region.  */
+  for (struct loop *loop = single_outer; loop != NULL; loop = loop->inner)
+    loop->in_oacc_kernels_region = true;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -12505,6 +12545,9 @@ expand_omp_target (struct omp_region *region)
   entry_bb = region->entry;
   exit_bb = region->exit;
 
+  if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+    mark_loops_in_oacc_kernels_region (region->entry, region->exit);
+
   if (offloaded)
     {
       unsigned srcidx, dstidx, num;

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-11 10:57   ` Richard Biener
@ 2015-11-16 11:39     ` Tom de Vries
  2015-11-16 12:41       ` Richard Biener
  2015-11-16 11:39     ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On 11/11/15 11:55, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patch adds and initializes the field in_oacc_kernels_region field in
>> struct loop.
>>
>> The field is used to signal to subsequent passes that we're dealing with a
>> loop in a kernels region that we're trying parallelize.
>>
>> Note that we do not parallelize kernels regions with more than one loop nest.
>> [ In general, kernels regions with more than one loop nest should be split up
>> into seperate kernels regions, but that's not supported atm. ]
>
> I think mark_loops_in_oacc_kernels_region can be greatly simplified.
>
> Both region entry and exit should have the same ->loop_father (a SESE
> region).  Then you can just walk that loops inner (and their sibling)
> loops checking their header domination relation with the region entry
> exit (only necessary for direct inner loops).

Updated patch to use the loops structure.  Atm I'm also skipping loops 
containing sibling loops, since I have no test-cases for that yet.

Thanks,
- Tom


[-- Attachment #2: 0003-Add-in_oacc_kernels_region-in-struct-loop.patch --]
[-- Type: text/x-patch, Size: 2785 bytes --]

Add in_oacc_kernels_region in struct loop

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* cfgloop.h (struct loop): Add in_oacc_kernels_region field.
	* omp-low.c (mark_loops_in_oacc_kernels_region): New function.
	(expand_omp_target): Call mark_loops_in_oacc_kernels_region.

---
 gcc/cfgloop.h |  3 +++
 gcc/omp-low.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 6af6893..ee73bf9 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if the loop is part of an oacc kernels region.  */
+  bool in_oacc_kernels_region;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 5f76434..fba7bbd 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12450,6 +12450,46 @@ get_oacc_ifn_dim_arg (const gimple *stmt)
   return (int) axis;
 }
 
+/* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
+   at REGION_EXIT.  */
+
+static void
+mark_loops_in_oacc_kernels_region (basic_block region_entry,
+				   basic_block region_exit)
+{
+  struct loop *outer = region_entry->loop_father;
+  gcc_assert (region_exit == NULL || outer == region_exit->loop_father);
+
+  /* Don't parallelize the kernels region if it contains more than one outer
+     loop.  */
+  unsigned int nr_outer_loops = 0;
+  struct loop *single_outer;
+  for (struct loop *loop = outer->inner; loop != NULL; loop = loop->next)
+    {
+      gcc_assert (loop_outer (loop) == outer);
+
+      if (!dominated_by_p (CDI_DOMINATORS, loop->header, region_entry))
+	continue;
+
+      if (region_exit != NULL
+	  && dominated_by_p (CDI_DOMINATORS, loop->header, region_exit))
+	continue;
+
+      nr_outer_loops++;
+      single_outer = loop;
+    }
+  if (nr_outer_loops != 1)
+    return;
+
+  for (struct loop *loop = single_outer->inner; loop != NULL; loop = loop->inner)
+    if (loop->next)
+      return;
+
+  /* Mark the loops in the region.  */
+  for (struct loop *loop = single_outer; loop != NULL; loop = loop->inner)
+    loop->in_oacc_kernels_region = true;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -12505,6 +12545,9 @@ expand_omp_target (struct omp_region *region)
   entry_bb = region->entry;
   exit_bb = region->exit;
 
+  if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+    mark_loops_in_oacc_kernels_region (region->entry, region->exit);
+
   if (offloaded)
     {
       unsigned srcidx, dstidx, num;

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-11 11:03   ` Richard Biener
@ 2015-11-16 11:55     ` Tom de Vries
  2015-11-16 12:45       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 4549 bytes --]

On 11/11/15 12:02, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>>
>>
>> This patch adds the pass_oacc_kernels pass group to the pass list in
>> passes.def.
>>
>> Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner
>> loop in a loop nest, the second for an outer loop in a loop nest.
>
> @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
>            /* pass_build_ealias is a dummy pass that ensures that we
>               execute TODO_rebuild_alias at this point.  */
>            NEXT_PASS (pass_build_ealias);
> +         /* Pass group that runs when there are oacc kernels in the
> +            function.  */
> +         NEXT_PASS (pass_oacc_kernels);
> +         PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> +             NEXT_PASS (pass_dominator_oacc_kernels);
> +             NEXT_PASS (pass_ch_oacc_kernels);
> +             NEXT_PASS (pass_dominator_oacc_kernels);
> +             NEXT_PASS (pass_tree_loop_init);
> +             NEXT_PASS (pass_lim);
> +             NEXT_PASS (pass_copy_prop);
> +             NEXT_PASS (pass_lim);
> +             NEXT_PASS (pass_copy_prop);
>
> iterate lim/copyprop twice?!  Why's that needed?
>

I've managed to eliminate the last pass_copy_prop, but not pass_lim. 
I've added a comment:
...
   /* We use pass_lim to rewrite in-memory iteration and reduction
      variable accesses in loops into local variables accesses.
      However, a single pass instantion manages to do this only for
      one loop level, so we use pass_lim twice to at least be able to
      handle a loop nest with a depth of two.  */
   NEXT_PASS (pass_lim);
   NEXT_PASS (pass_copy_prop);
   NEXT_PASS (pass_lim);
...

> +             NEXT_PASS (pass_scev_cprop);
>
> What's that for?  It's supposed to help removing loops - I don't
> expect kernels to vanish.

I'm using pass_scev_cprop for the "final value replacement" 
functionality. Added comment.

>
> +             NEXT_PASS (pass_tree_loop_done);
> +             NEXT_PASS (pass_dominator_oacc_kernels);
>
> Three times DOM?  No please.  I wonder why you don't run oacc_kernels
> after FRE and drop the initial DOM(s).
>

Done. There's just one pass_dominator_oacc_kernels left now.

> +             NEXT_PASS (pass_dce);
> +             NEXT_PASS (pass_tree_loop_init);
> +             NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> +             NEXT_PASS (pass_expand_omp_ssa);
> +             NEXT_PASS (pass_tree_loop_done);
>
> The switches into/outof tree_loop also look odd to me, but well
> (they'll be controlled by -ftree-loop-optimize)).
>

I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done 
in the pass group. Instead, I've added conditional loop optimizer setup in:
-  pass_lim and pass_scev_cprop (added in this patch), and
- pass_parallelize_loops_oacc_kernels (added in patch "Add
   pass_parallelize_loops_oacc_kernels").

Thanks,
- Tom


[-- Attachment #2: 0007-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 5177 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Allow to run outside
	pass_tree_loop.
	* tree-ssa-loop.c (pass_scev_cprop::clone): New function.
	(pass_scev_cprop::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 25 +++++++++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 14 ++++++++++++++
 gcc/tree-ssa-loop.c    | 22 +++++++++++++++++++++-
 5 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 9eae09a..8078afb 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13385,6 +13385,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index db822d3..d76cfd3 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -87,6 +87,31 @@ along with GCC; see the file COPYING3.  If not see
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      /* We need pass_ch here, because pass_lim has no effect on
+	         exit-first loops (PR65442).  Ideally we want to remove both
+		 this pass instantiation, and the reverse transformation
+		 transform_to_exit_first_loop_alt, which is done in
+		 pass_parallelize_loops_oacc_kernels. */
+	      NEXT_PASS (pass_ch);
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.
+		 However, a single pass instantion manages to do this only for
+		 one loop level, so we use pass_lim twice to at least be able to
+		 handle a loop nest with a depth of two.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_lim);
+	      /* We use pass_scev_cprop here for final value replacement.  */
+	      NEXT_PASS (pass_scev_cprop);
+	      NEXT_PASS (pass_dominator_oacc_kernels);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..48810f3 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-propagate.h"
 #include "trans-mem.h"
 #include "gimple-fold.h"
+#include "tree-scalar-evolution.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -2501,6 +2502,19 @@ tree_ssa_lim (void)
 {
   unsigned int todo;
 
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+				| LOOPS_HAVE_RECORDED_EXITS
+				| LOOP_CLOSED_SSA))
+    {
+      loop_optimizer_init (LOOPS_NORMAL
+			   | LOOPS_HAVE_RECORDED_EXITS);
+      rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+      /* We might discover new loops, e.g. when turning irreducible
+	 regions into reducible.  */
+      scev_initialize ();
+    }
+
   tree_ssa_lim_initialize ();
 
   /* Gathers information about memory accesses in the loops.  */
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index b51cac2..570406f 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -373,10 +373,30 @@ public:
 
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_scev_cprop; }
-  virtual unsigned int execute (function *) { return scev_const_prop (); }
+  virtual unsigned int execute (function *);
+  opt_pass * clone () { return new pass_scev_cprop (m_ctxt); }
 
 }; // class pass_scev_cprop
 
+unsigned int
+pass_scev_cprop::execute (function *)
+{
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+				| LOOPS_HAVE_RECORDED_EXITS
+				| LOOP_CLOSED_SSA))
+    {
+      loop_optimizer_init (LOOPS_NORMAL
+			   | LOOPS_HAVE_RECORDED_EXITS);
+      rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+      /* We might discover new loops, e.g. when turning irreducible
+	 regions into reducible.  */
+      scev_initialize ();
+    }
+
+  return scev_const_prop (); 
+}
+
 } // anon namespace
 
 gimple_opt_pass *

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
@ 2015-11-16 11:59   ` Tom de Vries
  2015-11-24 12:27     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 11:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3350 bytes --]

On 09/11/15 20:52, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> This patch adds pass_parallelize_loops_oacc_kernels.
>
> There's a number of things we do differently in parloops for oacc kernels:
> - in normal parloops, we generate code to choose between a parallel
>    version of the loop, and a sequential (low iteration count) version.
>    Since the code in oacc kernels region is supposed to run on the
>    accelerator anyway, we skip this check, and don't add a low iteration
>    count loop.
> - in normal parloops, we generate an #pragma omp parallel /
>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>    into a thread function. Since the oacc kernels region is already
>    split off, we don't add this pair.
> - we indicate the parallelization factor by setting the oacc function
>    attributes
> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>    we add the gang clause
> - in normal parloops, we rewrite the variable accesses in the loop in
>    terms into accesses relative to a thread function parameter. For the
>    oacc kernels region, that rewrite has already been done at omp-lower,
>    so we skip this.
> - we need to ensure that the entire kernels region can be run in
>    parallel. The loop independence check is already present, so for oacc
>    kernels we add a check between blocks outside the loop and the entire
>    region.
> - we guard stores in the blocks outside the loop with gang_pos == 0.
>    There's no need for each gang to write to a single location, we can
>    do this in just one gang. (Typically this is the write of the final
>    value of the iteration variable if that one is copied back to the
>    host).
>

Reposting with loop optimizer init added in 
pass_parallelize_loops_oacc_kernels::execute.

Thanks,
- Tom

[-- Attachment #2: 0006-Add-pass_parallelize_loops_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 30773 bytes --]

Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
        (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.

---
 gcc/omp-low.c       |   8 +-
 gcc/omp-low.h       |   1 +
 gcc/tree-parloops.c | 693 +++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-pass.h     |   2 +
 4 files changed, 640 insertions(+), 64 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fba7bbd..9eae09a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11944,10 +11944,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
   gsi_remove (&gsi, true);
   gsi = gsi_last_bb (store_bb);
+  stmt = gsi_stmt (gsi);
   gsi_remove (&gsi, true);
 
   if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_no_phi);
+    {
+      release_defs (stmt);
+      update_ssa (TODO_update_ssa_no_phi);
+    }
 
   return true;
 }
@@ -12321,7 +12325,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index 194b3d1..1790f40 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree);
 extern void replace_oacc_fn_attrib (tree, tree);
 extern tree build_oacc_routine_dims (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..96b8415 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
 
-  addr = build_addr (t);
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
   basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   gomp_continue *omp_cont_stmt;
   tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
   edge exit, nexit, guard, end, e;
+  tree for_clauses = NULL_TREE;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
   bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (!oacc_kernels_p)
+    {
+      paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
+    }
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+  if (!oacc_kernels_p)
+    {
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+    }
+  else
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
 
   /* Initialize NEW_DATA.  */
   if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
     }
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+  /* Skip insertion of OMP_RETURN for oacc_kernels_p.  We've already generated
+     one when lowering the oacc kernels directive in
+     pass_lower_omp/lower_omp (). */
+  if (!oacc_kernels_p)
+    {
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
     OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
       = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  if (1)
+    {
+      /* In combination with the NUM_GANGS on the parallel.  */
+      for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+    }
+
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   for_clauses, 1, NULL);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
+      many_iterations_cond
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2527,12 +2606,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2588,6 +2676,7 @@ try_create_reduction_list (loop_p loop,
 			 "  FAILED: it is not a part of reduction.\n");
 	      return false;
 	    }
+	  red->keep_res = phi;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
 	      fprintf (dump_file, "reduction phi is  ");
@@ -2622,15 +2711,402 @@ try_create_reduction_list (loop_p loop,
     }
 
 
+  if (oacc_kernels_p)
+    {
+      edge e = loop_preheader_edge (loop);
+
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      struct reduction_info *red;
+	      red = reduction_phi (reduction_list, phi);
+
+	      /* Look for pattern:
+
+		 <bb preheader>
+		   .omp_data_i = &.omp_data_arr;
+		   addr = .omp_data_i->sum;
+		   sum_a = *addr;
+
+		 <bb header>:
+		   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+		 and assign addr to reduc->reduc_addr.  */
+
+	      tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      gimple *stmt = SSA_NAME_DEF_STMT (arg);
+	      if (!gimple_assign_single_p (stmt))
+		return false;
+	      tree memref = gimple_assign_rhs1 (stmt);
+	      if (TREE_CODE (memref) != MEM_REF)
+		return false;
+	      tree addr = TREE_OPERAND (memref, 0);
+
+	      gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+	      if (!gimple_assign_single_p (stmt2))
+		return false;
+	      tree compref = gimple_assign_rhs1 (stmt2);
+	      if (TREE_CODE (compref) != COMPONENT_REF)
+		return false;
+	      tree addr2 = TREE_OPERAND (compref, 0);
+	      if (TREE_CODE (addr2) != MEM_REF)
+		return false;
+	      addr2 = TREE_OPERAND (addr2, 0);
+	      if (TREE_CODE (addr2) != SSA_NAME
+		  || addr2 != get_omp_data_i_param ())
+		return false;
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
+
   return true;
 }
 
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      tree omp_data_i,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
+  return true;
+}
+
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  tree omp_data_i = get_omp_data_i_param ();
+  gcc_assert (omp_data_i != NULL_TREE);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+				   reduction_list, reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
+  return res;
+}
+
 /* Detect parallel loops and generate parallel code using libgomp
    primitives.  Returns true if some loop was parallelized, false
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2642,19 +3118,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2666,6 +3152,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2707,6 +3209,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2716,14 +3219,23 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (!flag_loop_parallelize_all
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
+      /* Skip inner loop(s) of parallelized loop.  */
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2736,8 +3248,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2787,7 +3300,7 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
@@ -2806,3 +3319,59 @@ make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "parloops_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+  loop_optimizer_init (LOOPS_NORMAL
+		       | LOOPS_HAVE_RECORDED_EXITS);
+  rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+
+      return TODO_update_ssa;
+    }
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 395a93a..f5803d0 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -384,6 +384,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+  make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 7/16] Add pass_dominator_oacc_kernels
  2015-11-11 11:05   ` Richard Biener
@ 2015-11-16 12:04     ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 12:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 11/11/15 12:05, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patch adds pass_dominator_oacc_kernels (which we may as well call
>> pass_dominator_no_peel_loop_headers. It doesn't do anything
>> oacc-kernels-specific), to be used in the kernels pass group.
>>
>> The reason I'm adding a new pass instead of using pass_dominator is that
>> pass_dominator uses first_pass_instance. So adding a pass_dominator instance A
>> before a pass_dominator instance B has the unexpected consequence that it may
>> change the behaviour of instance B. I've filed PR68247 - "Remove
>> pass_first_instance" to note this issue.
>
> This looks ok (minus my comments to patch #10)
>

AFAIU, if "Remove first_pass_instance from pass_dominator" get approved 
and committed, we can drop this patch, and use this pass instantiation 
instead in the oacc_kernels pass group:
...
   NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
...

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 5/16] Add in_oacc_kernels_region in struct loop
  2015-11-16 11:39     ` Tom de Vries
@ 2015-11-16 12:41       ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-16 12:41 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 16 Nov 2015, Tom de Vries wrote:

> On 11/11/15 11:55, Richard Biener wrote:
> > On Mon, 9 Nov 2015, Tom de Vries wrote:
> > 
> > > On 09/11/15 16:35, Tom de Vries wrote:
> > > > Hi,
> > > > 
> > > > this patch series for stage1 trunk adds support to:
> > > > - parallelize oacc kernels regions using parloops, and
> > > > - map the loops onto the oacc gang dimension.
> > > > 
> > > > The patch series contains these patches:
> > > > 
> > > >        1    Insert new exit block only when needed in
> > > >           transform_to_exit_first_loop_alt
> > > >        2    Make create_parallel_loop return void
> > > >        3    Ignore reduction clause on kernels directive
> > > >        4    Implement -foffload-alias
> > > >        5    Add in_oacc_kernels_region in struct loop
> > > >        6    Add pass_oacc_kernels
> > > >        7    Add pass_dominator_oacc_kernels
> > > >        8    Add pass_ch_oacc_kernels
> > > >        9    Add pass_parallelize_loops_oacc_kernels
> > > >       10    Add pass_oacc_kernels pass group in passes.def
> > > >       11    Update testcases after adding kernels pass group
> > > >       12    Handle acc loop directive
> > > >       13    Add c-c++-common/goacc/kernels-*.c
> > > >       14    Add gfortran.dg/goacc/kernels-*.f95
> > > >       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > > >       16    Add libgomp.oacc-fortran/kernels-*.f95
> > > > 
> > > > The first 9 patches are more or less independent, but patches 10-16 are
> > > > intended to be committed at the same time.
> > > > 
> > > > Bootstrapped and reg-tested on x86_64.
> > > > 
> > > > Build and reg-tested with nvidia accelerator, in combination with a
> > > > patch that enables accelerator testing (which is submitted at
> > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > > > 
> > > > I'll post the individual patches in reply to this message.
> > > 
> > > this patch adds and initializes the field in_oacc_kernels_region field in
> > > struct loop.
> > > 
> > > The field is used to signal to subsequent passes that we're dealing with a
> > > loop in a kernels region that we're trying parallelize.
> > > 
> > > Note that we do not parallelize kernels regions with more than one loop
> > > nest.
> > > [ In general, kernels regions with more than one loop nest should be split
> > > up
> > > into seperate kernels regions, but that's not supported atm. ]
> > 
> > I think mark_loops_in_oacc_kernels_region can be greatly simplified.
> > 
> > Both region entry and exit should have the same ->loop_father (a SESE
> > region).  Then you can just walk that loops inner (and their sibling)
> > loops checking their header domination relation with the region entry
> > exit (only necessary for direct inner loops).
> 
> Updated patch to use the loops structure.  Atm I'm also skipping loops
> containing sibling loops, since I have no test-cases for that yet.

Looks ok to me now.  You want to update copy_loop_info btw.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 11:55     ` Tom de Vries
@ 2015-11-16 12:45       ` Richard Biener
  2015-11-16 23:21         ` Tom de Vries
  2015-11-19 10:31         ` Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-16 12:45 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 16 Nov 2015, Tom de Vries wrote:

> On 11/11/15 12:02, Richard Biener wrote:
> > On Mon, 9 Nov 2015, Tom de Vries wrote:
> > 
> > > On 09/11/15 16:35, Tom de Vries wrote:
> > > > Hi,
> > > > 
> > > > this patch series for stage1 trunk adds support to:
> > > > - parallelize oacc kernels regions using parloops, and
> > > > - map the loops onto the oacc gang dimension.
> > > > 
> > > > The patch series contains these patches:
> > > > 
> > > >        1    Insert new exit block only when needed in
> > > >           transform_to_exit_first_loop_alt
> > > >        2    Make create_parallel_loop return void
> > > >        3    Ignore reduction clause on kernels directive
> > > >        4    Implement -foffload-alias
> > > >        5    Add in_oacc_kernels_region in struct loop
> > > >        6    Add pass_oacc_kernels
> > > >        7    Add pass_dominator_oacc_kernels
> > > >        8    Add pass_ch_oacc_kernels
> > > >        9    Add pass_parallelize_loops_oacc_kernels
> > > >       10    Add pass_oacc_kernels pass group in passes.def
> > > >       11    Update testcases after adding kernels pass group
> > > >       12    Handle acc loop directive
> > > >       13    Add c-c++-common/goacc/kernels-*.c
> > > >       14    Add gfortran.dg/goacc/kernels-*.f95
> > > >       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > > >       16    Add libgomp.oacc-fortran/kernels-*.f95
> > > > 
> > > > The first 9 patches are more or less independent, but patches 10-16 are
> > > > intended to be committed at the same time.
> > > > 
> > > > Bootstrapped and reg-tested on x86_64.
> > > > 
> > > > Build and reg-tested with nvidia accelerator, in combination with a
> > > > patch that enables accelerator testing (which is submitted at
> > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > > > 
> > > > I'll post the individual patches in reply to this message.
> > > > 
> > > 
> > > This patch adds the pass_oacc_kernels pass group to the pass list in
> > > passes.def.
> > > 
> > > Note the repetition of pass_lim/pass_copy_prop. The first pair is for an
> > > inner
> > > loop in a loop nest, the second for an outer loop in a loop nest.
> > 
> > @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
> >            /* pass_build_ealias is a dummy pass that ensures that we
> >               execute TODO_rebuild_alias at this point.  */
> >            NEXT_PASS (pass_build_ealias);
> > +         /* Pass group that runs when there are oacc kernels in the
> > +            function.  */
> > +         NEXT_PASS (pass_oacc_kernels);
> > +         PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > +             NEXT_PASS (pass_dominator_oacc_kernels);
> > +             NEXT_PASS (pass_ch_oacc_kernels);
> > +             NEXT_PASS (pass_dominator_oacc_kernels);
> > +             NEXT_PASS (pass_tree_loop_init);
> > +             NEXT_PASS (pass_lim);
> > +             NEXT_PASS (pass_copy_prop);
> > +             NEXT_PASS (pass_lim);
> > +             NEXT_PASS (pass_copy_prop);
> > 
> > iterate lim/copyprop twice?!  Why's that needed?
> > 
> 
> I've managed to eliminate the last pass_copy_prop, but not pass_lim. I've
> added a comment:
> ...
>   /* We use pass_lim to rewrite in-memory iteration and reduction
>      variable accesses in loops into local variables accesses.
>      However, a single pass instantion manages to do this only for
>      one loop level, so we use pass_lim twice to at least be able to
>      handle a loop nest with a depth of two.  */
>   NEXT_PASS (pass_lim);
>   NEXT_PASS (pass_copy_prop);
>   NEXT_PASS (pass_lim);
> ...

Huh.  Testcase?  LIM is perfectly able to handle nests.

> > +             NEXT_PASS (pass_scev_cprop);
> > 
> > What's that for?  It's supposed to help removing loops - I don't
> > expect kernels to vanish.
> 
> I'm using pass_scev_cprop for the "final value replacement" functionality.
> Added comment.

That functionality is intented to enable loop removal.

> > 
> > +             NEXT_PASS (pass_tree_loop_done);
> > +             NEXT_PASS (pass_dominator_oacc_kernels);
> > 
> > Three times DOM?  No please.  I wonder why you don't run oacc_kernels
> > after FRE and drop the initial DOM(s).
> > 
> 
> Done. There's just one pass_dominator_oacc_kernels left now.
> 
> > +             NEXT_PASS (pass_dce);
> > +             NEXT_PASS (pass_tree_loop_init);
> > +             NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> > +             NEXT_PASS (pass_expand_omp_ssa);
> > +             NEXT_PASS (pass_tree_loop_done);
> > 
> > The switches into/outof tree_loop also look odd to me, but well
> > (they'll be controlled by -ftree-loop-optimize)).
> > 
> 
> I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in
> the pass group. Instead, I've added conditional loop optimizer setup in:
> -  pass_lim and pass_scev_cprop (added in this patch), and
> - pass_parallelize_loops_oacc_kernels (added in patch "Add
>   pass_parallelize_loops_oacc_kernels").

You miss calling scev_finalize ().

Much better otherwise.  I still wonder about scev_cprop and LIM two
times.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 12:45       ` Richard Biener
@ 2015-11-16 23:21         ` Tom de Vries
  2015-11-17 10:05           ` Richard Biener
  2015-11-19 10:31         ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-16 23:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 16/11/15 13:45, Richard Biener wrote:
>>> +             NEXT_PASS (pass_scev_cprop);
>>> > >
>>> > >What's that for?  It's supposed to help removing loops - I don't
>>> > >expect kernels to vanish.
>> >
>> >I'm using pass_scev_cprop for the "final value replacement" functionality.
>> >Added comment.

> That functionality is intented to enable loop removal.

Let me try to explain in a bit more detail.


I.

Consider a parloops testcase test.c, with a use of the final value of 
the iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
   int i;
   for (i = 0; i < n; ++i)
     a[i] = 1;

   return i;
}
...

Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

We can see here in the parloops dump-file that the loop was parallelized:
...
   SUCCESS: may be parallelized
...

Now say that we run with -fno-tree-scev-cprop in addition. Instead we 
find in the parloops dump-file:
...
phi is i_1 = PHI <i_10(4)>
arg of phi to exit:   value i_10 used outside loop
   checking if it a part of reduction pattern:
   FAILED: it is not a part of reduction.
...

Auto-parallelization fails in this case because there is a loop exit phi 
(the one in bb 6 defining i_1) which is not part of a reduction:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   # i_1 = PHI <i_10(4)>
   _20 = (unsigned int) i_1;
...

With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
   i_1 = PHI <i_10(4)>
   with
   i_1 = n_4(D);
...

And the resulting loop no longer has any loop exit phis, so 
auto-parallelization succeeds:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   _20 = (unsigned int) n_4(D);
...

[ I've filed PR68373 - "autopar fails on loop exit phi with argument 
defined outside loop", for a slightly different testcase where despite 
the final value replacement autopar still fails. ]


II.

Now, back to oacc kernels.

Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
   subroutine foo(n)
     implicit none
     integer :: n
     integer, dimension (0:n-1) :: a, b, c
     integer                    :: i, ii
     do i = 0, n - 1
        a(i) = i * 2
     end do

     do i = 0, n -1
        b(i) = i * 4
     end do

     !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
     do ii = 0, n - 1
        c(ii) = a(ii) + b(ii)
     end do
     !$acc end kernels

     do i = 0, n - 1
        if (c(i) .ne. a(i) + b(i)) call abort
     end do

   end subroutine foo
end module test
...

The loop at the start of the kernels pass group contains an in-memory 
iteration variable, with a store to '*_9 = _38'.
...
   <bb 4>:
   _13 = *.omp_data_i_4(D).c;
   c.21_14 = *_13;
   _16 = *_9;
   _17 = (integer(kind=8)) _16;
   _18 = *.omp_data_i_4(D).a;
   a.22_19 = *_18;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _24 = *.omp_data_i_4(D).b;
   b.23_25 = *_24;
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = _16 + 1;
   *_9 = _38;
   if (_8 == _16)
     goto <bb 3>;
   else
     goto <bb 4>;
...

After pass_lim/pass_copy_prop, we've rewritten that into using a local 
iteration variable, but we've generated a read of the final value of the 
iteration variable outside the loop, which means auto-parallelization 
will fail:
...
   <bb 5>:
   # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
   _17 = (integer(kind=8)) D__lsm.29_12;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = D__lsm.29_12 + 1;
   if (_8 == D__lsm.29_12)
     goto <bb 6>;
   else
     goto <bb 7>;

   <bb 6>:
   # D__lsm.29_27 = PHI <_38(5)>
   *_9 = D__lsm.29_27;
   goto <bb 3>;

   <bb 7>:
   goto <bb 5>;
...

This makes it similar to the parloops example above, and that's why I've 
added pass_scev_cprop in the kernels pass group.

[ And for some kernels test-cases with constant loop bound, it's not the 
final value replacement bit that does the substitution, but the first 
bit in scev_const_prop using resolve_mixers. So that's a related reason 
to use pass_scev_cprop. ]

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 23:21         ` Tom de Vries
@ 2015-11-17 10:05           ` Richard Biener
  2015-11-17 14:54             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-17 10:05 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 16/11/15 13:45, Richard Biener wrote:
>>>>
>>>> +             NEXT_PASS (pass_scev_cprop);
>>>> > >
>>>> > >What's that for?  It's supposed to help removing loops - I don't
>>>> > >expect kernels to vanish.
>>>
>>> >
>>> >I'm using pass_scev_cprop for the "final value replacement"
>>> > functionality.
>>> >Added comment.
>
>
>> That functionality is intented to enable loop removal.
>
>
> Let me try to explain in a bit more detail.
>
>
> I.
>
> Consider a parloops testcase test.c, with a use of the final value of the
> iteration variable (return i):
> ...
> unsigned int
> foo (int n, int *a)
> {
>   int i;
>   for (i = 0; i < n; ++i)
>     a[i] = 1;
>
>   return i;
> }
> ...
>
> Say we compile with:
> ...
> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> ...
>
> We can see here in the parloops dump-file that the loop was parallelized:
> ...
>   SUCCESS: may be parallelized
> ...
>
> Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> in the parloops dump-file:
> ...
> phi is i_1 = PHI <i_10(4)>
> arg of phi to exit:   value i_10 used outside loop
>   checking if it a part of reduction pattern:
>   FAILED: it is not a part of reduction.
> ...
>
> Auto-parallelization fails in this case because there is a loop exit phi
> (the one in bb 6 defining i_1) which is not part of a reduction:
> ...
>   <bb 4>:
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
>
>   <bb 5>:
>   goto <bb 4>;
>
>   <bb 6>:
>   # i_1 = PHI <i_10(4)>
>   _20 = (unsigned int) i_1;
> ...
>
> With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> ...
> final value replacement:
>   i_1 = PHI <i_10(4)>
>   with
>   i_1 = n_4(D);
> ...
>
> And the resulting loop no longer has any loop exit phis, so
> auto-parallelization succeeds:
> ...
>   <bb 4>:
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
>
>   <bb 5>:
>   goto <bb 4>;
>
>   <bb 6>:
>   _20 = (unsigned int) n_4(D);
> ...
>
> [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
> outside loop", for a slightly different testcase where despite the final
> value replacement autopar still fails. ]
>
>
> II.
>
> Now, back to oacc kernels.
>
> Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
> ...
> module test
> contains
>   subroutine foo(n)
>     implicit none
>     integer :: n
>     integer, dimension (0:n-1) :: a, b, c
>     integer                    :: i, ii
>     do i = 0, n - 1
>        a(i) = i * 2
>     end do
>
>     do i = 0, n -1
>        b(i) = i * 4
>     end do
>
>     !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
>     do ii = 0, n - 1
>        c(ii) = a(ii) + b(ii)
>     end do
>     !$acc end kernels
>
>     do i = 0, n - 1
>        if (c(i) .ne. a(i) + b(i)) call abort
>     end do
>
>   end subroutine foo
> end module test
> ...
>
> The loop at the start of the kernels pass group contains an in-memory
> iteration variable, with a store to '*_9 = _38'.
> ...
>   <bb 4>:
>   _13 = *.omp_data_i_4(D).c;
>   c.21_14 = *_13;
>   _16 = *_9;
>   _17 = (integer(kind=8)) _16;
>   _18 = *.omp_data_i_4(D).a;
>   a.22_19 = *_18;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _24 = *.omp_data_i_4(D).b;
>   b.23_25 = *_24;
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = _16 + 1;
>   *_9 = _38;
>   if (_8 == _16)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
> ...
>
> After pass_lim/pass_copy_prop, we've rewritten that into using a local
> iteration variable, but we've generated a read of the final value of the
> iteration variable outside the loop, which means auto-parallelization will
> fail:
> ...
>   <bb 5>:
>   # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
>   _17 = (integer(kind=8)) D__lsm.29_12;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = D__lsm.29_12 + 1;
>   if (_8 == D__lsm.29_12)
>     goto <bb 6>;
>   else
>     goto <bb 7>;
>
>   <bb 6>:
>   # D__lsm.29_27 = PHI <_38(5)>
>   *_9 = D__lsm.29_27;
>   goto <bb 3>;

So this store is not actually necessary?  Or just in an inconvenient place?

>
>   <bb 7>:
>   goto <bb 5>;
> ...
>
> This makes it similar to the parloops example above, and that's why I've
> added pass_scev_cprop in the kernels pass group.
>
> [ And for some kernels test-cases with constant loop bound, it's not the
> final value replacement bit that does the substitution, but the first bit in
> scev_const_prop using resolve_mixers. So that's a related reason to use
> pass_scev_cprop. ]

IMHO autopar needs to handle induction itself.  And the above LIM example
is none for why you need two LIM passes...

Richard.

> Thanks,
> - Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 10:05           ` Richard Biener
@ 2015-11-17 14:54             ` Tom de Vries
  2015-11-17 15:18               ` Richard Biener
  2015-11-19  0:35               ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-17 14:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 17/11/15 11:05, Richard Biener wrote:
> On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com> wrote:
>> On 16/11/15 13:45, Richard Biener wrote:
>>>>>
>>>>> +             NEXT_PASS (pass_scev_cprop);
>>>>>>>
>>>>>>> What's that for?  It's supposed to help removing loops - I don't
>>>>>>> expect kernels to vanish.
>>>>
>>>>>
>>>>> I'm using pass_scev_cprop for the "final value replacement"
>>>>> functionality.
>>>>> Added comment.
>>
>>
>>> That functionality is intented to enable loop removal.
>>
>>
>> Let me try to explain in a bit more detail.
>>
>>
>> I.
>>
>> Consider a parloops testcase test.c, with a use of the final value of the
>> iteration variable (return i):
>> ...
>> unsigned int
>> foo (int n, int *a)
>> {
>>    int i;
>>    for (i = 0; i < n; ++i)
>>      a[i] = 1;
>>
>>    return i;
>> }
>> ...
>>
>> Say we compile with:
>> ...
>> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
>> ...
>>
>> We can see here in the parloops dump-file that the loop was parallelized:
>> ...
>>    SUCCESS: may be parallelized
>> ...
>>
>> Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
>> in the parloops dump-file:
>> ...
>> phi is i_1 = PHI <i_10(4)>
>> arg of phi to exit:   value i_10 used outside loop
>>    checking if it a part of reduction pattern:
>>    FAILED: it is not a part of reduction.
>> ...
>>
>> Auto-parallelization fails in this case because there is a loop exit phi
>> (the one in bb 6 defining i_1) which is not part of a reduction:
>> ...
>>    <bb 4>:
>>    # i_13 = PHI <0(3), i_10(5)>
>>    _5 = (long unsigned int) i_13;
>>    _6 = _5 * 4;
>>    _8 = a_7(D) + _6;
>>    *_8 = 1;
>>    i_10 = i_13 + 1;
>>    if (n_4(D) > i_10)
>>      goto <bb 5>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 5>:
>>    goto <bb 4>;
>>
>>    <bb 6>:
>>    # i_1 = PHI <i_10(4)>
>>    _20 = (unsigned int) i_1;
>> ...
>>
>> With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
>> ...
>> final value replacement:
>>    i_1 = PHI <i_10(4)>
>>    with
>>    i_1 = n_4(D);
>> ...
>>
>> And the resulting loop no longer has any loop exit phis, so
>> auto-parallelization succeeds:
>> ...
>>    <bb 4>:
>>    # i_13 = PHI <0(3), i_10(5)>
>>    _5 = (long unsigned int) i_13;
>>    _6 = _5 * 4;
>>    _8 = a_7(D) + _6;
>>    *_8 = 1;
>>    i_10 = i_13 + 1;
>>    if (n_4(D) > i_10)
>>      goto <bb 5>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 5>:
>>    goto <bb 4>;
>>
>>    <bb 6>:
>>    _20 = (unsigned int) n_4(D);
>> ...
>>
>> [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
>> outside loop", for a slightly different testcase where despite the final
>> value replacement autopar still fails. ]
>>
>>
>> II.
>>
>> Now, back to oacc kernels.
>>
>> Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
>> ...
>> module test
>> contains
>>    subroutine foo(n)
>>      implicit none
>>      integer :: n
>>      integer, dimension (0:n-1) :: a, b, c
>>      integer                    :: i, ii
>>      do i = 0, n - 1
>>         a(i) = i * 2
>>      end do
>>
>>      do i = 0, n -1
>>         b(i) = i * 4
>>      end do
>>
>>      !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
>>      do ii = 0, n - 1
>>         c(ii) = a(ii) + b(ii)
>>      end do
>>      !$acc end kernels
>>
>>      do i = 0, n - 1
>>         if (c(i) .ne. a(i) + b(i)) call abort
>>      end do
>>
>>    end subroutine foo
>> end module test
>> ...
>>
>> The loop at the start of the kernels pass group contains an in-memory
>> iteration variable, with a store to '*_9 = _38'.
>> ...
>>    <bb 4>:
>>    _13 = *.omp_data_i_4(D).c;
>>    c.21_14 = *_13;
>>    _16 = *_9;
>>    _17 = (integer(kind=8)) _16;
>>    _18 = *.omp_data_i_4(D).a;
>>    a.22_19 = *_18;
>>    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>>    _24 = *.omp_data_i_4(D).b;
>>    b.23_25 = *_24;
>>    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>>    _30 = _23 + _29;
>>    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>>    _38 = _16 + 1;
>>    *_9 = _38;
>>    if (_8 == _16)
>>      goto <bb 3>;
>>    else
>>      goto <bb 4>;
>> ...
>>
>> After pass_lim/pass_copy_prop, we've rewritten that into using a local
>> iteration variable, but we've generated a read of the final value of the
>> iteration variable outside the loop, which means auto-parallelization will
>> fail:
>> ...
>>    <bb 5>:
>>    # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
>>    _17 = (integer(kind=8)) D__lsm.29_12;
>>    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>>    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>>    _30 = _23 + _29;
>>    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>>    _38 = D__lsm.29_12 + 1;
>>    if (_8 == D__lsm.29_12)
>>      goto <bb 6>;
>>    else
>>      goto <bb 7>;
>>
>>    <bb 6>:
>>    # D__lsm.29_27 = PHI <_38(5)>
>>    *_9 = D__lsm.29_27;
>>    goto <bb 3>;
>
> So this store is not actually necessary?

a.
In the case of this example, the store is dead.

There is a corresponding load at the point that we split off the region:
...
   <bb 9>:
   #pragma omp return

   <bb 10>:
   D.3635 = .omp_data_arr.25.ii;
   ii = *D.3635;
...

This load is later removed, given that ii is unused after the region. 
But once the region is split off,  there's nothing in the context of the 
store to suggest that it's dead.

And to get rid of the load of ii before the region is split off, we 
would have to implement some sort of liveness analysis on pre-ssa code.

b.
There's the case where there is an explicit use of ii after the region, 
in which case the store is not dead.

c.
And there's the case were we use a data clause on the region, f.i. 
'create (ii)' to indicate that the variable is neither copied in nor 
copied out of the region (the default for a scalar in a kernels region 
is 'copy', meaning copy-in-and-out).

[ This means the value of ii after the region is uninitialized. So even 
if there's a read from ii after the region, we cannot consider it 
connected to the store, given that the value written by the store on the 
accelerator will not be copied back to the host. ]

In this case, we already don't have any load of ii after the region:
...
   <bb 9>:
   #pragma omp return

   <bb 10>:
   .omp_data_sizes.28 = {CLOBBER};
   .omp_data_arr.27 = {CLOBBER};
...

We could insert clobbers for the bits of .omp_data_arr at the end of the 
region to indicate that those are not used. That might enable dse to get 
rid of the dead store.


But, I think we want a generic solution that handles cases a, b and c, 
which means we have to solve the most difficult case, which is b, where 
the store is not dead.

>  Or just in an inconvenient place?

I don't think the place of the store is inconvenient, it would be worse 
to have the store in the loop.

What is inconvenient about the store is the fact that it reads the final 
value of the iteration variable (which inhibits parloops).

>>    <bb 7>:
>>    goto <bb 5>;
>> ...
>>
>> This makes it similar to the parloops example above, and that's why I've
>> added pass_scev_cprop in the kernels pass group.
>>
>> [ And for some kernels test-cases with constant loop bound, it's not the
>> final value replacement bit that does the substitution, but the first bit in
>> scev_const_prop using resolve_mixers. So that's a related reason to use
>> pass_scev_cprop. ]
>
> IMHO autopar needs to handle induction itself.

I'm not sure what you mean. Could you elaborate?  Autopar handles 
induction variables, but it doesn't handle exit phis reading the final 
value of the induction variable. Is that what you want fixed? How?

> And the above LIM example
> is none for why you need two LIM passes...

Indeed. I'm planning a separate reply to explain in more detail the need 
for the two pass_lims.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 14:54             ` Tom de Vries
@ 2015-11-17 15:18               ` Richard Biener
  2015-11-17 15:39                 ` Tom de Vries
  2015-11-19  0:35               ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-17 15:18 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 11:05, Richard Biener wrote:
> > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com>
> > wrote:
> > > On 16/11/15 13:45, Richard Biener wrote:
> > > > > > 
> > > > > > +             NEXT_PASS (pass_scev_cprop);
> > > > > > > > 
> > > > > > > > What's that for?  It's supposed to help removing loops - I don't
> > > > > > > > expect kernels to vanish.
> > > > > 
> > > > > > 
> > > > > > I'm using pass_scev_cprop for the "final value replacement"
> > > > > > functionality.
> > > > > > Added comment.
> > > 
> > > 
> > > > That functionality is intented to enable loop removal.
> > > 
> > > 
> > > Let me try to explain in a bit more detail.
> > > 
> > > 
> > > I.
> > > 
> > > Consider a parloops testcase test.c, with a use of the final value of the
> > > iteration variable (return i):
> > > ...
> > > unsigned int
> > > foo (int n, int *a)
> > > {
> > >    int i;
> > >    for (i = 0; i < n; ++i)
> > >      a[i] = 1;
> > > 
> > >    return i;
> > > }
> > > ...
> > > 
> > > Say we compile with:
> > > ...
> > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > > ...
> > > 
> > > We can see here in the parloops dump-file that the loop was parallelized:
> > > ...
> > >    SUCCESS: may be parallelized
> > > ...
> > > 
> > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> > > in the parloops dump-file:
> > > ...
> > > phi is i_1 = PHI <i_10(4)>
> > > arg of phi to exit:   value i_10 used outside loop
> > >    checking if it a part of reduction pattern:
> > >    FAILED: it is not a part of reduction.
> > > ...
> > > 
> > > Auto-parallelization fails in this case because there is a loop exit phi
> > > (the one in bb 6 defining i_1) which is not part of a reduction:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    # i_1 = PHI <i_10(4)>
> > >    _20 = (unsigned int) i_1;
> > > ...
> > > 
> > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> > > ...
> > > final value replacement:
> > >    i_1 = PHI <i_10(4)>
> > >    with
> > >    i_1 = n_4(D);
> > > ...
> > > 
> > > And the resulting loop no longer has any loop exit phis, so
> > > auto-parallelization succeeds:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    _20 = (unsigned int) n_4(D);
> > > ...
> > > 
> > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument
> > > defined
> > > outside loop", for a slightly different testcase where despite the final
> > > value replacement autopar still fails. ]
> > > 
> > > 
> > > II.
> > > 
> > > Now, back to oacc kernels.
> > > 
> > > Consider test-case kernels-loop-n.f95 (will add this one to the
> > > test-cases):
> > > ...
> > > module test
> > > contains
> > >    subroutine foo(n)
> > >      implicit none
> > >      integer :: n
> > >      integer, dimension (0:n-1) :: a, b, c
> > >      integer                    :: i, ii
> > >      do i = 0, n - 1
> > >         a(i) = i * 2
> > >      end do
> > > 
> > >      do i = 0, n -1
> > >         b(i) = i * 4
> > >      end do
> > > 
> > >      !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> > >      do ii = 0, n - 1
> > >         c(ii) = a(ii) + b(ii)
> > >      end do
> > >      !$acc end kernels
> > > 
> > >      do i = 0, n - 1
> > >         if (c(i) .ne. a(i) + b(i)) call abort
> > >      end do
> > > 
> > >    end subroutine foo
> > > end module test
> > > ...
> > > 
> > > The loop at the start of the kernels pass group contains an in-memory
> > > iteration variable, with a store to '*_9 = _38'.
> > > ...
> > >    <bb 4>:
> > >    _13 = *.omp_data_i_4(D).c;
> > >    c.21_14 = *_13;
> > >    _16 = *_9;
> > >    _17 = (integer(kind=8)) _16;
> > >    _18 = *.omp_data_i_4(D).a;
> > >    a.22_19 = *_18;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _24 = *.omp_data_i_4(D).b;
> > >    b.23_25 = *_24;
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = _16 + 1;
> > >    *_9 = _38;
> > >    if (_8 == _16)
> > >      goto <bb 3>;
> > >    else
> > >      goto <bb 4>;
> > > ...
> > > 
> > > After pass_lim/pass_copy_prop, we've rewritten that into using a local
> > > iteration variable, but we've generated a read of the final value of the
> > > iteration variable outside the loop, which means auto-parallelization will
> > > fail:
> > > ...
> > >    <bb 5>:
> > >    # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
> > >    _17 = (integer(kind=8)) D__lsm.29_12;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = D__lsm.29_12 + 1;
> > >    if (_8 == D__lsm.29_12)
> > >      goto <bb 6>;
> > >    else
> > >      goto <bb 7>;
> > > 
> > >    <bb 6>:
> > >    # D__lsm.29_27 = PHI <_38(5)>
> > >    *_9 = D__lsm.29_27;
> > >    goto <bb 3>;
> > 
> > So this store is not actually necessary?
> 
> a.
> In the case of this example, the store is dead.
> 
> There is a corresponding load at the point that we split off the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   D.3635 = .omp_data_arr.25.ii;
>   ii = *D.3635;
> ...
> 
> This load is later removed, given that ii is unused after the region. But once
> the region is split off,  there's nothing in the context of the store to
> suggest that it's dead.
> 
> And to get rid of the load of ii before the region is split off, we would have
> to implement some sort of liveness analysis on pre-ssa code.
> 
> b.
> There's the case where there is an explicit use of ii after the region, in
> which case the store is not dead.
> 
> c.
> And there's the case were we use a data clause on the region, f.i. 'create
> (ii)' to indicate that the variable is neither copied in nor copied out of the
> region (the default for a scalar in a kernels region is 'copy', meaning
> copy-in-and-out).
> 
> [ This means the value of ii after the region is uninitialized. So even if
> there's a read from ii after the region, we cannot consider it connected to
> the store, given that the value written by the store on the accelerator will
> not be copied back to the host. ]
> 
> In this case, we already don't have any load of ii after the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   .omp_data_sizes.28 = {CLOBBER};
>   .omp_data_arr.27 = {CLOBBER};
> ...
> 
> We could insert clobbers for the bits of .omp_data_arr at the end of the
> region to indicate that those are not used. That might enable dse to get rid
> of the dead store.
> 
> 
> But, I think we want a generic solution that handles cases a, b and c, which
> means we have to solve the most difficult case, which is b, where the store is
> not dead.
> 
> >  Or just in an inconvenient place?
> 
> I don't think the place of the store is inconvenient, it would be worse to
> have the store in the loop.
> 
> What is inconvenient about the store is the fact that it reads the final value
> of the iteration variable (which inhibits parloops).
> 
> > >    <bb 7>:
> > >    goto <bb 5>;
> > > ...
> > > 
> > > This makes it similar to the parloops example above, and that's why I've
> > > added pass_scev_cprop in the kernels pass group.
> > > 
> > > [ And for some kernels test-cases with constant loop bound, it's not the
> > > final value replacement bit that does the substitution, but the first bit
> > > in
> > > scev_const_prop using resolve_mixers. So that's a related reason to use
> > > pass_scev_cprop. ]
> > 
> > IMHO autopar needs to handle induction itself.
> 
> I'm not sure what you mean. Could you elaborate?  Autopar handles induction
> variables, but it doesn't handle exit phis reading the final value of the
> induction variable. Is that what you want fixed? How?

Yes.  Perform final value replacement.

> > And the above LIM example
> > is none for why you need two LIM passes...
> 
> Indeed. I'm planning a separate reply to explain in more detail the need for
> the two pass_lims.

Thanks.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 15:18               ` Richard Biener
@ 2015-11-17 15:39                 ` Tom de Vries
  2015-11-17 22:21                   ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
  2015-11-18  8:30                   ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-17 15:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 17/11/15 16:18, Richard Biener wrote:
>>> IMHO autopar needs to handle induction itself.
>> >
>> >I'm not sure what you mean. Could you elaborate?  Autopar handles induction
>> >variables, but it doesn't handle exit phis reading the final value of the
>> >induction variable. Is that what you want fixed? How?
> Yes.  Perform final value replacement.
>

I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels 
seems to work fine.

Doing the same for pass_parallelize_loops like this:
...
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..d944395 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
    if (number_of_loops (fun) <= 1)
      return 0;

+  unsigned int sccp_todo = scev_const_prop ();
+  gcc_assert (sccp_todo == 0);
+
    if (parallelize_loops ())
      {
        fun->curr_properties &= ~(PROP_gimple_eomp);
...
seems to fix PR 68373 - "autopar fails on loop exit phi with argument 
defined outside loop".

The new scev_const_prop call in autopar rewrites this phi into an 
assignment, and that allows parloops to succeed:
...
final value replacement:
   n_2 = PHI <n_4(D)(4)>
   with
   n_2 = n_4(D);
...

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute
  2015-11-17 15:39                 ` Tom de Vries
@ 2015-11-17 22:21                   ` Tom de Vries
  2015-11-19  9:36                     ` Tom de Vries
  2015-11-18  8:30                   ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-17 22:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

[ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]

Hi,

Consider test-case test.c, with a use of the final value of the 
iteration variable (return i):
...
unsigned int
foo (int *a, unsigned int n)
{
   unsigned int i;
   for (i = 0; i < n; ++i)
     a[i] = 1;

   return i;
}
...

Compiled with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

Before parloops, we have:
...
  <bb 4>:
   # i_12 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_12;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_12 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   # i_14 = PHI <n_4(D)(4), 0(2)>
...

Parloops will fail because:
...
phi is n_2 = PHI <n_4(D)(4)>
arg of phi to exit:   value n_4(D) used outside loop
   checking if it a part of reduction pattern:
   FAILED: it is not a part of reduction....
...
[ note that the phi looks slightly different. In 
gather_scalar_reductions -> vect_analyze_loop_form -> 
vect_analyze_loop_form_1 -> split_loop_exit_edge we split the edge from 
bb4 to bb6. ]

This patch uses scev_const_prop at the start of parloops. 
scev_const_prop first also splits the exit edge, and then replaces the 
phi with a assignment:
...
  final value replacement:
   n_2 = PHI <n_4(D)(4)>
   with
   n_2 = n_4(D);
...

This allows parloops to succeed.

And there's a similar story when we compile with -fno-tree-scev-cprop in 
addition.

Bootstrapped and reg-tested on x86_64.

OK for stage3/stage1?

Thanks,
- Tom


[-- Attachment #2: 0005-Call-scev_const_prop-in-pass_parallelize_loops-execute.patch --]
[-- Type: text/x-patch, Size: 1372 bytes --]

Call scev_const_prop in pass_parallelize_loops::execute

2015-11-17  Tom de Vries  <tom@codesourcery.com>

	PR tree-optimization/68373
	* tree-parloops.c (pass_parallelize_loops::execute): Call
	scev_const_prop.

	* gcc.dg/autopar/pr68373.c: New test.

---
 gcc/testsuite/gcc.dg/autopar/pr68373.c | 14 ++++++++++++++
 gcc/tree-parloops.c                    |  3 +++
 2 files changed, 17 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/autopar/pr68373.c b/gcc/testsuite/gcc.dg/autopar/pr68373.c
new file mode 100644
index 0000000..8e0f8a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/autopar/pr68373.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+
+unsigned int
+foo (int *a, unsigned int n)
+{
+  unsigned int i;
+  for (i = 0; i < n; ++i)
+    a[i] = 1;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..d944395 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
+  unsigned int sccp_todo = scev_const_prop ();
+  gcc_assert (sccp_todo == 0);
+
   if (parallelize_loops ())
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 15:39                 ` Tom de Vries
  2015-11-17 22:21                   ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
@ 2015-11-18  8:30                   ` Richard Biener
  2015-11-18 16:22                     ` Bernhard Reutner-Fischer
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-18  8:30 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 16:18, Richard Biener wrote:
> > > > IMHO autopar needs to handle induction itself.
> > > >
> > > >I'm not sure what you mean. Could you elaborate?  Autopar handles
> > > induction
> > > >variables, but it doesn't handle exit phis reading the final value of the
> > > >induction variable. Is that what you want fixed? How?
> > Yes.  Perform final value replacement.
> > 
> 
> I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels seems to
> work fine.
> 
> Doing the same for pass_parallelize_loops like this:
> ...
> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
> index 17415a8..d944395 100644
> --- a/gcc/tree-parloops.c
> +++ b/gcc/tree-parloops.c
> @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
>    if (number_of_loops (fun) <= 1)
>      return 0;
> 
> +  unsigned int sccp_todo = scev_const_prop ();
> +  gcc_assert (sccp_todo == 0);
> +
>    if (parallelize_loops ())
>      {
>        fun->curr_properties &= ~(PROP_gimple_eomp);
> ...
> seems to fix PR 68373 - "autopar fails on loop exit phi with argument defined
> outside loop".
> 
> The new scev_const_prop call in autopar rewrites this phi into an assignment,
> and that allows parloops to succeed:
> ...
> final value replacement:
>   n_2 = PHI <n_4(D)(4)>
>   with
>   n_2 = n_4(D);
> ...

That works for me but please factor out the final value replacement
code from scev_const_prop.  I think best would be to have a
helper that does final value replacement for a single loop so you
can call it for loops to paralellize only.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-18  8:30                   ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
@ 2015-11-18 16:22                     ` Bernhard Reutner-Fischer
  2015-11-20 12:53                       ` [committed, trivial] Fix typo and trailing whitespace in dump-file strings in parloops Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Bernhard Reutner-Fischer @ 2015-11-18 16:22 UTC (permalink / raw)
  To: Richard Biener, Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On November 18, 2015 9:30:23 AM GMT+01:00, Richard Biener <rguenther@suse.de> wrote:
>On Tue, 17 Nov 2015, Tom de Vries wrote:
>
>> On 17/11/15 16:18, Richard Biener wrote:
>> > > > IMHO autopar needs to handle induction itself.
>> > > >
>> > > >I'm not sure what you mean. Could you elaborate?  Autopar
>handles
>> > > induction
>> > > >variables, but it doesn't handle exit phis reading the final
>value of the
>> > > >induction variable. Is that what you want fixed? How?
>> > Yes.  Perform final value replacement.
>> > 
>> 
>> I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels
>seems to
>> work fine.
>> 
>> Doing the same for pass_parallelize_loops like this:
>> ...
>> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
>> index 17415a8..d944395 100644
>> --- a/gcc/tree-parloops.c
>> +++ b/gcc/tree-parloops.c
>> @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
>>    if (number_of_loops (fun) <= 1)
>>      return 0;
>> 
>> +  unsigned int sccp_todo = scev_const_prop ();
>> +  gcc_assert (sccp_todo == 0);
>> +
>>    if (parallelize_loops ())
>>      {
>>        fun->curr_properties &= ~(PROP_gimple_eomp);
>> ...
>> seems to fix PR 68373 - "autopar fails on loop exit phi with argument
>defined
>> outside loop".
>> 
>> The new scev_const_prop call in autopar rewrites this phi into an
>assignment,
>> and that allows parloops to succeed:
>> ...
>> final value replacement:
>>   n_2 = PHI <n_4(D)(4)>
>>   with
>>   n_2 = n_4(D);
>> ...
>
>That works for me but please factor out the final value replacement
>code from scev_const_prop.  I think best would be to have a
>helper that does final value replacement for a single loop so you
>can call it for loops to paralellize only.

Bonus points for fixing the dump_file to parse in:

>Parloops will fail because:
>...
>phi is n_2 = PHI <n_4(D)(4)>
>arg of phi to exit: value n_4(D) used outside loop
>checking if it a part of reduction pattern:

s/it a/it is/

>FAILED: it is not a part of reduction....
>...

TIA,
>
>Richard.
>
>> Thanks,
>> - Tom
>> 
>> 


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-17 14:54             ` Tom de Vries
  2015-11-17 15:18               ` Richard Biener
@ 2015-11-19  0:35               ` Tom de Vries
  2015-11-20 10:28                 ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19  0:35 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 17/11/15 15:53, Tom de Vries wrote:
>> And the above LIM example
>> is none for why you need two LIM passes...
>
> Indeed. I'm planning a separate reply to explain in more detail the need
> for the two pass_lims.

I.

I managed to get rid of the two pass_lims for the motivating example 
that I used until now (goacc/kernels-double-reduction.c). I found that 
by adding a pass_dominator instance after pass_ch, I could get rid of 
the second pass_lim (and pass_copyprop as well).

But... then I wrote a counter example 
(goacc/kernels-double-reduction-n.c), and I'm back at two pass_lims (and 
two pass_dominators).
Also I've split the pass group into a bit before and after pass_fre.

So, the current pass group looks like:
...
NEXT_PASS (pass_build_ealias);

/* Pass group that runs when the function is an offloaded function
    containing oacc kernels loops.  Part 1.  */
NEXT_PASS (pass_oacc_kernels);
PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
     /* We need pass_ch here, because pass_lim has no effect on
        exit-first loops (PR65442).  Ideally we want to remove both
        this pass instantiation, and the reverse transformation
        transform_to_exit_first_loop_alt, which is done in
        pass_parallelize_loops_oacc_kernels. */
     NEXT_PASS (pass_ch);
POP_INSERT_PASSES ()

NEXT_PASS (pass_fre);

/* Pass group that runs when the function is an offloaded function
    containing oacc kernels loops.  Part 2.  */
NEXT_PASS (pass_oacc_kernels2);
PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
     /* We use pass_lim to rewrite in-memory iteration and reduction
        variable accesses in loops into local variables accesses.  */
     NEXT_PASS (pass_lim);
     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
     NEXT_PASS (pass_lim);
     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
     NEXT_PASS (pass_dce);
     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
     NEXT_PASS (pass_expand_omp_ssa);
POP_INSERT_PASSES ()
NEXT_PASS (pass_merge_phi);
...


II.

The motivating test-case kernels-double-reduction-n.c:
...
#include <stdlib.h>

#define N 500

unsigned int a[N][N];

void  __attribute__((noinline,noclone))
foo (unsigned int n)
{
   int i, j;
   unsigned int sum = 1;

#pragma acc kernels copyin (a[0:n]) copy (sum)
   {
     for (i = 0; i < n; ++i)
       for (j = 0; j < n; ++j)
         sum += a[i][j];
   }

   if (sum != 5001)
     abort ();
}
...


III.

Before first pass_lim. Note no phis on inner or outer loop header for 
iteration varables or reduction variable:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>: outer loop header
   _12 = *.omp_data_i_4(D).j;
   *_12 = 0;
   if (_45 != 0)
     goto <bb 6>;
   else
     goto <bb 5>;

   <bb 6>: inner loop header, latch
   _19 = *.omp_data_i_4(D).a;
   _21 = *_5;
   _23 = *_12;
   _24 = *_19[_21][_23];
   _25 = *.omp_data_i_4(D).sum;
   sum.0_26 = *_25;
   sum.1_27 = _24 + sum.0_26;
   *_25 = sum.1_27;
   _33 = _23 + 1;
   *_12 = _33;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 6>;
   else
     goto <bb 5>;

   <bb 5>: outer loop latch
   _36 = *_5;
   _38 = _36 + 1;
   *_5 = _38;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 3>:
   return;
...


IV.

After first pass_lim/pass_dom pair. Note there are phis on the inner 
loop header for the reduction and the iteration variable, but not on the 
outer loop header:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>:
   _12 = *.omp_data_i_4(D).j;
   _19 = *.omp_data_i_4(D).a;
   D__lsm.10_50 = *_12;
   D__lsm.11_51 = 0;
   _25 = *.omp_data_i_4(D).sum;

   <bb 5>: outer loop header
   D__lsm.10_20 = 0;
   D__lsm.11_22 = 1;
   _21 = *_5;
   D__lsm.12_28 = *_25;
   D__lsm.13_30 = 0;
   goto <bb 7>;

   <bb 7>: inner loop header, latch
   # D__lsm.10_47 = PHI <0(5), _33(7)>
   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
   _23 = D__lsm.10_47;
   _24 = *_19[_21][D__lsm.10_47];
   sum.0_26 = D__lsm.12_49;
   sum.1_27 = _24 + D__lsm.12_49;
   D__lsm.12_31 = sum.1_27;
   D__lsm.13_32 = 1;
   _33 = D__lsm.10_47 + 1;
   D__lsm.10_14 = _33;
   D__lsm.11_15 = 1;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 7>;
   else
     goto <bb 8>;

   <bb 8>: outer loop latch
   # D__lsm.10_35 = PHI <_33(7)>
   # D__lsm.11_37 = PHI <1(7)>
   # D__lsm.12_7 = PHI <sum.1_27(7)>
   # D__lsm.13_8 = PHI <1(7)>
   *_25 = sum.1_27;
   _36 = *_5;
   _38 = _36 + 1;
   *_5 = _38;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 6>:
   # D__lsm.10_10 = PHI <_33(8)>
   # D__lsm.11_11 = PHI <1(8)>
   *_12 = _33;
   goto <bb 3>;

   <bb 3>:
   return;
...


V.

After second pass_lim/pass_dom pair. Note there are phis on the inner 
and outer loop header for the reduction and the iteration variables:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>:
   _12 = *.omp_data_i_4(D).j;
   _19 = *.omp_data_i_4(D).a;
   D__lsm.10_50 = *_12;
   D__lsm.11_51 = 0;
   _25 = *.omp_data_i_4(D).sum;
   D__lsm.14_40 = 0;
   D__lsm.15_2 = 0;
   D__lsm.16_1 = *_25;
   D__lsm.17_46 = 0;

   <bb 5>: outer loop header
   # D__lsm.14_13 = PHI <0(4), _38(8)>
   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
   D__lsm.10_20 = 0;
   D__lsm.11_22 = 1;
   _21 = D__lsm.14_13;
   D__lsm.12_28 = D__lsm.16_34;
   D__lsm.13_30 = 0;
   goto <bb 7>;

   <bb 7>: inner loop header, latch
   # D__lsm.10_47 = PHI <0(5), _33(7)>
   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
   _23 = D__lsm.10_47;
   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
   sum.0_26 = D__lsm.12_49;
   sum.1_27 = _24 + D__lsm.12_49;
   D__lsm.12_31 = sum.1_27;
   D__lsm.13_32 = 1;
   _33 = D__lsm.10_47 + 1;
   D__lsm.10_14 = _33;
   D__lsm.11_15 = 1;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 7>;
   else
     goto <bb 8>;

   <bb 8>: outer loop latch
   # D__lsm.10_35 = PHI <_33(7)>
   # D__lsm.11_37 = PHI <1(7)>
   # D__lsm.12_7 = PHI <sum.1_27(7)>
   # D__lsm.13_8 = PHI <1(7)>
   # sum.1_48 = PHI <sum.1_27(7)>
   # _53 = PHI <_33(7)>
   D__lsm.16_56 = sum.1_27;
   D__lsm.17_57 = 1;
   _36 = D__lsm.14_13;
   _38 = D__lsm.14_13 + 1;
   D__lsm.14_58 = _38;
   D__lsm.15_59 = 1;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 6>:
   # D__lsm.10_10 = PHI <_33(8)>
   # D__lsm.11_11 = PHI <1(8)>
   # _43 = PHI <_33(8)>
   # D__lsm.16_62 = PHI <sum.1_27(8)>
   # D__lsm.17_63 = PHI <1(8)>
   # D__lsm.14_64 = PHI <_38(8)>
   # D__lsm.15_65 = PHI <1(8)>
   *_5 = _38;
   *_25 = sum.1_27;
   *_12 = _33;
   goto <bb 3>;

   <bb 3>:
   return;
...


VI.

After pass_dce, so before parloops-oacc-kernels:
...
   <bb 2>:
   _5 = *.omp_data_i_4(D).i;
   *_5 = 0;
   _44 = *.omp_data_i_4(D).n;
   _45 = *_44;
   if (_45 != 0)
     goto <bb 4>;
   else
     goto <bb 3>;

   <bb 4>:
   _12 = *.omp_data_i_4(D).j;
   _19 = *.omp_data_i_4(D).a;
   _25 = *.omp_data_i_4(D).sum;
   D__lsm.16_1 = *_25;

   <bb 5>: outer loop header
   # D__lsm.14_13 = PHI <0(4), _38(8)>
   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
   goto <bb 7>;

   <bb 7>: inner loop header, latch
   # D__lsm.10_47 = PHI <0(5), _33(7)>
   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
   sum.1_27 = _24 + D__lsm.12_49;
   _33 = D__lsm.10_47 + 1;
   j.2_16 = (unsigned int) _33;
   if (j.2_16 < _45)
     goto <bb 7>;
   else
     goto <bb 8>;

   <bb 8>: outer loop latch
   _38 = D__lsm.14_13 + 1;
   i.3_9 = (unsigned int) _38;
   if (i.3_9 < _45)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 6>:
   *_5 = _38;
   *_25 = sum.1_27;
   *_12 = _33;
   goto <bb 3>;

   <bb 3>:
   return;
...

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute
  2015-11-17 22:21                   ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
@ 2015-11-19  9:36                     ` Tom de Vries
  2015-11-20 10:15                       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19  9:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2068 bytes --]

On 17/11/15 23:20, Tom de Vries wrote:
> [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
>
> Hi,
>
> Consider test-case test.c, with a use of the final value of the
> iteration variable (return i):
> ...
> unsigned int
> foo (int *a, unsigned int n)
> {
>    unsigned int i;
>    for (i = 0; i < n; ++i)
>      a[i] = 1;
>
>    return i;
> }
> ...
>
> Compiled with:
> ...
> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> ...
>
> Before parloops, we have:
> ...
>   <bb 4>:
>    # i_12 = PHI <0(3), i_10(5)>
>    _5 = (long unsigned int) i_12;
>    _6 = _5 * 4;
>    _8 = a_7(D) + _6;
>    *_8 = 1;
>    i_10 = i_12 + 1;
>    if (n_4(D) > i_10)
>      goto <bb 5>;
>    else
>      goto <bb 6>;
>
>    <bb 5>:
>    goto <bb 4>;
>
>    <bb 6>:
>    # i_14 = PHI <n_4(D)(4), 0(2)>
> ...
>
> Parloops will fail because:
> ...
> phi is n_2 = PHI <n_4(D)(4)>
> arg of phi to exit:   value n_4(D) used outside loop
>    checking if it a part of reduction pattern:
>    FAILED: it is not a part of reduction....
> ...
> [ note that the phi looks slightly different. In
> gather_scalar_reductions -> vect_analyze_loop_form ->
> vect_analyze_loop_form_1 -> split_loop_exit_edge we split the edge from
> bb4 to bb6. ]
>
> This patch uses scev_const_prop at the start of parloops.
> scev_const_prop first also splits the exit edge, and then replaces the
> phi with a assignment:
> ...
>   final value replacement:
>    n_2 = PHI <n_4(D)(4)>
>    with
>    n_2 = n_4(D);
> ...
>
> This allows parloops to succeed.
>
> And there's a similar story when we compile with -fno-tree-scev-cprop in
> addition.
>
> Bootstrapped and reg-tested on x86_64.
>
> OK for stage3/stage1?

The patch has been updated to do the final value replacement only for 
the loop that parloops is processing, as suggested in review comment at 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02166.html .

That means the patch is now also required for the kernels patch series.

Bootstrapped and reg-tested on x86_64.

OK for stage 3 trunk?

Thanks,
- Tom

[-- Attachment #2: 0001-Do-final-value-replacement-in-try_create_reduction_list.patch --]
[-- Type: text/x-patch, Size: 10690 bytes --]

Do final value replacement in try_create_reduction_list

2015-11-18  Tom de Vries  <tom@codesourcery.com>

	* tree-scalar-evolution.c (final_value_replacement_loop): Factor out of ...
	(scev_const_prop): ... here.
	* tree-scalar-evolution.h (final_value_replacement_loop): Declare.
	* tree-parloops.c (try_create_reduction_list): Call
	final_value_replacement_loop.

	* gcc.dg/autopar/pr68373.c: New test.

---
 gcc/testsuite/gcc.dg/autopar/pr68373.c |  14 ++
 gcc/tree-parloops.c                    |   3 +
 gcc/tree-scalar-evolution.c            | 248 +++++++++++++++++----------------
 gcc/tree-scalar-evolution.h            |   1 +
 4 files changed, 145 insertions(+), 121 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/autopar/pr68373.c b/gcc/testsuite/gcc.dg/autopar/pr68373.c
new file mode 100644
index 0000000..8e0f8a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/autopar/pr68373.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
+
+unsigned int
+foo (int *a, unsigned int n)
+{
+  unsigned int i;
+  for (i = 0; i < n; ++i)
+    a[i] = 1;
+
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..8d7912d 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2539,6 +2539,9 @@ try_create_reduction_list (loop_p loop,
 
   gcc_assert (exit);
 
+  /* Try to get rid of exit phis.  */
+  final_value_replacement_loop (loop);
+
   gather_scalar_reductions (loop, reduction_list);
 
 
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 27630f0..9b33693 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -3417,6 +3417,131 @@ expression_expensive_p (tree expr)
     }
 }
 
+/* Do final value replacement for LOOP.  */
+
+void
+final_value_replacement_loop (struct loop *loop)
+{
+  /* If we do not know exact number of iterations of the loop, we cannot
+     replace the final value.  */
+  edge exit = single_exit (loop);
+  if (!exit)
+    return;
+
+  tree niter = number_of_latch_executions (loop);
+  if (niter == chrec_dont_know)
+    return;
+
+  /* Ensure that it is possible to insert new statements somewhere.  */
+  if (!single_pred_p (exit->dest))
+    split_loop_exit_edge (exit);
+
+  /* Set stmt insertion pointer.  All stmts are inserted before this point.  */
+  gimple_stmt_iterator gsi = gsi_after_labels (exit->dest);
+
+  struct loop *ex_loop
+    = superloop_at_depth (loop,
+			  loop_depth (exit->dest->loop_father) + 1);
+
+  gphi_iterator psi;
+  for (psi = gsi_start_phis (exit->dest); !gsi_end_p (psi); )
+    {
+      gphi *phi = psi.phi ();
+      tree rslt = PHI_RESULT (phi);
+      tree def = PHI_ARG_DEF_FROM_EDGE (phi, exit);
+      if (virtual_operand_p (def))
+	{
+	  gsi_next (&psi);
+	  continue;
+	}
+
+      if (!POINTER_TYPE_P (TREE_TYPE (def))
+	  && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
+	{
+	  gsi_next (&psi);
+	  continue;
+	}
+
+      bool folded_casts;
+      def = analyze_scalar_evolution_in_loop (ex_loop, loop, def,
+					      &folded_casts);
+      def = compute_overall_effect_of_inner_loop (ex_loop, def);
+      if (!tree_does_not_contain_chrecs (def)
+	  || chrec_contains_symbols_defined_in_loop (def, ex_loop->num)
+	  /* Moving the computation from the loop may prolong life range
+	     of some ssa names, which may cause problems if they appear
+	     on abnormal edges.  */
+	  || contains_abnormal_ssa_name_p (def)
+	  /* Do not emit expensive expressions.  The rationale is that
+	     when someone writes a code like
+
+	     while (n > 45) n -= 45;
+
+	     he probably knows that n is not large, and does not want it
+	     to be turned into n %= 45.  */
+	  || expression_expensive_p (def))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "not replacing:\n  ");
+	      print_gimple_stmt (dump_file, phi, 0, 0);
+	      fprintf (dump_file, "\n");
+	    }
+	  gsi_next (&psi);
+	  continue;
+	}
+
+      /* Eliminate the PHI node and replace it by a computation outside
+	 the loop.  */
+      if (dump_file)
+	{
+	  fprintf (dump_file, "\nfinal value replacement:\n  ");
+	  print_gimple_stmt (dump_file, phi, 0, 0);
+	  fprintf (dump_file, "  with\n  ");
+	}
+      def = unshare_expr (def);
+      remove_phi_node (&psi, false);
+
+      /* If def's type has undefined overflow and there were folded
+	 casts, rewrite all stmts added for def into arithmetics
+	 with defined overflow behavior.  */
+      if (folded_casts && ANY_INTEGRAL_TYPE_P (TREE_TYPE (def))
+	  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (def)))
+	{
+	  gimple_seq stmts;
+	  gimple_stmt_iterator gsi2;
+	  def = force_gimple_operand (def, &stmts, true, NULL_TREE);
+	  gsi2 = gsi_start (stmts);
+	  while (!gsi_end_p (gsi2))
+	    {
+	      gimple *stmt = gsi_stmt (gsi2);
+	      gimple_stmt_iterator gsi3 = gsi2;
+	      gsi_next (&gsi2);
+	      gsi_remove (&gsi3, false);
+	      if (is_gimple_assign (stmt)
+		  && arith_code_with_undefined_signed_overflow
+		  (gimple_assign_rhs_code (stmt)))
+		gsi_insert_seq_before (&gsi,
+				       rewrite_to_defined_overflow (stmt),
+				       GSI_SAME_STMT);
+	      else
+		gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	    }
+	}
+      else
+	def = force_gimple_operand_gsi (&gsi, def, false, NULL_TREE,
+					true, GSI_SAME_STMT);
+
+      gassign *ass = gimple_build_assign (rslt, def);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      if (dump_file)
+	{
+	  print_gimple_stmt (dump_file, ass, 0, 0);
+	  fprintf (dump_file, "\n");
+	}
+    }
+}
+
 /* Replace ssa names for that scev can prove they are constant by the
    appropriate constants.  Also perform final value replacement in loops,
    in case the replacement expressions are cheap.
@@ -3430,8 +3555,7 @@ scev_const_prop (void)
   basic_block bb;
   tree name, type, ev;
   gphi *phi;
-  gassign *ass;
-  struct loop *loop, *ex_loop;
+  struct loop *loop;
   bitmap ssa_names_to_remove = NULL;
   unsigned i;
   gphi_iterator psi;
@@ -3507,126 +3631,8 @@ scev_const_prop (void)
 
   /* Now the regular final value replacement.  */
   FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
-    {
-      edge exit;
-      tree def, rslt, niter;
-      gimple_stmt_iterator gsi;
-
-      /* If we do not know exact number of iterations of the loop, we cannot
-	 replace the final value.  */
-      exit = single_exit (loop);
-      if (!exit)
-	continue;
-
-      niter = number_of_latch_executions (loop);
-      if (niter == chrec_dont_know)
-	continue;
-
-      /* Ensure that it is possible to insert new statements somewhere.  */
-      if (!single_pred_p (exit->dest))
-	split_loop_exit_edge (exit);
-      gsi = gsi_after_labels (exit->dest);
+    final_value_replacement_loop (loop);
 
-      ex_loop = superloop_at_depth (loop,
-				    loop_depth (exit->dest->loop_father) + 1);
-
-      for (psi = gsi_start_phis (exit->dest); !gsi_end_p (psi); )
-	{
-	  phi = psi.phi ();
-	  rslt = PHI_RESULT (phi);
-	  def = PHI_ARG_DEF_FROM_EDGE (phi, exit);
-	  if (virtual_operand_p (def))
-	    {
-	      gsi_next (&psi);
-	      continue;
-	    }
-
-	  if (!POINTER_TYPE_P (TREE_TYPE (def))
-	      && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
-	    {
-	      gsi_next (&psi);
-	      continue;
-	    }
-
-	  bool folded_casts;
-	  def = analyze_scalar_evolution_in_loop (ex_loop, loop, def,
-						  &folded_casts);
-	  def = compute_overall_effect_of_inner_loop (ex_loop, def);
-	  if (!tree_does_not_contain_chrecs (def)
-	      || chrec_contains_symbols_defined_in_loop (def, ex_loop->num)
-	      /* Moving the computation from the loop may prolong life range
-		 of some ssa names, which may cause problems if they appear
-		 on abnormal edges.  */
-	      || contains_abnormal_ssa_name_p (def)
-	      /* Do not emit expensive expressions.  The rationale is that
-		 when someone writes a code like
-
-		 while (n > 45) n -= 45;
-
-		 he probably knows that n is not large, and does not want it
-		 to be turned into n %= 45.  */
-	      || expression_expensive_p (def))
-	    {
-	      if (dump_file && (dump_flags & TDF_DETAILS))
-		{
-	          fprintf (dump_file, "not replacing:\n  ");
-	          print_gimple_stmt (dump_file, phi, 0, 0);
-	          fprintf (dump_file, "\n");
-		}
-	      gsi_next (&psi);
-	      continue;
-	    }
-
-	  /* Eliminate the PHI node and replace it by a computation outside
-	     the loop.  */
-	  if (dump_file)
-	    {
-	      fprintf (dump_file, "\nfinal value replacement:\n  ");
-	      print_gimple_stmt (dump_file, phi, 0, 0);
-	      fprintf (dump_file, "  with\n  ");
-	    }
-	  def = unshare_expr (def);
-	  remove_phi_node (&psi, false);
-
-	  /* If def's type has undefined overflow and there were folded
-	     casts, rewrite all stmts added for def into arithmetics
-	     with defined overflow behavior.  */
-	  if (folded_casts && ANY_INTEGRAL_TYPE_P (TREE_TYPE (def))
-	      && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (def)))
-	    {
-	      gimple_seq stmts;
-	      gimple_stmt_iterator gsi2;
-	      def = force_gimple_operand (def, &stmts, true, NULL_TREE);
-	      gsi2 = gsi_start (stmts);
-	      while (!gsi_end_p (gsi2))
-		{
-		  gimple *stmt = gsi_stmt (gsi2);
-		  gimple_stmt_iterator gsi3 = gsi2;
-		  gsi_next (&gsi2);
-		  gsi_remove (&gsi3, false);
-		  if (is_gimple_assign (stmt)
-		      && arith_code_with_undefined_signed_overflow
-					(gimple_assign_rhs_code (stmt)))
-		    gsi_insert_seq_before (&gsi,
-					   rewrite_to_defined_overflow (stmt),
-					   GSI_SAME_STMT);
-		  else
-		    gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
-		}
-	    }
-	  else
-	    def = force_gimple_operand_gsi (&gsi, def, false, NULL_TREE,
-					    true, GSI_SAME_STMT);
-
-	  ass = gimple_build_assign (rslt, def);
-	  gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
-	  if (dump_file)
-	    {
-	      print_gimple_stmt (dump_file, ass, 0, 0);
-	      fprintf (dump_file, "\n");
-	    }
-	}
-    }
   return 0;
 }
 
diff --git a/gcc/tree-scalar-evolution.h b/gcc/tree-scalar-evolution.h
index 6d31280..29c7cd4 100644
--- a/gcc/tree-scalar-evolution.h
+++ b/gcc/tree-scalar-evolution.h
@@ -33,6 +33,7 @@ extern tree analyze_scalar_evolution (struct loop *, tree);
 extern tree instantiate_scev (basic_block, struct loop *, tree);
 extern tree resolve_mixers (struct loop *, tree, bool *);
 extern void gather_stats_on_scev_database (void);
+extern void final_value_replacement_loop (struct loop *);
 extern unsigned int scev_const_prop (void);
 extern bool expression_expensive_p (tree);
 extern bool simple_iv (struct loop *, struct loop *, tree, struct affine_iv *,

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-16 12:45       ` Richard Biener
  2015-11-16 23:21         ` Tom de Vries
@ 2015-11-19 10:31         ` Tom de Vries
  2015-11-20 10:37           ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19 10:31 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1148 bytes --]

On 16/11/15 13:45, Richard Biener wrote:
>> I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in
>> >the pass group. Instead, I've added conditional loop optimizer setup in:
>> >-  pass_lim and pass_scev_cprop (added in this patch), and

Reposting the "Add pass_oacc_kernels pass group in passes.def" patch.

pass_scev_cprop is no longer part of the pass group.

And I've dropped the scev_initialize in pass_lim.

Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info 
is initialized at the start of the pass group and updated or reset by 
passes in the pass group if necessary, such that it's always available, 
or can be recalculated on the spot.

First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim 
doesn't use scev info. So there doesn't seem to be a need to do anything 
about scev info for using pass_lim outside pass_tree_loop.

>> >- pass_parallelize_loops_oacc_kernels (added in patch "Add
>> >   pass_parallelize_loops_oacc_kernels").
> You miss calling scev_finalize ().

I've added the scev_finalize () in patch "Add 
pass_parallelize_loops_oacc_kernels".

Thanks,
- Tom


[-- Attachment #2: 0005-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 4035 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 25 +++++++++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 10 +++++++++-
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 9c27396..d2f88b3 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13385,6 +13385,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..00446c3 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,32 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      /* We need pass_ch here, because pass_lim has no effect on
+	         exit-first loops (PR65442).  Ideally we want to remove both
+		 this pass instantiation, and the reverse transformation
+		 transform_to_exit_first_loop_alt, which is done in
+		 pass_parallelize_loops_oacc_kernels. */
+	      NEXT_PASS (pass_ch);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..96f05f2 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -2496,7 +2496,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
    i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,9 +2560,17 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+				| LOOPS_HAVE_RECORDED_EXITS))
+    loop_optimizer_init (LOOPS_NORMAL
+			 | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
     return 0;
 
+  if (!loops_state_satisfies_p (LOOP_CLOSED_SSA))
+    rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
   return tree_ssa_lim ();
 }
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-11 10:59   ` Richard Biener
@ 2015-11-19 13:51     ` Tom de Vries
  2015-11-24 12:17       ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-19 13:51 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 11/11/15 11:58, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patchs add a pass group pass_oacc_kernels (which will be added to the
>> pass list as a whole in patch 10).
>
> Just to understand (while also skimming the HSA patches).
>
> You are basically relying on autopar for what the HSA patches call
> "gridification"?  That is, OMP lowering produces loopy kernels
> and autopar then will basically strip the outermost loop?

Short answer: no. In more detail...

Existing openmp support maps explictly independent loops (annotated with 
omp-for) in omp-parallel regions onto pthreads. It generates thread 
functions containing sequential loops that iterate on a subset of data 
of the original loop.

Parloops maps sequential loops onto pthreads by:
- proving the loop is independent
- identifiying reductions
- rewriting the loop into an omp-for annotated loop
- wrapping the loop in an an omp-parallel region
- rewriting the variable accesses in the loop such that they are
   relative to base pointers passed into the region
   (note: this bit is done by omplower for omp-for loops from source)
- rewriting the preloop-read and postloop-write pair of a reduction
   variable into an atomic update
- letting a subsequent ompexpand expand the omp-for and omp-parallel

The HSA support maps explicitly independent loops in openmp target 
regions onto an shared memory accelerator. By default, it generates 
kernel functions containing sequential loops that iterate on a subset of 
data of the original loop. The control flow has a performance penalty on 
the accelerator, so there's a concept called gridification (explained 
here: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00586.html ). [ I'm 
not sure if it is an additional transformation or a different style of 
generation ].  The gridification increases the launch dimensions of the 
kernels to a point that there's only one iteration left in the loop, 
which means that the control flow can be eliminated.

The openacc kernels support maps loops in an oacc kernels region onto a 
non-shared memory accelerator. These loops can be unannotated loops, or 
acc-loop annotated loops. If an acc-loop directive contains the 
independent clause, the loop is explicitly independent.

The current oacc kernels implementation mostly ignores the acc-loop 
directive, in order to unify handling of the annotated and unannotated 
loop. The patch "Handle acc loop directive" (at 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html ) expands the 
annotated loop as sequential loop.
At the point that we get to pass_parallelize_loops_oacc_kernels, we have 
sequential loops in an offloaded function (atm, there's no support for 
the independent clause yet).

So pass_parallelize_loops_oacc_kernels transforms sequential loops in an 
offloaded function originating from a kernels region into explicitly 
independent loops by:
- proving the loop is independent
- identifying reductions
- rewriting the loop into an acc-loop annotated loop
- annotating the offloaded function with kernel launch dimensions
- rewriting the preloop-load and postloop-store pair of a reduction
   variable into an atomic update
- letting a subsequent ompexpand expand the acc-loop

I'd say there's is no explicit gridification in there.

AFAIU, gridification is something that can result from determining the 
lauch dimensions of the offloaded function, and optimizing for those 
dimensions. Currently pass_parallelize_loops_oacc_kernels is a place 
where we set launch dimensions, but we're not optimizing for that, that 
happens later-on. (And I'm starting to wonder whether I can get rid of 
the setting of the gang dimension in pass_parallelize_loops_oacc_kernels).

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute
  2015-11-19  9:36                     ` Tom de Vries
@ 2015-11-20 10:15                       ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-20 10:15 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 17/11/15 23:20, Tom de Vries wrote:
> > [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
> > 
> > Hi,
> > 
> > Consider test-case test.c, with a use of the final value of the
> > iteration variable (return i):
> > ...
> > unsigned int
> > foo (int *a, unsigned int n)
> > {
> >    unsigned int i;
> >    for (i = 0; i < n; ++i)
> >      a[i] = 1;
> > 
> >    return i;
> > }
> > ...
> > 
> > Compiled with:
> > ...
> > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > ...
> > 
> > Before parloops, we have:
> > ...
> >   <bb 4>:
> >    # i_12 = PHI <0(3), i_10(5)>
> >    _5 = (long unsigned int) i_12;
> >    _6 = _5 * 4;
> >    _8 = a_7(D) + _6;
> >    *_8 = 1;
> >    i_10 = i_12 + 1;
> >    if (n_4(D) > i_10)
> >      goto <bb 5>;
> >    else
> >      goto <bb 6>;
> > 
> >    <bb 5>:
> >    goto <bb 4>;
> > 
> >    <bb 6>:
> >    # i_14 = PHI <n_4(D)(4), 0(2)>
> > ...
> > 
> > Parloops will fail because:
> > ...
> > phi is n_2 = PHI <n_4(D)(4)>
> > arg of phi to exit:   value n_4(D) used outside loop
> >    checking if it a part of reduction pattern:
> >    FAILED: it is not a part of reduction....
> > ...
> > [ note that the phi looks slightly different. In
> > gather_scalar_reductions -> vect_analyze_loop_form ->
> > vect_analyze_loop_form_1 -> split_loop_exit_edge we split the edge from
> > bb4 to bb6. ]
> > 
> > This patch uses scev_const_prop at the start of parloops.
> > scev_const_prop first also splits the exit edge, and then replaces the
> > phi with a assignment:
> > ...
> >   final value replacement:
> >    n_2 = PHI <n_4(D)(4)>
> >    with
> >    n_2 = n_4(D);
> > ...
> > 
> > This allows parloops to succeed.
> > 
> > And there's a similar story when we compile with -fno-tree-scev-cprop in
> > addition.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > OK for stage3/stage1?
> 
> The patch has been updated to do the final value replacement only for the loop
> that parloops is processing, as suggested in review comment at
> https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02166.html .
> 
> That means the patch is now also required for the kernels patch series.
> 
> Bootstrapped and reg-tested on x86_64.
> 
> OK for stage 3 trunk?

Ok.  Please mention tree-optimization/68373 in the changelog.

Thanks,
Richard.

> Thanks,
> - Tom
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-19  0:35               ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
@ 2015-11-20 10:28                 ` Richard Biener
  2015-11-21  8:42                   ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-20 10:28 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 17/11/15 15:53, Tom de Vries wrote:
> > > And the above LIM example
> > > is none for why you need two LIM passes...
> > 
> > Indeed. I'm planning a separate reply to explain in more detail the need
> > for the two pass_lims.
> 
> I.
> 
> I managed to get rid of the two pass_lims for the motivating example that I
> used until now (goacc/kernels-double-reduction.c). I found that by adding a
> pass_dominator instance after pass_ch, I could get rid of the second pass_lim
> (and pass_copyprop as well).
> 
> But... then I wrote a counter example (goacc/kernels-double-reduction-n.c),
> and I'm back at two pass_lims (and two pass_dominators).
> Also I've split the pass group into a bit before and after pass_fre.
> 
> So, the current pass group looks like:
> ...
> NEXT_PASS (pass_build_ealias);
> 
> /* Pass group that runs when the function is an offloaded function
>    containing oacc kernels loops.  Part 1.  */
> NEXT_PASS (pass_oacc_kernels);
> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>     /* We need pass_ch here, because pass_lim has no effect on
>        exit-first loops (PR65442).  Ideally we want to remove both
>        this pass instantiation, and the reverse transformation
>        transform_to_exit_first_loop_alt, which is done in
>        pass_parallelize_loops_oacc_kernels. */
>     NEXT_PASS (pass_ch);
> POP_INSERT_PASSES ()
> 
> NEXT_PASS (pass_fre);
> 
> /* Pass group that runs when the function is an offloaded function
>    containing oacc kernels loops.  Part 2.  */
> NEXT_PASS (pass_oacc_kernels2);
> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
>     /* We use pass_lim to rewrite in-memory iteration and reduction
>        variable accesses in loops into local variables accesses.  */
>     NEXT_PASS (pass_lim);
>     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>     NEXT_PASS (pass_lim);
>     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>     NEXT_PASS (pass_dce);
>     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
>     NEXT_PASS (pass_expand_omp_ssa);
> POP_INSERT_PASSES ()
> NEXT_PASS (pass_merge_phi);
> ...
> 
> 
> II.
> 
> The motivating test-case kernels-double-reduction-n.c:
> ...
> #include <stdlib.h>
> 
> #define N 500
> 
> unsigned int a[N][N];
> 
> void  __attribute__((noinline,noclone))
> foo (unsigned int n)
> {
>   int i, j;
>   unsigned int sum = 1;
> 
> #pragma acc kernels copyin (a[0:n]) copy (sum)
>   {
>     for (i = 0; i < n; ++i)
>       for (j = 0; j < n; ++j)
>         sum += a[i][j];
>   }
> 
>   if (sum != 5001)
>     abort ();
> }
> ...
> 
> 
> III.
> 
> Before first pass_lim. Note no phis on inner or outer loop header for
> iteration varables or reduction variable:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>: outer loop header
>   _12 = *.omp_data_i_4(D).j;
>   *_12 = 0;
>   if (_45 != 0)
>     goto <bb 6>;
>   else
>     goto <bb 5>;
> 
>   <bb 6>: inner loop header, latch
>   _19 = *.omp_data_i_4(D).a;
>   _21 = *_5;
>   _23 = *_12;
>   _24 = *_19[_21][_23];
>   _25 = *.omp_data_i_4(D).sum;
>   sum.0_26 = *_25;
>   sum.1_27 = _24 + sum.0_26;
>   *_25 = sum.1_27;
>   _33 = _23 + 1;
>   *_12 = _33;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 6>;
>   else
>     goto <bb 5>;
> 
>   <bb 5>: outer loop latch
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...
> 
> 
> IV.
> 
> After first pass_lim/pass_dom pair. Note there are phis on the inner loop
> header for the reduction and the iteration variable, but not on the outer loop
> header:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>:
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   D__lsm.10_50 = *_12;
>   D__lsm.11_51 = 0;
>   _25 = *.omp_data_i_4(D).sum;
> 
>   <bb 5>: outer loop header
>   D__lsm.10_20 = 0;
>   D__lsm.11_22 = 1;
>   _21 = *_5;
>   D__lsm.12_28 = *_25;
>   D__lsm.13_30 = 0;
>   goto <bb 7>;
> 
>   <bb 7>: inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
>   _23 = D__lsm.10_47;
>   _24 = *_19[_21][D__lsm.10_47];
>   sum.0_26 = D__lsm.12_49;
>   sum.1_27 = _24 + D__lsm.12_49;
>   D__lsm.12_31 = sum.1_27;
>   D__lsm.13_32 = 1;
>   _33 = D__lsm.10_47 + 1;
>   D__lsm.10_14 = _33;
>   D__lsm.11_15 = 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 7>;
>   else
>     goto <bb 8>;
> 
>   <bb 8>: outer loop latch
>   # D__lsm.10_35 = PHI <_33(7)>
>   # D__lsm.11_37 = PHI <1(7)>
>   # D__lsm.12_7 = PHI <sum.1_27(7)>
>   # D__lsm.13_8 = PHI <1(7)>
>   *_25 = sum.1_27;
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
>   <bb 6>:
>   # D__lsm.10_10 = PHI <_33(8)>
>   # D__lsm.11_11 = PHI <1(8)>
>   *_12 = _33;
>   goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...
> 
> 
> V.
> 
> After second pass_lim/pass_dom pair. Note there are phis on the inner and
> outer loop header for the reduction and the iteration variables:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>:
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   D__lsm.10_50 = *_12;
>   D__lsm.11_51 = 0;
>   _25 = *.omp_data_i_4(D).sum;
>   D__lsm.14_40 = 0;
>   D__lsm.15_2 = 0;
>   D__lsm.16_1 = *_25;
>   D__lsm.17_46 = 0;
> 
>   <bb 5>: outer loop header
>   # D__lsm.14_13 = PHI <0(4), _38(8)>
>   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
>   D__lsm.10_20 = 0;
>   D__lsm.11_22 = 1;
>   _21 = D__lsm.14_13;
>   D__lsm.12_28 = D__lsm.16_34;
>   D__lsm.13_30 = 0;
>   goto <bb 7>;
> 
>   <bb 7>: inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
>   _23 = D__lsm.10_47;
>   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
>   sum.0_26 = D__lsm.12_49;
>   sum.1_27 = _24 + D__lsm.12_49;
>   D__lsm.12_31 = sum.1_27;
>   D__lsm.13_32 = 1;
>   _33 = D__lsm.10_47 + 1;
>   D__lsm.10_14 = _33;
>   D__lsm.11_15 = 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 7>;
>   else
>     goto <bb 8>;
> 
>   <bb 8>: outer loop latch
>   # D__lsm.10_35 = PHI <_33(7)>
>   # D__lsm.11_37 = PHI <1(7)>
>   # D__lsm.12_7 = PHI <sum.1_27(7)>
>   # D__lsm.13_8 = PHI <1(7)>
>   # sum.1_48 = PHI <sum.1_27(7)>
>   # _53 = PHI <_33(7)>
>   D__lsm.16_56 = sum.1_27;
>   D__lsm.17_57 = 1;
>   _36 = D__lsm.14_13;
>   _38 = D__lsm.14_13 + 1;
>   D__lsm.14_58 = _38;
>   D__lsm.15_59 = 1;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
>   <bb 6>:
>   # D__lsm.10_10 = PHI <_33(8)>
>   # D__lsm.11_11 = PHI <1(8)>
>   # _43 = PHI <_33(8)>
>   # D__lsm.16_62 = PHI <sum.1_27(8)>
>   # D__lsm.17_63 = PHI <1(8)>
>   # D__lsm.14_64 = PHI <_38(8)>
>   # D__lsm.15_65 = PHI <1(8)>
>   *_5 = _38;
>   *_25 = sum.1_27;
>   *_12 = _33;
>   goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...

Sorry but staring at dumps doesn't make me understand the issue you
run into.  Where can I reproduce this if I have time to look at this?

From the dump below I understand you want no memory references in
the outer loop?  So the issue seems to be that store motion fails
to insert the preheader load / exit store to the outermost loop
possible and thus another LIM pass is needed to "store motion" those
again?  But a simple testcase

int a;
int *p = &a;
int foo (int n)
{
  for (int i = 0; i < n; ++i)
    for (int j = 0; j < 100; ++j)
      *p += j + i;
  return a;
}

shows that LIM can do this in one step.  Which means it should
be investigated why it doesn't do this properly for your testcase
(store motion of *_25).

Simply adding two LIM passes either papers over a wrong-code
bug (in LIM or in DOM) or over a missed-optimization in LIM.

Richard.
 
> 
> VI.
> 
> After pass_dce, so before parloops-oacc-kernels:
> ...
>   <bb 2>:
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
>   <bb 4>:
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   _25 = *.omp_data_i_4(D).sum;
>   D__lsm.16_1 = *_25;
> 
>   <bb 5>: outer loop header
>   # D__lsm.14_13 = PHI <0(4), _38(8)>
>   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
>   goto <bb 7>;
> 
>   <bb 7>: inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
>   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
>   sum.1_27 = _24 + D__lsm.12_49;
>   _33 = D__lsm.10_47 + 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
>     goto <bb 7>;
>   else
>     goto <bb 8>;
> 
>   <bb 8>: outer loop latch
>   _38 = D__lsm.14_13 + 1;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
>   <bb 6>:
>   *_5 = _38;
>   *_25 = sum.1_27;
>   *_12 = _33;
>   goto <bb 3>;
> 
>   <bb 3>:
>   return;
> ...
> 
> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-19 10:31         ` Tom de Vries
@ 2015-11-20 10:37           ` Richard Biener
  2015-11-20 13:27             ` Tom de Vries
  2015-11-22 23:37             ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
  0 siblings, 2 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-20 10:37 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 16/11/15 13:45, Richard Biener wrote:
> > > I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done
> > > in
> > > >the pass group. Instead, I've added conditional loop optimizer setup in:
> > > >-  pass_lim and pass_scev_cprop (added in this patch), and
> 
> Reposting the "Add pass_oacc_kernels pass group in passes.def" patch.
> 
> pass_scev_cprop is no longer part of the pass group.
> 
> And I've dropped the scev_initialize in pass_lim.
> 
> Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info is
> initialized at the start of the pass group and updated or reset by passes in
> the pass group if necessary, such that it's always available, or can be
> recalculated on the spot.
> 
> First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim
> doesn't use scev info. So there doesn't seem to be a need to do anything about
> scev info for using pass_lim outside pass_tree_loop.
> 
> > > >- pass_parallelize_loops_oacc_kernels (added in patch "Add
> > > >   pass_parallelize_loops_oacc_kernels").
> > You miss calling scev_finalize ().
> 
> I've added the scev_finalize () in patch "Add
> pass_parallelize_loops_oacc_kernels".

 pass_lim::execute (function *fun)
 {
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+                               | LOOPS_HAVE_RECORDED_EXITS))
+    loop_optimizer_init (LOOPS_NORMAL
+                        | LOOPS_HAVE_RECORDED_EXITS);
+

note that this will, when not in the loop pipeline, not properly
fixup loops if LOOPS_NEED_FIXUP is set (that doesn't clear other
loop flags).  I'd rather make loop_optimizer_init do nothing
if requested flags are already set and no fixup is needed and
call the above unconditionally.  Thus sth like

Index: gcc/loop-init.c
===================================================================
--- gcc/loop-init.c     (revision 230649)
+++ gcc/loop-init.c     (working copy)
@@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
       calculate_dominance_info (CDI_DOMINATORS);
 
       if (!needs_fixup)
-       checking_verify_loop_structure ();
+       {
+         checking_verify_loop_structure ();
+         if (loops_state_satisfies_p (flags))
+           goto out;
+       }
 
       /* Clear all flags.  */
       if (recorded_exits)
@@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
   /* Apply flags to loops.  */
   apply_loop_flags (flags);
 
+  checking_verify_loop_structure ();
+
+out:
   /* Dump loops.  */
   flow_loops_dump (dump_file, NULL, 1);
 
-  checking_verify_loop_structure ();
-
   timevar_pop (TV_LOOP_INIT);
 }
 



   if (number_of_loops (fun) <= 1)
     return 0;
 
+  if (!loops_state_satisfies_p (LOOP_CLOSED_SSA))
+    rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
   return tree_ssa_lim ();
 }

that looks bogus.  The into-loop-closed SSA rewrite should be
only done if the state _satisfies_ it.  I understand LIM doesn't
require loop-closed SSA.  But it also doesn't destroy it obviously.
So just remove that.



> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [committed, trivial] Fix typo and trailing whitespace in dump-file strings in parloops
  2015-11-18 16:22                     ` Bernhard Reutner-Fischer
@ 2015-11-20 12:53                       ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-20 12:53 UTC (permalink / raw)
  To: Bernhard Reutner-Fischer, Richard Biener
  Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

[ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]

On 18/11/15 17:22, Bernhard Reutner-Fischer wrote:
> Bonus points for fixing the dump_file to parse in:
>
>> >Parloops will fail because:
>> >...
>> >phi is n_2 = PHI <n_4(D)(4)>
>> >arg of phi to exit: value n_4(D) used outside loop
>> >checking if it a part of reduction pattern:
> s/it a/it is/
>

This patch fixes a typo and trailing whitespace in dump-file strings in 
parloops.

Build for c and fortran, tested -fdump-tree-parloops testcases.

Committed to trunk as trivial.

Thanks,
- Tom

[-- Attachment #2: 0001-Fix-typo-and-trailing-whitespace-in-dump-file-strings-in-parloops.patch --]
[-- Type: text/x-patch, Size: 1244 bytes --]

Fix typo and trailing whitespace in dump-file strings in parloops

2015-11-19  Tom de Vries  <tom@codesourcery.com>

	* tree-parloops.c (build_new_reduction): Fix trailing whitespace in
	dump-file string.
	(try_create_reduction_list): Same.  Fix typo in dump-file string.

---
 gcc/tree-parloops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 8d7912d..aca2370 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2383,7 +2383,7 @@ build_new_reduction (reduction_info_table_type *reduction_list,
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file,
-	       "Detected reduction. reduction stmt is: \n");
+	       "Detected reduction. reduction stmt is:\n");
       print_gimple_stmt (dump_file, reduc_stmt, 0, 0);
       fprintf (dump_file, "\n");
     }
@@ -2564,7 +2564,7 @@ try_create_reduction_list (loop_p loop,
 	      print_generic_expr (dump_file, val, 0);
 	      fprintf (dump_file, " used outside loop\n");
 	      fprintf (dump_file,
-		       "  checking if it a part of reduction pattern:  \n");
+		       "  checking if it is part of reduction pattern:\n");
 	    }
 	  if (reduction_list->elements () == 0)
 	    {

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 10:37           ` Richard Biener
@ 2015-11-20 13:27             ` Tom de Vries
  2015-11-20 13:29               ` Richard Biener
  2015-11-22 23:37             ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-20 13:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 20/11/15 11:37, Richard Biener wrote:
>    I'd rather make loop_optimizer_init do nothing
> if requested flags are already set and no fixup is needed

> Thus sth like
>
> Index: gcc/loop-init.c
> ===================================================================
> --- gcc/loop-init.c     (revision 230649)
> +++ gcc/loop-init.c     (working copy)
> @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
>         calculate_dominance_info (CDI_DOMINATORS);
>
>         if (!needs_fixup)
> -       checking_verify_loop_structure ();
> +       {
> +         checking_verify_loop_structure ();
> +         if (loops_state_satisfies_p (flags))
> +           goto out;

What about flags that are present in the loops state, but not requested 
in flags? Should we try to clear those flags?

Thanks,
- Tom

> +       }
>
>         /* Clear all flags.  */
>         if (recorded_exits)
> @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
>     /* Apply flags to loops.  */
>     apply_loop_flags (flags);
>
> +  checking_verify_loop_structure ();
> +
> +out:
>     /* Dump loops.  */
>     flow_loops_dump (dump_file, NULL, 1);
>
> -  checking_verify_loop_structure ();
> -
>     timevar_pop (TV_LOOP_INIT);
>   }

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 13:27             ` Tom de Vries
@ 2015-11-20 13:29               ` Richard Biener
  2015-11-20 16:34                 ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-20 13:29 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Fri, 20 Nov 2015, Tom de Vries wrote:

> On 20/11/15 11:37, Richard Biener wrote:
> >    I'd rather make loop_optimizer_init do nothing
> > if requested flags are already set and no fixup is needed
> 
> > Thus sth like
> > 
> > Index: gcc/loop-init.c
> > ===================================================================
> > --- gcc/loop-init.c     (revision 230649)
> > +++ gcc/loop-init.c     (working copy)
> > @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
> >         calculate_dominance_info (CDI_DOMINATORS);
> > 
> >         if (!needs_fixup)
> > -       checking_verify_loop_structure ();
> > +       {
> > +         checking_verify_loop_structure ();
> > +         if (loops_state_satisfies_p (flags))
> > +           goto out;
> 
> What about flags that are present in the loops state, but not requested in
> flags? Should we try to clear those flags?

No, I don't think so, that would break in-loop-pipeline LIM, dropping
loop-closed SSA for example.

I agree it's somewhat of an odd behavior but all passes should
either be placed in a sub-pipeline with an outer 
loop_optimizer_init()/finalize () call or call both themselves.

Richard.

> Thanks,
> - Tom
> 
> > +       }
> > 
> >         /* Clear all flags.  */
> >         if (recorded_exits)
> > @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
> >     /* Apply flags to loops.  */
> >     apply_loop_flags (flags);
> > 
> > +  checking_verify_loop_structure ();
> > +
> > +out:
> >     /* Dump loops.  */
> >     flow_loops_dump (dump_file, NULL, 1);
> > 
> > -  checking_verify_loop_structure ();
> > -
> >     timevar_pop (TV_LOOP_INIT);
> >   }
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 13:29               ` Richard Biener
@ 2015-11-20 16:34                 ` Tom de Vries
  2015-11-23 10:11                   ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-20 16:34 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 20/11/15 14:29, Richard Biener wrote:
> I agree it's somewhat of an odd behavior but all passes should
> either be placed in a sub-pipeline with an outer
> loop_optimizer_init()/finalize () call or call both themselves.

Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks 
the loop pipeline.

We could use the style used in pass_slp_vectorize::execute:
...
pass_slp_vectorize::execute (function *fun)
{
   basic_block bb;

   bool in_loop_pipeline = scev_initialized_p ();
   if (!in_loop_pipeline)
     {
       loop_optimizer_init (LOOPS_NORMAL);
       scev_initialize ();
     }

   ...

   if (!in_loop_pipeline)
     {
       scev_finalize ();
       loop_optimizer_finalize ();
     }
...

Although that doesn't strike me as particularly clean.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 10:28                 ` Richard Biener
@ 2015-11-21  8:42                   ` Tom de Vries
  2015-11-23 11:31                     ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-21  8:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 10151 bytes --]

On 20/11/15 11:28, Richard Biener wrote:
> On Thu, 19 Nov 2015, Tom de Vries wrote:
>
>> >On 17/11/15 15:53, Tom de Vries wrote:
>>>> > > >And the above LIM example
>>>> > > >is none for why you need two LIM passes...
>>> > >
>>> > >Indeed. I'm planning a separate reply to explain in more detail the need
>>> > >for the two pass_lims.
>> >
>> >I.
>> >
>> >I managed to get rid of the two pass_lims for the motivating example that I
>> >used until now (goacc/kernels-double-reduction.c). I found that by adding a
>> >pass_dominator instance after pass_ch, I could get rid of the second pass_lim
>> >(and pass_copyprop as well).
>> >
>> >But... then I wrote a counter example (goacc/kernels-double-reduction-n.c),
>> >and I'm back at two pass_lims (and two pass_dominators).
>> >Also I've split the pass group into a bit before and after pass_fre.
>> >
>> >So, the current pass group looks like:
>> >...
>> >NEXT_PASS (pass_build_ealias);
>> >
>> >/* Pass group that runs when the function is an offloaded function
>> >    containing oacc kernels loops.  Part 1.  */
>> >NEXT_PASS (pass_oacc_kernels);
>> >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>> >     /* We need pass_ch here, because pass_lim has no effect on
>> >        exit-first loops (PR65442).  Ideally we want to remove both
>> >        this pass instantiation, and the reverse transformation
>> >        transform_to_exit_first_loop_alt, which is done in
>> >        pass_parallelize_loops_oacc_kernels. */
>> >     NEXT_PASS (pass_ch);
>> >POP_INSERT_PASSES ()
>> >
>> >NEXT_PASS (pass_fre);
>> >
>> >/* Pass group that runs when the function is an offloaded function
>> >    containing oacc kernels loops.  Part 2.  */
>> >NEXT_PASS (pass_oacc_kernels2);
>> >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
>> >     /* We use pass_lim to rewrite in-memory iteration and reduction
>> >        variable accesses in loops into local variables accesses.  */
>> >     NEXT_PASS (pass_lim);
>> >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>> >     NEXT_PASS (pass_lim);
>> >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>> >     NEXT_PASS (pass_dce);
>> >     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
>> >     NEXT_PASS (pass_expand_omp_ssa);
>> >POP_INSERT_PASSES ()
>> >NEXT_PASS (pass_merge_phi);
>> >...
>> >
>> >
>> >II.
>> >
>> >The motivating test-case kernels-double-reduction-n.c:
>> >...
>> >#include <stdlib.h>
>> >
>> >#define N 500
>> >
>> >unsigned int a[N][N];
>> >
>> >void  __attribute__((noinline,noclone))
>> >foo (unsigned int n)
>> >{
>> >   int i, j;
>> >   unsigned int sum = 1;
>> >
>> >#pragma acc kernels copyin (a[0:n]) copy (sum)
>> >   {
>> >     for (i = 0; i < n; ++i)
>> >       for (j = 0; j < n; ++j)
>> >         sum += a[i][j];
>> >   }
>> >
>> >   if (sum != 5001)
>> >     abort ();
>> >}
>> >...
>> >
>> >
>> >III.
>> >
>> >Before first pass_lim. Note no phis on inner or outer loop header for
>> >iteration varables or reduction variable:
>> >...
>> >   <bb 2>:
>> >   _5 = *.omp_data_i_4(D).i;
>> >   *_5 = 0;
>> >   _44 = *.omp_data_i_4(D).n;
>> >   _45 = *_44;
>> >   if (_45 != 0)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 4>: outer loop header
>> >   _12 = *.omp_data_i_4(D).j;
>> >   *_12 = 0;
>> >   if (_45 != 0)
>> >     goto <bb 6>;
>> >   else
>> >     goto <bb 5>;
>> >
>> >   <bb 6>: inner loop header, latch
>> >   _19 = *.omp_data_i_4(D).a;
>> >   _21 = *_5;
>> >   _23 = *_12;
>> >   _24 = *_19[_21][_23];
>> >   _25 = *.omp_data_i_4(D).sum;
>> >   sum.0_26 = *_25;
>> >   sum.1_27 = _24 + sum.0_26;
>> >   *_25 = sum.1_27;
>> >   _33 = _23 + 1;
>> >   *_12 = _33;
>> >   j.2_16 = (unsigned int) _33;
>> >   if (j.2_16 < _45)
>> >     goto <bb 6>;
>> >   else
>> >     goto <bb 5>;
>> >
>> >   <bb 5>: outer loop latch
>> >   _36 = *_5;
>> >   _38 = _36 + 1;
>> >   *_5 = _38;
>> >   i.3_9 = (unsigned int) _38;
>> >   if (i.3_9 < _45)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 3>:
>> >   return;
>> >...
>> >
>> >
>> >IV.
>> >
>> >After first pass_lim/pass_dom pair. Note there are phis on the inner loop
>> >header for the reduction and the iteration variable, but not on the outer loop
>> >header:
>> >...
>> >   <bb 2>:
>> >   _5 = *.omp_data_i_4(D).i;
>> >   *_5 = 0;
>> >   _44 = *.omp_data_i_4(D).n;
>> >   _45 = *_44;
>> >   if (_45 != 0)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 4>:
>> >   _12 = *.omp_data_i_4(D).j;
>> >   _19 = *.omp_data_i_4(D).a;
>> >   D__lsm.10_50 = *_12;
>> >   D__lsm.11_51 = 0;
>> >   _25 = *.omp_data_i_4(D).sum;
>> >
>> >   <bb 5>: outer loop header
>> >   D__lsm.10_20 = 0;
>> >   D__lsm.11_22 = 1;
>> >   _21 = *_5;
>> >   D__lsm.12_28 = *_25;
>> >   D__lsm.13_30 = 0;
>> >   goto <bb 7>;
>> >
>> >   <bb 7>: inner loop header, latch
>> >   # D__lsm.10_47 = PHI <0(5), _33(7)>
>> >   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
>> >   _23 = D__lsm.10_47;
>> >   _24 = *_19[_21][D__lsm.10_47];
>> >   sum.0_26 = D__lsm.12_49;
>> >   sum.1_27 = _24 + D__lsm.12_49;
>> >   D__lsm.12_31 = sum.1_27;
>> >   D__lsm.13_32 = 1;
>> >   _33 = D__lsm.10_47 + 1;
>> >   D__lsm.10_14 = _33;
>> >   D__lsm.11_15 = 1;
>> >   j.2_16 = (unsigned int) _33;
>> >   if (j.2_16 < _45)
>> >     goto <bb 7>;
>> >   else
>> >     goto <bb 8>;
>> >
>> >   <bb 8>: outer loop latch
>> >   # D__lsm.10_35 = PHI <_33(7)>
>> >   # D__lsm.11_37 = PHI <1(7)>
>> >   # D__lsm.12_7 = PHI <sum.1_27(7)>
>> >   # D__lsm.13_8 = PHI <1(7)>
>> >   *_25 = sum.1_27;
>> >   _36 = *_5;
>> >   _38 = _36 + 1;
>> >   *_5 = _38;
>> >   i.3_9 = (unsigned int) _38;
>> >   if (i.3_9 < _45)
>> >     goto <bb 5>;
>> >   else
>> >     goto <bb 6>;
>> >
>> >   <bb 6>:
>> >   # D__lsm.10_10 = PHI <_33(8)>
>> >   # D__lsm.11_11 = PHI <1(8)>
>> >   *_12 = _33;
>> >   goto <bb 3>;
>> >
>> >   <bb 3>:
>> >   return;
>> >...
>> >
>> >
>> >V.
>> >
>> >After second pass_lim/pass_dom pair. Note there are phis on the inner and
>> >outer loop header for the reduction and the iteration variables:
>> >...
>> >   <bb 2>:
>> >   _5 = *.omp_data_i_4(D).i;
>> >   *_5 = 0;
>> >   _44 = *.omp_data_i_4(D).n;
>> >   _45 = *_44;
>> >   if (_45 != 0)
>> >     goto <bb 4>;
>> >   else
>> >     goto <bb 3>;
>> >
>> >   <bb 4>:
>> >   _12 = *.omp_data_i_4(D).j;
>> >   _19 = *.omp_data_i_4(D).a;
>> >   D__lsm.10_50 = *_12;
>> >   D__lsm.11_51 = 0;
>> >   _25 = *.omp_data_i_4(D).sum;
>> >   D__lsm.14_40 = 0;
>> >   D__lsm.15_2 = 0;
>> >   D__lsm.16_1 = *_25;
>> >   D__lsm.17_46 = 0;
>> >
>> >   <bb 5>: outer loop header
>> >   # D__lsm.14_13 = PHI <0(4), _38(8)>
>> >   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
>> >   D__lsm.10_20 = 0;
>> >   D__lsm.11_22 = 1;
>> >   _21 = D__lsm.14_13;
>> >   D__lsm.12_28 = D__lsm.16_34;
>> >   D__lsm.13_30 = 0;
>> >   goto <bb 7>;
>> >
>> >   <bb 7>: inner loop header, latch
>> >   # D__lsm.10_47 = PHI <0(5), _33(7)>
>> >   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
>> >   _23 = D__lsm.10_47;
>> >   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
>> >   sum.0_26 = D__lsm.12_49;
>> >   sum.1_27 = _24 + D__lsm.12_49;
>> >   D__lsm.12_31 = sum.1_27;
>> >   D__lsm.13_32 = 1;
>> >   _33 = D__lsm.10_47 + 1;
>> >   D__lsm.10_14 = _33;
>> >   D__lsm.11_15 = 1;
>> >   j.2_16 = (unsigned int) _33;
>> >   if (j.2_16 < _45)
>> >     goto <bb 7>;
>> >   else
>> >     goto <bb 8>;
>> >
>> >   <bb 8>: outer loop latch
>> >   # D__lsm.10_35 = PHI <_33(7)>
>> >   # D__lsm.11_37 = PHI <1(7)>
>> >   # D__lsm.12_7 = PHI <sum.1_27(7)>
>> >   # D__lsm.13_8 = PHI <1(7)>
>> >   # sum.1_48 = PHI <sum.1_27(7)>
>> >   # _53 = PHI <_33(7)>
>> >   D__lsm.16_56 = sum.1_27;
>> >   D__lsm.17_57 = 1;
>> >   _36 = D__lsm.14_13;
>> >   _38 = D__lsm.14_13 + 1;
>> >   D__lsm.14_58 = _38;
>> >   D__lsm.15_59 = 1;
>> >   i.3_9 = (unsigned int) _38;
>> >   if (i.3_9 < _45)
>> >     goto <bb 5>;
>> >   else
>> >     goto <bb 6>;
>> >
>> >   <bb 6>:
>> >   # D__lsm.10_10 = PHI <_33(8)>
>> >   # D__lsm.11_11 = PHI <1(8)>
>> >   # _43 = PHI <_33(8)>
>> >   # D__lsm.16_62 = PHI <sum.1_27(8)>
>> >   # D__lsm.17_63 = PHI <1(8)>
>> >   # D__lsm.14_64 = PHI <_38(8)>
>> >   # D__lsm.15_65 = PHI <1(8)>
>> >   *_5 = _38;
>> >   *_25 = sum.1_27;
>> >   *_12 = _33;
>> >   goto <bb 3>;
>> >
>> >   <bb 3>:
>> >   return;
>> >...
> Sorry but staring at dumps doesn't make me understand the issue you
> run into.  Where can I reproduce this if I have time to look at this?

I've posted the state of the patch series that reproduces this problem 
at 
https://github.com/vries/gcc/commits/vries/master-port-kernels-test-rb , 
run goacc.exp, testcase kernels-double-reduction-n.c.

> From the dump below I understand you want no memory references in
> the outer loop?
> So the issue seems to be that store motion fails
> to insert the preheader load / exit store to the outermost loop
> possible and thus another LIM pass is needed to "store motion" those
> again?

Yep.

>  But a simple testcase
>
> int a;
> int *p = &a;
> int foo (int n)
> {
>    for (int i = 0; i < n; ++i)
>      for (int j = 0; j < 100; ++j)
>        *p += j + i;
>    return a;
> }
>
> shows that LIM can do this in one step.

I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry 
conditions" for a test-case where that doesn't happen (when using 
-fno-tree-dominator-opts).

> Which means it should
> be investigated why it doesn't do this properly for your testcase
> (store motion of *_25).

There seems to be two related problems:
1. the store has tree_could_trap_p (ref->mem.ref) true, which should be
    false. I'll work on a fix for this.
2. Give that the store can trap, I  was running into PR68465. I managed
    to eliminate the 2nd pass_lim by moving the pass_dominator instance
    before the pass_lim instance.

Attached patch shows the pass group with only one pass_lim. I hope to be 
able to eliminate the first pass_dominator instance before pass_lim once 
I fix 1.

> Simply adding two LIM passes either papers over a wrong-code
> bug (in LIM or in DOM) or over a missed-optimization in LIM.

AFAIU now, it's PR68465, a missed optimization in LIM.

Thanks,
- Tom



[-- Attachment #2: 0005-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 4721 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* loop-init.c (loop_optimizer_init): If loops state doesn't need fixup,
	and requested flags are present in the loops state, don't reapply flags.
	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/loop-init.c        | 11 ++++++++---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 24 ++++++++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c |  4 +++-
 5 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index e32c94a..5bc0c54 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
       calculate_dominance_info (CDI_DOMINATORS);
 
       if (!needs_fixup)
-	checking_verify_loop_structure ();
+	{
+	  checking_verify_loop_structure ();
+	  if (loops_state_satisfies_p (flags))
+	    goto out;
+	}
 
       /* Clear all flags.  */
       if (recorded_exits)
@@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
   /* Apply flags to loops.  */
   apply_loop_flags (flags);
 
+  checking_verify_loop_structure ();
+
+ out:
   /* Dump loops.  */
   flow_loops_dump (dump_file, NULL, 1);
 
-  checking_verify_loop_structure ();
-
   timevar_pop (TV_LOOP_INIT);
 }
 
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 9c27396..d2f88b3 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13385,6 +13385,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..67f6829 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,31 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      /* We need pass_ch here, because pass_lim has no effect on
+	         exit-first loops (PR65442).  Ideally we want to remove both
+		 this pass instantiation, and the reverse transformation
+		 transform_to_exit_first_loop_alt, which is done in
+		 pass_parallelize_loops_oacc_kernels. */
+	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..2435da6 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -2496,7 +2496,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
    i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,6 +2560,8 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
     return 0;
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:39               ` Jakub Jelinek
@ 2015-11-21 12:24                 ` Tom de Vries
  2015-11-23 11:46                   ` Richard Biener
  2015-12-11 12:45                 ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-21 12:24 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3387 bytes --]

On 13/11/15 12:39, Jakub Jelinek wrote:
> On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
>>> thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta issues'.
>>>
>>> Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit above?
>>> Is that sort of what you had in mind?
>>
>> Yes.  Whether that makes sense is another question of course.  You can
>> annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
>> as well if you know dependences without the users intervention.
>
> I really don't like even the GCC offload-alias, I just don't see anything
> special on the offload code.  Not to mention that the same issue is already
> with other outlined functions, like OpenMP tasks or parallel regions, those
> aren't offloaded, yet they can suffer from worse alias/points-to analysis
> too.

AFAIU there is one aspect that is different for offloaded code: the 
setup of the data on the device.

Consider this example:
...
unsigned int a[N];
unsigned int b[N];
unsigned int c[N];

int
main (void)
{
   ...

#pragma acc kernels copyin (a) copyin (b) copyout (c)
   {
     for (COUNTERTYPE ii = 0; ii < N; ii++)
       c[ii] = a[ii] + b[ii];
   }

   ...
...

At gimple level, we have:
...
#pragma omp target oacc_kernels \
   map(force_from:c [len: 2097152]) \
   map(force_to:b [len: 2097152]) \
   map(force_to:a [len: 2097152])
...

[ The meaning of the force_from/force_to mappings is given in 
include/gomp-constants.h:
...
     /* Allocate.  */
     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
     /* ..., and copy to device.  */
     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
     /* ..., and copy from device.  */
     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
     /* ..., and copy to and from device.  */
     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
...  ]

So before calling the offloaded function, a separate alloc is done for 
a, b and c, and the base pointers of the newly allocated objects are 
passed to the offloaded function.

This means we can mark those base pointers as restrict in the offloaded 
function.

Attached proof-of-concept patch implements that.

> We simply have some compiler internal interface between the caller and
> callee of the outlined regions, each interface in between those has
> its own structure type used to communicate the info;
> we can attach attributes on the fields, or some flags to indicate some
> properties interesting from aliasing POV.
> We don't really need to perform
> full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> the relationship in between such callers and callees (for offloading regions
> we already have "omp target entrypoint" attribute on the callee and a
> singler caller), tell LTO if possible not to split those into different
> partitions if easily possible, and then just for these pairs perform
> aliasing/points-to analysis in the caller and the result record using
> cliques/special attributes/whatever to the callee side, so that the callee
> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.

As a start, is the approach of this patch OK?

It will allow us to commit the oacc kernels patch series with the 
ability to parallelize non-trivial testcases, and work on improving the 
alias bit after that.

Thanks,
- Tom




[-- Attachment #2: 0018-Mark-pointers-to-allocated-target-vars-as-restricted-if-possible.patch --]
[-- Type: text/x-patch, Size: 4201 bytes --]

Mark pointers to allocated target vars as restricted, if possible

---
 gcc/omp-low.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 268b67b..0ce822d 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1372,7 +1372,8 @@ build_sender_ref (tree var, omp_context *ctx)
 /* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
 
 static void
-install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
+		     bool base_pointers_restrict)
 {
   tree field, type, sfield = NULL_TREE;
   splay_tree_key key = (splay_tree_key) var;
@@ -1396,7 +1397,11 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
       type = build_pointer_type (build_pointer_type (type));
     }
   else if (by_ref)
-    type = build_pointer_type (type);
+    {
+      type = build_pointer_type (type);
+      if (base_pointers_restrict)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+    }
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
 
@@ -1460,6 +1465,12 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
     splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield);
 }
 
+static void
+install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+{
+  install_var_field_1 (var, by_ref, mask, ctx, false);
+}
+
 static tree
 install_var_local (tree var, omp_context *ctx)
 {
@@ -1816,7 +1827,8 @@ fixup_child_record_type (omp_context *ctx)
    specified by CLAUSES.  */
 
 static void
-scan_sharing_clauses (tree clauses, omp_context *ctx)
+scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
+			bool base_pointers_restrict)
 {
   tree c, decl;
   bool scan_array_reductions = false;
@@ -2073,7 +2085,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 		      && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE)
 		    install_var_field (decl, true, 7, ctx);
 		  else
-		    install_var_field (decl, true, 3, ctx);
+		    install_var_field_1 (decl, true, 3, ctx, base_pointers_restrict);
 		  if (is_gimple_omp_offloaded (ctx->stmt))
 		    install_var_local (decl, ctx);
 		}
@@ -2339,6 +2351,12 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 	scan_omp (&OMP_CLAUSE_LINEAR_GIMPLE_SEQ (c), ctx);
 }
 
+static void
+scan_sharing_clauses (tree clauses, omp_context *ctx)
+{
+  scan_sharing_clauses_1 (clauses, ctx, false);
+}
+
 /* Create a new name for omp child function.  Returns an identifier.  If
    IS_CILK_FOR is true then the suffix for the child function is
    "_cilk_for_fn."  */
@@ -3056,13 +3074,52 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   DECL_NAMELESS (name) = 1;
   TYPE_NAME (ctx->record_type) = name;
   TYPE_ARTIFICIAL (ctx->record_type) = 1;
+
+  bool base_pointers_restrict = false;
   if (offloaded)
     {
       create_omp_child_function (ctx, false);
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+
+      /* If all the clauses force allocation, we can be certain that the objects
+	 on the target are disjoint, and therefore mark the base pointers as
+	 restrict.  */
+      base_pointers_restrict = true;
+      tree c;
+      for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+	{
+	  switch (OMP_CLAUSE_CODE (c))
+	    {
+	    case OMP_CLAUSE_MAP:
+	      switch (OMP_CLAUSE_MAP_KIND (c))
+		{
+		case GOMP_MAP_ALLOC:
+		case GOMP_MAP_FORCE_TO:
+		case GOMP_MAP_FORCE_FROM:
+		case GOMP_MAP_FORCE_TOFROM:
+		  break;
+		default:
+		  base_pointers_restrict = false;
+		  break;
+		}
+	      break;
+
+	    default:
+	      base_pointers_restrict = false;
+	      break;
+	    }
+
+	  if (!base_pointers_restrict)
+	    break;
+	}
+      if (base_pointers_restrict)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Base pointers in offloaded function are restrict\n");
+	}
     }
 
-  scan_sharing_clauses (clauses, ctx);
+  scan_sharing_clauses_1 (clauses, ctx, base_pointers_restrict);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
   if (TYPE_FIELDS (ctx->record_type) == NULL)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init
  2015-11-20 10:37           ` Richard Biener
  2015-11-20 13:27             ` Tom de Vries
@ 2015-11-22 23:37             ` Tom de Vries
  2015-11-23 10:33               ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-22 23:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1459 bytes --]

[ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]

On 20/11/15 11:37, Richard Biener wrote:
> I'd rather make loop_optimizer_init do nothing
> if requested flags are already set and no fixup is needed and
> call the above unconditionally.  Thus sth like
>
> Index: gcc/loop-init.c
> ===================================================================
> --- gcc/loop-init.c     (revision 230649)
> +++ gcc/loop-init.c     (working copy)
> @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
>         calculate_dominance_info (CDI_DOMINATORS);
>
>         if (!needs_fixup)
> -       checking_verify_loop_structure ();
> +       {
> +         checking_verify_loop_structure ();
> +         if (loops_state_satisfies_p (flags))
> +           goto out;
> +       }
>
>         /* Clear all flags.  */
>         if (recorded_exits)
> @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
>     /* Apply flags to loops.  */
>     apply_loop_flags (flags);
>
> +  checking_verify_loop_structure ();
> +
> +out:
>     /* Dump loops.  */
>     flow_loops_dump (dump_file, NULL, 1);
>
> -  checking_verify_loop_structure ();
> -
>     timevar_pop (TV_LOOP_INIT);
>   }

This patch implements that approach, but the patch is slightly more 
complicated because of the need to handle 
LOOPS_MAY_HAVE_MULTIPLE_LATCHES differently than the rest of the flags.

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0002-Don-t-reapply-loops-flags-if-unnecessary-in-loop_optimizer_init.patch --]
[-- Type: text/x-patch, Size: 1546 bytes --]

Don't reapply loops flags if unnecessary in loop_optimizer_init

2015-11-22  Tom de Vries  <tom@codesourcery.com>

	* loop-init.c (loop_optimizer_init): Don't reapply loops flags if
	unnecessary.

---
 gcc/loop-init.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index e32c94a..4b72cab 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -85,6 +85,8 @@ loop_optimizer_init (unsigned flags)
 {
   timevar_push (TV_LOOP_INIT);
 
+  gcc_checking_assert ((flags & (LOOP_CLOSED_SSA | LOOPS_NEED_FIXUP)) == 0);
+
   if (!current_loops)
     {
       gcc_assert (!(cfun->curr_properties & PROP_loops));
@@ -103,7 +105,17 @@ loop_optimizer_init (unsigned flags)
       calculate_dominance_info (CDI_DOMINATORS);
 
       if (!needs_fixup)
-	checking_verify_loop_structure ();
+	{
+	  checking_verify_loop_structure ();
+
+	  bool need_reapply
+	    = (!loops_state_satisfies_p (flags
+					 & (~LOOPS_MAY_HAVE_MULTIPLE_LATCHES))
+	       || (loops_state_satisfies_p (LOOPS_MAY_HAVE_MULTIPLE_LATCHES)
+		   && ((flags & LOOPS_MAY_HAVE_MULTIPLE_LATCHES) == 0)));
+	  if (!need_reapply)
+	    goto out;
+	}
 
       /* Clear all flags.  */
       if (recorded_exits)
@@ -122,11 +134,12 @@ loop_optimizer_init (unsigned flags)
   /* Apply flags to loops.  */
   apply_loop_flags (flags);
 
+  checking_verify_loop_structure ();
+
+ out:
   /* Dump loops.  */
   flow_loops_dump (dump_file, NULL, 1);
 
-  checking_verify_loop_structure ();
-
   timevar_pop (TV_LOOP_INIT);
 }
 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-20 16:34                 ` Tom de Vries
@ 2015-11-23 10:11                   ` Richard Biener
  2015-11-24 12:22                     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 10:11 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Fri, 20 Nov 2015, Tom de Vries wrote:

> On 20/11/15 14:29, Richard Biener wrote:
> > I agree it's somewhat of an odd behavior but all passes should
> > either be placed in a sub-pipeline with an outer
> > loop_optimizer_init()/finalize () call or call both themselves.
> 
> Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop
> pipeline.
> 
> We could use the style used in pass_slp_vectorize::execute:
> ...
> pass_slp_vectorize::execute (function *fun)
> {
>   basic_block bb;
> 
>   bool in_loop_pipeline = scev_initialized_p ();
>   if (!in_loop_pipeline)
>     {
>       loop_optimizer_init (LOOPS_NORMAL);
>       scev_initialize ();
>     }
> 
>   ...
> 
>   if (!in_loop_pipeline)
>     {
>       scev_finalize ();
>       loop_optimizer_finalize ();
>     }
> ...
> 
> Although that doesn't strike me as particularly clean.

At least it would be a consistent "unclean" style.  So yes, the
above would work for me.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init
  2015-11-22 23:37             ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
@ 2015-11-23 10:33               ` Richard Biener
  2015-11-23 11:27                 ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 10:33 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Mon, 23 Nov 2015, Tom de Vries wrote:

> [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
> 
> On 20/11/15 11:37, Richard Biener wrote:
> > I'd rather make loop_optimizer_init do nothing
> > if requested flags are already set and no fixup is needed and
> > call the above unconditionally.  Thus sth like
> > 
> > Index: gcc/loop-init.c
> > ===================================================================
> > --- gcc/loop-init.c     (revision 230649)
> > +++ gcc/loop-init.c     (working copy)
> > @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
> >         calculate_dominance_info (CDI_DOMINATORS);
> > 
> >         if (!needs_fixup)
> > -       checking_verify_loop_structure ();
> > +       {
> > +         checking_verify_loop_structure ();
> > +         if (loops_state_satisfies_p (flags))
> > +           goto out;
> > +       }
> > 
> >         /* Clear all flags.  */
> >         if (recorded_exits)
> > @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
> >     /* Apply flags to loops.  */
> >     apply_loop_flags (flags);
> > 
> > +  checking_verify_loop_structure ();
> > +
> > +out:
> >     /* Dump loops.  */
> >     flow_loops_dump (dump_file, NULL, 1);
> > 
> > -  checking_verify_loop_structure ();
> > -
> >     timevar_pop (TV_LOOP_INIT);
> >   }
> 
> This patch implements that approach, but the patch is slightly more
> complicated because of the need to handle LOOPS_MAY_HAVE_MULTIPLE_LATCHES
> differently than the rest of the flags.
> 
> Bootstrapped and reg-tested on x86_64.
> 
> OK for stage3 trunk?

Let's revisit this during stage1 if the scev_initialized () thing
SLP vectorization uses works, ok?

Thanks,
Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init
  2015-11-23 10:33               ` Richard Biener
@ 2015-11-23 11:27                 ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-23 11:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2818 bytes --]

On 23/11/15 11:29, Richard Biener wrote:
> On Mon, 23 Nov 2015, Tom de Vries wrote:
>
>> [ was: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def ]
>>
>> On 20/11/15 11:37, Richard Biener wrote:
>>> I'd rather make loop_optimizer_init do nothing
>>> if requested flags are already set and no fixup is needed and
>>> call the above unconditionally.  Thus sth like
>>>
>>> Index: gcc/loop-init.c
>>> ===================================================================
>>> --- gcc/loop-init.c     (revision 230649)
>>> +++ gcc/loop-init.c     (working copy)
>>> @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
>>>          calculate_dominance_info (CDI_DOMINATORS);
>>>
>>>          if (!needs_fixup)
>>> -       checking_verify_loop_structure ();
>>> +       {
>>> +         checking_verify_loop_structure ();
>>> +         if (loops_state_satisfies_p (flags))
>>> +           goto out;
>>> +       }
>>>
>>>          /* Clear all flags.  */
>>>          if (recorded_exits)
>>> @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
>>>      /* Apply flags to loops.  */
>>>      apply_loop_flags (flags);
>>>
>>> +  checking_verify_loop_structure ();
>>> +
>>> +out:
>>>      /* Dump loops.  */
>>>      flow_loops_dump (dump_file, NULL, 1);
>>>
>>> -  checking_verify_loop_structure ();
>>> -
>>>      timevar_pop (TV_LOOP_INIT);
>>>    }
>>
>> This patch implements that approach, but the patch is slightly more
>> complicated because of the need to handle LOOPS_MAY_HAVE_MULTIPLE_LATCHES
>> differently than the rest of the flags.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> OK for stage3 trunk?
>
> Let's revisit this during stage1 if the scev_initialized () thing
> SLP vectorization uses works, ok?
>

OK, I'll give that a try.

FTR, attached two patches are an attempt at a cleaner solution for 
pass_slp_vectorize::execute (in combination with patch "Don't reapply 
loops flags if unnecessary in loop_optimizer_init").

The first patch introduces a property PROP_scev, set for the duration of 
the loop pipeline. It allows us to call scev_initialize and 
scev_finalize unconditionally. Outside the loop pipeline calling the 
functions has the usual effect. Inside the loop pipeline, calling the 
functions has no effect.

The second patch introduces a property PROP_loops_normal_re_lcssa, set 
for the duration of the loop pipeline. It allows us (in combination with 
"Don't reapply loops flags if unnecessary in loop_optimizer_init") to 
call loop_optimizer_init and loop_optimizer_finalize unconditionally.
Outside the loop pipeline, calling the functions has the usual effect. 
Inside the loop pipeline, calling loop_optimizer_finalize has no effect, 
and calling loop_optimizer_initialize has no effect unless a fixup or a 
new loop property is needed.

Thanks,
- Tom


[-- Attachment #2: 0020-Add-PROP_scev.patch --]
[-- Type: text/x-patch, Size: 3142 bytes --]

Add PROP_scev

---
 gcc/tree-pass.h             |  1 +
 gcc/tree-scalar-evolution.c | 13 +++++++++++++
 gcc/tree-ssa-loop.c         |  3 ++-
 gcc/tree-vectorizer.c       |  4 ++--
 4 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 004db77..4e66b2c 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -227,6 +227,7 @@ protected:
 						   of math functions; the
 						   current choices have
 						   been optimized.  */
+#define PROP_scev		(1 << 16)	/* preserve scev info.  */
 
 #define PROP_trees \
   (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 9b33693..5d5e354 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -280,6 +280,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "tree-ssa-propagate.h"
 #include "gimple-fold.h"
+#include "tree-pass.h"
 
 static tree analyze_scalar_evolution_1 (struct loop *, tree, tree);
 static tree analyze_scalar_evolution_for_address_of (struct loop *loop,
@@ -3168,6 +3169,12 @@ scev_initialize (void)
 {
   struct loop *loop;
 
+  if (cfun->curr_properties & PROP_scev)
+    {
+      gcc_assert (scev_initialized_p ());
+      return;
+    }
+
   scalar_evolution_info = hash_table<scev_info_hasher>::create_ggc (100);
 
   initialize_scalar_evolutions_analyzer ();
@@ -3367,6 +3374,12 @@ simple_iv (struct loop *wrto_loop, struct loop *use_loop, tree op,
 void
 scev_finalize (void)
 {
+  if (cfun->curr_properties & PROP_scev)
+    {
+      gcc_assert (scev_initialized_p ());
+      return;
+    }
+
   if (!scalar_evolution_info)
     return;
   scalar_evolution_info->empty ();
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index d30e3c8..739fda7 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -290,7 +290,7 @@ const pass_data pass_data_tree_loop_init =
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_NONE, /* tv_id */
   PROP_cfg, /* properties_required */
-  0, /* properties_provided */
+  PROP_scev, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
   0, /* todo_flags_finish */
@@ -524,6 +524,7 @@ make_pass_iv_optimize (gcc::context *ctxt)
 static unsigned int
 tree_ssa_loop_done (void)
 {
+  cfun->curr_properties &= ~PROP_scev;
   free_numbers_of_iterations_estimates (cfun);
   scev_finalize ();
   loop_optimizer_finalize ();
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index b721c56..b06433d 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -731,8 +731,8 @@ pass_slp_vectorize::execute (function *fun)
   if (!in_loop_pipeline)
     {
       loop_optimizer_init (LOOPS_NORMAL);
-      scev_initialize ();
     }
+  scev_initialize ();
 
   /* Mark all stmts as not belonging to the current region and unvisited.  */
   FOR_EACH_BB_FN (bb, fun)
@@ -757,9 +757,9 @@ pass_slp_vectorize::execute (function *fun)
 
   free_stmt_vec_info_vec ();
 
+  scev_finalize ();
   if (!in_loop_pipeline)
     {
-      scev_finalize ();
       loop_optimizer_finalize ();
     }
 

[-- Attachment #3: 0021-Add-PROP_loops_normal_re_lcssa.patch --]
[-- Type: text/x-patch, Size: 3449 bytes --]

Add PROP_loops_normal_re_lcssa

---
 gcc/loop-init.c       | 13 +++++++++++++
 gcc/tree-pass.h       |  3 +++
 gcc/tree-ssa-loop.c   |  4 ++--
 gcc/tree-vectorizer.c | 11 ++---------
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index 4b72cab..9ce3e9e 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -100,6 +100,10 @@ loop_optimizer_init (unsigned flags)
       bool needs_fixup = loops_state_satisfies_p (LOOPS_NEED_FIXUP);
 
       gcc_assert (cfun->curr_properties & PROP_loops);
+      if (cfun->curr_properties & PROP_loops_normal_re_lcssa)
+	gcc_assert (loops_state_satisfies_p (LOOPS_NORMAL
+					     | LOOPS_HAVE_RECORDED_EXITS
+					     | LOOP_CLOSED_SSA));
 
       /* Ensure that the dominators are computed, like flow_loops_find does.  */
       calculate_dominance_info (CDI_DOMINATORS);
@@ -151,6 +155,15 @@ loop_optimizer_finalize (struct function *fn)
   struct loop *loop;
   basic_block bb;
 
+  if (fn->curr_properties & PROP_loops_normal_re_lcssa)
+    {
+      gcc_assert (loops_state_satisfies_p (fn, LOOPS_NORMAL
+					   | LOOPS_HAVE_RECORDED_EXITS
+					   | LOOP_CLOSED_SSA));
+      gcc_assert (!loops_state_satisfies_p (fn, LOOPS_NEED_FIXUP));
+      return;
+    }
+
   timevar_push (TV_LOOP_FINI);
 
   if (loops_state_satisfies_p (fn, LOOPS_HAVE_RECORDED_EXITS))
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 4e66b2c..c43a5f3 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -228,6 +228,9 @@ protected:
 						   current choices have
 						   been optimized.  */
 #define PROP_scev		(1 << 16)	/* preserve scev info.  */
+/* preserve loop structures in LOOPS_NORMAL with recorded exits, and in loop
+   closed ssa.  */
+#define PROP_loops_normal_re_lcssa	(1 << 17)
 
 #define PROP_trees \
   (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 739fda7..73fbb43 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -290,7 +290,7 @@ const pass_data pass_data_tree_loop_init =
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_NONE, /* tv_id */
   PROP_cfg, /* properties_required */
-  PROP_scev, /* properties_provided */
+  PROP_loops_normal_re_lcssa | PROP_scev, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
   0, /* todo_flags_finish */
@@ -524,7 +524,7 @@ make_pass_iv_optimize (gcc::context *ctxt)
 static unsigned int
 tree_ssa_loop_done (void)
 {
-  cfun->curr_properties &= ~PROP_scev;
+  cfun->curr_properties &= ~(PROP_loops_normal_re_lcssa | PROP_scev);
   free_numbers_of_iterations_estimates (cfun);
   scev_finalize ();
   loop_optimizer_finalize ();
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index b06433d..503f227 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -727,11 +727,7 @@ pass_slp_vectorize::execute (function *fun)
 {
   basic_block bb;
 
-  bool in_loop_pipeline = scev_initialized_p ();
-  if (!in_loop_pipeline)
-    {
-      loop_optimizer_init (LOOPS_NORMAL);
-    }
+  loop_optimizer_init (LOOPS_NORMAL);
   scev_initialize ();
 
   /* Mark all stmts as not belonging to the current region and unvisited.  */
@@ -758,10 +754,7 @@ pass_slp_vectorize::execute (function *fun)
   free_stmt_vec_info_vec ();
 
   scev_finalize ();
-  if (!in_loop_pipeline)
-    {
-      loop_optimizer_finalize ();
-    }
+  loop_optimizer_finalize ();
 
   return 0;
 }

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-21  8:42                   ` Tom de Vries
@ 2015-11-23 11:31                     ` Richard Biener
  2015-11-23 15:53                       ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 11:31 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On Sat, 21 Nov 2015, Tom de Vries wrote:

> On 20/11/15 11:28, Richard Biener wrote:
> > On Thu, 19 Nov 2015, Tom de Vries wrote:
> > 
> > > >On 17/11/15 15:53, Tom de Vries wrote:
> > > > > > > >And the above LIM example
> > > > > > > >is none for why you need two LIM passes...
> > > > > >
> > > > > >Indeed. I'm planning a separate reply to explain in more detail the
> > > > need
> > > > > >for the two pass_lims.
> > > >
> > > >I.
> > > >
> > > >I managed to get rid of the two pass_lims for the motivating example that
> > > I
> > > >used until now (goacc/kernels-double-reduction.c). I found that by adding
> > > a
> > > >pass_dominator instance after pass_ch, I could get rid of the second
> > > pass_lim
> > > >(and pass_copyprop as well).
> > > >
> > > >But... then I wrote a counter example
> > > (goacc/kernels-double-reduction-n.c),
> > > >and I'm back at two pass_lims (and two pass_dominators).
> > > >Also I've split the pass group into a bit before and after pass_fre.
> > > >
> > > >So, the current pass group looks like:
> > > >...
> > > >NEXT_PASS (pass_build_ealias);
> > > >
> > > >/* Pass group that runs when the function is an offloaded function
> > > >    containing oacc kernels loops.  Part 1.  */
> > > >NEXT_PASS (pass_oacc_kernels);
> > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > > >     /* We need pass_ch here, because pass_lim has no effect on
> > > >        exit-first loops (PR65442).  Ideally we want to remove both
> > > >        this pass instantiation, and the reverse transformation
> > > >        transform_to_exit_first_loop_alt, which is done in
> > > >        pass_parallelize_loops_oacc_kernels. */
> > > >     NEXT_PASS (pass_ch);
> > > >POP_INSERT_PASSES ()
> > > >
> > > >NEXT_PASS (pass_fre);
> > > >
> > > >/* Pass group that runs when the function is an offloaded function
> > > >    containing oacc kernels loops.  Part 2.  */
> > > >NEXT_PASS (pass_oacc_kernels2);
> > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
> > > >     /* We use pass_lim to rewrite in-memory iteration and reduction
> > > >        variable accesses in loops into local variables accesses.  */
> > > >     NEXT_PASS (pass_lim);
> > > >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> > > >     NEXT_PASS (pass_lim);
> > > >     NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> > > >     NEXT_PASS (pass_dce);
> > > >     NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> > > >     NEXT_PASS (pass_expand_omp_ssa);
> > > >POP_INSERT_PASSES ()
> > > >NEXT_PASS (pass_merge_phi);
> > > >...
> > > >
> > > >
> > > >II.
> > > >
> > > >The motivating test-case kernels-double-reduction-n.c:
> > > >...
> > > >#include <stdlib.h>
> > > >
> > > >#define N 500
> > > >
> > > >unsigned int a[N][N];
> > > >
> > > >void  __attribute__((noinline,noclone))
> > > >foo (unsigned int n)
> > > >{
> > > >   int i, j;
> > > >   unsigned int sum = 1;
> > > >
> > > >#pragma acc kernels copyin (a[0:n]) copy (sum)
> > > >   {
> > > >     for (i = 0; i < n; ++i)
> > > >       for (j = 0; j < n; ++j)
> > > >         sum += a[i][j];
> > > >   }
> > > >
> > > >   if (sum != 5001)
> > > >     abort ();
> > > >}
> > > >...
> > > >
> > > >
> > > >III.
> > > >
> > > >Before first pass_lim. Note no phis on inner or outer loop header for
> > > >iteration varables or reduction variable:
> > > >...
> > > >   <bb 2>:
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 4>: outer loop header
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   *_12 = 0;
> > > >   if (_45 != 0)
> > > >     goto <bb 6>;
> > > >   else
> > > >     goto <bb 5>;
> > > >
> > > >   <bb 6>: inner loop header, latch
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   _21 = *_5;
> > > >   _23 = *_12;
> > > >   _24 = *_19[_21][_23];
> > > >   _25 = *.omp_data_i_4(D).sum;
> > > >   sum.0_26 = *_25;
> > > >   sum.1_27 = _24 + sum.0_26;
> > > >   *_25 = sum.1_27;
> > > >   _33 = _23 + 1;
> > > >   *_12 = _33;
> > > >   j.2_16 = (unsigned int) _33;
> > > >   if (j.2_16 < _45)
> > > >     goto <bb 6>;
> > > >   else
> > > >     goto <bb 5>;
> > > >
> > > >   <bb 5>: outer loop latch
> > > >   _36 = *_5;
> > > >   _38 = _36 + 1;
> > > >   *_5 = _38;
> > > >   i.3_9 = (unsigned int) _38;
> > > >   if (i.3_9 < _45)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 3>:
> > > >   return;
> > > >...
> > > >
> > > >
> > > >IV.
> > > >
> > > >After first pass_lim/pass_dom pair. Note there are phis on the inner loop
> > > >header for the reduction and the iteration variable, but not on the outer
> > > loop
> > > >header:
> > > >...
> > > >   <bb 2>:
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 4>:
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   D__lsm.10_50 = *_12;
> > > >   D__lsm.11_51 = 0;
> > > >   _25 = *.omp_data_i_4(D).sum;
> > > >
> > > >   <bb 5>: outer loop header
> > > >   D__lsm.10_20 = 0;
> > > >   D__lsm.11_22 = 1;
> > > >   _21 = *_5;
> > > >   D__lsm.12_28 = *_25;
> > > >   D__lsm.13_30 = 0;
> > > >   goto <bb 7>;
> > > >
> > > >   <bb 7>: inner loop header, latch
> > > >   # D__lsm.10_47 = PHI <0(5), _33(7)>
> > > >   # D__lsm.12_49 = PHI <D__lsm.12_28(5), sum.1_27(7)>
> > > >   _23 = D__lsm.10_47;
> > > >   _24 = *_19[_21][D__lsm.10_47];
> > > >   sum.0_26 = D__lsm.12_49;
> > > >   sum.1_27 = _24 + D__lsm.12_49;
> > > >   D__lsm.12_31 = sum.1_27;
> > > >   D__lsm.13_32 = 1;
> > > >   _33 = D__lsm.10_47 + 1;
> > > >   D__lsm.10_14 = _33;
> > > >   D__lsm.11_15 = 1;
> > > >   j.2_16 = (unsigned int) _33;
> > > >   if (j.2_16 < _45)
> > > >     goto <bb 7>;
> > > >   else
> > > >     goto <bb 8>;
> > > >
> > > >   <bb 8>: outer loop latch
> > > >   # D__lsm.10_35 = PHI <_33(7)>
> > > >   # D__lsm.11_37 = PHI <1(7)>
> > > >   # D__lsm.12_7 = PHI <sum.1_27(7)>
> > > >   # D__lsm.13_8 = PHI <1(7)>
> > > >   *_25 = sum.1_27;
> > > >   _36 = *_5;
> > > >   _38 = _36 + 1;
> > > >   *_5 = _38;
> > > >   i.3_9 = (unsigned int) _38;
> > > >   if (i.3_9 < _45)
> > > >     goto <bb 5>;
> > > >   else
> > > >     goto <bb 6>;
> > > >
> > > >   <bb 6>:
> > > >   # D__lsm.10_10 = PHI <_33(8)>
> > > >   # D__lsm.11_11 = PHI <1(8)>
> > > >   *_12 = _33;
> > > >   goto <bb 3>;
> > > >
> > > >   <bb 3>:
> > > >   return;
> > > >...
> > > >
> > > >
> > > >V.
> > > >
> > > >After second pass_lim/pass_dom pair. Note there are phis on the inner and
> > > >outer loop header for the reduction and the iteration variables:
> > > >...
> > > >   <bb 2>:
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > >     goto <bb 4>;
> > > >   else
> > > >     goto <bb 3>;
> > > >
> > > >   <bb 4>:
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   D__lsm.10_50 = *_12;
> > > >   D__lsm.11_51 = 0;
> > > >   _25 = *.omp_data_i_4(D).sum;
> > > >   D__lsm.14_40 = 0;
> > > >   D__lsm.15_2 = 0;
> > > >   D__lsm.16_1 = *_25;
> > > >   D__lsm.17_46 = 0;
> > > >
> > > >   <bb 5>: outer loop header
> > > >   # D__lsm.14_13 = PHI <0(4), _38(8)>
> > > >   # D__lsm.16_34 = PHI <D__lsm.16_1(4), sum.1_27(8)>
> > > >   D__lsm.10_20 = 0;
> > > >   D__lsm.11_22 = 1;
> > > >   _21 = D__lsm.14_13;
> > > >   D__lsm.12_28 = D__lsm.16_34;
> > > >   D__lsm.13_30 = 0;
> > > >   goto <bb 7>;
> > > >
> > > >   <bb 7>: inner loop header, latch
> > > >   # D__lsm.10_47 = PHI <0(5), _33(7)>
> > > >   # D__lsm.12_49 = PHI <D__lsm.16_34(5), sum.1_27(7)>
> > > >   _23 = D__lsm.10_47;
> > > >   _24 = *_19[D__lsm.14_13][D__lsm.10_47];
> > > >   sum.0_26 = D__lsm.12_49;
> > > >   sum.1_27 = _24 + D__lsm.12_49;
> > > >   D__lsm.12_31 = sum.1_27;
> > > >   D__lsm.13_32 = 1;
> > > >   _33 = D__lsm.10_47 + 1;
> > > >   D__lsm.10_14 = _33;
> > > >   D__lsm.11_15 = 1;
> > > >   j.2_16 = (unsigned int) _33;
> > > >   if (j.2_16 < _45)
> > > >     goto <bb 7>;
> > > >   else
> > > >     goto <bb 8>;
> > > >
> > > >   <bb 8>: outer loop latch
> > > >   # D__lsm.10_35 = PHI <_33(7)>
> > > >   # D__lsm.11_37 = PHI <1(7)>
> > > >   # D__lsm.12_7 = PHI <sum.1_27(7)>
> > > >   # D__lsm.13_8 = PHI <1(7)>
> > > >   # sum.1_48 = PHI <sum.1_27(7)>
> > > >   # _53 = PHI <_33(7)>
> > > >   D__lsm.16_56 = sum.1_27;
> > > >   D__lsm.17_57 = 1;
> > > >   _36 = D__lsm.14_13;
> > > >   _38 = D__lsm.14_13 + 1;
> > > >   D__lsm.14_58 = _38;
> > > >   D__lsm.15_59 = 1;
> > > >   i.3_9 = (unsigned int) _38;
> > > >   if (i.3_9 < _45)
> > > >     goto <bb 5>;
> > > >   else
> > > >     goto <bb 6>;
> > > >
> > > >   <bb 6>:
> > > >   # D__lsm.10_10 = PHI <_33(8)>
> > > >   # D__lsm.11_11 = PHI <1(8)>
> > > >   # _43 = PHI <_33(8)>
> > > >   # D__lsm.16_62 = PHI <sum.1_27(8)>
> > > >   # D__lsm.17_63 = PHI <1(8)>
> > > >   # D__lsm.14_64 = PHI <_38(8)>
> > > >   # D__lsm.15_65 = PHI <1(8)>
> > > >   *_5 = _38;
> > > >   *_25 = sum.1_27;
> > > >   *_12 = _33;
> > > >   goto <bb 3>;
> > > >
> > > >   <bb 3>:
> > > >   return;
> > > >...
> > Sorry but staring at dumps doesn't make me understand the issue you
> > run into.  Where can I reproduce this if I have time to look at this?
> 
> I've posted the state of the patch series that reproduces this problem at
> https://github.com/vries/gcc/commits/vries/master-port-kernels-test-rb , run
> goacc.exp, testcase kernels-double-reduction-n.c.
> 
> > From the dump below I understand you want no memory references in
> > the outer loop?
> > So the issue seems to be that store motion fails
> > to insert the preheader load / exit store to the outermost loop
> > possible and thus another LIM pass is needed to "store motion" those
> > again?
> 
> Yep.
> 
> >  But a simple testcase
> > 
> > int a;
> > int *p = &a;
> > int foo (int n)
> > {
> >    for (int i = 0; i < n; ++i)
> >      for (int j = 0; j < 100; ++j)
> >        *p += j + i;
> >    return a;
> > }
> > 
> > shows that LIM can do this in one step.
> 
> I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry
> conditions" for a test-case where that doesn't happen (when using
> -fno-tree-dominator-opts).
> 
> > Which means it should
> > be investigated why it doesn't do this properly for your testcase
> > (store motion of *_25).
> 
> There seems to be two related problems:
> 1. the store has tree_could_trap_p (ref->mem.ref) true, which should be
>    false. I'll work on a fix for this.
> 2. Give that the store can trap, I  was running into PR68465. I managed
>    to eliminate the 2nd pass_lim by moving the pass_dominator instance
>    before the pass_lim instance.
> 
> Attached patch shows the pass group with only one pass_lim. I hope to be able
> to eliminate the first pass_dominator instance before pass_lim once I fix 1.
> 
> > Simply adding two LIM passes either papers over a wrong-code
> > bug (in LIM or in DOM) or over a missed-optimization in LIM.
> 
> AFAIU now, it's PR68465, a missed optimization in LIM.

Ok, it's not really LIMs job to cleanup loop header copying that way.

DOM performs jump-threading for this but FRE should also be able
to handle this just fine.  Ah, it doesn't because the outer loop
header directly contains the condition

Index: gcc/tree-ssa-sccvn.c
===================================================================
--- gcc/tree-ssa-sccvn.c        (revision 230737)
+++ gcc/tree-ssa-sccvn.c        (working copy)
@@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b
 
   /* If we have a single predecessor record the equivalence from a
      possible condition on the predecessor edge.  */
-  if (single_pred_p (bb))
+  edge pred_e = NULL;
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      if (e->flags & EDGE_DFS_BACK)
+       continue;
+      if (! pred_e)
+       pred_e = e;
+      else
+       {
+         pred_e = NULL;
+         break;
+       }
+    }
+  if (pred_e)
     {
-      edge e = single_pred_edge (bb);
       /* Check if there are multiple executable successor edges in
         the source block.  Otherwise there is no additional info
         to be recorded.  */
       edge e2;
-      FOR_EACH_EDGE (e2, ei, e->src->succs)
-       if (e2 != e
+      FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
+       if (e2 != pred_e
            && e2->flags & EDGE_EXECUTABLE)
          break;
       if (e2 && (e2->flags & EDGE_EXECUTABLE))
        {
-         gimple *stmt = last_stmt (e->src);
+         gimple *stmt = last_stmt (pred_e->src);
          if (stmt
              && gimple_code (stmt) == GIMPLE_COND)
            {
@@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
              tree lhs = gimple_cond_lhs (stmt);
              tree rhs = gimple_cond_rhs (stmt);
              record_conds (bb, code, lhs, rhs,
-                           (e->flags & EDGE_TRUE_VALUE) != 0);
+                           (pred_e->flags & EDGE_TRUE_VALUE) != 0);
              code = invert_tree_comparison (code, HONOR_NANS (lhs));
              if (code != ERROR_MARK)
                record_conds (bb, code, lhs, rhs,
-                             (e->flags & EDGE_TRUE_VALUE) == 0);
+                             (pred_e->flags & EDGE_TRUE_VALUE) == 0);
            }
        }
     }

fixes this for me (for a small testcase).  Does it help yours?

Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
it's supposed to match edges that have the src dominated by the dest).
Testing the above now.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-21 12:24                 ` Tom de Vries
@ 2015-11-23 11:46                   ` Richard Biener
  2015-11-27 11:44                     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-23 11:46 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Sat, 21 Nov 2015, Tom de Vries wrote:

> On 13/11/15 12:39, Jakub Jelinek wrote:
> > On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
> > > > thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta
> > > > issues'.
> > > > 
> > > > Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit
> > > > above?
> > > > Is that sort of what you had in mind?
> > > 
> > > Yes.  Whether that makes sense is another question of course.  You can
> > > annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
> > > as well if you know dependences without the users intervention.
> > 
> > I really don't like even the GCC offload-alias, I just don't see anything
> > special on the offload code.  Not to mention that the same issue is already
> > with other outlined functions, like OpenMP tasks or parallel regions, those
> > aren't offloaded, yet they can suffer from worse alias/points-to analysis
> > too.
> 
> AFAIU there is one aspect that is different for offloaded code: the setup of
> the data on the device.
> 
> Consider this example:
> ...
> unsigned int a[N];
> unsigned int b[N];
> unsigned int c[N];
> 
> int
> main (void)
> {
>   ...
> 
> #pragma acc kernels copyin (a) copyin (b) copyout (c)
>   {
>     for (COUNTERTYPE ii = 0; ii < N; ii++)
>       c[ii] = a[ii] + b[ii];
>   }
> 
>   ...
> ...
> 
> At gimple level, we have:
> ...
> #pragma omp target oacc_kernels \
>   map(force_from:c [len: 2097152]) \
>   map(force_to:b [len: 2097152]) \
>   map(force_to:a [len: 2097152])
> ...
> 
> [ The meaning of the force_from/force_to mappings is given in
> include/gomp-constants.h:
> ...
>     /* Allocate.  */
>     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
>     /* ..., and copy to device.  */
>     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
>     /* ..., and copy from device.  */
>     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
>     /* ..., and copy to and from device.  */
>     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
> ...  ]
> 
> So before calling the offloaded function, a separate alloc is done for a, b
> and c, and the base pointers of the newly allocated objects are passed to the
> offloaded function.
> 
> This means we can mark those base pointers as restrict in the offloaded
> function.
> 
> Attached proof-of-concept patch implements that.
> 
> > We simply have some compiler internal interface between the caller and
> > callee of the outlined regions, each interface in between those has
> > its own structure type used to communicate the info;
> > we can attach attributes on the fields, or some flags to indicate some
> > properties interesting from aliasing POV.
> > We don't really need to perform
> > full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> > the relationship in between such callers and callees (for offloading regions
> > we already have "omp target entrypoint" attribute on the callee and a
> > singler caller), tell LTO if possible not to split those into different
> > partitions if easily possible, and then just for these pairs perform
> > aliasing/points-to analysis in the caller and the result record using
> > cliques/special attributes/whatever to the callee side, so that the callee
> > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
> 
> As a start, is the approach of this patch OK?

Works for me but leaving to Jakub to review for correctness.

Richard.

> It will allow us to commit the oacc kernels patch series with the ability to
> parallelize non-trivial testcases, and work on improving the alias bit after
> that.
> 
> Thanks,
> - Tom
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-23 11:31                     ` Richard Biener
@ 2015-11-23 15:53                       ` Tom de Vries
  2015-11-23 16:38                         ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-23 15:53 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On 23/11/15 12:31, Richard Biener wrote:
>>>  From the dump below I understand you want no memory references in
>>> > >the outer loop?
>>> > >So the issue seems to be that store motion fails
>>> > >to insert the preheader load / exit store to the outermost loop
>>> > >possible and thus another LIM pass is needed to "store motion" those
>>> > >again?
>> >
>> >Yep.
>> >
>>> > >  But a simple testcase
>>> > >
>>> > >int a;
>>> > >int *p = &a;
>>> > >int foo (int n)
>>> > >{
>>> > >    for (int i = 0; i < n; ++i)
>>> > >      for (int j = 0; j < 100; ++j)
>>> > >        *p += j + i;
>>> > >    return a;
>>> > >}
>>> > >
>>> > >shows that LIM can do this in one step.
>> >
>> >I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry
>> >conditions" for a test-case where that doesn't happen (when using
>> >-fno-tree-dominator-opts).
>> >
>>> > >Which means it should
>>> > >be investigated why it doesn't do this properly for your testcase
>>> > >(store motion of *_25).
>> >
>> >There seems to be two related problems:
>> >1. the store has tree_could_trap_p (ref->mem.ref) true, which should be
>> >    false. I'll work on a fix for this.
>> >2. Give that the store can trap, I  was running into PR68465. I managed
>> >    to eliminate the 2nd pass_lim by moving the pass_dominator instance
>> >    before the pass_lim instance.
>> >
>> >Attached patch shows the pass group with only one pass_lim. I hope to be able
>> >to eliminate the first pass_dominator instance before pass_lim once I fix 1.
>> >
>>> > >Simply adding two LIM passes either papers over a wrong-code
>>> > >bug (in LIM or in DOM) or over a missed-optimization in LIM.
>> >
>> >AFAIU now, it's PR68465, a missed optimization in LIM.
> Ok, it's not really LIMs job to cleanup loop header copying that way.
>
> DOM performs jump-threading for this but FRE should also be able
> to handle this just fine.  Ah, it doesn't because the outer loop
> header directly contains the condition
>
> Index: gcc/tree-ssa-sccvn.c
> ===================================================================
> --- gcc/tree-ssa-sccvn.c        (revision 230737)
> +++ gcc/tree-ssa-sccvn.c        (working copy)
> @@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b
>
>     /* If we have a single predecessor record the equivalence from a
>        possible condition on the predecessor edge.  */
> -  if (single_pred_p (bb))
> +  edge pred_e = NULL;
> +  FOR_EACH_EDGE (e, ei, bb->preds)
> +    {
> +      if (e->flags & EDGE_DFS_BACK)
> +       continue;
> +      if (! pred_e)
> +       pred_e = e;
> +      else
> +       {
> +         pred_e = NULL;
> +         break;
> +       }
> +    }
> +  if (pred_e)
>       {
> -      edge e = single_pred_edge (bb);
>         /* Check if there are multiple executable successor edges in
>           the source block.  Otherwise there is no additional info
>           to be recorded.  */
>         edge e2;
> -      FOR_EACH_EDGE (e2, ei, e->src->succs)
> -       if (e2 != e
> +      FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
> +       if (e2 != pred_e
>              && e2->flags & EDGE_EXECUTABLE)
>            break;
>         if (e2 && (e2->flags & EDGE_EXECUTABLE))
>          {
> -         gimple *stmt = last_stmt (e->src);
> +         gimple *stmt = last_stmt (pred_e->src);
>            if (stmt
>                && gimple_code (stmt) == GIMPLE_COND)
>              {
> @@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
>                tree lhs = gimple_cond_lhs (stmt);
>                tree rhs = gimple_cond_rhs (stmt);
>                record_conds (bb, code, lhs, rhs,
> -                           (e->flags & EDGE_TRUE_VALUE) != 0);
> +                           (pred_e->flags & EDGE_TRUE_VALUE) != 0);
>                code = invert_tree_comparison (code, HONOR_NANS (lhs));
>                if (code != ERROR_MARK)
>                  record_conds (bb, code, lhs, rhs,
> -                             (e->flags & EDGE_TRUE_VALUE) == 0);
> +                             (pred_e->flags & EDGE_TRUE_VALUE) == 0);
>              }
>          }
>       }
>
> fixes this for me (for a small testcase).  Does it help yours?
>

Yes, it has the desired effect (of not needing pass_dominator before 
pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as 
non-trapping" committed as r230738, also has that effect, so AFAIU I 
don't require this tree-ssa-sccvn.c fix.

Thanks,
- Tom

> Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
> it's supposed to match edges that have the src dominated by the dest).
> Testing the above now.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-23 15:53                       ` Tom de Vries
@ 2015-11-23 16:38                         ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-23 16:38 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches, Jakub Jelinek

On November 23, 2015 4:37:18 PM GMT+01:00, Tom de Vries <Tom_deVries@mentor.com> wrote:
>On 23/11/15 12:31, Richard Biener wrote:
>>>>  From the dump below I understand you want no memory references in
>>>> > >the outer loop?
>>>> > >So the issue seems to be that store motion fails
>>>> > >to insert the preheader load / exit store to the outermost loop
>>>> > >possible and thus another LIM pass is needed to "store motion"
>those
>>>> > >again?
>>> >
>>> >Yep.
>>> >
>>>> > >  But a simple testcase
>>>> > >
>>>> > >int a;
>>>> > >int *p = &a;
>>>> > >int foo (int n)
>>>> > >{
>>>> > >    for (int i = 0; i < n; ++i)
>>>> > >      for (int j = 0; j < 100; ++j)
>>>> > >        *p += j + i;
>>>> > >    return a;
>>>> > >}
>>>> > >
>>>> > >shows that LIM can do this in one step.
>>> >
>>> >I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop
>entry
>>> >conditions" for a test-case where that doesn't happen (when using
>>> >-fno-tree-dominator-opts).
>>> >
>>>> > >Which means it should
>>>> > >be investigated why it doesn't do this properly for your
>testcase
>>>> > >(store motion of *_25).
>>> >
>>> >There seems to be two related problems:
>>> >1. the store has tree_could_trap_p (ref->mem.ref) true, which
>should be
>>> >    false. I'll work on a fix for this.
>>> >2. Give that the store can trap, I  was running into PR68465. I
>managed
>>> >    to eliminate the 2nd pass_lim by moving the pass_dominator
>instance
>>> >    before the pass_lim instance.
>>> >
>>> >Attached patch shows the pass group with only one pass_lim. I hope
>to be able
>>> >to eliminate the first pass_dominator instance before pass_lim once
>I fix 1.
>>> >
>>>> > >Simply adding two LIM passes either papers over a wrong-code
>>>> > >bug (in LIM or in DOM) or over a missed-optimization in LIM.
>>> >
>>> >AFAIU now, it's PR68465, a missed optimization in LIM.
>> Ok, it's not really LIMs job to cleanup loop header copying that way.
>>
>> DOM performs jump-threading for this but FRE should also be able
>> to handle this just fine.  Ah, it doesn't because the outer loop
>> header directly contains the condition
>>
>> Index: gcc/tree-ssa-sccvn.c
>> ===================================================================
>> --- gcc/tree-ssa-sccvn.c        (revision 230737)
>> +++ gcc/tree-ssa-sccvn.c        (working copy)
>> @@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b
>>
>>     /* If we have a single predecessor record the equivalence from a
>>        possible condition on the predecessor edge.  */
>> -  if (single_pred_p (bb))
>> +  edge pred_e = NULL;
>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>> +    {
>> +      if (e->flags & EDGE_DFS_BACK)
>> +       continue;
>> +      if (! pred_e)
>> +       pred_e = e;
>> +      else
>> +       {
>> +         pred_e = NULL;
>> +         break;
>> +       }
>> +    }
>> +  if (pred_e)
>>       {
>> -      edge e = single_pred_edge (bb);
>>         /* Check if there are multiple executable successor edges in
>>           the source block.  Otherwise there is no additional info
>>           to be recorded.  */
>>         edge e2;
>> -      FOR_EACH_EDGE (e2, ei, e->src->succs)
>> -       if (e2 != e
>> +      FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
>> +       if (e2 != pred_e
>>              && e2->flags & EDGE_EXECUTABLE)
>>            break;
>>         if (e2 && (e2->flags & EDGE_EXECUTABLE))
>>          {
>> -         gimple *stmt = last_stmt (e->src);
>> +         gimple *stmt = last_stmt (pred_e->src);
>>            if (stmt
>>                && gimple_code (stmt) == GIMPLE_COND)
>>              {
>> @@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
>>                tree lhs = gimple_cond_lhs (stmt);
>>                tree rhs = gimple_cond_rhs (stmt);
>>                record_conds (bb, code, lhs, rhs,
>> -                           (e->flags & EDGE_TRUE_VALUE) != 0);
>> +                           (pred_e->flags & EDGE_TRUE_VALUE) != 0);
>>                code = invert_tree_comparison (code, HONOR_NANS
>(lhs));
>>                if (code != ERROR_MARK)
>>                  record_conds (bb, code, lhs, rhs,
>> -                             (e->flags & EDGE_TRUE_VALUE) == 0);
>> +                             (pred_e->flags & EDGE_TRUE_VALUE) ==
>0);
>>              }
>>          }
>>       }
>>
>> fixes this for me (for a small testcase).  Does it help yours?
>>
>
>Yes, it has the desired effect (of not needing pass_dominator before 
>pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as 
>non-trapping" committed as r230738, also has that effect, so AFAIU I 
>don't require this tree-ssa-sccvn.c fix.

OK, I committed it anyway already.

Richard.

>Thanks,
>- Tom
>
>> Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
>> it's supposed to match edges that have the src dominated by the
>dest).
>> Testing the above now.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-19 13:51     ` Tom de Vries
@ 2015-11-24 12:17       ` Tom de Vries
  2015-11-25 10:42         ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

On 19/11/15 14:50, Tom de Vries wrote:
> On 11/11/15 11:58, Richard Biener wrote:
>> On Mon, 9 Nov 2015, Tom de Vries wrote:
>>
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>> Hi,
>>>>
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>>>
>>>> The patch series contains these patches:
>>>>
>>>>        1    Insert new exit block only when needed in
>>>>           transform_to_exit_first_loop_alt
>>>>        2    Make create_parallel_loop return void
>>>>        3    Ignore reduction clause on kernels directive
>>>>        4    Implement -foffload-alias
>>>>        5    Add in_oacc_kernels_region in struct loop
>>>>        6    Add pass_oacc_kernels
>>>>        7    Add pass_dominator_oacc_kernels
>>>>        8    Add pass_ch_oacc_kernels
>>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>>       11    Update testcases after adding kernels pass group
>>>>       12    Handle acc loop directive
>>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>
>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>> intended to be committed at the same time.
>>>>
>>>> Bootstrapped and reg-tested on x86_64.
>>>>
>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>> patch that enables accelerator testing (which is submitted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>
>>>> I'll post the individual patches in reply to this message.
>>>
>>> this patchs add a pass group pass_oacc_kernels (which will be added
>>> to the
>>> pass list as a whole in patch 10).
>>
>> Just to understand (while also skimming the HSA patches).
>>
>> You are basically relying on autopar for what the HSA patches call
>> "gridification"?  That is, OMP lowering produces loopy kernels
>> and autopar then will basically strip the outermost loop?
>
> Short answer: no. In more detail...
<SNIP>

Reposting patch, after splitting the pass group into two.

Thanks,
- TOm


[-- Attachment #2: 0002-Add-pass_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 4336 bytes --]

Add pass_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_oacc_kernels, make_pass_oacc_kernels2):
	Declare.
	* tree-ssa-loop.c (gate_oacc_kernels): New static function.
	(pass_data_oacc_kernels, pass_data_oacc_kernels2): New pass_data.
	(class pass_oacc_kernels, class pass_oacc_kernels2): New pass.
	(make_pass_oacc_kernels, make_pass_oacc_kernels2): New function.

---
 gcc/tree-pass.h     |   2 +
 gcc/tree-ssa-loop.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 112 insertions(+)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index dcd2d5e..9704918 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -465,6 +465,8 @@ extern gimple_opt_pass *make_pass_strength_reduction (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index afdef12..cf7d94e 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
+#include "omp-low.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -141,6 +142,115 @@ make_pass_tree_loop (gcc::context *ctxt)
   return new pass_tree_loop (ctxt);
 }
 
+/* Gate for oacc kernels pass group.  */
+
+static bool
+gate_oacc_kernels (function *fn)
+{
+  if (flag_tree_parallelize_loops <= 1)
+    return false;
+
+  tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
+  if (oacc_function_attr == NULL_TREE)
+    return false;
+
+  tree val = TREE_VALUE (oacc_function_attr);
+  while (val != NULL_TREE && TREE_VALUE (val) == NULL_TREE)
+    val = TREE_CHAIN (val);
+
+  if (val != NULL_TREE)
+    return false;
+
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 0)
+    if (loop->in_oacc_kernels_region)
+      return true;
+
+  return false;
+}
+
+/* The oacc kernels superpass.  */
+
+namespace {
+
+const pass_data pass_data_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
+
+}; // class pass_oacc_kernels
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_oacc_kernels (ctxt);
+}
+
+namespace {
+
+const pass_data pass_data_oacc_kernels2 =
+{
+  GIMPLE_PASS, /* type */
+  "oacc_kernels2", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_oacc_kernels2 : public gimple_opt_pass
+{
+public:
+  pass_oacc_kernels2 (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
+  virtual unsigned int execute (function *fn)
+    {
+      /* Rather than having a copy of the previous dump, get some use out of
+	 this dump, and try to minimize differences with the following pass
+	 (pass_lim), which will initizalize the loop optimizer with
+	 LOOPS_NORMAL.  */
+      loop_optimizer_init (LOOPS_NORMAL);
+      loop_optimizer_finalize (fn);
+      return 0;
+    }
+
+}; // class pass_oacc_kernels2
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_oacc_kernels2 (gcc::context *ctxt)
+{
+  return new pass_oacc_kernels2 (ctxt);
+}
+
 /* The no-loop superpass.  */
 
 namespace {

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-23 10:11                   ` Richard Biener
@ 2015-11-24 12:22                     ` Tom de Vries
  2015-11-24 13:19                       ` Richard Biener
  2015-11-25 10:44                       ` Richard Biener
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:22 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]

On 23/11/15 11:02, Richard Biener wrote:
> On Fri, 20 Nov 2015, Tom de Vries wrote:
>
>> On 20/11/15 14:29, Richard Biener wrote:
>>> I agree it's somewhat of an odd behavior but all passes should
>>> either be placed in a sub-pipeline with an outer
>>> loop_optimizer_init()/finalize () call or call both themselves.
>>
>> Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop
>> pipeline.
>>
>> We could use the style used in pass_slp_vectorize::execute:
>> ...
>> pass_slp_vectorize::execute (function *fun)
>> {
>>    basic_block bb;
>>
>>    bool in_loop_pipeline = scev_initialized_p ();
>>    if (!in_loop_pipeline)
>>      {
>>        loop_optimizer_init (LOOPS_NORMAL);
>>        scev_initialize ();
>>      }
>>
>>    ...
>>
>>    if (!in_loop_pipeline)
>>      {
>>        scev_finalize ();
>>        loop_optimizer_finalize ();
>>      }
>> ...
>>
>> Although that doesn't strike me as particularly clean.
>
> At least it would be a consistent "unclean" style.  So yes, the
> above would work for me.
>

Reposting using the in_loop_pipeline style in pass_lim.

Thanks,
- Tom


[-- Attachment #2: 0004-Add-pass_oacc_kernels-pass-group-in-passes.def.patch --]
[-- Type: text/x-patch, Size: 3891 bytes --]

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c          |  1 +
 gcc/passes.def         | 18 ++++++++++++++++++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 12 ++++++++++--
 4 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index efe5d3a..7318b0e 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13366,6 +13366,7 @@ public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..f1969c0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,25 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_ch);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+	         variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..0d82d36 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-propagate.h"
 #include "trans-mem.h"
 #include "gimple-fold.h"
+#include "tree-scalar-evolution.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -2496,7 +2497,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
    i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,10 +2561,17 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  bool in_loop_pipeline = scev_initialized_p ();
+  if (!in_loop_pipeline)
+    loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
     return 0;
+  unsigned int todo = tree_ssa_lim ();
 
-  return tree_ssa_lim ();
+  if (!in_loop_pipeline)
+    loop_optimizer_finalize ();
+  return todo;
 }
 
 } // anon namespace

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING][PATCH, 3/16] Ignore reduction clause on kernels directive
  2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
@ 2015-11-24 12:25   ` Tom de Vries
  2016-01-18 14:24     ` [PING^2][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:25 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener, Thomas Schwinge

On 09/11/15 16:50, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> As discussed here (
> https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00785.html ), the kernels
> directive does not allow the reduction clause.  This patch fixes that.
>

Ping.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-16 11:59   ` Tom de Vries
@ 2015-11-24 12:27     ` Tom de Vries
  2015-12-13 16:58       ` [PIING][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3601 bytes --]

On 16/11/15 12:59, Tom de Vries wrote:
> On 09/11/15 20:52, Tom de Vries wrote:
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>       1    Insert new exit block only when needed in
>>>          transform_to_exit_first_loop_alt
>>>       2    Make create_parallel_loop return void
>>>       3    Ignore reduction clause on kernels directive
>>>       4    Implement -foffload-alias
>>>       5    Add in_oacc_kernels_region in struct loop
>>>       6    Add pass_oacc_kernels
>>>       7    Add pass_dominator_oacc_kernels
>>>       8    Add pass_ch_oacc_kernels
>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>      11    Update testcases after adding kernels pass group
>>>      12    Handle acc loop directive
>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> This patch adds pass_parallelize_loops_oacc_kernels.
>>
>> There's a number of things we do differently in parloops for oacc
>> kernels:
>> - in normal parloops, we generate code to choose between a parallel
>>    version of the loop, and a sequential (low iteration count) version.
>>    Since the code in oacc kernels region is supposed to run on the
>>    accelerator anyway, we skip this check, and don't add a low iteration
>>    count loop.
>> - in normal parloops, we generate an #pragma omp parallel /
>>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>>    into a thread function. Since the oacc kernels region is already
>>    split off, we don't add this pair.
>> - we indicate the parallelization factor by setting the oacc function
>>    attributes
>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>    we add the gang clause
>> - in normal parloops, we rewrite the variable accesses in the loop in
>>    terms into accesses relative to a thread function parameter. For the
>>    oacc kernels region, that rewrite has already been done at omp-lower,
>>    so we skip this.
>> - we need to ensure that the entire kernels region can be run in
>>    parallel. The loop independence check is already present, so for oacc
>>    kernels we add a check between blocks outside the loop and the entire
>>    region.
>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>    There's no need for each gang to write to a single location, we can
>>    do this in just one gang. (Typically this is the write of the final
>>    value of the iteration variable if that one is copied back to the
>>    host).
>>
>
> Reposting with loop optimizer init added in
> pass_parallelize_loops_oacc_kernels::execute.
>

Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize 
  added in pass_parallelize_loops_oacc_kernels::execute.

Thanks,
- Tom


[-- Attachment #2: 0003-Add-pass_parallelize_loops_oacc_kernels.patch --]
[-- Type: text/x-patch, Size: 30877 bytes --]

Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
	(create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.

---
 gcc/omp-low.c       |   8 +-
 gcc/omp-low.h       |   1 +
 gcc/tree-parloops.c | 700 +++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-pass.h     |   2 +
 4 files changed, 647 insertions(+), 64 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 0d4c6e5..efe5d3a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11925,10 +11925,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
   gsi_remove (&gsi, true);
   gsi = gsi_last_bb (store_bb);
+  stmt = gsi_stmt (gsi);
   gsi_remove (&gsi, true);
 
   if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_no_phi);
+    {
+      release_defs (stmt);
+      update_ssa (TODO_update_ssa_no_phi);
+    }
 
   return true;
 }
@@ -12302,7 +12306,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index 194b3d1..1790f40 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree);
 extern void replace_oacc_fn_attrib (tree, tree);
 extern tree build_oacc_routine_dims (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 9b564ca..0403d3b 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
 
-  addr = build_addr (t);
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
   basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   gomp_continue *omp_cont_stmt;
   tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
   edge exit, nexit, guard, end, e;
+  tree for_clauses = NULL_TREE;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
   bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (!oacc_kernels_p)
+    {
+      paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
+    }
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+  if (!oacc_kernels_p)
+    {
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+    }
+  else
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
 
   /* Initialize NEW_DATA.  */
   if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
     }
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+  /* Skip insertion of OMP_RETURN for oacc_kernels_p.  We've already generated
+     one when lowering the oacc kernels directive in
+     pass_lower_omp/lower_omp (). */
+  if (!oacc_kernels_p)
+    {
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
     OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
       = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  if (1)
+    {
+      /* In combination with the NUM_GANGS on the parallel.  */
+      for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+    }
+
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   for_clauses, 1, NULL);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
+      many_iterations_cond
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2531,12 +2610,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2595,6 +2683,7 @@ try_create_reduction_list (loop_p loop,
 			 "  FAILED: it is not a part of reduction.\n");
 	      return false;
 	    }
+	  red->keep_res = phi;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
 	      fprintf (dump_file, "reduction phi is  ");
@@ -2629,15 +2718,402 @@ try_create_reduction_list (loop_p loop,
     }
 
 
+  if (oacc_kernels_p)
+    {
+      edge e = loop_preheader_edge (loop);
+
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      struct reduction_info *red;
+	      red = reduction_phi (reduction_list, phi);
+
+	      /* Look for pattern:
+
+		 <bb preheader>
+		   .omp_data_i = &.omp_data_arr;
+		   addr = .omp_data_i->sum;
+		   sum_a = *addr;
+
+		 <bb header>:
+		   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+		 and assign addr to reduc->reduc_addr.  */
+
+	      tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      gimple *stmt = SSA_NAME_DEF_STMT (arg);
+	      if (!gimple_assign_single_p (stmt))
+		return false;
+	      tree memref = gimple_assign_rhs1 (stmt);
+	      if (TREE_CODE (memref) != MEM_REF)
+		return false;
+	      tree addr = TREE_OPERAND (memref, 0);
+
+	      gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+	      if (!gimple_assign_single_p (stmt2))
+		return false;
+	      tree compref = gimple_assign_rhs1 (stmt2);
+	      if (TREE_CODE (compref) != COMPONENT_REF)
+		return false;
+	      tree addr2 = TREE_OPERAND (compref, 0);
+	      if (TREE_CODE (addr2) != MEM_REF)
+		return false;
+	      addr2 = TREE_OPERAND (addr2, 0);
+	      if (TREE_CODE (addr2) != SSA_NAME
+		  || addr2 != get_omp_data_i_param ())
+		return false;
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
+
+  return true;
+}
+
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      tree omp_data_i,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
   return true;
 }
 
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  tree omp_data_i = get_omp_data_i_param ();
+  gcc_assert (omp_data_i != NULL_TREE);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+				   reduction_list, reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
+  return res;
+}
+
 /* Detect parallel loops and generate parallel code using libgomp
    primitives.  Returns true if some loop was parallelized, false
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2649,19 +3125,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2673,6 +3159,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2714,6 +3216,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2723,14 +3226,23 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (!flag_loop_parallelize_all
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
+      /* Skip inner loop(s) of parallelized loop.  */
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2743,8 +3255,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2794,7 +3307,7 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
@@ -2813,3 +3326,66 @@ make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "parloops_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+  unsigned int todo = 0;
+
+  loop_optimizer_init (LOOPS_NORMAL
+		       | LOOPS_HAVE_RECORDED_EXITS);
+
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+  scev_initialize ();
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+      todo |= TODO_update_ssa;
+    }
+
+  scev_finalize ();
+  loop_optimizer_finalize ();
+
+  return todo;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 9704918..004db77 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -385,6 +385,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+  make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PING][PATCH, 12/16] Handle acc loop directive
  2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
@ 2015-11-24 12:30   ` Tom de Vries
  2016-01-18 14:27     ` [PING^2][PATCH, " Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 12:30 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 09/11/15 21:06, Tom de Vries wrote:
> On 09/11/15 16:35, Tom de Vries wrote:
>> Hi,
>>
>> this patch series for stage1 trunk adds support to:
>> - parallelize oacc kernels regions using parloops, and
>> - map the loops onto the oacc gang dimension.
>>
>> The patch series contains these patches:
>>
>>       1    Insert new exit block only when needed in
>>          transform_to_exit_first_loop_alt
>>       2    Make create_parallel_loop return void
>>       3    Ignore reduction clause on kernels directive
>>       4    Implement -foffload-alias
>>       5    Add in_oacc_kernels_region in struct loop
>>       6    Add pass_oacc_kernels
>>       7    Add pass_dominator_oacc_kernels
>>       8    Add pass_ch_oacc_kernels
>>       9    Add pass_parallelize_loops_oacc_kernels
>>      10    Add pass_oacc_kernels pass group in passes.def
>>      11    Update testcases after adding kernels pass group
>>      12    Handle acc loop directive
>>      13    Add c-c++-common/goacc/kernels-*.c
>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>
>> The first 9 patches are more or less independent, but patches 10-16 are
>> intended to be committed at the same time.
>>
>> Bootstrapped and reg-tested on x86_64.
>>
>> Build and reg-tested with nvidia accelerator, in combination with a
>> patch that enables accelerator testing (which is submitted at
>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>
>> I'll post the individual patches in reply to this message.
>
> this patch deals with loops in an oacc kernels region which are
> annotated using "#pragma acc loop". It expands such a loop as a normal
> loop, which has the effect of ignoring the "#pragma acc loop".
>

Ping.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 12:22                     ` Tom de Vries
@ 2015-11-24 13:19                       ` Richard Biener
  2015-11-24 14:33                         ` Tom de Vries
  2015-11-25 10:44                       ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-24 13:19 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 23/11/15 11:02, Richard Biener wrote:
> > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > 
> > > On 20/11/15 14:29, Richard Biener wrote:
> > > > I agree it's somewhat of an odd behavior but all passes should
> > > > either be placed in a sub-pipeline with an outer
> > > > loop_optimizer_init()/finalize () call or call both themselves.
> > > 
> > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
> > > loop
> > > pipeline.
> > > 
> > > We could use the style used in pass_slp_vectorize::execute:
> > > ...
> > > pass_slp_vectorize::execute (function *fun)
> > > {
> > >    basic_block bb;
> > > 
> > >    bool in_loop_pipeline = scev_initialized_p ();
> > >    if (!in_loop_pipeline)
> > >      {
> > >        loop_optimizer_init (LOOPS_NORMAL);
> > >        scev_initialize ();
> > >      }
> > > 
> > >    ...
> > > 
> > >    if (!in_loop_pipeline)
> > >      {
> > >        scev_finalize ();
> > >        loop_optimizer_finalize ();
> > >      }
> > > ...
> > > 
> > > Although that doesn't strike me as particularly clean.
> > 
> > At least it would be a consistent "unclean" style.  So yes, the
> > above would work for me.
> > 
> 
> Reposting using the in_loop_pipeline style in pass_lim.

The tree-ssa-loop-im.c changes are ok (I suppose the other changes
are in the other patch you posted as well).

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 13:19                       ` Richard Biener
@ 2015-11-24 14:33                         ` Tom de Vries
  2015-11-24 14:36                           ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 14:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 24/11/15 14:13, Richard Biener wrote:
> On Tue, 24 Nov 2015, Tom de Vries wrote:
>
>> >On 23/11/15 11:02, Richard Biener wrote:
>>> > >On Fri, 20 Nov 2015, Tom de Vries wrote:
>>> > >
>>>> > > >On 20/11/15 14:29, Richard Biener wrote:
>>>>> > > > >I agree it's somewhat of an odd behavior but all passes should
>>>>> > > > >either be placed in a sub-pipeline with an outer
>>>>> > > > >loop_optimizer_init()/finalize () call or call both themselves.
>>>> > > >
>>>> > > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
>>>> > > >loop
>>>> > > >pipeline.
>>>> > > >
>>>> > > >We could use the style used in pass_slp_vectorize::execute:
>>>> > > >...
>>>> > > >pass_slp_vectorize::execute (function *fun)
>>>> > > >{
>>>> > > >    basic_block bb;
>>>> > > >
>>>> > > >    bool in_loop_pipeline = scev_initialized_p ();
>>>> > > >    if (!in_loop_pipeline)
>>>> > > >      {
>>>> > > >        loop_optimizer_init (LOOPS_NORMAL);
>>>> > > >        scev_initialize ();
>>>> > > >      }
>>>> > > >
>>>> > > >    ...
>>>> > > >
>>>> > > >    if (!in_loop_pipeline)
>>>> > > >      {
>>>> > > >        scev_finalize ();
>>>> > > >        loop_optimizer_finalize ();
>>>> > > >      }
>>>> > > >...
>>>> > > >
>>>> > > >Although that doesn't strike me as particularly clean.
>>> > >
>>> > >At least it would be a consistent "unclean" style.  So yes, the
>>> > >above would work for me.
>>> > >
>> >
>> >Reposting using the in_loop_pipeline style in pass_lim.
> The tree-ssa-loop-im.c changes are ok

OK, I'll commit those.

> (I suppose the other changes
> are in the other patch you posted as well).

This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch 
contains changes related to adding pass_oacc_kernels2. Are those the 
"other changes" you're referring to?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 14:33                         ` Tom de Vries
@ 2015-11-24 14:36                           ` Richard Biener
  2015-11-24 15:05                             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-24 14:36 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 24/11/15 14:13, Richard Biener wrote:
> > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > 
> > > >On 23/11/15 11:02, Richard Biener wrote:
> > > > > >On Fri, 20 Nov 2015, Tom de Vries wrote:
> > > > > >
> > > > > > > >On 20/11/15 14:29, Richard Biener wrote:
> > > > > > > > > >I agree it's somewhat of an odd behavior but all passes
> > > > > > should
> > > > > > > > > >either be placed in a sub-pipeline with an outer
> > > > > > > > > >loop_optimizer_init()/finalize () call or call both
> > > > > > themselves.
> > > > > > > >
> > > > > > > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim
> > > > > breaks the
> > > > > > > >loop
> > > > > > > >pipeline.
> > > > > > > >
> > > > > > > >We could use the style used in pass_slp_vectorize::execute:
> > > > > > > >...
> > > > > > > >pass_slp_vectorize::execute (function *fun)
> > > > > > > >{
> > > > > > > >    basic_block bb;
> > > > > > > >
> > > > > > > >    bool in_loop_pipeline = scev_initialized_p ();
> > > > > > > >    if (!in_loop_pipeline)
> > > > > > > >      {
> > > > > > > >        loop_optimizer_init (LOOPS_NORMAL);
> > > > > > > >        scev_initialize ();
> > > > > > > >      }
> > > > > > > >
> > > > > > > >    ...
> > > > > > > >
> > > > > > > >    if (!in_loop_pipeline)
> > > > > > > >      {
> > > > > > > >        scev_finalize ();
> > > > > > > >        loop_optimizer_finalize ();
> > > > > > > >      }
> > > > > > > >...
> > > > > > > >
> > > > > > > >Although that doesn't strike me as particularly clean.
> > > > > >
> > > > > >At least it would be a consistent "unclean" style.  So yes, the
> > > > > >above would work for me.
> > > > > >
> > > >
> > > >Reposting using the in_loop_pipeline style in pass_lim.
> > The tree-ssa-loop-im.c changes are ok
> 
> OK, I'll commit those.
> 
> > (I suppose the other changes
> > are in the other patch you posted as well).
> 
> This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
> contains changes related to adding pass_oacc_kernels2. Are those the "other
> changes" you're referring to?

No, the other pathc adding oacc_kernels pass group to passes.def.

Btw, at some point splitting patches too much becomes very much
confusing instead of helping.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 14:36                           ` Richard Biener
@ 2015-11-24 15:05                             ` Tom de Vries
  2015-11-25 10:43                               ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-24 15:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek

On 24/11/15 15:33, Richard Biener wrote:
> On Tue, 24 Nov 2015, Tom de Vries wrote:
>
>> On 24/11/15 14:13, Richard Biener wrote:
>>> On Tue, 24 Nov 2015, Tom de Vries wrote:
>>>
>>>>> On 23/11/15 11:02, Richard Biener wrote:
>>>>>>> On Fri, 20 Nov 2015, Tom de Vries wrote:
>>>>>>>
>>>>>>>>> On 20/11/15 14:29, Richard Biener wrote:
>>>>>>>>>>> I agree it's somewhat of an odd behavior but all passes
>>>>>>> should
>>>>>>>>>>> either be placed in a sub-pipeline with an outer
>>>>>>>>>>> loop_optimizer_init()/finalize () call or call both
>>>>>>> themselves.
>>>>>>>>>
>>>>>>>>> Hmm, but adding loop_optimizer_finalize at the end of pass_lim
>>>>>> breaks the
>>>>>>>>> loop
>>>>>>>>> pipeline.
>>>>>>>>>
>>>>>>>>> We could use the style used in pass_slp_vectorize::execute:
>>>>>>>>> ...
>>>>>>>>> pass_slp_vectorize::execute (function *fun)
>>>>>>>>> {
>>>>>>>>>     basic_block bb;
>>>>>>>>>
>>>>>>>>>     bool in_loop_pipeline = scev_initialized_p ();
>>>>>>>>>     if (!in_loop_pipeline)
>>>>>>>>>       {
>>>>>>>>>         loop_optimizer_init (LOOPS_NORMAL);
>>>>>>>>>         scev_initialize ();
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>>     ...
>>>>>>>>>
>>>>>>>>>     if (!in_loop_pipeline)
>>>>>>>>>       {
>>>>>>>>>         scev_finalize ();
>>>>>>>>>         loop_optimizer_finalize ();
>>>>>>>>>       }
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> Although that doesn't strike me as particularly clean.
>>>>>>>
>>>>>>> At least it would be a consistent "unclean" style.  So yes, the
>>>>>>> above would work for me.
>>>>>>>
>>>>>
>>>>> Reposting using the in_loop_pipeline style in pass_lim.
>>> The tree-ssa-loop-im.c changes are ok
>>
>> OK, I'll commit those.
>>
>>> (I suppose the other changes
>>> are in the other patch you posted as well).
>>
>> This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
>> contains changes related to adding pass_oacc_kernels2. Are those the "other
>> changes" you're referring to?
>
> No, the other pathc adding oacc_kernels pass group to passes.def.
>

I don't understand. There 's only one patch adding oacc_kernels pass 
group to passes.def (which is the one in this thread).

> Btw, at some point splitting patches too much becomes very much
> confusing instead of helping.

Would it help if I merge "Add pass_oacc_kernels" with this patch?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 6/16] Add pass_oacc_kernels
  2015-11-24 12:17       ` Tom de Vries
@ 2015-11-25 10:42         ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-25 10:42 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 19/11/15 14:50, Tom de Vries wrote:
> > On 11/11/15 11:58, Richard Biener wrote:
> > > On Mon, 9 Nov 2015, Tom de Vries wrote:
> > > 
> > > > On 09/11/15 16:35, Tom de Vries wrote:
> > > > > Hi,
> > > > > 
> > > > > this patch series for stage1 trunk adds support to:
> > > > > - parallelize oacc kernels regions using parloops, and
> > > > > - map the loops onto the oacc gang dimension.
> > > > > 
> > > > > The patch series contains these patches:
> > > > > 
> > > > >        1    Insert new exit block only when needed in
> > > > >           transform_to_exit_first_loop_alt
> > > > >        2    Make create_parallel_loop return void
> > > > >        3    Ignore reduction clause on kernels directive
> > > > >        4    Implement -foffload-alias
> > > > >        5    Add in_oacc_kernels_region in struct loop
> > > > >        6    Add pass_oacc_kernels
> > > > >        7    Add pass_dominator_oacc_kernels
> > > > >        8    Add pass_ch_oacc_kernels
> > > > >        9    Add pass_parallelize_loops_oacc_kernels
> > > > >       10    Add pass_oacc_kernels pass group in passes.def
> > > > >       11    Update testcases after adding kernels pass group
> > > > >       12    Handle acc loop directive
> > > > >       13    Add c-c++-common/goacc/kernels-*.c
> > > > >       14    Add gfortran.dg/goacc/kernels-*.f95
> > > > >       15    Add libgomp.oacc-c-c++-common/kernels-*.c
> > > > >       16    Add libgomp.oacc-fortran/kernels-*.f95
> > > > > 
> > > > > The first 9 patches are more or less independent, but patches 10-16
> > > > > are
> > > > > intended to be committed at the same time.
> > > > > 
> > > > > Bootstrapped and reg-tested on x86_64.
> > > > > 
> > > > > Build and reg-tested with nvidia accelerator, in combination with a
> > > > > patch that enables accelerator testing (which is submitted at
> > > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > > > > 
> > > > > I'll post the individual patches in reply to this message.
> > > > 
> > > > this patchs add a pass group pass_oacc_kernels (which will be added
> > > > to the
> > > > pass list as a whole in patch 10).
> > > 
> > > Just to understand (while also skimming the HSA patches).
> > > 
> > > You are basically relying on autopar for what the HSA patches call
> > > "gridification"?  That is, OMP lowering produces loopy kernels
> > > and autopar then will basically strip the outermost loop?
> > 
> > Short answer: no. In more detail...
> <SNIP>
> 
> Reposting patch, after splitting the pass group into two.

Ok.

Richard.

> Thanks,
> - TOm
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 15:05                             ` Tom de Vries
@ 2015-11-25 10:43                               ` Richard Biener
  0 siblings, 0 replies; 133+ messages in thread
From: Richard Biener @ 2015-11-25 10:43 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 24/11/15 15:33, Richard Biener wrote:
> > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > 
> > > On 24/11/15 14:13, Richard Biener wrote:
> > > > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > > > 
> > > > > > On 23/11/15 11:02, Richard Biener wrote:
> > > > > > > > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > > > > > > > 
> > > > > > > > > > On 20/11/15 14:29, Richard Biener wrote:
> > > > > > > > > > > > I agree it's somewhat of an odd behavior but all passes
> > > > > > > > should
> > > > > > > > > > > > either be placed in a sub-pipeline with an outer
> > > > > > > > > > > > loop_optimizer_init()/finalize () call or call both
> > > > > > > > themselves.
> > > > > > > > > > 
> > > > > > > > > > Hmm, but adding loop_optimizer_finalize at the end of
> > > > > > > > > > pass_lim
> > > > > > > breaks the
> > > > > > > > > > loop
> > > > > > > > > > pipeline.
> > > > > > > > > > 
> > > > > > > > > > We could use the style used in pass_slp_vectorize::execute:
> > > > > > > > > > ...
> > > > > > > > > > pass_slp_vectorize::execute (function *fun)
> > > > > > > > > > {
> > > > > > > > > >     basic_block bb;
> > > > > > > > > > 
> > > > > > > > > >     bool in_loop_pipeline = scev_initialized_p ();
> > > > > > > > > >     if (!in_loop_pipeline)
> > > > > > > > > >       {
> > > > > > > > > >         loop_optimizer_init (LOOPS_NORMAL);
> > > > > > > > > >         scev_initialize ();
> > > > > > > > > >       }
> > > > > > > > > > 
> > > > > > > > > >     ...
> > > > > > > > > > 
> > > > > > > > > >     if (!in_loop_pipeline)
> > > > > > > > > >       {
> > > > > > > > > >         scev_finalize ();
> > > > > > > > > >         loop_optimizer_finalize ();
> > > > > > > > > >       }
> > > > > > > > > > ...
> > > > > > > > > > 
> > > > > > > > > > Although that doesn't strike me as particularly clean.
> > > > > > > > 
> > > > > > > > At least it would be a consistent "unclean" style.  So yes, the
> > > > > > > > above would work for me.
> > > > > > > > 
> > > > > > 
> > > > > > Reposting using the in_loop_pipeline style in pass_lim.
> > > > The tree-ssa-loop-im.c changes are ok
> > > 
> > > OK, I'll commit those.
> > > 
> > > > (I suppose the other changes
> > > > are in the other patch you posted as well).
> > > 
> > > This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
> > > contains changes related to adding pass_oacc_kernels2. Are those the
> > > "other
> > > changes" you're referring to?
> > 
> > No, the other pathc adding oacc_kernels pass group to passes.def.
> > 
> 
> I don't understand. There 's only one patch adding oacc_kernels pass group to
> passes.def (which is the one in this thread).
> 
> > Btw, at some point splitting patches too much becomes very much
> > confusing instead of helping.
> 
> Would it help if I merge "Add pass_oacc_kernels" with this patch?

It would have, yes.  As said, the excessive splitting just confuses
the review process.  Will review in the present state anyway.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-24 12:22                     ` Tom de Vries
  2015-11-24 13:19                       ` Richard Biener
@ 2015-11-25 10:44                       ` Richard Biener
  2015-11-30 17:48                         ` [gomp4] " Thomas Schwinge
  1 sibling, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-11-25 10:44 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 23/11/15 11:02, Richard Biener wrote:
> > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > 
> > > On 20/11/15 14:29, Richard Biener wrote:
> > > > I agree it's somewhat of an odd behavior but all passes should
> > > > either be placed in a sub-pipeline with an outer
> > > > loop_optimizer_init()/finalize () call or call both themselves.
> > > 
> > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
> > > loop
> > > pipeline.
> > > 
> > > We could use the style used in pass_slp_vectorize::execute:
> > > ...
> > > pass_slp_vectorize::execute (function *fun)
> > > {
> > >    basic_block bb;
> > > 
> > >    bool in_loop_pipeline = scev_initialized_p ();
> > >    if (!in_loop_pipeline)
> > >      {
> > >        loop_optimizer_init (LOOPS_NORMAL);
> > >        scev_initialize ();
> > >      }
> > > 
> > >    ...
> > > 
> > >    if (!in_loop_pipeline)
> > >      {
> > >        scev_finalize ();
> > >        loop_optimizer_finalize ();
> > >      }
> > > ...
> > > 
> > > Although that doesn't strike me as particularly clean.
> > 
> > At least it would be a consistent "unclean" style.  So yes, the
> > above would work for me.
> > 
> 
> Reposting using the in_loop_pipeline style in pass_lim.

Ok.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-23 11:46                   ` Richard Biener
@ 2015-11-27 11:44                     ` Tom de Vries
  2015-11-27 12:14                       ` Tom de Vries
  2015-12-02  9:46                       ` Jakub Jelinek
  0 siblings, 2 replies; 133+ messages in thread
From: Tom de Vries @ 2015-11-27 11:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4227 bytes --]

On 23/11/15 12:41, Richard Biener wrote:
> On Sat, 21 Nov 2015, Tom de Vries wrote:
>
>> >On 13/11/15 12:39, Jakub Jelinek wrote:
>>> > >On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
>>>>> > > > >thanks for the explanation. Filed as PR68331 - '[meta-bug] fipa-pta
>>>>> > > > >issues'.
>>>>> > > > >
>>>>> > > > >Any feedback on the '#pragma GCC offload-alias=<none|pointer|all>' bit
>>>>> > > > >above?
>>>>> > > > >Is that sort of what you had in mind?
>>>> > > >
>>>> > > >Yes.  Whether that makes sense is another question of course.  You can
>>>> > > >annotate memory references with MR_DEPENDENCE_BASE/CLIQUE yourself
>>>> > > >as well if you know dependences without the users intervention.
>>> > >
>>> > >I really don't like even the GCC offload-alias, I just don't see anything
>>> > >special on the offload code.  Not to mention that the same issue is already
>>> > >with other outlined functions, like OpenMP tasks or parallel regions, those
>>> > >aren't offloaded, yet they can suffer from worse alias/points-to analysis
>>> > >too.
>> >
>> >AFAIU there is one aspect that is different for offloaded code: the setup of
>> >the data on the device.
>> >
>> >Consider this example:
>> >...
>> >unsigned int a[N];
>> >unsigned int b[N];
>> >unsigned int c[N];
>> >
>> >int
>> >main (void)
>> >{
>> >   ...
>> >
>> >#pragma acc kernels copyin (a) copyin (b) copyout (c)
>> >   {
>> >     for (COUNTERTYPE ii = 0; ii < N; ii++)
>> >       c[ii] = a[ii] + b[ii];
>> >   }
>> >
>> >   ...
>> >...
>> >
>> >At gimple level, we have:
>> >...
>> >#pragma omp target oacc_kernels \
>> >   map(force_from:c [len: 2097152]) \
>> >   map(force_to:b [len: 2097152]) \
>> >   map(force_to:a [len: 2097152])
>> >...
>> >
>> >[ The meaning of the force_from/force_to mappings is given in
>> >include/gomp-constants.h:
>> >...
>> >     /* Allocate.  */
>> >     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
>> >     /* ..., and copy to device.  */
>> >     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
>> >     /* ..., and copy from device.  */
>> >     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
>> >     /* ..., and copy to and from device.  */
>> >     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
>> >...  ]
>> >
>> >So before calling the offloaded function, a separate alloc is done for a, b
>> >and c, and the base pointers of the newly allocated objects are passed to the
>> >offloaded function.
>> >
>> >This means we can mark those base pointers as restrict in the offloaded
>> >function.
>> >
>> >Attached proof-of-concept patch implements that.
>> >
>>> > >We simply have some compiler internal interface between the caller and
>>> > >callee of the outlined regions, each interface in between those has
>>> > >its own structure type used to communicate the info;
>>> > >we can attach attributes on the fields, or some flags to indicate some
>>> > >properties interesting from aliasing POV.
>>> > >We don't really need to perform
>>> > >full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
>>> > >the relationship in between such callers and callees (for offloading regions
>>> > >we already have "omp target entrypoint" attribute on the callee and a
>>> > >singler caller), tell LTO if possible not to split those into different
>>> > >partitions if easily possible, and then just for these pairs perform
>>> > >aliasing/points-to analysis in the caller and the result record using
>>> > >cliques/special attributes/whatever to the callee side, so that the callee
>>> > >(outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
>> >
>> >As a start, is the approach of this patch OK?
> Works for me but leaving to Jakub to review for correctness.

Attached patch is a complete version:
- added ChangeLog
- added missing function header comments
- moved analysis to separate function
   omp_target_base_pointers_restrict_p
- added example in comment before analysis
- fixed error in omp_target_base_pointers_restrict_p where I was using
   GOMP_MAP_ALLOC but should have been using GOMP_MAP_FORCE_ALLOC
- added testcases

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0001-Mark-pointers-to-allocated-target-vars-as-restricted-if-possible.patch --]
[-- Type: text/x-patch, Size: 13735 bytes --]

Mark pointers to allocated target vars as restricted, if possible

2015-11-26  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (install_var_field_1): New function, factored out of ...
	(install_var_field): ... here.
	(scan_sharing_clauses_1): New function, factored out of ...
	(scan_sharing_clauses): ... here.
	(omp_target_base_pointers_restrict_p): New function.
	(scan_omp_target): Call scan_sharing_clauses_1 instead of
	scan_sharing_clauses, with base_pointers_restrict arg.

	* c-c++-common/goacc/kernels-alias-2.c: New test.
	* c-c++-common/goacc/kernels-alias-3.c: New test.
	* c-c++-common/goacc/kernels-alias-4.c: New test.
	* c-c++-common/goacc/kernels-alias-5.c: New test.
	* c-c++-common/goacc/kernels-alias-6.c: New test.
	* c-c++-common/goacc/kernels-alias-7.c: New test.
	* c-c++-common/goacc/kernels-alias-8.c: New test.
	* c-c++-common/goacc/kernels-alias.c: New test.

---
 gcc/omp-low.c                                      | 109 +++++++++++++++++++--
 gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c |  27 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c |  20 ++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c |  22 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c |  19 ++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c |  23 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c |  25 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c |  22 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias.c   |  29 ++++++
 9 files changed, 289 insertions(+), 7 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 0d4c6e5..6843c49 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1366,10 +1366,12 @@ build_sender_ref (tree var, omp_context *ctx)
   return build_sender_ref ((splay_tree_key) var, ctx);
 }
 
-/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
+/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
+   BASE_POINTERS_RESTRICT, declare the field with restrict.  */
 
 static void
-install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
+		     bool base_pointers_restrict)
 {
   tree field, type, sfield = NULL_TREE;
   splay_tree_key key = (splay_tree_key) var;
@@ -1393,7 +1395,11 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
       type = build_pointer_type (build_pointer_type (type));
     }
   else if (by_ref)
-    type = build_pointer_type (type);
+    {
+      type = build_pointer_type (type);
+      if (base_pointers_restrict)
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
+    }
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
 
@@ -1457,6 +1463,14 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
     splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield);
 }
 
+/* As install_var_field_1, but with base_pointers_restrict == false.  */
+
+static void
+install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
+{
+  install_var_field_1 (var, by_ref, mask, ctx, false);
+}
+
 static tree
 install_var_local (tree var, omp_context *ctx)
 {
@@ -1810,10 +1824,12 @@ fixup_child_record_type (omp_context *ctx)
 }
 
 /* Instantiate decls as necessary in CTX to satisfy the data sharing
-   specified by CLAUSES.  */
+   specified by CLAUSES.  If BASE_POINTERS_RESTRICT, install var field with
+   restrict.  */
 
 static void
-scan_sharing_clauses (tree clauses, omp_context *ctx)
+scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
+			bool base_pointers_restrict)
 {
   tree c, decl;
   bool scan_array_reductions = false;
@@ -2070,7 +2086,8 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 		      && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE)
 		    install_var_field (decl, true, 7, ctx);
 		  else
-		    install_var_field (decl, true, 3, ctx);
+		    install_var_field_1 (decl, true, 3, ctx,
+					 base_pointers_restrict);
 		  if (is_gimple_omp_offloaded (ctx->stmt))
 		    install_var_local (decl, ctx);
 		}
@@ -2336,6 +2353,14 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 	scan_omp (&OMP_CLAUSE_LINEAR_GIMPLE_SEQ (c), ctx);
 }
 
+/* As scan_sharing_clauses_1, but with base_pointers_restrict == false.  */
+
+static void
+scan_sharing_clauses (tree clauses, omp_context *ctx)
+{
+  scan_sharing_clauses_1 (clauses, ctx, false);
+}
+
 /* Create a new name for omp child function.  Returns an identifier.  If
    IS_CILK_FOR is true then the suffix for the child function is
    "_cilk_for_fn."  */
@@ -3032,6 +3057,68 @@ scan_omp_single (gomp_single *stmt, omp_context *outer_ctx)
     layout_type (ctx->record_type);
 }
 
+/* Return true if the CLAUSES of an omp target guarantee that the base pointers
+   used in the corresponding offloaded function are restrict.  */
+
+static bool
+omp_target_base_pointers_restrict_p (tree clauses)
+{
+  /* The analysis relies on the GOMP_MAP_FORCE_* mapping kinds, which are only
+     used by OpenACC.  */
+  if (flag_openacc == 0)
+    return false;
+
+  /* I.  Basic example:
+
+       void foo (void)
+       {
+	 unsigned int a[2], b[2];
+
+	 #pragma acc kernels \
+	   copyout (a) \
+	   copyout (b)
+	 {
+	   a[0] = 0;
+	   b[0] = 1;
+	 }
+       }
+
+     After gimplification, we have:
+
+       #pragma omp target oacc_kernels \
+	 map(force_from:a [len: 8]) \
+	 map(force_from:b [len: 8])
+       {
+	 a[0] = 0;
+	 b[0] = 1;
+       }
+
+     Because both mappings have the force prefix, we know that they will be
+     allocated when calling the corresponding offloaded function, which means we
+     can mark the base pointers for a and b in the offloaded function as
+     restrict.  */
+
+  tree c;
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP)
+	return false;
+
+      switch (OMP_CLAUSE_MAP_KIND (c))
+	{
+	case GOMP_MAP_FORCE_ALLOC:
+	case GOMP_MAP_FORCE_TO:
+	case GOMP_MAP_FORCE_FROM:
+	case GOMP_MAP_FORCE_TOFROM:
+	  break;
+	default:
+	  return false;
+	}
+    }
+
+  return true;
+}
+
 /* Scan a GIMPLE_OMP_TARGET.  */
 
 static void
@@ -3053,13 +3140,21 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   DECL_NAMELESS (name) = 1;
   TYPE_NAME (ctx->record_type) = name;
   TYPE_ARTIFICIAL (ctx->record_type) = 1;
+
+  bool base_pointers_restrict = false;
   if (offloaded)
     {
       create_omp_child_function (ctx, false);
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+
+      base_pointers_restrict = omp_target_base_pointers_restrict_p (clauses);
+      if (base_pointers_restrict
+	  && dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Base pointers in offloaded function are restrict\n");
     }
 
-  scan_sharing_clauses (clauses, ctx);
+  scan_sharing_clauses_1 (clauses, ctx, base_pointers_restrict);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
   if (TYPE_FIELDS (ctx->record_type) == NULL)
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c
new file mode 100644
index 0000000..d437c47
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-2.c
@@ -0,0 +1,27 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+void
+foo (void)
+{
+  unsigned int a;
+  unsigned int b;
+  unsigned int c;
+  unsigned int d;
+
+#pragma acc kernels copyin (a) create (b) copyout (c) copy (d)
+  {
+    a = 0;
+    b = 0;
+    c = 0;
+    d = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 8 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c
new file mode 100644
index 0000000..0eda7e1
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-3.c
@@ -0,0 +1,20 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+void
+foo (void)
+{
+  unsigned int a;
+  unsigned int *p = &a;
+
+#pragma acc kernels pcopyin (a, p[0:1])
+  {
+    a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c
new file mode 100644
index 0000000..037901f
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-4.c
@@ -0,0 +1,22 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int *p = &a[0];
+
+#pragma acc kernels pcopyin (a, p[0:2])
+  {
+    a[0] = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c
new file mode 100644
index 0000000..69cd3fb
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-5.c
@@ -0,0 +1,19 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+void
+foo (int *a)
+{
+  int *p = a;
+
+#pragma acc kernels pcopyin (a[0:1], p[0:1])
+  {
+    *a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c
new file mode 100644
index 0000000..6ebce15
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-6.c
@@ -0,0 +1,23 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+typedef __SIZE_TYPE__ size_t;
+extern void *acc_copyin (void *, size_t);
+
+void
+foo (void)
+{
+  int a = 0;
+  int *p = (int *)acc_copyin (&a, sizeof (a));
+
+#pragma acc kernels deviceptr (p) pcopy(a)
+  {
+    a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c
new file mode 100644
index 0000000..40eb235
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-7.c
@@ -0,0 +1,25 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+typedef __SIZE_TYPE__ size_t;
+extern void *acc_copyin (void *, size_t);
+
+#define N 2
+
+void
+foo (void)
+{
+  int a[N];
+  int *p = (int *)acc_copyin (&a[0], sizeof (a));
+
+#pragma acc kernels deviceptr (p) pcopy(a)
+  {
+    a[0] = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c
new file mode 100644
index 0000000..0b93e35
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-8.c
@@ -0,0 +1,22 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+typedef __SIZE_TYPE__ size_t;
+extern void *acc_copyin (void *, size_t);
+
+void
+foo (int *a, size_t n)
+{
+  int *p = (int *)acc_copyin (&a, n);
+
+#pragma acc kernels deviceptr (p) pcopy(a[0:n])
+  {
+    a = 0;
+    *p = 1;
+  }
+}
+
+/* Only the omp_data_i related loads should be annotated with cliques.  */
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 2 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 2 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias.c
new file mode 100644
index 0000000..25821ab2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int b[N];
+  unsigned int c[N];
+  unsigned int d[N];
+
+#pragma acc kernels copyin (a) create (b) copyout (c) copy (d)
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 8 "ealias" } } */
+

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-27 11:44                     ` Tom de Vries
@ 2015-11-27 12:14                       ` Tom de Vries
  2015-12-02  9:59                         ` Jakub Jelinek
  2015-12-02  9:46                       ` Jakub Jelinek
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-11-27 12:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5720 bytes --]

On 27/11/15 12:42, Tom de Vries wrote:
> On 23/11/15 12:41, Richard Biener wrote:
>> On Sat, 21 Nov 2015, Tom de Vries wrote:
>>
>>> >On 13/11/15 12:39, Jakub Jelinek wrote:
>>>> > >On Fri, Nov 13, 2015 at 12:29:51PM +0100, Richard Biener wrote:
>>>>>> > > > >thanks for the explanation. Filed as PR68331 - '[meta-bug]
>>>>>> fipa-pta
>>>>>> > > > >issues'.
>>>>>> > > > >
>>>>>> > > > >Any feedback on the '#pragma GCC
>>>>>> offload-alias=<none|pointer|all>' bit
>>>>>> > > > >above?
>>>>>> > > > >Is that sort of what you had in mind?
>>>>> > > >
>>>>> > > >Yes.  Whether that makes sense is another question of course.
>>>>> You can
>>>>> > > >annotate memory references with MR_DEPENDENCE_BASE/CLIQUE
>>>>> yourself
>>>>> > > >as well if you know dependences without the users intervention.
>>>> > >
>>>> > >I really don't like even the GCC offload-alias, I just don't see
>>>> anything
>>>> > >special on the offload code.  Not to mention that the same issue
>>>> is already
>>>> > >with other outlined functions, like OpenMP tasks or parallel
>>>> regions, those
>>>> > >aren't offloaded, yet they can suffer from worse alias/points-to
>>>> analysis
>>>> > >too.
>>> >
>>> >AFAIU there is one aspect that is different for offloaded code: the
>>> setup of
>>> >the data on the device.
>>> >
>>> >Consider this example:
>>> >...
>>> >unsigned int a[N];
>>> >unsigned int b[N];
>>> >unsigned int c[N];
>>> >
>>> >int
>>> >main (void)
>>> >{
>>> >   ...
>>> >
>>> >#pragma acc kernels copyin (a) copyin (b) copyout (c)
>>> >   {
>>> >     for (COUNTERTYPE ii = 0; ii < N; ii++)
>>> >       c[ii] = a[ii] + b[ii];
>>> >   }
>>> >
>>> >   ...
>>> >...
>>> >
>>> >At gimple level, we have:
>>> >...
>>> >#pragma omp target oacc_kernels \
>>> >   map(force_from:c [len: 2097152]) \
>>> >   map(force_to:b [len: 2097152]) \
>>> >   map(force_to:a [len: 2097152])
>>> >...
>>> >
>>> >[ The meaning of the force_from/force_to mappings is given in
>>> >include/gomp-constants.h:
>>> >...
>>> >     /* Allocate.  */
>>> >     GOMP_MAP_FORCE_ALLOC = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_ALLOC),
>>> >     /* ..., and copy to device.  */
>>> >     GOMP_MAP_FORCE_TO = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TO),
>>> >     /* ..., and copy from device.  */
>>> >     GOMP_MAP_FORCE_FROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_FROM),
>>> >     /* ..., and copy to and from device.  */
>>> >     GOMP_MAP_FORCE_TOFROM = (GOMP_MAP_FLAG_FORCE | GOMP_MAP_TOFROM),
>>> >...  ]
>>> >
>>> >So before calling the offloaded function, a separate alloc is done
>>> for a, b
>>> >and c, and the base pointers of the newly allocated objects are
>>> passed to the
>>> >offloaded function.
>>> >
>>> >This means we can mark those base pointers as restrict in the offloaded
>>> >function.
>>> >
>>> >Attached proof-of-concept patch implements that.
>>> >
>>>> > >We simply have some compiler internal interface between the
>>>> caller and
>>>> > >callee of the outlined regions, each interface in between those has
>>>> > >its own structure type used to communicate the info;
>>>> > >we can attach attributes on the fields, or some flags to indicate
>>>> some
>>>> > >properties interesting from aliasing POV.
>>>> > >We don't really need to perform
>>>> > >full IPA-PTA, perhaps it would be enough to a) record somewhere
>>>> in cgraph
>>>> > >the relationship in between such callers and callees (for
>>>> offloading regions
>>>> > >we already have "omp target entrypoint" attribute on the callee
>>>> and a
>>>> > >singler caller), tell LTO if possible not to split those into
>>>> different
>>>> > >partitions if easily possible, and then just for these pairs perform
>>>> > >aliasing/points-to analysis in the caller and the result record
>>>> using
>>>> > >cliques/special attributes/whatever to the callee side, so that
>>>> the callee
>>>> > >(outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
>>>> analysis.
>>> >
>>> >As a start, is the approach of this patch OK?
>> Works for me but leaving to Jakub to review for correctness.
>
> Attached patch is a complete version:
> - added ChangeLog
> - added missing function header comments
> - moved analysis to separate function
>    omp_target_base_pointers_restrict_p
> - added example in comment before analysis
> - fixed error in omp_target_base_pointers_restrict_p where I was using
>    GOMP_MAP_ALLOC but should have been using GOMP_MAP_FORCE_ALLOC
> - added testcases
>

This follow-up patch handles the case that we copy from/to pointers 
rather than declared variables:
...
        void foo (unsigned int *a, unsigned int *b)
        {
	 #pragma acc kernels copyout (a[0:2]) copyout (b[0:2])
	 {
	   a[0] = 0;
	   b[0] = 1;
	 }
        }
...

After gimplification, we have:
...
      foo (unsigned int * a, unsigned int * b)
      {
        unsigned int * b.0;
        unsigned int * a.1;

        b.0 = b;
        a.1 = a;
        #pragma omp target oacc_kernels \
	 map(force_from:*a.1 (*a) [len: 8]) \
	 map(alloc:a [pointer assign, bias: 0]) \
	 map(force_from:*b.0 (*b) [len: 8]) \
	 map(alloc:b [pointer assign, bias: 0])
        {
	 unsigned int * a.2;
	 unsigned int * b.3;

	 a.2 = a;
	 *a.2 = 0;
	 b.3 = b;
	 *b.3 = 1;
       }
      }
...

We don't bail out of omp_target_base_pointers_restrict_p when 
encountering 'map(alloc:a [pointer assign, bias: 0])', given that we can 
find the matching 'map(force_from:*a.1 (*a) [len: 8])'.

Using this and the previous patch, I'm able to do auto-parallelization 
on all the oacc kernels c test-cases, with the obvious exception of the 
testcases where some of used variables are mapped using the 'present' 
tag (in other words, missing the force tag).

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0002-Handle-non-declared-variables-in-kernels-alias-analysis.patch --]
[-- Type: text/x-patch, Size: 10410 bytes --]

Handle non-declared variables in kernels alias analysis

2015-11-27  Tom de Vries  <tom@codesourcery.com>

	* gimplify.c (gimplify_scan_omp_clauses): Initialize
	OMP_CLAUSE_ORIG_DECL.
	* omp-low.c (install_var_field_1): Handle base_pointers_restrict for
	pointers.
	(map_ptr_clause_points_to_clause_p)
	(nr_map_ptr_clauses_pointing_to_clause): New function.
	(omp_target_base_pointers_restrict_p): Handle GOMP_MAP_POINTER.
	* tree-pretty-print.c (dump_omp_clause): Print OMP_CLAUSE_ORIG_DECL.
	* tree.c (omp_clause_num_ops): Set num_ops for OMP_CLAUSE_MAP to 3.
	* tree.h (OMP_CLAUSE_ORIG_DECL): New macro.

	* c-c++-common/goacc/kernels-alias-10.c: New test.
	* c-c++-common/goacc/kernels-alias-9.c: New test.

---
 gcc/gimplify.c                                     |   1 +
 gcc/omp-low.c                                      | 134 ++++++++++++++++++++-
 .../c-c++-common/goacc/kernels-alias-10.c          |  29 +++++
 gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c |  29 +++++
 gcc/tree-pretty-print.c                            |   8 ++
 gcc/tree.c                                         |   2 +-
 gcc/tree.h                                         |   5 +
 7 files changed, 205 insertions(+), 3 deletions(-)

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index a3ed378..fcac745 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -6713,6 +6713,7 @@ gimplify_scan_omp_clauses (tree *list_p, gimple_seq *pre_p,
 	  if (!DECL_P (decl))
 	    {
 	      tree d = decl, *pd;
+	      OMP_CLAUSE_ORIG_DECL (c) = copy_node (decl);
 	      if (TREE_CODE (d) == ARRAY_REF)
 		{
 		  while (TREE_CODE (d) == ARRAY_REF)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 6843c49..8ae08c52 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1396,6 +1396,9 @@ install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
     }
   else if (by_ref)
     {
+      if (base_pointers_restrict
+	  && POINTER_TYPE_P (type))
+	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
       type = build_pointer_type (type);
       if (base_pointers_restrict)
 	type = build_qualified_type (type, TYPE_QUAL_RESTRICT);
@@ -3057,6 +3060,64 @@ scan_omp_single (gomp_single *stmt, omp_context *outer_ctx)
     layout_type (ctx->record_type);
 }
 
+/* Return true if OMP_CLAUSE_DECL (MAP_POINTER_CLAUSE) points to
+   OMP_CLAUSE_DECL (CLAUSE).  */
+
+static bool
+map_ptr_clause_points_to_clause_p (tree map_pointer_clause, tree clause)
+{
+  gcc_assert (OMP_CLAUSE_CODE (map_pointer_clause) == OMP_CLAUSE_MAP);
+  gcc_assert (OMP_CLAUSE_MAP_KIND (map_pointer_clause) == GOMP_MAP_POINTER);
+
+  if (OMP_CLAUSE_CODE (clause) != OMP_CLAUSE_MAP)
+    return false;
+
+  tree orig_decl = OMP_CLAUSE_ORIG_DECL (clause);
+  if (orig_decl == NULL_TREE)
+    return false;
+
+  tree ptr_decl = OMP_CLAUSE_DECL (map_pointer_clause);
+  switch (TREE_CODE (orig_decl))
+    {
+    case ARRAY_REF:
+      if (!integer_zerop (TREE_OPERAND (orig_decl, 1)))
+	return false;
+
+      /* Fall through.  */
+    case INDIRECT_REF:
+      if (!operand_equal_p (ptr_decl, TREE_OPERAND (orig_decl, 0), 0))
+	return false;
+      break;
+    default:
+      return false;
+    }
+
+  return true;
+}
+
+/* Return the number of map_pointer clauses in CLAUSES pointing to CLAUSE.  */
+
+static unsigned int
+nr_map_ptr_clauses_pointing_to_clause (tree clauses, tree clause)
+{
+  unsigned int nr = 0;
+
+  tree c;
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP)
+	continue;
+
+      if (OMP_CLAUSE_MAP_KIND (c) != GOMP_MAP_POINTER)
+	continue;
+
+      if (map_ptr_clause_points_to_clause_p (c, clause))
+	nr++;
+    }
+
+  return nr;
+}
+
 /* Return true if the CLAUSES of an omp target guarantee that the base pointers
    used in the corresponding offloaded function are restrict.  */
 
@@ -3096,8 +3157,59 @@ omp_target_base_pointers_restrict_p (tree clauses)
      Because both mappings have the force prefix, we know that they will be
      allocated when calling the corresponding offloaded function, which means we
      can mark the base pointers for a and b in the offloaded function as
-     restrict.  */
+     restrict.
+
+     II.  GOMP_MAP_POINTER example:
 
+       void foo (unsigned int *a, unsigned int *b)
+       {
+	 #pragma acc kernels copyout (a[0:2]) copyout (b[0:2])
+	 {
+	   a[0] = 0;
+	   b[0] = 1;
+	 }
+       }
+
+     After gimplification, we have:
+
+     foo (unsigned int * a, unsigned int * b)
+     {
+       unsigned int * b.0;
+       unsigned int * a.1;
+
+       b.0 = b;
+       a.1 = a;
+       #pragma omp target oacc_kernels \
+	 map(force_from:*a.1 (*a) [len: 8]) \
+	 map(alloc:a [pointer assign, bias: 0]) \
+	 map(force_from:*b.0 (*b) [len: 8]) \
+	 map(alloc:b [pointer assign, bias: 0])
+       {
+	 unsigned int * a.2;
+	 unsigned int * b.3;
+
+	 a.2 = a;
+	 *a.2 = 0;
+	 b.3 = b;
+	 *b.3 = 1;
+       }
+     }
+
+     Because:
+     - we can prove for both pointer assign mappings that they point to a
+       force-prefixed mapping, and
+     - the force-prefixed mappings themselves do not have their OMP_CLAUSE_DECL
+       used in the body,
+     we can mark the base pointers for a and b in the offloaded function as
+     restrict.
+
+     KLUDGE: In order to connect the pointer mapping clause to the force_*
+     clause, we need to save the pre-gimplification OMP_CLAUSE_DECL as
+     OMP_CLAUSE_ORIG_DECL.  Note that OMP_CLAUSE_ORIG_DECL is printed as '(*a)'
+     in 'map(force_from:*a.1 (*a) [len: 8])'.  */
+
+  unsigned int ptr_found = 0;
+  unsigned int ptr_matched = 0;
   tree c;
   for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
     {
@@ -3110,13 +3222,31 @@ omp_target_base_pointers_restrict_p (tree clauses)
 	case GOMP_MAP_FORCE_TO:
 	case GOMP_MAP_FORCE_FROM:
 	case GOMP_MAP_FORCE_TOFROM:
+	  {
+	    unsigned int nr
+	      = nr_map_ptr_clauses_pointing_to_clause (clauses, c);
+	    if (DECL_P (OMP_CLAUSE_DECL (c)))
+	      {
+		if (nr != 0)
+		  return false;
+	      }
+	    else
+	      {
+		if (nr != 1)
+		  return false;
+		ptr_matched++;
+	      }
+	  }
+	  break;
+	case GOMP_MAP_POINTER:
+	  ptr_found++;
 	  break;
 	default:
 	  return false;
 	}
     }
 
-  return true;
+  return ptr_found == ptr_matched;
 }
 
 /* Scan a GIMPLE_OMP_TARGET.  */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c
new file mode 100644
index 0000000..ce5bbe8
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-10.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int b[N];
+  unsigned int c[N];
+  unsigned int d[N];
+
+#pragma acc kernels copyin (a[0:N]) create (b[0:N]) copyout (c[0:N]) copy (d[0:N])
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 8 "ealias" } } */
+
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c
new file mode 100644
index 0000000..7229fd4
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-9.c
@@ -0,0 +1,29 @@
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fdump-tree-ealias-all" } */
+
+#define N 2
+
+void
+foo (unsigned int *a, unsigned int *b, unsigned int *c, unsigned int *d)
+{
+
+#pragma acc kernels copyin (a[0:N]) create (b[0:N]) copyout (c[0:N]) copy (d[0:N])
+  {
+    a[0] = 0;
+    b[0] = 0;
+    c[0] = 0;
+    d[0] = 0;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "clique 1 base 1" 4 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 2" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 3" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 4" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 5" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 6" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 7" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 8" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "clique 1 base 9" 1 "ealias" } } */
+/* { dg-final { scan-tree-dump-times "(?n)clique .* base .*" 12 "ealias" } } */
+
diff --git a/gcc/tree-pretty-print.c b/gcc/tree-pretty-print.c
index caec760..4b94f18 100644
--- a/gcc/tree-pretty-print.c
+++ b/gcc/tree-pretty-print.c
@@ -666,6 +666,14 @@ dump_omp_clause (pretty_printer *pp, tree clause, int spc, int flags)
       pp_colon (pp);
       dump_generic_node (pp, OMP_CLAUSE_DECL (clause),
 			 spc, flags, false);
+      if (OMP_CLAUSE_ORIG_DECL (clause) != NULL_TREE)
+	{
+	  pp_space (pp);
+	  pp_left_paren (pp);
+	  dump_generic_node (pp, OMP_CLAUSE_ORIG_DECL (clause),
+			     spc, flags, false);
+	  pp_right_paren (pp);
+	}
      print_clause_size:
       if (OMP_CLAUSE_SIZE (clause))
 	{
diff --git a/gcc/tree.c b/gcc/tree.c
index 779fe93..45f9a17 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -277,7 +277,7 @@ unsigned const char omp_clause_num_ops[] =
   1, /* OMP_CLAUSE_LINK  */
   2, /* OMP_CLAUSE_FROM  */
   2, /* OMP_CLAUSE_TO  */
-  2, /* OMP_CLAUSE_MAP  */
+  3, /* OMP_CLAUSE_MAP  */
   1, /* OMP_CLAUSE_USE_DEVICE_PTR  */
   1, /* OMP_CLAUSE_IS_DEVICE_PTR  */
   2, /* OMP_CLAUSE__CACHE_  */
diff --git a/gcc/tree.h b/gcc/tree.h
index cb52deb..27221ee 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1382,6 +1382,11 @@ extern void protected_set_expr_location (tree, location_t);
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_RANGE_CHECK (OMP_CLAUSE_CHECK (NODE),	\
 					      OMP_CLAUSE_PRIVATE,	\
 					      OMP_CLAUSE__LOOPTEMP_), 0)
+#define OMP_CLAUSE_ORIG_DECL(NODE)					\
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_RANGE_CHECK (OMP_CLAUSE_CHECK (NODE),	\
+					      OMP_CLAUSE_PRIVATE,	\
+					      OMP_CLAUSE__LOOPTEMP_), 2)
+
 #define OMP_CLAUSE_HAS_LOCATION(NODE) \
   (LOCATION_LOCUS ((OMP_CLAUSE_CHECK (NODE))->omp_clause.locus)		\
   != UNKNOWN_LOCATION)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [gomp4] Use pass_ch instead of pass_ch_oacc_kernels (was: [PATCH, 8/16] Add pass_ch_oacc_kernels)
  2015-11-11 20:29   ` Tom de Vries
@ 2015-11-30 12:12     ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2015-11-30 12:12 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Jakub Jelinek, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 9343 bytes --]

Hi!

On Wed, 11 Nov 2015 21:29:10 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 09/11/15 19:33, Tom de Vries wrote:
> > On 09/11/15 16:35, Tom de Vries wrote:
> > this patch adds a pass pass_ch_oacc_kernels, which is like pass_ch, but
> > only runs for loops with oacc_kernels_region set.
> >
> > [ But... thinking about it a bit more, I think that we could use a
> > regular pass_ch instead. We only use the kernels pass group for a single
> > loop nest in a kernels region, and we mark all the loops in the loop
> > nest with oacc_kernels_region. So I think that the oacc_kernels_region
> > test in pass_ch_oacc_kernels::process_loop_p evaluates to true. ]
> >
> > So, I'll try to confirm with retesting that we can drop this patch.
> >
> 
> That's confirmed. I can use pass_ch instead of pass_ch_oacc_kernels, so 
> I'm dropping this patch from the series.

Committed to gomp-4_0-branch in r231067:

commit 8249e606d83025092e3b0b227360f7e38fe591d4
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Mon Nov 30 12:05:50 2015 +0000

    Use pass_ch instead of pass_ch_oacc_kernels
    
    	gcc/
    	* passes.def: Use pass_ch instead of pass_ch_oacc_kernels.
    	* tree-pass.h (make_pass_ch_oacc_kernels): Remove.
    	* tree-ssa-loop-ch.c: Revert to trunk r230907 version.
    	gcc/testsuite/
    	* gcc.dg/tree-ssa/copy-headers.c: Update for new pass_ch.
    	* gcc.dg/tree-ssa/foldconst-2.c: Likewise.
    	* gcc.dg/tree-ssa/loop-40.c: Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231067 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp                           |    6 +++
 gcc/passes.def                               |    2 +-
 gcc/testsuite/ChangeLog.gomp                 |    6 +++
 gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c |    4 +-
 gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c  |    4 +-
 gcc/testsuite/gcc.dg/tree-ssa/loop-40.c      |    4 +-
 gcc/tree-pass.h                              |    1 -
 gcc/tree-ssa-loop-ch.c                       |   60 +++-----------------------
 8 files changed, 24 insertions(+), 63 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index 54712ab..2c8f0c2 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,9 @@
+2015-11-30  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* passes.def: Use pass_ch instead of pass_ch_oacc_kernels.
+	* tree-pass.h (make_pass_ch_oacc_kernels): Remove.
+	* tree-ssa-loop-ch.c: Revert to trunk r230907 version.
+
 2015-11-18  Nathan Sidwell  <nathan@codesourcery.com>
 
 	* config/nvptx/nvptx.c: Remove unneeded #includes. Backport
diff --git gcc/passes.def gcc/passes.def
index e44bfac..f4eb235 100644
--- gcc/passes.def
+++ gcc/passes.def
@@ -93,7 +93,7 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_oacc_kernels);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_ch_oacc_kernels);
+	      NEXT_PASS (pass_ch);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_tree_loop_init);
 	      NEXT_PASS (pass_lim);
diff --git gcc/testsuite/ChangeLog.gomp gcc/testsuite/ChangeLog.gomp
index dd3b1f5..59733bd 100644
--- gcc/testsuite/ChangeLog.gomp
+++ gcc/testsuite/ChangeLog.gomp
@@ -1,3 +1,9 @@
+2015-11-30  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* gcc.dg/tree-ssa/copy-headers.c: Update for new pass_ch.
+	* gcc.dg/tree-ssa/foldconst-2.c: Likewise.
+	* gcc.dg/tree-ssa/loop-40.c: Likewise.
+
 2015-11-19  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* gfortran.dg/goacc/routine-6.f90: Ensure that the device clause is
diff --git gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c
index 4241b40..a5a8212 100644
--- gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c
+++ gcc/testsuite/gcc.dg/tree-ssa/copy-headers.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-ch-details" } */
+/* { dg-options "-O2 -fdump-tree-ch2-details" } */
 
 extern int foo (int);
 
@@ -12,4 +12,4 @@ void bla (void)
 }
 
 /* There should be a header duplicated.  */
-/* { dg-final { scan-tree-dump-times "Duplicating header" 1 "ch"} } */
+/* { dg-final { scan-tree-dump-times "Duplicating header" 1 "ch2"} } */
diff --git gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c
index eb1e6de..e9a6f87 100644
--- gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c
+++ gcc/testsuite/gcc.dg/tree-ssa/foldconst-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-ch" } */
+/* { dg-options "-O2 -fdump-tree-ch2" } */
 typedef union tree_node *tree;
 enum tree_code
 {
@@ -56,4 +56,4 @@ emit_support_tinfos (void)
 }
 /* We should copy loop header to fundamentals[0] and then fold it way into
    known value.  */
-/* { dg-final { scan-tree-dump-not "fundamentals.0" "ch"} } */
+/* { dg-final { scan-tree-dump-not "fundamentals.0" "ch2"} } */
diff --git gcc/testsuite/gcc.dg/tree-ssa/loop-40.c gcc/testsuite/gcc.dg/tree-ssa/loop-40.c
index 8397396..36db565 100644
--- gcc/testsuite/gcc.dg/tree-ssa/loop-40.c
+++ gcc/testsuite/gcc.dg/tree-ssa/loop-40.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-ch-details" } */
+/* { dg-options "-O2 -fdump-tree-ch2-details" } */
 
 int mymax2(int *it, int *end)
 {
@@ -10,4 +10,4 @@ int mymax2(int *it, int *end)
   return max;
 }
 
-/* { dg-final { scan-tree-dump "Duplicating header" "ch" } } */
+/* { dg-final { scan-tree-dump "Duplicating header" "ch2" } } */
diff --git gcc/tree-pass.h gcc/tree-pass.h
index 8ac8e72..004db77 100644
--- gcc/tree-pass.h
+++ gcc/tree-pass.h
@@ -392,7 +392,6 @@ extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_ch_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_split_paths (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
diff --git gcc/tree-ssa-loop-ch.c gcc/tree-ssa-loop-ch.c
index 3773e94..6493fcc 100644
--- gcc/tree-ssa-loop-ch.c
+++ gcc/tree-ssa-loop-ch.c
@@ -33,7 +33,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-ssa-scopedtables.h"
 #include "tree-ssa-threadedge.h"
-#include "omp-low.h"
 
 /* Duplicates headers of loops if they are small enough, so that the statements
    in the loop body are always executed when the loop is entered.  This
@@ -125,7 +124,7 @@ do_while_loop_p (struct loop *loop)
 
 namespace {
 
-/* Common superclass for header-copying phases.  */
+/* Common superclass for both header-copying phases.  */
 class ch_base : public gimple_opt_pass
 {
   protected:
@@ -160,16 +159,14 @@ public:
     : ch_base (pass_data_ch, ctxt)
   {}
 
-  pass_ch (pass_data data, gcc::context *ctxt)
-    : ch_base (data, ctxt)
-  {}
-
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_ch != 0; }
   
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
@@ -341,8 +338,6 @@ ch_base::copy_headers (function *fun)
   return changed ? TODO_cleanup_cfg : 0;
 }
 
-} // anon namespace
-
 /* Initialize the loop structures we need, and finalize after.  */
 
 unsigned int
@@ -408,6 +403,8 @@ pass_ch_vect::process_loop_p (struct loop *loop)
   return false;
 }
 
+} // anon namespace
+
 gimple_opt_pass *
 make_pass_ch_vect (gcc::context *ctxt)
 {
@@ -419,50 +416,3 @@ make_pass_ch (gcc::context *ctxt)
 {
   return new pass_ch (ctxt);
 }
-
-namespace {
-
-const pass_data pass_data_ch_oacc_kernels =
-{
-  GIMPLE_PASS, /* type */
-  "ch_oacc_kernels", /* name */
-  OPTGROUP_LOOP, /* optinfo_flags */
-  TV_TREE_CH, /* tv_id */
-  ( PROP_cfg | PROP_ssa ), /* properties_required */
-  0, /* properties_provided */
-  0, /* properties_destroyed */
-  0, /* todo_flags_start */
-  TODO_cleanup_cfg, /* todo_flags_finish */
-};
-
-class pass_ch_oacc_kernels : public pass_ch
-{
-public:
-  pass_ch_oacc_kernels (gcc::context *ctxt)
-    : pass_ch (pass_data_ch_oacc_kernels, ctxt)
-  {}
-
-  /* opt_pass methods: */
-  virtual bool gate (function *) { return true; }
-
-protected:
-  /* ch_base method: */
-  virtual bool process_loop_p (struct loop *loop);
-}; // class pass_ch_oacc_kernels
-
-} // anon namespace
-
-bool
-pass_ch_oacc_kernels::process_loop_p (struct loop *loop)
-{
-  if (!loop->in_oacc_kernels_region)
-    return false;
-
-  return pass_ch::process_loop_p (loop);
-}
-
-gimple_opt_pass *
-make_pass_ch_oacc_kernels (gcc::context *ctxt)
-{
-  return new pass_ch_oacc_kernels (ctxt);
-}


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [gomp4] Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
  2015-11-25 10:44                       ` Richard Biener
@ 2015-11-30 17:48                         ` Thomas Schwinge
  0 siblings, 0 replies; 133+ messages in thread
From: Thomas Schwinge @ 2015-11-30 17:48 UTC (permalink / raw)
  To: gcc-patches, Tom de Vries; +Cc: Richard Biener, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 5606 bytes --]

Hi!

On Wed, 25 Nov 2015 11:43:14 +0100 (CET), Richard Biener <rguenther@suse.de> wrote:
> On Tue, 24 Nov 2015, Tom de Vries wrote:
> > > [...]
> > 
> > Reposting using the in_loop_pipeline style in pass_lim.
> 
> Ok.

I merged trunk r230907 into gomp-4_0-branch in a very simplistic way,
basically just moving pass_fre in between pass_oacc_kernels and the (new)
pass_oacc_kernels2 pass groups.  We'll want to clean this up later (on
gomp-4_0-branch), once we're more clear on what difference will remain
between the trunk and gomp-4_0-branch pass structures (if any); for now
this makes sure we don't regress OpenACC kernels functionality on
gomp-4_0-branch.  In gomp-4_0-branch r231078, I effectively applied the
following:

commit ffae8a36e195172327a233bd397a4230a7939681
Merge: 8249e60 e1e1688
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Mon Nov 30 17:28:07 2015 +0000

    svn merge -r 230906:230907 svn+ssh://gcc.gnu.org/svn/gcc/trunk
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231078 138bc75d-0d04-0410-961f-82ee72b054a4

 gcc/ChangeLog           |  6 ++++
 gcc/passes.def          | 13 +++++++--
 gcc/testsuite/ChangeLog | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 3 deletions(-)

[diff --git gcc/ChangeLog gcc/ChangeLog]
diff --git gcc/passes.def gcc/passes.def
index f4eb235..9fe4fec 100644
--- gcc/passes.def
+++ gcc/passes.def
@@ -84,36 +84,43 @@ along with GCC; see the file COPYING3.  If not see
 	  /* After CCP we rewrite no longer addressed locals into SSA
 	     form if possible.  */
 	  NEXT_PASS (pass_forwprop);
 	  NEXT_PASS (pass_sra_early);
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when there are oacc kernels in the
-	     function.  */
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 1.  */
 	  NEXT_PASS (pass_oacc_kernels);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_ch);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	  POP_INSERT_PASSES ()
+	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	     containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+		 variable accesses in loops into local variables accesses.  */
 	      NEXT_PASS (pass_tree_loop_init);
 	      NEXT_PASS (pass_lim);
 	      NEXT_PASS (pass_copy_prop);
 	      NEXT_PASS (pass_lim);
 	      NEXT_PASS (pass_copy_prop);
 	      NEXT_PASS (pass_scev_cprop);
 	      NEXT_PASS (pass_tree_loop_done);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_dce);
 	      NEXT_PASS (pass_tree_loop_init);
       	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
 	      NEXT_PASS (pass_expand_omp_ssa);
 	      NEXT_PASS (pass_tree_loop_done);
 	  POP_INSERT_PASSES ()
-	  NEXT_PASS (pass_fre);
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);
 	  NEXT_PASS (pass_convert_switch);
 	  NEXT_PASS (pass_cleanup_eh);
[diff --git gcc/testsuite/ChangeLog gcc/testsuite/ChangeLog]

..., so the following difference from trunk to gomp-4_0-branch remains to
be resolved/reduced (plus the corresponding testsuite tree dump scanning
changes):

--- gcc/passes.def
+++ gcc/passes.def
@@ -89,25 +89,36 @@ along with GCC; see the file COPYING3.  If not see
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
 	  /* Pass group that runs when the function is an offloaded function
 	     containing oacc kernels loops.  Part 1.  */
 	  NEXT_PASS (pass_oacc_kernels);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
 	  /* Pass group that runs when the function is an offloaded function
 	     containing oacc kernels loops.  Part 2.  */
 	  NEXT_PASS (pass_oacc_kernels2);
 	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
 	      /* We use pass_lim to rewrite in-memory iteration and reduction
 		 variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_tree_loop_init);
 	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_scev_cprop);
+	      NEXT_PASS (pass_tree_loop_done);
 	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
 	      NEXT_PASS (pass_dce);
+	      NEXT_PASS (pass_tree_loop_init);
+      	      NEXT_PASS (pass_parallelize_loops_oacc_kernels);
 	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_tree_loop_done);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-27 11:44                     ` Tom de Vries
  2015-11-27 12:14                       ` Tom de Vries
@ 2015-12-02  9:46                       ` Jakub Jelinek
  2015-12-02 13:11                         ` Tom de Vries
  1 sibling, 1 reply; 133+ messages in thread
From: Jakub Jelinek @ 2015-12-02  9:46 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches

On Fri, Nov 27, 2015 at 12:42:09PM +0100, Tom de Vries wrote:
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -1366,10 +1366,12 @@ build_sender_ref (tree var, omp_context *ctx)
>    return build_sender_ref ((splay_tree_key) var, ctx);
>  }
>  
> -/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
> +/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
> +   BASE_POINTERS_RESTRICT, declare the field with restrict.  */
>  
>  static void
> -install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
> +install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
> +		     bool base_pointers_restrict)

Ugh, why the renaming?  Just use default argument:
		bool base_pointers_restrict = false

> +/* As install_var_field_1, but with base_pointers_restrict == false.  */
> +
> +static void
> +install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
> +{
> +  install_var_field_1 (var, by_ref, mask, ctx, false);
> +}

And avoid the wrapper.

>  /* Instantiate decls as necessary in CTX to satisfy the data sharing
> -   specified by CLAUSES.  */
> +   specified by CLAUSES.  If BASE_POINTERS_RESTRICT, install var field with
> +   restrict.  */
>  
>  static void
> -scan_sharing_clauses (tree clauses, omp_context *ctx)
> +scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
> +			bool base_pointers_restrict)

Likewise.

Otherwise LGTM, but I'm worried if this isn't related in any way to
PR68640 and might not make things worse.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-27 12:14                       ` Tom de Vries
@ 2015-12-02  9:59                         ` Jakub Jelinek
  2016-03-14 13:16                           ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Jakub Jelinek @ 2015-12-02  9:59 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Biener, gcc-patches

On Fri, Nov 27, 2015 at 01:03:52PM +0100, Tom de Vries wrote:
> Handle non-declared variables in kernels alias analysis
> 
> 2015-11-27  Tom de Vries  <tom@codesourcery.com>
> 
> 	* gimplify.c (gimplify_scan_omp_clauses): Initialize
> 	OMP_CLAUSE_ORIG_DECL.
> 	* omp-low.c (install_var_field_1): Handle base_pointers_restrict for
> 	pointers.
> 	(map_ptr_clause_points_to_clause_p)
> 	(nr_map_ptr_clauses_pointing_to_clause): New function.
> 	(omp_target_base_pointers_restrict_p): Handle GOMP_MAP_POINTER.
> 	* tree-pretty-print.c (dump_omp_clause): Print OMP_CLAUSE_ORIG_DECL.
> 	* tree.c (omp_clause_num_ops): Set num_ops for OMP_CLAUSE_MAP to 3.
> 	* tree.h (OMP_CLAUSE_ORIG_DECL): New macro.
> 
> 	* c-c++-common/goacc/kernels-alias-10.c: New test.
> 	* c-c++-common/goacc/kernels-alias-9.c: New test.

I don't like this (mainly the addition of OMP_CLAUSE_ORIG_DECL),
but it also sounds wrong to me.
The primary question is how do you handle GOMP_MAP_POINTER
(which is something we don't use for C/C++ OpenMP anymore,
and Fortran OpenMP will stop using it in GCC 7 or 6.2?) on the OpenACC
libgomp side, does it work like GOMP_MAP_ALLOC or GOMP_MAP_FORCE_ALLOC?
Similarly GOMP_MAP_TO_PSET.  If it works like GOMP_MAP_ALLOC (it does
on the OpenMP side in target.c, so if something is already mapped, no
further pointer assignment happens), then your change looks wrong.
If it works like GOMP_MAP_FORCE_ALLOC, then you just should treat
GOMP_MAP_POINTER on all OpenACC constructs as opcode that allows the
restrict operation.  If it should behave differently depending on
if the corresponding array section has been mapped with GOMP_MAP_FORCE_*
or without it, then supposedly you should use a different code for
those two.

	Jakub

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-02  9:46                       ` Jakub Jelinek
@ 2015-12-02 13:11                         ` Tom de Vries
  0 siblings, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-12-02 13:11 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches

On 02/12/15 10:45, Jakub Jelinek wrote:
> On Fri, Nov 27, 2015 at 12:42:09PM +0100, Tom de Vries wrote:
>> --- a/gcc/omp-low.c
>> +++ b/gcc/omp-low.c
>> @@ -1366,10 +1366,12 @@ build_sender_ref (tree var, omp_context *ctx)
>>     return build_sender_ref ((splay_tree_key) var, ctx);
>>   }
>>
>> -/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  */
>> +/* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
>> +   BASE_POINTERS_RESTRICT, declare the field with restrict.  */
>>
>>   static void
>> -install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
>> +install_var_field_1 (tree var, bool by_ref, int mask, omp_context *ctx,
>> +		     bool base_pointers_restrict)
>
> Ugh, why the renaming?  Just use default argument:
> 		bool base_pointers_restrict = false
>
>> +/* As install_var_field_1, but with base_pointers_restrict == false.  */
>> +
>> +static void
>> +install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
>> +{
>> +  install_var_field_1 (var, by_ref, mask, ctx, false);
>> +}
>
> And avoid the wrapper.
>
>>   /* Instantiate decls as necessary in CTX to satisfy the data sharing
>> -   specified by CLAUSES.  */
>> +   specified by CLAUSES.  If BASE_POINTERS_RESTRICT, install var field with
>> +   restrict.  */
>>
>>   static void
>> -scan_sharing_clauses (tree clauses, omp_context *ctx)
>> +scan_sharing_clauses_1 (tree clauses, omp_context *ctx,
>> +			bool base_pointers_restrict)
>
> Likewise.
>
> Otherwise LGTM,

Hi Jakub,

thanks for the review.

> but I'm worried if this isn't related in any way to
> PR68640 and might not make things worse.
>

AFAIU, they're sort of opposite cases:
- in the case of the PR, we add restrict in a function argument
   by accident
- in the case of this patch, we add restrict in a function argument
   by analysis

[ Btw, now that this patch (which exploits GOMP_MAP_FORCE_* mappings)
   is OK-ed, the patch "Fix oacc kernels default mapping for scalars" at
   https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03334.html becomes more
   relevant, since that one ensures that scalars by default
   get the GOMP_MAP_FORCE_COPY mapping (rather than the incorrect
   GOMP_MAP_COPY) ]

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-11 11:01     ` Jakub Jelinek
  2015-11-12 16:04       ` Tom de Vries
@ 2015-12-03 11:53       ` Tom de Vries
  1 sibling, 0 replies; 133+ messages in thread
From: Tom de Vries @ 2015-12-03 11:53 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

On 11/11/15 12:00, Jakub Jelinek wrote:
> On Wed, Nov 11, 2015 at 11:51:02AM +0100, Richard Biener wrote:
>>> The option -foffload-alias=pointer instructs the compiler to assume that
>>> objects references in an offload region do not alias.
>>>
>>> The option -foffload-alias=all instructs the compiler to make no
>>> assumptions about aliasing in offload regions.
>>>
>>> The default value is -foffload-alias=none.
>>
>> I think global options for this is nonsense.  Please follow what
>> we do for #pragma GCC ivdep for example, thus allow the alias
>> behavior to be specified per "region" (whatever makes sense here
>> in the context of offloading).
>
> Yeah, completely agreed.  I don't see why the offloaded region would be in
> any way special, they are C/C++/Fortran code as any other.
> What we can and should improve is teach IPA aliasing/points to analysis
> about the way we lower the host vs. offloading region boundary, so that
> if alias analysis on the caller of GOMP_target_ext/GOACC_parallel_keyed
> determines something it can be used on the offloaded function side and vice
> versa, but a switch like the above is just wrong.

Filed the GOMP_target_ext bit as PR 68675 - Handle GOMP_target_ext 
optimally in ipa-pta.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-11-13 11:39               ` Jakub Jelinek
  2015-11-21 12:24                 ` Tom de Vries
@ 2015-12-11 12:45                 ` Tom de Vries
  2015-12-11 13:00                   ` Richard Biener
  1 sibling, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-11 12:45 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

On 13/11/15 12:39, Jakub Jelinek wrote:
> We simply have some compiler internal interface between the caller and
> callee of the outlined regions, each interface in between those has
> its own structure type used to communicate the info;
> we can attach attributes on the fields, or some flags to indicate some
> properties interesting from aliasing POV.  We don't really need to perform
> full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> the relationship in between such callers and callees (for offloading regions
> we already have "omp target entrypoint" attribute on the callee and a
> singler caller), tell LTO if possible not to split those into different
> partitions if easily possible, and then just for these pairs perform
> aliasing/points-to analysis in the caller and the result record using
> cliques/special attributes/whatever to the callee side, so that the callee
> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.

Hi,

This work-in-progress patch allows me to use IPA PTA information in the 
kernels pass group.

Since:
-  I'm running IPA PTA before ealias, and IPA PTA does not interpret
    restrict, and
- compute_may_alias doesn't run if IPA PTA information is present
I needed to convince ealias to do the restrict clique/base annotation.

It would be more logical to fit IPA PTA after ealias, but one is an IPA 
pass, the other a regular one-function pass, so I would have to split 
the containing pass groups pass_all_early_optimizations and 
pass_local_optimization_passes. I'll give that a try now.

Any comments?

Thanks,
- Tom

[-- Attachment #2: 0008-Run-pass_ipa_pta-before-pass_local_optimization_passes.patch --]
[-- Type: text/x-patch, Size: 5025 bytes --]

Run pass_ipa_pta before pass_local_optimization_passes

---
 gcc/gimple-ssa.h           |  2 ++
 gcc/passes.def             |  1 +
 gcc/tree-pass.h            |  1 +
 gcc/tree-ssa-structalias.c | 60 +++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/gcc/gimple-ssa.h b/gcc/gimple-ssa.h
index 39551da..aff2fb7 100644
--- a/gcc/gimple-ssa.h
+++ b/gcc/gimple-ssa.h
@@ -83,6 +83,8 @@ struct GTY(()) gimple_df {
   /* The PTA solution for the ESCAPED artificial variable.  */
   struct pt_solution escaped;
 
+  bool clique_base_annotation_done;
+
   /* A map of decls to artificial ssa-names that point to the partition
      of the decl.  */
   hash_map<tree, tree> * GTY((skip(""))) decls_to_pointers;
diff --git a/gcc/passes.def b/gcc/passes.def
index 678a900..5293be0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -68,6 +68,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
   POP_INSERT_PASSES ()
 
+  NEXT_PASS (pass_ipa_pta_oacc_kernels);
   NEXT_PASS (pass_local_optimization_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_local_optimization_passes)
       NEXT_PASS (pass_fixup_cfg);
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 4566d33..980922e 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -497,6 +497,7 @@ extern ipa_opt_pass_d *make_pass_ipa_devirt (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_pure_const (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_pta (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_tm (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_target_clone (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_dispatcher_calls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index 7420ce1..dfc0422 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -6939,7 +6939,7 @@ solve_constraints (void)
    at the start of the file for an algorithmic overview.  */
 
 static void
-compute_points_to_sets (void)
+compute_points_to_sets (bool set_points_to_info)
 {
   basic_block bb;
   unsigned i;
@@ -6981,6 +6981,9 @@ compute_points_to_sets (void)
   /* From the constraints compute the points-to sets.  */
   solve_constraints ();
 
+  if (!set_points_to_info)
+    goto done;
+
   /* Compute the points-to set for ESCAPED used for call-clobber analysis.  */
   cfun->gimple_df->escaped = find_what_var_points_to (cfun->decl,
 						      get_varinfo (escaped_id));
@@ -7057,6 +7060,7 @@ compute_points_to_sets (void)
 	}
     }
 
+ done:
   timevar_pop (TV_TREE_PTA);
 }
 
@@ -7289,6 +7293,8 @@ compute_dependence_clique (void)
 unsigned int
 compute_may_aliases (void)
 {
+  bool set_points_to_info = true;
+
   if (cfun->gimple_df->ipa_pta)
     {
       if (dump_file)
@@ -7300,13 +7306,16 @@ compute_may_aliases (void)
 	  dump_alias_info (dump_file);
 	}
 
-      return 0;
+      if (cfun->gimple_df->clique_base_annotation_done)
+	return 0;
+
+      set_points_to_info = false;
     }
 
   /* For each pointer P_i, determine the sets of variables that P_i may
      point-to.  Compute the reachability set of escaped and call-used
      variables.  */
-  compute_points_to_sets ();
+  compute_points_to_sets (set_points_to_info);
 
   /* Debugging dumps.  */
   if (dump_file)
@@ -7314,6 +7323,7 @@ compute_may_aliases (void)
 
   /* Compute restrict-based memory disambiguations.  */
   compute_dependence_clique ();
+  cfun->gimple_df->clique_base_annotation_done = true;
 
   /* Deallocate memory used by aliasing data structures and the internal
      points-to solution.  */
@@ -7816,3 +7826,47 @@ make_pass_ipa_pta (gcc::context *ctxt)
 {
   return new pass_ipa_pta (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_ipa_pta_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "pta_oacc_kernels", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_IPA_PTA, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_pta_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_pta_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (optimize
+	      && flag_openacc
+	      && flag_tree_parallelize_loops > 1
+	      /* Don't bother doing anything if the program has errors.  */
+	      && !seen_error ());
+    }
+
+  virtual unsigned int execute (function *) { return ipa_pta_execute (); }
+
+}; // class pass_ipa_pta_oacc_kernels
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_ipa_pta_oacc_kernels (ctxt);
+}

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-11 12:45                 ` Tom de Vries
@ 2015-12-11 13:00                   ` Richard Biener
  2015-12-13 16:38                     ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-12-11 13:00 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Fri, 11 Dec 2015, Tom de Vries wrote:

> On 13/11/15 12:39, Jakub Jelinek wrote:
> > We simply have some compiler internal interface between the caller and
> > callee of the outlined regions, each interface in between those has
> > its own structure type used to communicate the info;
> > we can attach attributes on the fields, or some flags to indicate some
> > properties interesting from aliasing POV.  We don't really need to perform
> > full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
> > the relationship in between such callers and callees (for offloading regions
> > we already have "omp target entrypoint" attribute on the callee and a
> > singler caller), tell LTO if possible not to split those into different
> > partitions if easily possible, and then just for these pairs perform
> > aliasing/points-to analysis in the caller and the result record using
> > cliques/special attributes/whatever to the callee side, so that the callee
> > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
> 
> Hi,
> 
> This work-in-progress patch allows me to use IPA PTA information in the
> kernels pass group.
> 
> Since:
> -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
>    restrict, and
> - compute_may_alias doesn't run if IPA PTA information is present
> I needed to convince ealias to do the restrict clique/base annotation.
> 
> It would be more logical to fit IPA PTA after ealias, but one is an IPA pass,
> the other a regular one-function pass, so I would have to split the containing
> pass groups pass_all_early_optimizations and pass_local_optimization_passes.
> I'll give that a try now.
> 
> Any comments?

I don't think you want to run IPA PTA before early
optimizations, it (and ealias) rely on some initial cleanup to
do anything meaningful with well-spent ressources.

The local PTA "hack" also looks more like a waste of resources, but well 
... teaching IPA PTA to honor restrict might be an impossible task
though I didn't think much about it other than handling it only for
nonlocal_p functions (for others we should see all incoming args
if IPA PTA works optimally).  The restrict tags will leak all over
the place of course and in the end no meaningful cliques may remain.

Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-11 13:00                   ` Richard Biener
@ 2015-12-13 16:38                     ` Tom de Vries
  2015-12-14 13:26                       ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-13 16:38 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2860 bytes --]

On 11/12/15 14:00, Richard Biener wrote:
> On Fri, 11 Dec 2015, Tom de Vries wrote:
>
>> On 13/11/15 12:39, Jakub Jelinek wrote:
>>> We simply have some compiler internal interface between the caller and
>>> callee of the outlined regions, each interface in between those has
>>> its own structure type used to communicate the info;
>>> we can attach attributes on the fields, or some flags to indicate some
>>> properties interesting from aliasing POV.  We don't really need to perform
>>> full IPA-PTA, perhaps it would be enough to a) record somewhere in cgraph
>>> the relationship in between such callers and callees (for offloading regions
>>> we already have "omp target entrypoint" attribute on the callee and a
>>> singler caller), tell LTO if possible not to split those into different
>>> partitions if easily possible, and then just for these pairs perform
>>> aliasing/points-to analysis in the caller and the result record using
>>> cliques/special attributes/whatever to the callee side, so that the callee
>>> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias analysis.
>>
>> Hi,
>>
>> This work-in-progress patch allows me to use IPA PTA information in the
>> kernels pass group.
>>
>> Since:
>> -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
>>     restrict, and
>> - compute_may_alias doesn't run if IPA PTA information is present
>> I needed to convince ealias to do the restrict clique/base annotation.
>>
>> It would be more logical to fit IPA PTA after ealias, but one is an IPA pass,
>> the other a regular one-function pass, so I would have to split the containing
>> pass groups pass_all_early_optimizations and pass_local_optimization_passes.
>> I'll give that a try now.
>>

I've tried this approach, but realized that this changes the order in 
which non-openacc functions are processed in the compiler, so I've 
abandoned this idea.

>> Any comments?
>
> I don't think you want to run IPA PTA before early
> optimizations, it (and ealias) rely on some initial cleanup to
> do anything meaningful with well-spent ressources.
>
> The local PTA "hack" also looks more like a waste of resources, but well
> ... teaching IPA PTA to honor restrict might be an impossible task
> though I didn't think much about it other than handling it only for
> nonlocal_p functions (for others we should see all incoming args
> if IPA PTA works optimally).  The restrict tags will leak all over
> the place of course and in the end no meaningful cliques may remain.
>

This patch:
- moves the kernels pass group to the first position in the pass list
   after ealias where we're back in ipa mode
- inserts an new ipa pass to contain the gimple pass group called
   pass_oacc_ipa
- inserts a version of ipa-pta before the pass group.

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom


[-- Attachment #2: 0003-Add-pass_oacc_ipa.patch --]
[-- Type: text/x-patch, Size: 13777 bytes --]

Add pass_oacc_ipa

---
 gcc/passes.def                          | 37 ++++++++++++++-------------
 gcc/testsuite/g++.dg/ipa/devirt-37.C    | 10 ++++----
 gcc/testsuite/g++.dg/ipa/devirt-40.C    |  4 +--
 gcc/testsuite/g++.dg/tree-ssa/pr61034.C | 10 ++++----
 gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c   |  4 +--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c    |  4 +--
 gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c    |  4 +--
 gcc/tree-pass.h                         |  3 ++-
 gcc/tree-ssa-loop.c                     | 40 ++++++++++++++----------------
 gcc/tree-ssa-structalias.c              | 44 +++++++++++++++++++++++++++++++++
 10 files changed, 102 insertions(+), 58 deletions(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index 43ce3d5..579dd63 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,24 +88,7 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 1.  */
-	  NEXT_PASS (pass_oacc_kernels);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
-	      NEXT_PASS (pass_ch);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 2.  */
-	  NEXT_PASS (pass_oacc_kernels2);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
-	      /* We use pass_lim to rewrite in-memory iteration and reduction
-		 variable accesses in loops into local variables accesses.  */
-	      NEXT_PASS (pass_lim);
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_dce);
-	      NEXT_PASS (pass_expand_omp_ssa);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
@@ -124,6 +107,26 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
   POP_INSERT_PASSES ()
+
+  NEXT_PASS (pass_ipa_pta_oacc_kernels);
+  NEXT_PASS (pass_oacc_ipa);
+  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
+      /* Pass group that runs when the function is an offloaded function
+         containing oacc kernels loops.  */
+      NEXT_PASS (pass_oacc_kernels);
+      PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+          NEXT_PASS (pass_ch);
+          NEXT_PASS (pass_fre);
+          /* We use pass_lim to rewrite in-memory iteration and reduction
+	     variable accesses in loops into local variables accesses.  */
+          NEXT_PASS (pass_lim);
+          NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+          NEXT_PASS (pass_dce);
+          NEXT_PASS (pass_expand_omp_ssa);
+          NEXT_PASS (pass_rebuild_cgraph_edges);
+      POP_INSERT_PASSES ()
+  POP_INSERT_PASSES ()
+
   NEXT_PASS (pass_ipa_chkp_produce_thunks);
   NEXT_PASS (pass_ipa_auto_profile);
   NEXT_PASS (pass_ipa_free_inline_summary);
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-37.C b/gcc/testsuite/g++.dg/ipa/devirt-37.C
index 9c5287e..b7f52a0 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-37.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-37.C
@@ -1,4 +1,4 @@
-/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre2-details -fno-early-inlining"  } */
+/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre3-details -fno-early-inlining"  } */
 #include <stdlib.h>
 struct A {virtual void test() {abort ();}};
 struct B:A
@@ -30,7 +30,7 @@ t()
 /* After inlining the call within constructor needs to be checked to not go into a basetype.
    We should see the vtbl store and we should notice extcall as possibly clobbering the
    type but ignore it because b is in static storage.  */
-/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre2"  } } */
+/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-40.C b/gcc/testsuite/g++.dg/ipa/devirt-40.C
index 279a228..5107c29 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-40.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-40.C
@@ -1,4 +1,4 @@
-/* { dg-options "-O2 -fdump-tree-fre2-details"  } */
+/* { dg-options "-O2 -fdump-tree-fre3-details"  } */
 typedef enum
 {
 } UErrorCode;
@@ -19,4 +19,4 @@ A::m_fn1 (UnicodeString &, int &p2, UErrorCode &) const
   UnicodeString a[2];
 }
 
-/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre2"  } } */
+/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
index cd4ee05..c06c580 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O2 -fdump-tree-fre2 -fdump-tree-optimized" }
+// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized" }
 
 #define assume(x) if(!(x))__builtin_unreachable()
 
@@ -42,13 +42,13 @@ bool f(I a, I b, I c, I d) {
 // a bunch of conditional free()s and unreachable()s.
 // This works only if everything is inlined into 'f'.
 
-// { dg-final { scan-tree-dump-times ";; Function" 1 "fre2" } }
-// { dg-final { scan-tree-dump-times "unreachable" 11 "fre2" } }
+// { dg-final { scan-tree-dump-times ";; Function" 1 "fre3" } }
+// { dg-final { scan-tree-dump-times "unreachable" 11 "fre3" } }
 
 // Note that depending on PUSH_ARGS_REVERSED we are presented with
 // a different initial CFG and thus the final outcome is different
 
-// { dg-final { scan-tree-dump-times "free" 10 "fre2" { target x86_64-*-* i?86-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 10 "fre3" { target x86_64-*-* i?86-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 3 "optimized" { target x86_64-*-* i?86-*-* } } }
-// { dg-final { scan-tree-dump-times "free" 14 "fre2" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 14 "fre3" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 4 "optimized" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
index f558df3..71b31c4 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
@@ -1,5 +1,5 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2 -fno-ipa-icf" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3 -fno-ipa-icf" } */
 
 static int x, y;
 
@@ -54,7 +54,7 @@ int main()
   local_address_taken (&y);
   /* As we are computing flow- and context-insensitive we may not
      CSE the load of x here.  */
-  /* { dg-final { scan-tree-dump " = x;" "fre2" } } */
+  /* { dg-final { scan-tree-dump " = x;" "fre3" } } */
   return x;
 }
 
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
index ff6fa57..8655794 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 static int __attribute__((noinline,noclone))
 foo (int *p, int *q)
@@ -23,4 +23,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
index 106e325..c42762a 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 int a, b;
 
@@ -28,4 +28,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index e1cbce9..1a1da12 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -468,7 +468,7 @@ extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_oacc_ipa (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
@@ -495,6 +495,7 @@ extern ipa_opt_pass_d *make_pass_ipa_devirt (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_pure_const (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_pta (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_tm (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_target_clone (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_dispatcher_calls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index cf7d94e..0e1dad8 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -206,12 +206,14 @@ make_pass_oacc_kernels (gcc::context *ctxt)
   return new pass_oacc_kernels (ctxt);
 }
 
+/* The oacc ipa superpass.  */
+
 namespace {
 
-const pass_data pass_data_oacc_kernels2 =
+const pass_data pass_data_oacc_ipa =
 {
-  GIMPLE_PASS, /* type */
-  "oacc_kernels2", /* name */
+  SIMPLE_IPA_PASS, /* type */
+  "oacc_ipa", /* name */
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_TREE_LOOP, /* tv_id */
   PROP_cfg, /* properties_required */
@@ -221,34 +223,28 @@ const pass_data pass_data_oacc_kernels2 =
   0, /* todo_flags_finish */
 };
 
-class pass_oacc_kernels2 : public gimple_opt_pass
+class pass_oacc_ipa : public simple_ipa_opt_pass
 {
 public:
-  pass_oacc_kernels2 (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  pass_oacc_ipa (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_oacc_ipa, ctxt)
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
-  virtual unsigned int execute (function *fn)
-    {
-      /* Rather than having a copy of the previous dump, get some use out of
-	 this dump, and try to minimize differences with the following pass
-	 (pass_lim), which will initizalize the loop optimizer with
-	 LOOPS_NORMAL.  */
-      loop_optimizer_init (LOOPS_NORMAL);
-      loop_optimizer_finalize (fn);
-      return 0;
-    }
-
-}; // class pass_oacc_kernels2
+  virtual bool gate (function *)
+  {
+    return (flag_openacc
+	    && flag_tree_parallelize_loops > 1);
+  }
+					     
+}; // class pass_oacc_ipa
 
 } // anon namespace
 
-gimple_opt_pass *
-make_pass_oacc_kernels2 (gcc::context *ctxt)
+simple_ipa_opt_pass *
+make_pass_oacc_ipa (gcc::context *ctxt)
 {
-  return new pass_oacc_kernels2 (ctxt);
+  return new pass_oacc_ipa (ctxt);
 }
 
 /* The no-loop superpass.  */
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index 7420ce1..b105edc 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -7816,3 +7816,47 @@ make_pass_ipa_pta (gcc::context *ctxt)
 {
   return new pass_ipa_pta (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_ipa_pta_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "pta_oacc_kernels", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_IPA_PTA, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_pta_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_pta_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (optimize
+	      && flag_openacc
+	      && flag_tree_parallelize_loops > 1
+	      /* Don't bother doing anything if the program has errors.  */
+	      && !seen_error ());
+    }
+
+  virtual unsigned int execute (function *) { return ipa_pta_execute (); }
+
+}; // class pass_ipa_pta_oacc_kernels
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_pta_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_ipa_pta_oacc_kernels (ctxt);
+}

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-11-24 12:27     ` Tom de Vries
@ 2015-12-13 16:58       ` Tom de Vries
  2015-12-14 15:23         ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-13 16:58 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Richard Biener

On 24/11/15 13:24, Tom de Vries wrote:
> On 16/11/15 12:59, Tom de Vries wrote:
>> On 09/11/15 20:52, Tom de Vries wrote:
>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>> Hi,
>>>>
>>>> this patch series for stage1 trunk adds support to:
>>>> - parallelize oacc kernels regions using parloops, and
>>>> - map the loops onto the oacc gang dimension.
>>>>
>>>> The patch series contains these patches:
>>>>
>>>>       1    Insert new exit block only when needed in
>>>>          transform_to_exit_first_loop_alt
>>>>       2    Make create_parallel_loop return void
>>>>       3    Ignore reduction clause on kernels directive
>>>>       4    Implement -foffload-alias
>>>>       5    Add in_oacc_kernels_region in struct loop
>>>>       6    Add pass_oacc_kernels
>>>>       7    Add pass_dominator_oacc_kernels
>>>>       8    Add pass_ch_oacc_kernels
>>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>>      11    Update testcases after adding kernels pass group
>>>>      12    Handle acc loop directive
>>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>
>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>> intended to be committed at the same time.
>>>>
>>>> Bootstrapped and reg-tested on x86_64.
>>>>
>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>> patch that enables accelerator testing (which is submitted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>
>>>> I'll post the individual patches in reply to this message.
>>>
>>> This patch adds pass_parallelize_loops_oacc_kernels.
>>>
>>> There's a number of things we do differently in parloops for oacc
>>> kernels:
>>> - in normal parloops, we generate code to choose between a parallel
>>>    version of the loop, and a sequential (low iteration count) version.
>>>    Since the code in oacc kernels region is supposed to run on the
>>>    accelerator anyway, we skip this check, and don't add a low iteration
>>>    count loop.
>>> - in normal parloops, we generate an #pragma omp parallel /
>>>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>>>    into a thread function. Since the oacc kernels region is already
>>>    split off, we don't add this pair.
>>> - we indicate the parallelization factor by setting the oacc function
>>>    attributes
>>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>>    we add the gang clause
>>> - in normal parloops, we rewrite the variable accesses in the loop in
>>>    terms into accesses relative to a thread function parameter. For the
>>>    oacc kernels region, that rewrite has already been done at omp-lower,
>>>    so we skip this.
>>> - we need to ensure that the entire kernels region can be run in
>>>    parallel. The loop independence check is already present, so for oacc
>>>    kernels we add a check between blocks outside the loop and the entire
>>>    region.
>>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>>    There's no need for each gang to write to a single location, we can
>>>    do this in just one gang. (Typically this is the write of the final
>>>    value of the iteration variable if that one is copied back to the
>>>    host).
>>>
>>
>> Reposting with loop optimizer init added in
>> pass_parallelize_loops_oacc_kernels::execute.
>>
>
> Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
>   added in pass_parallelize_loops_oacc_kernels::execute.
>

Ping.

Anything I can do to facilitate the review?

Thanks,
  Tom
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-13 16:38                     ` Tom de Vries
@ 2015-12-14 13:26                       ` Richard Biener
  2015-12-14 15:44                         ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-12-14 13:26 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Sun, 13 Dec 2015, Tom de Vries wrote:

> On 11/12/15 14:00, Richard Biener wrote:
> > On Fri, 11 Dec 2015, Tom de Vries wrote:
> > 
> > > On 13/11/15 12:39, Jakub Jelinek wrote:
> > > > We simply have some compiler internal interface between the caller and
> > > > callee of the outlined regions, each interface in between those has
> > > > its own structure type used to communicate the info;
> > > > we can attach attributes on the fields, or some flags to indicate some
> > > > properties interesting from aliasing POV.  We don't really need to
> > > > perform
> > > > full IPA-PTA, perhaps it would be enough to a) record somewhere in
> > > > cgraph
> > > > the relationship in between such callers and callees (for offloading
> > > > regions
> > > > we already have "omp target entrypoint" attribute on the callee and a
> > > > singler caller), tell LTO if possible not to split those into different
> > > > partitions if easily possible, and then just for these pairs perform
> > > > aliasing/points-to analysis in the caller and the result record using
> > > > cliques/special attributes/whatever to the callee side, so that the
> > > > callee
> > > > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
> > > > analysis.
> > > 
> > > Hi,
> > > 
> > > This work-in-progress patch allows me to use IPA PTA information in the
> > > kernels pass group.
> > > 
> > > Since:
> > > -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
> > >     restrict, and
> > > - compute_may_alias doesn't run if IPA PTA information is present
> > > I needed to convince ealias to do the restrict clique/base annotation.
> > > 
> > > It would be more logical to fit IPA PTA after ealias, but one is an IPA
> > > pass,
> > > the other a regular one-function pass, so I would have to split the
> > > containing
> > > pass groups pass_all_early_optimizations and
> > > pass_local_optimization_passes.
> > > I'll give that a try now.
> > > 
> 
> I've tried this approach, but realized that this changes the order in which
> non-openacc functions are processed in the compiler, so I've abandoned this
> idea.
> 
> > > Any comments?
> > 
> > I don't think you want to run IPA PTA before early
> > optimizations, it (and ealias) rely on some initial cleanup to
> > do anything meaningful with well-spent ressources.
> > 
> > The local PTA "hack" also looks more like a waste of resources, but well
> > ... teaching IPA PTA to honor restrict might be an impossible task
> > though I didn't think much about it other than handling it only for
> > nonlocal_p functions (for others we should see all incoming args
> > if IPA PTA works optimally).  The restrict tags will leak all over
> > the place of course and in the end no meaningful cliques may remain.
> > 
> 
> This patch:
> - moves the kernels pass group to the first position in the pass list
>   after ealias where we're back in ipa mode
> - inserts an new ipa pass to contain the gimple pass group called
>   pass_oacc_ipa
> - inserts a version of ipa-pta before the pass group.

In principle I like this a lot, but

+  NEXT_PASS (pass_ipa_pta_oacc_kernels);
+  NEXT_PASS (pass_oacc_ipa);
+  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)

I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
group and thus just "clone" ipa_pta?  sub-passes of IPA passes can
be both ipa passes and non-ipa passes.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
  2015-12-13 16:58       ` [PIING][PATCH, " Tom de Vries
@ 2015-12-14 15:23         ` Richard Biener
  2016-01-16 22:41           ` [Committed] Move pass_expand_omp_ssa out of pass_parallelize_loops Tom de Vries
                             ` (2 more replies)
  0 siblings, 3 replies; 133+ messages in thread
From: Richard Biener @ 2015-12-14 15:23 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek, Richard Biener

On Sun, Dec 13, 2015 at 5:58 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 24/11/15 13:24, Tom de Vries wrote:
>>
>> On 16/11/15 12:59, Tom de Vries wrote:
>>>
>>> On 09/11/15 20:52, Tom de Vries wrote:
>>>>
>>>> On 09/11/15 16:35, Tom de Vries wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> this patch series for stage1 trunk adds support to:
>>>>> - parallelize oacc kernels regions using parloops, and
>>>>> - map the loops onto the oacc gang dimension.
>>>>>
>>>>> The patch series contains these patches:
>>>>>
>>>>>       1    Insert new exit block only when needed in
>>>>>          transform_to_exit_first_loop_alt
>>>>>       2    Make create_parallel_loop return void
>>>>>       3    Ignore reduction clause on kernels directive
>>>>>       4    Implement -foffload-alias
>>>>>       5    Add in_oacc_kernels_region in struct loop
>>>>>       6    Add pass_oacc_kernels
>>>>>       7    Add pass_dominator_oacc_kernels
>>>>>       8    Add pass_ch_oacc_kernels
>>>>>       9    Add pass_parallelize_loops_oacc_kernels
>>>>>      10    Add pass_oacc_kernels pass group in passes.def
>>>>>      11    Update testcases after adding kernels pass group
>>>>>      12    Handle acc loop directive
>>>>>      13    Add c-c++-common/goacc/kernels-*.c
>>>>>      14    Add gfortran.dg/goacc/kernels-*.f95
>>>>>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>>>      16    Add libgomp.oacc-fortran/kernels-*.f95
>>>>>
>>>>> The first 9 patches are more or less independent, but patches 10-16 are
>>>>> intended to be committed at the same time.
>>>>>
>>>>> Bootstrapped and reg-tested on x86_64.
>>>>>
>>>>> Build and reg-tested with nvidia accelerator, in combination with a
>>>>> patch that enables accelerator testing (which is submitted at
>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>>>
>>>>> I'll post the individual patches in reply to this message.
>>>>
>>>>
>>>> This patch adds pass_parallelize_loops_oacc_kernels.
>>>>
>>>> There's a number of things we do differently in parloops for oacc
>>>> kernels:
>>>> - in normal parloops, we generate code to choose between a parallel
>>>>    version of the loop, and a sequential (low iteration count) version.
>>>>    Since the code in oacc kernels region is supposed to run on the
>>>>    accelerator anyway, we skip this check, and don't add a low iteration
>>>>    count loop.
>>>> - in normal parloops, we generate an #pragma omp parallel /
>>>>    GIMPLE_OMP_RETURN pair to delimit the region which will we split off
>>>>    into a thread function. Since the oacc kernels region is already
>>>>    split off, we don't add this pair.
>>>> - we indicate the parallelization factor by setting the oacc function
>>>>    attributes
>>>> - we generate an #pragma oacc loop instead of an #pragma omp for, and
>>>>    we add the gang clause
>>>> - in normal parloops, we rewrite the variable accesses in the loop in
>>>>    terms into accesses relative to a thread function parameter. For the
>>>>    oacc kernels region, that rewrite has already been done at omp-lower,
>>>>    so we skip this.
>>>> - we need to ensure that the entire kernels region can be run in
>>>>    parallel. The loop independence check is already present, so for oacc
>>>>    kernels we add a check between blocks outside the loop and the entire
>>>>    region.
>>>> - we guard stores in the blocks outside the loop with gang_pos == 0.
>>>>    There's no need for each gang to write to a single location, we can
>>>>    do this in just one gang. (Typically this is the write of the final
>>>>    value of the iteration variable if that one is copied back to the
>>>>    host).
>>>>
>>>
>>> Reposting with loop optimizer init added in
>>> pass_parallelize_loops_oacc_kernels::execute.
>>>
>>
>> Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
>>   added in pass_parallelize_loops_oacc_kernels::execute.
>>
>
> Ping.
>
> Anything I can do to facilitate the review?

Document new functions, avoid if (1).

Ideally some refactoring would avoid some of the if (!oacc_kernels_p) spaghetti
but I'm considering tree-parloops.c (and its bugs) yours.

Can the pass not just use a pass parameter to switch between oacc/non-oacc?

Richard.

> Thanks,
>  Tom
>>
>>
>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-14 13:26                       ` Richard Biener
@ 2015-12-14 15:44                         ` Tom de Vries
  2015-12-16 13:16                           ` Richard Biener
  0 siblings, 1 reply; 133+ messages in thread
From: Tom de Vries @ 2015-12-14 15:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4194 bytes --]

On 14/12/15 14:26, Richard Biener wrote:
> On Sun, 13 Dec 2015, Tom de Vries wrote:
>
>> On 11/12/15 14:00, Richard Biener wrote:
>>> On Fri, 11 Dec 2015, Tom de Vries wrote:
>>>
>>>> On 13/11/15 12:39, Jakub Jelinek wrote:
>>>>> We simply have some compiler internal interface between the caller and
>>>>> callee of the outlined regions, each interface in between those has
>>>>> its own structure type used to communicate the info;
>>>>> we can attach attributes on the fields, or some flags to indicate some
>>>>> properties interesting from aliasing POV.  We don't really need to
>>>>> perform
>>>>> full IPA-PTA, perhaps it would be enough to a) record somewhere in
>>>>> cgraph
>>>>> the relationship in between such callers and callees (for offloading
>>>>> regions
>>>>> we already have "omp target entrypoint" attribute on the callee and a
>>>>> singler caller), tell LTO if possible not to split those into different
>>>>> partitions if easily possible, and then just for these pairs perform
>>>>> aliasing/points-to analysis in the caller and the result record using
>>>>> cliques/special attributes/whatever to the callee side, so that the
>>>>> callee
>>>>> (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
>>>>> analysis.
>>>>
>>>> Hi,
>>>>
>>>> This work-in-progress patch allows me to use IPA PTA information in the
>>>> kernels pass group.
>>>>
>>>> Since:
>>>> -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
>>>>      restrict, and
>>>> - compute_may_alias doesn't run if IPA PTA information is present
>>>> I needed to convince ealias to do the restrict clique/base annotation.
>>>>
>>>> It would be more logical to fit IPA PTA after ealias, but one is an IPA
>>>> pass,
>>>> the other a regular one-function pass, so I would have to split the
>>>> containing
>>>> pass groups pass_all_early_optimizations and
>>>> pass_local_optimization_passes.
>>>> I'll give that a try now.
>>>>
>>
>> I've tried this approach, but realized that this changes the order in which
>> non-openacc functions are processed in the compiler, so I've abandoned this
>> idea.
>>
>>>> Any comments?
>>>
>>> I don't think you want to run IPA PTA before early
>>> optimizations, it (and ealias) rely on some initial cleanup to
>>> do anything meaningful with well-spent ressources.
>>>
>>> The local PTA "hack" also looks more like a waste of resources, but well
>>> ... teaching IPA PTA to honor restrict might be an impossible task
>>> though I didn't think much about it other than handling it only for
>>> nonlocal_p functions (for others we should see all incoming args
>>> if IPA PTA works optimally).  The restrict tags will leak all over
>>> the place of course and in the end no meaningful cliques may remain.
>>>
>>
>> This patch:
>> - moves the kernels pass group to the first position in the pass list
>>    after ealias where we're back in ipa mode
>> - inserts an new ipa pass to contain the gimple pass group called
>>    pass_oacc_ipa
>> - inserts a version of ipa-pta before the pass group.
>
> In principle I like this a lot, but
>
> +  NEXT_PASS (pass_ipa_pta_oacc_kernels);
> +  NEXT_PASS (pass_oacc_ipa);
> +  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
>
> I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
> group and thus just "clone" ipa_pta?

Done. But using a clone means using the same gate function, and that 
means that this pass_ipa_pta instance no longer runs by default for 
openacc by default.

I've added enabling-by-default of fipa-pta for fopenacc in 
default_options_optimization to fix that.

> sub-passes of IPA passes can
> be both ipa passes and non-ipa passes.

Right. It does mean that I need yet another pass (pass_ipa_oacc_kernels) 
to do the IPA/non-IPA transition at pass/sub-pass boundary:
...
   NEXT_PASS (pass_ipa_oacc);
   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
       NEXT_PASS (pass_ipa_pta);
       NEXT_PASS (pass_ipa_oacc_kernels);
       PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
          /* out-of-ipa */
          NEXT_PASS (pass_oacc_kernels);
          PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
...

OK for stage3 if bootstrap and reg-test succeeds?

Thanks,
- Tom


[-- Attachment #2: 0003-Add-pass_oacc_ipa.patch --]
[-- Type: text/x-patch, Size: 15289 bytes --]

Add pass_oacc_ipa

2015-12-14  Tom de Vries  <tom@codesourcery.com>

	* opts.c (default_options_optimization): Set fipa-pta on by default for
	fopenacc.
	* passes.def: Move kernels pass group to pass_ipa_oacc.
	* tree-pass.h (make_pass_oacc_kernels2): Remove.
	(make_pass_ipa_oacc, make_pass_ipa_oacc_kernels): Declare.
	* tree-ssa-loop.c (pass_oacc_kernels2, make_pass_oacc_kernels2): Remove.
	(pass_ipa_oacc, pass_ipa_oacc_kernels): New pass.
	(make_pass_ipa_oacc, make_pass_ipa_oacc_kernels): New function.
	* tree-ssa-structalias.c (pass_ipa_pta::clone): New function.

	* g++.dg/ipa/devirt-37.C: Update for new fre2 pass.
	* g++.dg/ipa/devirt-40.C: Same.
	* g++.dg/tree-ssa/pr61034.C: Same.
	* gcc.dg/ipa/ipa-pta-13.c: Same.
	* gcc.dg/ipa/ipa-pta-3.c: Same.
	* gcc.dg/ipa/ipa-pta-4.c: Same.

---
 gcc/opts.c                              |  9 ++++
 gcc/passes.def                          | 41 ++++++++++--------
 gcc/testsuite/g++.dg/ipa/devirt-37.C    | 10 ++---
 gcc/testsuite/g++.dg/ipa/devirt-40.C    |  4 +-
 gcc/testsuite/g++.dg/tree-ssa/pr61034.C | 10 ++---
 gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c   |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c    |  4 +-
 gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c    |  4 +-
 gcc/tree-pass.h                         |  3 +-
 gcc/tree-ssa-loop.c                     | 76 ++++++++++++++++++++++++---------
 gcc/tree-ssa-structalias.c              |  2 +
 11 files changed, 110 insertions(+), 57 deletions(-)

diff --git a/gcc/opts.c b/gcc/opts.c
index 3d25f98..42d5566 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -560,6 +560,7 @@ default_options_optimization (struct gcc_options *opts,
 {
   unsigned int i;
   int opt2;
+  bool openacc_mode = false;
 
   /* Scan to see what optimization level has been specified.  That will
      determine the default value of many flags.  */
@@ -619,6 +620,10 @@ default_options_optimization (struct gcc_options *opts,
 	  opts->x_optimize_debug = 1;
 	  break;
 
+	case OPT_fopenacc:
+	  openacc_mode = true;
+	  break;
+
 	default:
 	  /* Ignore other options in this prescan.  */
 	  break;
@@ -633,6 +638,10 @@ default_options_optimization (struct gcc_options *opts,
   /* -O2 param settings.  */
   opt2 = (opts->x_optimize >= 2);
 
+  if (openacc_mode
+      && !opts_set->x_flag_ipa_pta)
+    opts->x_flag_ipa_pta = true;
+
   /* Track fields in field-sensitive alias analysis.  */
   maybe_set_param_value
     (PARAM_MAX_FIELDS_FOR_FIELD_SENSITIVE,
diff --git a/gcc/passes.def b/gcc/passes.def
index 43ce3d5..96e18f1 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,24 +88,7 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 1.  */
-	  NEXT_PASS (pass_oacc_kernels);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
-	      NEXT_PASS (pass_ch);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
-	  /* Pass group that runs when the function is an offloaded function
-	     containing oacc kernels loops.  Part 2.  */
-	  NEXT_PASS (pass_oacc_kernels2);
-	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
-	      /* We use pass_lim to rewrite in-memory iteration and reduction
-		 variable accesses in loops into local variables accesses.  */
-	      NEXT_PASS (pass_lim);
-	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
-	      NEXT_PASS (pass_dce);
-	      NEXT_PASS (pass_expand_omp_ssa);
-	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
@@ -124,6 +107,30 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
   POP_INSERT_PASSES ()
+
+  NEXT_PASS (pass_ipa_oacc);
+  PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
+      NEXT_PASS (pass_ipa_pta);
+      /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.	 */
+      NEXT_PASS (pass_ipa_oacc_kernels);
+      PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	      NEXT_PASS (pass_ch);
+	      NEXT_PASS (pass_fre);
+	      /* We use pass_lim to rewrite in-memory iteration and reduction
+		 variable accesses in loops into local variables accesses.  */
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	      NEXT_PASS (pass_dce);
+	      /* pass_parallelize_loops_oacc_kernels */
+	      NEXT_PASS (pass_expand_omp_ssa);
+	      NEXT_PASS (pass_rebuild_cgraph_edges);
+	  POP_INSERT_PASSES ()
+      POP_INSERT_PASSES ()
+  POP_INSERT_PASSES ()
+
   NEXT_PASS (pass_ipa_chkp_produce_thunks);
   NEXT_PASS (pass_ipa_auto_profile);
   NEXT_PASS (pass_ipa_free_inline_summary);
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-37.C b/gcc/testsuite/g++.dg/ipa/devirt-37.C
index 9c5287e..b7f52a0 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-37.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-37.C
@@ -1,4 +1,4 @@
-/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre2-details -fno-early-inlining"  } */
+/* { dg-options "-fpermissive -O2 -fno-indirect-inlining -fno-devirtualize-speculatively -fdump-tree-fre3-details -fno-early-inlining"  } */
 #include <stdlib.h>
 struct A {virtual void test() {abort ();}};
 struct B:A
@@ -30,7 +30,7 @@ t()
 /* After inlining the call within constructor needs to be checked to not go into a basetype.
    We should see the vtbl store and we should notice extcall as possibly clobbering the
    type but ignore it because b is in static storage.  */
-/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre2"  } } */
-/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre2"  } } */
+/* { dg-final { scan-tree-dump "No dynamic type change found."  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Checking vtbl store:"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "Function call may change dynamic type:extcall"  "fre3"  } } */
+/* { dg-final { scan-tree-dump "converting indirect call to function virtual void"  "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/ipa/devirt-40.C b/gcc/testsuite/g++.dg/ipa/devirt-40.C
index 279a228..5107c29 100644
--- a/gcc/testsuite/g++.dg/ipa/devirt-40.C
+++ b/gcc/testsuite/g++.dg/ipa/devirt-40.C
@@ -1,4 +1,4 @@
-/* { dg-options "-O2 -fdump-tree-fre2-details"  } */
+/* { dg-options "-O2 -fdump-tree-fre3-details"  } */
 typedef enum
 {
 } UErrorCode;
@@ -19,4 +19,4 @@ A::m_fn1 (UnicodeString &, int &p2, UErrorCode &) const
   UnicodeString a[2];
 }
 
-/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre2"  } } */
+/* { dg-final { scan-tree-dump-not "\\n  OBJ_TYPE_REF" "fre3"  } } */
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
index cd4ee05..c06c580 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O2 -fdump-tree-fre2 -fdump-tree-optimized" }
+// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized" }
 
 #define assume(x) if(!(x))__builtin_unreachable()
 
@@ -42,13 +42,13 @@ bool f(I a, I b, I c, I d) {
 // a bunch of conditional free()s and unreachable()s.
 // This works only if everything is inlined into 'f'.
 
-// { dg-final { scan-tree-dump-times ";; Function" 1 "fre2" } }
-// { dg-final { scan-tree-dump-times "unreachable" 11 "fre2" } }
+// { dg-final { scan-tree-dump-times ";; Function" 1 "fre3" } }
+// { dg-final { scan-tree-dump-times "unreachable" 11 "fre3" } }
 
 // Note that depending on PUSH_ARGS_REVERSED we are presented with
 // a different initial CFG and thus the final outcome is different
 
-// { dg-final { scan-tree-dump-times "free" 10 "fre2" { target x86_64-*-* i?86-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 10 "fre3" { target x86_64-*-* i?86-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 3 "optimized" { target x86_64-*-* i?86-*-* } } }
-// { dg-final { scan-tree-dump-times "free" 14 "fre2" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
+// { dg-final { scan-tree-dump-times "free" 14 "fre3" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
 // { dg-final { scan-tree-dump-times "free" 4 "optimized" { target aarch64-*-* ia64-*-* arm-*-* hppa*-*-* sparc*-*-* powerpc*-*-* alpha*-*-* } } }
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
index f558df3..71b31c4 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-13.c
@@ -1,5 +1,5 @@
 /* { dg-do link } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2 -fno-ipa-icf" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3 -fno-ipa-icf" } */
 
 static int x, y;
 
@@ -54,7 +54,7 @@ int main()
   local_address_taken (&y);
   /* As we are computing flow- and context-insensitive we may not
      CSE the load of x here.  */
-  /* { dg-final { scan-tree-dump " = x;" "fre2" } } */
+  /* { dg-final { scan-tree-dump " = x;" "fre3" } } */
   return x;
 }
 
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
index ff6fa57..8655794 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-3.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 static int __attribute__((noinline,noclone))
 foo (int *p, int *q)
@@ -23,4 +23,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
index 106e325..c42762a 100644
--- a/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-pta-4.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre2-details" } */
+/* { dg-options "-O2 -fipa-pta -fdump-ipa-pta-details -fdump-tree-fre3-details" } */
 
 int a, b;
 
@@ -28,4 +28,4 @@ int main()
 
 /* { dg-final { scan-ipa-dump "foo.arg0 = &a" "pta" } } */
 /* { dg-final { scan-ipa-dump "foo.arg1 = &b" "pta" } } */
-/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre2" } } */
+/* { dg-final { scan-tree-dump "Replaced \\\*p_2\\\(D\\\) with 1" "fre3" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index e1cbce9..dcdbdfd 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -468,7 +468,8 @@ extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
-extern gimple_opt_pass *make_pass_oacc_kernels2 (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_ipa_oacc_kernels (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index cf7d94e..1fe2716 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -36,6 +36,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
 #include "omp-low.h"
+#include "diagnostic-core.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -206,12 +207,14 @@ make_pass_oacc_kernels (gcc::context *ctxt)
   return new pass_oacc_kernels (ctxt);
 }
 
+/* The ipa oacc superpass.  */
+
 namespace {
 
-const pass_data pass_data_oacc_kernels2 =
+const pass_data pass_data_ipa_oacc =
 {
-  GIMPLE_PASS, /* type */
-  "oacc_kernels2", /* name */
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc", /* name */
   OPTGROUP_LOOP, /* optinfo_flags */
   TV_TREE_LOOP, /* tv_id */
   PROP_cfg, /* properties_required */
@@ -221,34 +224,65 @@ const pass_data pass_data_oacc_kernels2 =
   0, /* todo_flags_finish */
 };
 
-class pass_oacc_kernels2 : public gimple_opt_pass
+class pass_ipa_oacc : public simple_ipa_opt_pass
 {
 public:
-  pass_oacc_kernels2 (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_oacc_kernels2, ctxt)
+  pass_ipa_oacc (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc, ctxt)
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
-  virtual unsigned int execute (function *fn)
-    {
-      /* Rather than having a copy of the previous dump, get some use out of
-	 this dump, and try to minimize differences with the following pass
-	 (pass_lim), which will initizalize the loop optimizer with
-	 LOOPS_NORMAL.  */
-      loop_optimizer_init (LOOPS_NORMAL);
-      loop_optimizer_finalize (fn);
-      return 0;
-    }
+  virtual bool gate (function *)
+  {
+    return (optimize
+	    /* Don't bother doing anything if the program has errors.  */
+	    && !seen_error ()
+	    && flag_openacc
+	    && flag_tree_parallelize_loops > 1);
+  }
 
-}; // class pass_oacc_kernels2
+}; // class pass_ipa_oacc
 
 } // anon namespace
 
-gimple_opt_pass *
-make_pass_oacc_kernels2 (gcc::context *ctxt)
+simple_ipa_opt_pass *
+make_pass_ipa_oacc (gcc::context *ctxt)
+{
+  return new pass_ipa_oacc (ctxt);
+}
+
+/* The ipa oacc kernels pass.  */
+
+namespace {
+
+const pass_data pass_data_ipa_oacc_kernels =
+{
+  SIMPLE_IPA_PASS, /* type */
+  "ipa_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_oacc_kernels : public simple_ipa_opt_pass
+{
+public:
+  pass_ipa_oacc_kernels (gcc::context *ctxt)
+    : simple_ipa_opt_pass (pass_data_ipa_oacc_kernels, ctxt)
+  {}
+
+}; // class pass_ipa_oacc_kernels
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_ipa_oacc_kernels (gcc::context *ctxt)
 {
-  return new pass_oacc_kernels2 (ctxt);
+  return new pass_ipa_oacc_kernels (ctxt);
 }
 
 /* The no-loop superpass.  */
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index b34c955..5f8c0b6 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -7821,6 +7821,8 @@ public:
 	      && !seen_error ());
     }
 
+  opt_pass * clone () { return new pass_ipa_pta (m_ctxt); }
+
   virtual unsigned int execute (function *) { return ipa_pta_execute (); }
 
 }; // class pass_ipa_pta

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH, 4/16] Implement -foffload-alias
  2015-12-14 15:44                         ` Tom de Vries
@ 2015-12-16 13:16                           ` Richard Biener
  2015-12-16 14:43                             ` Tom de Vries
  0 siblings, 1 reply; 133+ messages in thread
From: Richard Biener @ 2015-12-16 13:16 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Jakub Jelinek, gcc-patches

On Mon, 14 Dec 2015, Tom de Vries wrote:

> On 14/12/15 14:26, Richard Biener wrote:
> > On Sun, 13 Dec 2015, Tom de Vries wrote:
> > 
> > > On 11/12/15 14:00, Richard Biener wrote:
> > > > On Fri, 11 Dec 2015, Tom de Vries wrote:
> > > > 
> > > > > On 13/11/15 12:39, Jakub Jelinek wrote:
> > > > > > We simply have some compiler internal interface between the caller
> > > > > > and
> > > > > > callee of the outlined regions, each interface in between those has
> > > > > > its own structure type used to communicate the info;
> > > > > > we can attach attributes on the fields, or some flags to indicate
> > > > > > some
> > > > > > properties interesting from aliasing POV.  We don't really need to
> > > > > > perform
> > > > > > full IPA-PTA, perhaps it would be enough to a) record somewhere in
> > > > > > cgraph
> > > > > > the relationship in between such callers and callees (for offloading
> > > > > > regions
> > > > > > we already have "omp target entrypoint" attribute on the callee and
> > > > > > a
> > > > > > singler caller), tell LTO if possible not to split those into
> > > > > > different
> > > > > > partitions if easily possible, and then just for these pairs perform
> > > > > > aliasing/points-to analysis in the caller and the result record
> > > > > > using
> > > > > > cliques/special attributes/whatever to the callee side, so that the
> > > > > > callee
> > > > > > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
> > > > > > analysis.
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > This work-in-progress patch allows me to use IPA PTA information in
> > > > > the
> > > > > kernels pass group.
> > > > > 
> > > > > Since:
> > > > > -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
> > > > >      restrict, and
> > > > > - compute_may_alias doesn't run if IPA PTA information is present
> > > > > I needed to convince ealias to do the restrict clique/base annotation.
> > > > > 
> > > > > It would be more logical to fit IPA PTA after ealias, but one is an
> > > > > IPA
> > > > > pass,
> > > > > the other a regular one-function pass, so I would have to split the
> > > > > containing
> > > > > pass groups pass_all_early_optimizations and
> > > > > pass_local_optimization_passes.
> > > > > I'll give that a try now.
> > > > > 
> > > 
> > > I've tried this approach, but realized that this changes the order in
> > > which
> > > non-openacc functions are processed in the compiler, so I've abandoned
> > > this
> > > idea.
> > > 
> > > > > Any comments?
> > > > 
> > > > I don't think you want to run IPA PTA before early
> > > > optimizations, it (and ealias) rely on some initial cleanup to
> > > > do anything meaningful with well-spent ressources.
> > > > 
> > > > The local PTA "hack" also looks more like a waste of resources, but well
> > > > ... teaching IPA PTA to honor restrict might be an impossible task
> > > > though I didn't think much about it other than handling it only for
> > > > nonlocal_p functions (for others we should see all incoming args
> > > > if IPA PTA works optimally).  The restrict tags will leak all over
> > > > the place of course and in the end no meaningful cliques may remain.
> > > > 
> > > 
> > > This patch:
> > > - moves the kernels pass group to the first position in the pass list
> > >    after ealias where we're back in ipa mode
> > > - inserts an new ipa pass to contain the gimple pass group called
> > >    pass_oacc_ipa
> > > - inserts a version of ipa-pta before the pass group.
> > 
> > In principle I like this a lot, but
> > 
> > +  NEXT_PASS (pass_ipa_pta_oacc_kernels);
> > +  NEXT_PASS (pass_oacc_ipa);
> > +  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
> > 
> > I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
> > group and thus just "clone" ipa_pta?
> 
> Done. But using a clone means using the same gate function, and that means
> that this pass_ipa_pta instance no longer runs by default for openacc by
> default.
> 
> I've added enabling-by-default of fipa-pta for fopenacc in
> default_options_optimization to fix that.

Hmm, but that enables both IPA PTA passes then?  I suppose that's ok,
and if not enabling the "late" IPA PTA you'd want to re-set 
gimple_df->ipa_pta.

> > sub-passes of IPA passes can
> > be both ipa passes and non-ipa passes.
> 
> Right. It does mean that I need yet another pass (pass_ipa_oacc_kernels) to do
> the IPA/non-IPA transition at pass/sub-pass boundary:
> ...
>   NEXT_PASS (pass_ipa_oacc);
>   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
>       NEXT_PASS (pass_ipa_pta);
>       NEXT_PASS (pass_ipa_oacc_kernels);
>       PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
>          /* out-of-ipa */
>          NEXT_PASS (pass_oacc_kernels);
>          PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> ...
> 
> OK for stage3 if bootstrap and reg-test succeeds?

Ok.

Richard.

^ permalink